Solved, see:
http://stackoverflow.com/questions/38470114/how-to-connect-hbase-and-spark-using-python/38575095
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-connect-HBase-and-Spark-using-Python-tp27372p27409.html
Sent from the Apache Spark User List
So it appears it should be possible to use HBase's new hbase-spark module, if
you follow this pattern:
https://hbase.apache.org/book.html#_sparksql_dataframes
Unfortunately, when I run my example from PySpark, I get the following
exception:
> py4j.protocol.Py4JJavaError: An error occurred while
I'd like to know whether there's any way to query HBase with Spark SQL via
the PySpark interface. See my question on SO:
http://stackoverflow.com/questions/38470114/how-to-connect-hbase-and-spark-using-python
The new HBase-Spark module in HBase, which introduces the
HBaseContext/JavaHBaseContext,
After deserialization, something seems to be wrong with my pandas DataFrames.
It looks like the timezone information is lost, and subsequent errors ensue.
Serializing and deserializing a timezone-aware DataFrame tests just fine, so
it must be Spark that somehow changes the data.
My program runs
I want several RDDs (which are the result of my program's operations on
existing RDDs) to match the partitioning of an existing RDD, since they will
be joined together in the end. Do I understand correctly that I would
benefit from using a custom partitioner that would be applied to all RDDs?