Hi there,

I have lots of raw data in several Hive tables where we built a workflow to
"join" those records together and restructured into HBase. It was done
using plain MapReduce to generate HFile, and then load incremental from
HFile into HBase to guarantee the best performance.

However, we need to do some time series analysis for each of the record in
HBase, but the implementation was done in Python (pandas, scikit learn)
which is pretty time-consuming to reproduce in Java, Scala.

I am thinking PySpark is probably the best approach if it works.
Can pyspark read from HFile directory? or can it read from HBase in
parallel?
I don't see that many examples out there so any help or guidance will be
appreciated.

Also, we are using Cloudera Hadoop so there might be a slight delay with
the latest Spark release.

Best regards,

Bin

Reply via email to