Hi there, I have lots of raw data in several Hive tables where we built a workflow to "join" those records together and restructured into HBase. It was done using plain MapReduce to generate HFile, and then load incremental from HFile into HBase to guarantee the best performance.
However, we need to do some time series analysis for each of the record in HBase, but the implementation was done in Python (pandas, scikit learn) which is pretty time-consuming to reproduce in Java, Scala. I am thinking PySpark is probably the best approach if it works. Can pyspark read from HFile directory? or can it read from HBase in parallel? I don't see that many examples out there so any help or guidance will be appreciated. Also, we are using Cloudera Hadoop so there might be a slight delay with the latest Spark release. Best regards, Bin