Have you looked at Phoenix ? https://phoenix.apache.org/joins.html
On Fri, Sep 29, 2017 at 3:25 AM, wenxing zheng <[email protected]> wrote: > Dear all, > > I have 3 big HBase tables, which all have millions of rows(rows are synced > from MySQL DB via Bin log) and for each HBase table, we have an external > table on Hive correspondingly with the storage by > 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'. The advantage is that > we can always keep sync up with the production DB and provides random > access by key. > > Now our business needs to do some analysis on those tables with Join query. > What's the best practice to make it? > > From my experiment, I found that with the Spark SQL on HBase or Hive, the > job ran very slowly and will saturate the network bandwidth. But it works > very well for the Hive SQL directly against Hive from HDFS files(make a > copy of the data to HDFS files). > > Appreciated for any advice on what would be the problem here? and the way > to optimize the job. > Regards, Wenxing >
