Dear all, I have 3 big HBase tables, which all have millions of rows(rows are synced from MySQL DB via Bin log) and for each HBase table, we have an external table on Hive correspondingly with the storage by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'. The advantage is that we can always keep sync up with the production DB and provides random access by key.
Now our business needs to do some analysis on those tables with Join query. What's the best practice to make it? >From my experiment, I found that with the Spark SQL on HBase or Hive, the job ran very slowly and will saturate the network bandwidth. But it works very well for the Hive SQL directly against Hive from HDFS files(make a copy of the data to HDFS files). Appreciated for any advice on what would be the problem here? and the way to optimize the job. Regards, Wenxing