Dear all,

I have 3 big HBase tables, which all have millions of rows(rows are synced
from MySQL DB via Bin log) and for each HBase table, we have an external
table on Hive correspondingly with the storage by
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'. The advantage is that
we can always keep sync up with the production DB and provides random
access by key.

Now our business needs to do some analysis on those tables with Join query.
What's the best practice to make it?

>From my experiment, I found that with the Spark SQL on HBase or Hive, the
job ran very slowly and will saturate the network bandwidth. But it works
very well for the Hive SQL directly against Hive from HDFS files(make a
copy of the data to HDFS files).

Appreciated for any advice on what would be the problem here? and the way
to optimize the job.
Regards, Wenxing

Reply via email to