Have you considered running Hive/Spark over snapshots of your HBase tables?

If you're seeing network saturation over HBase but not hdfs, makes me think
data locality is not being honored. Might be worth investigating as well.

On Fri, Sep 29, 2017 at 3:26 AM wenxing zheng <wenxing.zh...@gmail.com>
wrote:

> Dear all,
>
> I have 3 big HBase tables, which all have millions of rows(rows are synced
> from MySQL DB via Bin log) and for each HBase table, we have an external
> table on Hive correspondingly with the storage by
> 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'. The advantage is that
> we can always keep sync up with the production DB and provides random
> access by key.
>
> Now our business needs to do some analysis on those tables with Join query.
> What's the best practice to make it?
>
> From my experiment, I found that with the Spark SQL on HBase or Hive, the
> job ran very slowly and will saturate the network bandwidth. But it works
> very well for the Hive SQL directly against Hive from HDFS files(make a
> copy of the data to HDFS files).
>
> Appreciated for any advice on what would be the problem here? and the way
> to optimize the job.
> Regards, Wenxing
>

Reply via email to