Hi,

I want to use Spark with HBase and I'm confused about how to ingest my data
using HBase' HFileOutputFormat. It recommends calling
configureIncrementalLoad which does the following:

   - Inspects the table to configure a total order partitioner
   - Uploads the partitions file to the cluster and adds it to the
   DistributedCache
   - Sets the number of reduce tasks to match the current number of regions
   - Sets the output key/value class to match HFileOutputFormat2's
   requirements
   - Sets the reducer up to perform the appropriate sorting (either
   KeyValueSortReducer or PutSortReducer)

But in Spark, it seems I have to do the sorting and partition myself, right?

Can anyone show me how to do it properly? Is there a better way to ingest
data fast to HBase from Spark?

Cheers,
-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Reply via email to