You ask an interesting question… 

Lets set aside spark, and look at the overall ingestion pattern. 

Its really an ingestion pattern where your input in to the system is from a 
queue. 

Are the events discrete or continuous? (This is kinda important.) 

If the events are continuous then more than likely you’re going to be ingesting 
data where the key is somewhat sequential. If you use put(), you end up with 
hot spotting. And you’ll end up with regions half full. 
So you would be better off batching up the data and doing bulk imports. 

If the events are discrete, then you’ll want to use put() because the odds are 
you will not be using a sequential key. (You could, but I’d suggest that you 
rethink your primary key) 

Depending on the rate of ingestion, you may want to do a manual flush. (It 
depends on the velocity of data to be ingested and your use case )
(Remember what caching occurs and where when dealing with HBase.) 

A third option… Depending on how you use the data, you may want to avoid 
storing the data in HBase, and only use HBase as an index to where you store 
the data files for quick access.  Again it depends on your data ingestion flow 
and how you intend to use the data. 

So really this is less a spark issue than an HBase issue when it comes to 
design. 

HTH

-Mike
> On Jul 15, 2015, at 11:46 AM, Shushant Arora <shushantaror...@gmail.com> 
> wrote:
> 
> Hi
> 
> I have a requirement of writing in hbase table from Spark streaming app after 
> some processing.
> Is Hbase put operation the only way of writing to hbase or is there any 
> specialised connector or rdd of spark for hbase write.
> 
> Should Bulk load to hbase from streaming  app be avoided if output of each 
> batch interval is just few mbs?
> 
> Thanks
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to