Re: spark streaming job to hbase write

Shushant Arora Fri, 17 Jul 2015 05:16:31 -0700

Thanks !

My key is random (hexadecimal). So hot spot should not be created.


Is there any concept of bulk put. Say I want to raise a one put request for
a 1000 size batch which will hit a region server instead of individual put
for each key.


Htable.put(List<Put>) Does this handles batching of put based on
regionserver to which they will land to finally. Say in my batch there are
10 puts- 5 for RS1,3 for RS3 and 2 for RS3. Does this handles that?









On Thu, Jul 16, 2015 at 8:31 PM, Michael Segel <michael_se...@hotmail.com>
wrote:

> You ask an interesting question…
>
> Lets set aside spark, and look at the overall ingestion pattern.
>
> Its really an ingestion pattern where your input in to the system is from
> a queue.
>
> Are the events discrete or continuous? (This is kinda important.)
>
> If the events are continuous then more than likely you’re going to be
> ingesting data where the key is somewhat sequential. If you use put(), you
> end up with hot spotting. And you’ll end up with regions half full.
> So you would be better off batching up the data and doing bulk imports.
>
> If the events are discrete, then you’ll want to use put() because the odds
> are you will not be using a sequential key. (You could, but I’d suggest
> that you rethink your primary key)
>
> Depending on the rate of ingestion, you may want to do a manual flush. (It
> depends on the velocity of data to be ingested and your use case )
> (Remember what caching occurs and where when dealing with HBase.)
>
> A third option… Depending on how you use the data, you may want to avoid
> storing the data in HBase, and only use HBase as an index to where you
> store the data files for quick access.  Again it depends on your data
> ingestion flow and how you intend to use the data.
>
> So really this is less a spark issue than an HBase issue when it comes to
> design.
>
> HTH
>
> -Mike
>
> > On Jul 15, 2015, at 11:46 AM, Shushant Arora <shushantaror...@gmail.com>
> wrote:
> >
> > Hi
> >
> > I have a requirement of writing in hbase table from Spark streaming app
> after some processing.
> > Is Hbase put operation the only way of writing to hbase or is there any
> specialised connector or rdd of spark for hbase write.
> >
> > Should Bulk load to hbase from streaming  app be avoided if output of
> each batch interval is just few mbs?
> >
> > Thanks
> >
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>

Re: spark streaming job to hbase write

Reply via email to