Re: spark streaming job to hbase write

Ted Yu Fri, 17 Jul 2015 07:19:47 -0700

Internally AsyncProcess uses a Map which is keyed by server name:

    Map<ServerName, MultiAction<Row>> actionsByServer =


        new HashMap<ServerName, MultiAction<Row>>();

Here MultiAction would group Put's in your example which are destined for
the same server.

Cheers

On Fri, Jul 17, 2015 at 5:15 AM, Shushant Arora <shushantaror...@gmail.com>
wrote:

> Thanks !
>
> My key is random (hexadecimal). So hot spot should not be created.
>
> Is there any concept of bulk put. Say I want to raise a one put request
> for a 1000 size batch which will hit a region server instead of individual
> put for each key.
>
>
> Htable.put(List<Put>) Does this handles batching of put based on
> regionserver to which they will land to finally. Say in my batch there are
> 10 puts- 5 for RS1,3 for RS3 and 2 for RS3. Does this handles that?
>
>
>
>
>
>
>
>
>
> On Thu, Jul 16, 2015 at 8:31 PM, Michael Segel <michael_se...@hotmail.com>
> wrote:
>
>> You ask an interesting question…
>>
>> Lets set aside spark, and look at the overall ingestion pattern.
>>
>> Its really an ingestion pattern where your input in to the system is from
>> a queue.
>>
>> Are the events discrete or continuous? (This is kinda important.)
>>
>> If the events are continuous then more than likely you’re going to be
>> ingesting data where the key is somewhat sequential. If you use put(), you
>> end up with hot spotting. And you’ll end up with regions half full.
>> So you would be better off batching up the data and doing bulk imports.
>>
>> If the events are discrete, then you’ll want to use put() because the
>> odds are you will not be using a sequential key. (You could, but I’d
>> suggest that you rethink your primary key)
>>
>> Depending on the rate of ingestion, you may want to do a manual flush.
>> (It depends on the velocity of data to be ingested and your use case )
>> (Remember what caching occurs and where when dealing with HBase.)
>>
>> A third option… Depending on how you use the data, you may want to avoid
>> storing the data in HBase, and only use HBase as an index to where you
>> store the data files for quick access.  Again it depends on your data
>> ingestion flow and how you intend to use the data.
>>
>> So really this is less a spark issue than an HBase issue when it comes to
>> design.
>>
>> HTH
>>
>> -Mike
>>
>> > On Jul 15, 2015, at 11:46 AM, Shushant Arora <shushantaror...@gmail.com>
>> wrote:
>> >
>> > Hi
>> >
>> > I have a requirement of writing in hbase table from Spark streaming app
>> after some processing.
>> > Is Hbase put operation the only way of writing to hbase or is there any
>> specialised connector or rdd of spark for hbase write.
>> >
>> > Should Bulk load to hbase from streaming  app be avoided if output of
>> each batch interval is just few mbs?
>> >
>> > Thanks
>> >
>>
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>>
>>
>>
>>
>>
>>
>

Re: spark streaming job to hbase write

Reply via email to