Re: spark streaming job to hbase write

Ted Yu Fri, 17 Jul 2015 07:50:19 -0700

It resorts to the following method for finding region location:

  private RegionLocations locateRegionInMeta(TableName tableName, byte[]
row,


                 boolean useCache, boolean retry, int replicaId) throws
IOException {

Note: useCache value is true in this call path.

Meaning the client side cache would be consulted to reduce RPC to server
hosting hbase:meta

Cheers

On Fri, Jul 17, 2015 at 7:41 AM, Shushant Arora <shushantaror...@gmail.com>
wrote:

> Is this map creation happening on client side ?
>
> But how does it know which RS will contain that row key in put operation
> until asking the .Meta. table .
>  Does Hbase client first gets that ranges of keys of each Reagionservers
> and then group put objects based on Region servers ?
>
> On Fri, Jul 17, 2015 at 7:48 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Internally AsyncProcess uses a Map which is keyed by server name:
>>
>>     Map<ServerName, MultiAction<Row>> actionsByServer =
>>
>>         new HashMap<ServerName, MultiAction<Row>>();
>>
>> Here MultiAction would group Put's in your example which are destined for
>> the same server.
>>
>> Cheers
>>
>> On Fri, Jul 17, 2015 at 5:15 AM, Shushant Arora <
>> shushantaror...@gmail.com> wrote:
>>
>>> Thanks !
>>>
>>> My key is random (hexadecimal). So hot spot should not be created.
>>>
>>> Is there any concept of bulk put. Say I want to raise a one put request
>>> for a 1000 size batch which will hit a region server instead of individual
>>> put for each key.
>>>
>>>
>>> Htable.put(List<Put>) Does this handles batching of put based on
>>> regionserver to which they will land to finally. Say in my batch there are
>>> 10 puts- 5 for RS1,3 for RS3 and 2 for RS3. Does this handles that?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jul 16, 2015 at 8:31 PM, Michael Segel <
>>> michael_se...@hotmail.com> wrote:
>>>
>>>> You ask an interesting question…
>>>>
>>>> Lets set aside spark, and look at the overall ingestion pattern.
>>>>
>>>> Its really an ingestion pattern where your input in to the system is
>>>> from a queue.
>>>>
>>>> Are the events discrete or continuous? (This is kinda important.)
>>>>
>>>> If the events are continuous then more than likely you’re going to be
>>>> ingesting data where the key is somewhat sequential. If you use put(), you
>>>> end up with hot spotting. And you’ll end up with regions half full.
>>>> So you would be better off batching up the data and doing bulk imports.
>>>>
>>>> If the events are discrete, then you’ll want to use put() because the
>>>> odds are you will not be using a sequential key. (You could, but I’d
>>>> suggest that you rethink your primary key)
>>>>
>>>> Depending on the rate of ingestion, you may want to do a manual flush.
>>>> (It depends on the velocity of data to be ingested and your use case )
>>>> (Remember what caching occurs and where when dealing with HBase.)
>>>>
>>>> A third option… Depending on how you use the data, you may want to
>>>> avoid storing the data in HBase, and only use HBase as an index to where
>>>> you store the data files for quick access.  Again it depends on your data
>>>> ingestion flow and how you intend to use the data.
>>>>
>>>> So really this is less a spark issue than an HBase issue when it comes
>>>> to design.
>>>>
>>>> HTH
>>>>
>>>> -Mike
>>>>
>>>> > On Jul 15, 2015, at 11:46 AM, Shushant Arora <
>>>> shushantaror...@gmail.com> wrote:
>>>> >
>>>> > Hi
>>>> >
>>>> > I have a requirement of writing in hbase table from Spark streaming
>>>> app after some processing.
>>>> > Is Hbase put operation the only way of writing to hbase or is there
>>>> any specialised connector or rdd of spark for hbase write.
>>>> >
>>>> > Should Bulk load to hbase from streaming  app be avoided if output of
>>>> each batch interval is just few mbs?
>>>> >
>>>> > Thanks
>>>> >
>>>>
>>>> The opinions expressed here are mine, while they may reflect a
>>>> cognitive thought, that is purely accidental.
>>>> Use at your own risk.
>>>> Michael Segel
>>>> michael_segel (AT) hotmail.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: spark streaming job to hbase write

Reply via email to