Re: Using Scans in parallel

Himanshu Vashishtha Sun, 09 Oct 2011 19:04:28 -0700

Interesting.

Hey Bryan, can you please share the stats about: how many Regions, how
many Region Servers, time taken by Serial scanner and with 8 parallel
scanners.


Himanshu

On Sun, Oct 9, 2011 at 6:49 PM, Bryan Keller <brya...@gmail.com> wrote:
> This is 100% reproducible for me, so I doubt it is related to random number 
> generation.
>
> On Oct 9, 2011, at 2:53 PM, lars hofhansl wrote:
>
>> How frequently does this happen?
>> I did notice a while ago in the code that scanner ids are drawn just from a 
>> Random number generator.
>>
>> So in theory it would be possible that multiple concurrent scans draw the 
>> same scanner id.
>>
>> Since these are longs, this is astronomically unlikely, though (picking the 
>> same number of 2^64, just does not happen :) ).
>>
>>
>>
>> ________________________________
>> From: Bryan Keller <brya...@gmail.com>
>> To: user@hbase.apache.org
>> Sent: Sunday, October 9, 2011 2:40 PM
>> Subject: Re: Using Scans in parallel
>>
>> This is just scanning (reads). I'll need to do more testing to find a cause, 
>> hopefully it is something with my test.
>>
>> On Oct 9, 2011, at 1:13 PM, lars hofhansl wrote:
>>
>>> Which version of HBase?
>>> Are there concurrent inserts? If so, do you see splits in the log files 
>>> happening while you do the scanning?
>>>
>>> I am pretty sure this has nothing to do with concurrent scans.
>>>
>>> From: Bryan Keller <brya...@gmail.com>
>>> To: Bryan Keller <brya...@gmail.com>
>>> Cc: user@hbase.apache.org
>>> Sent: Sunday, October 9, 2011 11:03 AM
>>> Subject: Re: Using Scans in parallel
>>>
>>> On further thought, it seems this might be a serious issue, as two 
>>> unrelated processes within an application may be scanning the same table at 
>>> the same time.
>>>
>>> On Oct 9, 2011, at 10:59 AM, Bryan Keller wrote:
>>>
>>>> I was not able to get consistent results using multiple scanners in 
>>>> parallel on a table. I implemented a counter test that used 8 scanners in 
>>>> parallel on a table with 2m rows with 2k+ columns each, and the results 
>>>> were not consistent. There were no errors thrown, but the count was off by 
>>>> as much as 2%. Using a single thread gave the same (correct) result every 
>>>> run.
>>>>
>>>> I tried various approaches, such as creating an HTable and opening a 
>>>> connection per thread, but I was not able to get stable results. I would 
>>>> do some testing before using parallel scanners as described here.
>>>>
>>>>
>>>> On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote:
>>>>
>>>>> That's part of it, the other part is to get the region demarcations.
>>>>> You can also just get the smallest and largest key of the table and pick 
>>>>> other demarcations for your scans. Then your individual scans will likely 
>>>>> cover multiple regions and regionservers.
>>>>>
>>>>>
>>>>> Your threading model depends on your needs. If you interested in lowest 
>>>>> latency you want to keep your regionservers busy for each query.
>>>>> What exactly that means depends on your setup. Maybe you split up the 
>>>>> overall scan so that no more than N scans are active at any regionserver.
>>>>>
>>>>> If you're more interested in overall predictability, you might not want 
>>>>> parallelize each scan too much.
>>>>>
>>>>>
>>>>>
>>>>> ----- Original Message -----
>>>>> From: Sam Seigal <selek...@yahoo.com>
>>>>> To: user@hbase.apache.org; lars hofhansl <lhofha...@yahoo.com>
>>>>> Cc: "hbase-u...@hadoop.apache.org" <hbase-u...@hadoop.apache.org>
>>>>> Sent: Wednesday, October 5, 2011 6:18 PM
>>>>> Subject: Re: Using Scans in parallel
>>>>>
>>>>> So the whole point of getting the region locations is to ensure that
>>>>> there is one thread per region server ?
>>>>>
>>>>>
>>>>> On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <lhofha...@yahoo.com> wrote:
>>>>>> Hi Sam,
>>>>>>
>>>>>>
>>>>>> There were some attempts to build this in. In the end I think the exact 
>>>>>> patterns are different based on what one is trying to achieve.
>>>>>> Currently what you can do is getting all the region locations 
>>>>>> (HTable.getRegionLocations). From the HRegionInfos you can
>>>>>> get the regions start and end keys.
>>>>>> Now you can issue parallel scan for as many regions as you want (by 
>>>>>> create a Scan object with start and row set to the region's
>>>>>> start and end key).
>>>>>> You probably want to group the regions by regionserver and have one 
>>>>>> thread per region server, or something.
>>>>>>
>>>>>>
>>>>>> -- Lars
>>>>>> ________________________________
>>>>>> From: Sam Seigal <selek...@yahoo.com>
>>>>>> To: hbase-u...@hadoop.apache.org
>>>>>> Sent: Wednesday, October 5, 2011 4:29 PM
>>>>>> Subject: Using Scans in parallel
>>>>>>
>>>>>> Hi ,
>>>>>>
>>>>>> Is there a known way to be able to do Scan's in parallel (in different
>>>>>> threads even) and then sort/combine the output ?
>>>>>>
>>>>>> For a row key like:
>>>>>>
>>>>>> prefix-event_type-event_id
>>>>>> prefix-event_type-event_id
>>>>>>
>>>>>> I want to declare two scan objects (for say event_id_type foo)
>>>>>>
>>>>>> Scan 1 =>  0-foo
>>>>>> Scan 2 =>  1-foo
>>>>>>
>>>>>> execute the scans in parallel (maybe even in different threads) and
>>>>>> then merge the results ?
>>>>>>
>>>>>> Thank you,
>>>>>>
>>>>>> Sam
>>>>>>
>>>>>
>>>>
>>>
>>>
>
>

Re: Using Scans in parallel

Reply via email to