Re: Using Scans in parallel

Bryan Keller Sun, 09 Oct 2011 10:59:54 -0700

I was not able to get consistent results using multiple scanners in parallel on 
a table. I implemented a counter test that used 8 scanners in parallel on a 
table with 2m rows with 2k+ columns each, and the results were not consistent. 
There were no errors thrown, but the count was off by as much as 2%. Using a 
single thread gave the same (correct) result every run.


I tried various approaches, such as creating an HTable and opening a connection 
per thread, but I was not able to get stable results. I would do some testing 
before using parallel scanners as described here.


On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote:

> That's part of it, the other part is to get the region demarcations.
> You can also just get the smallest and largest key of the table and pick 
> other demarcations for your scans. Then your individual scans will likely 
> cover multiple regions and regionservers.
> 
> 
> Your threading model depends on your needs. If you interested in lowest 
> latency you want to keep your regionservers busy for each query.
> What exactly that means depends on your setup. Maybe you split up the overall 
> scan so that no more than N scans are active at any regionserver.
> 
> If you're more interested in overall predictability, you might not want 
> parallelize each scan too much.
> 
> 
> 
> ----- Original Message -----
> From: Sam Seigal <selek...@yahoo.com>
> To: user@hbase.apache.org; lars hofhansl <lhofha...@yahoo.com>
> Cc: "hbase-u...@hadoop.apache.org" <hbase-u...@hadoop.apache.org>
> Sent: Wednesday, October 5, 2011 6:18 PM
> Subject: Re: Using Scans in parallel
> 
> So the whole point of getting the region locations is to ensure that
> there is one thread per region server ?
> 
> 
> On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <lhofha...@yahoo.com> wrote:
>> Hi Sam,
>> 
>> 
>> There were some attempts to build this in. In the end I think the exact 
>> patterns are different based on what one is trying to achieve.
>> Currently what you can do is getting all the region locations 
>> (HTable.getRegionLocations). From the HRegionInfos you can
>> get the regions start and end keys.
>> Now you can issue parallel scan for as many regions as you want (by create a 
>> Scan object with start and row set to the region's
>> start and end key).
>> You probably want to group the regions by regionserver and have one thread 
>> per region server, or something.
>> 
>> 
>> -- Lars
>> ________________________________
>> From: Sam Seigal <selek...@yahoo.com>
>> To: hbase-u...@hadoop.apache.org
>> Sent: Wednesday, October 5, 2011 4:29 PM
>> Subject: Using Scans in parallel
>> 
>> Hi ,
>> 
>> Is there a known way to be able to do Scan's in parallel (in different
>> threads even) and then sort/combine the output ?
>> 
>> For a row key like:
>> 
>> prefix-event_type-event_id
>> prefix-event_type-event_id
>> 
>> I want to declare two scan objects (for say event_id_type foo)
>> 
>> Scan 1 =>  0-foo
>> Scan 2 =>  1-foo
>> 
>> execute the scans in parallel (maybe even in different threads) and
>> then merge the results ?
>> 
>> Thank you,
>> 
>> Sam
>> 
>

Re: Using Scans in parallel

Reply via email to