Interesting. Hey Bryan, can you please share the stats about: how many Regions, how many Region Servers, time taken by Serial scanner and with 8 parallel scanners.
Himanshu On Sun, Oct 9, 2011 at 6:49 PM, Bryan Keller <brya...@gmail.com> wrote: > This is 100% reproducible for me, so I doubt it is related to random number > generation. > > On Oct 9, 2011, at 2:53 PM, lars hofhansl wrote: > >> How frequently does this happen? >> I did notice a while ago in the code that scanner ids are drawn just from a >> Random number generator. >> >> So in theory it would be possible that multiple concurrent scans draw the >> same scanner id. >> >> Since these are longs, this is astronomically unlikely, though (picking the >> same number of 2^64, just does not happen :) ). >> >> >> >> ________________________________ >> From: Bryan Keller <brya...@gmail.com> >> To: user@hbase.apache.org >> Sent: Sunday, October 9, 2011 2:40 PM >> Subject: Re: Using Scans in parallel >> >> This is just scanning (reads). I'll need to do more testing to find a cause, >> hopefully it is something with my test. >> >> On Oct 9, 2011, at 1:13 PM, lars hofhansl wrote: >> >>> Which version of HBase? >>> Are there concurrent inserts? If so, do you see splits in the log files >>> happening while you do the scanning? >>> >>> I am pretty sure this has nothing to do with concurrent scans. >>> >>> From: Bryan Keller <brya...@gmail.com> >>> To: Bryan Keller <brya...@gmail.com> >>> Cc: user@hbase.apache.org >>> Sent: Sunday, October 9, 2011 11:03 AM >>> Subject: Re: Using Scans in parallel >>> >>> On further thought, it seems this might be a serious issue, as two >>> unrelated processes within an application may be scanning the same table at >>> the same time. >>> >>> On Oct 9, 2011, at 10:59 AM, Bryan Keller wrote: >>> >>>> I was not able to get consistent results using multiple scanners in >>>> parallel on a table. I implemented a counter test that used 8 scanners in >>>> parallel on a table with 2m rows with 2k+ columns each, and the results >>>> were not consistent. There were no errors thrown, but the count was off by >>>> as much as 2%. Using a single thread gave the same (correct) result every >>>> run. >>>> >>>> I tried various approaches, such as creating an HTable and opening a >>>> connection per thread, but I was not able to get stable results. I would >>>> do some testing before using parallel scanners as described here. >>>> >>>> >>>> On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote: >>>> >>>>> That's part of it, the other part is to get the region demarcations. >>>>> You can also just get the smallest and largest key of the table and pick >>>>> other demarcations for your scans. Then your individual scans will likely >>>>> cover multiple regions and regionservers. >>>>> >>>>> >>>>> Your threading model depends on your needs. If you interested in lowest >>>>> latency you want to keep your regionservers busy for each query. >>>>> What exactly that means depends on your setup. Maybe you split up the >>>>> overall scan so that no more than N scans are active at any regionserver. >>>>> >>>>> If you're more interested in overall predictability, you might not want >>>>> parallelize each scan too much. >>>>> >>>>> >>>>> >>>>> ----- Original Message ----- >>>>> From: Sam Seigal <selek...@yahoo.com> >>>>> To: user@hbase.apache.org; lars hofhansl <lhofha...@yahoo.com> >>>>> Cc: "hbase-u...@hadoop.apache.org" <hbase-u...@hadoop.apache.org> >>>>> Sent: Wednesday, October 5, 2011 6:18 PM >>>>> Subject: Re: Using Scans in parallel >>>>> >>>>> So the whole point of getting the region locations is to ensure that >>>>> there is one thread per region server ? >>>>> >>>>> >>>>> On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <lhofha...@yahoo.com> wrote: >>>>>> Hi Sam, >>>>>> >>>>>> >>>>>> There were some attempts to build this in. In the end I think the exact >>>>>> patterns are different based on what one is trying to achieve. >>>>>> Currently what you can do is getting all the region locations >>>>>> (HTable.getRegionLocations). From the HRegionInfos you can >>>>>> get the regions start and end keys. >>>>>> Now you can issue parallel scan for as many regions as you want (by >>>>>> create a Scan object with start and row set to the region's >>>>>> start and end key). >>>>>> You probably want to group the regions by regionserver and have one >>>>>> thread per region server, or something. >>>>>> >>>>>> >>>>>> -- Lars >>>>>> ________________________________ >>>>>> From: Sam Seigal <selek...@yahoo.com> >>>>>> To: hbase-u...@hadoop.apache.org >>>>>> Sent: Wednesday, October 5, 2011 4:29 PM >>>>>> Subject: Using Scans in parallel >>>>>> >>>>>> Hi , >>>>>> >>>>>> Is there a known way to be able to do Scan's in parallel (in different >>>>>> threads even) and then sort/combine the output ? >>>>>> >>>>>> For a row key like: >>>>>> >>>>>> prefix-event_type-event_id >>>>>> prefix-event_type-event_id >>>>>> >>>>>> I want to declare two scan objects (for say event_id_type foo) >>>>>> >>>>>> Scan 1 => 0-foo >>>>>> Scan 2 => 1-foo >>>>>> >>>>>> execute the scans in parallel (maybe even in different threads) and >>>>>> then merge the results ? >>>>>> >>>>>> Thank you, >>>>>> >>>>>> Sam >>>>>> >>>>> >>>> >>> >>> > >