I was not able to get consistent results using multiple scanners in parallel on a table. I implemented a counter test that used 8 scanners in parallel on a table with 2m rows with 2k+ columns each, and the results were not consistent. There were no errors thrown, but the count was off by as much as 2%. Using a single thread gave the same (correct) result every run.
I tried various approaches, such as creating an HTable and opening a connection per thread, but I was not able to get stable results. I would do some testing before using parallel scanners as described here. On Oct 5, 2011, at 10:11 PM, lars hofhansl wrote: > That's part of it, the other part is to get the region demarcations. > You can also just get the smallest and largest key of the table and pick > other demarcations for your scans. Then your individual scans will likely > cover multiple regions and regionservers. > > > Your threading model depends on your needs. If you interested in lowest > latency you want to keep your regionservers busy for each query. > What exactly that means depends on your setup. Maybe you split up the overall > scan so that no more than N scans are active at any regionserver. > > If you're more interested in overall predictability, you might not want > parallelize each scan too much. > > > > ----- Original Message ----- > From: Sam Seigal <selek...@yahoo.com> > To: user@hbase.apache.org; lars hofhansl <lhofha...@yahoo.com> > Cc: "hbase-u...@hadoop.apache.org" <hbase-u...@hadoop.apache.org> > Sent: Wednesday, October 5, 2011 6:18 PM > Subject: Re: Using Scans in parallel > > So the whole point of getting the region locations is to ensure that > there is one thread per region server ? > > > On Wed, Oct 5, 2011 at 4:42 PM, lars hofhansl <lhofha...@yahoo.com> wrote: >> Hi Sam, >> >> >> There were some attempts to build this in. In the end I think the exact >> patterns are different based on what one is trying to achieve. >> Currently what you can do is getting all the region locations >> (HTable.getRegionLocations). From the HRegionInfos you can >> get the regions start and end keys. >> Now you can issue parallel scan for as many regions as you want (by create a >> Scan object with start and row set to the region's >> start and end key). >> You probably want to group the regions by regionserver and have one thread >> per region server, or something. >> >> >> -- Lars >> ________________________________ >> From: Sam Seigal <selek...@yahoo.com> >> To: hbase-u...@hadoop.apache.org >> Sent: Wednesday, October 5, 2011 4:29 PM >> Subject: Using Scans in parallel >> >> Hi , >> >> Is there a known way to be able to do Scan's in parallel (in different >> threads even) and then sort/combine the output ? >> >> For a row key like: >> >> prefix-event_type-event_id >> prefix-event_type-event_id >> >> I want to declare two scan objects (for say event_id_type foo) >> >> Scan 1 => 0-foo >> Scan 2 => 1-foo >> >> execute the scans in parallel (maybe even in different threads) and >> then merge the results ? >> >> Thank you, >> >> Sam >> >