Re: Scan vs Parallel scan.

Guillermo Ortiz Fri, 12 Sep 2014 07:06:40 -0700

For an partial scan, I guess that I call to the RS to get data, it starts
looking in the store files and recollecting the data. (It doesn't write to
the blockcache in both cases). It has ready the data and it gives to the
client the data step by step, I mean,,, it depends the caching and batching
parameters.


Big differences that I see...
I'm opening more connections to the Table, one for Region.

I should check the single table scan, it looks like it does partial scans
sequentially. Since you can see on the HBase Master how the request
increase one after another, not all in the same time.

2014-09-12 15:23 GMT+02:00 Michael Segel <michael_se...@hotmail.com>:

> It doesn’t matter which RS, but that you have 1 thread for each region.
>
> So for each thread, what’s happening.
> Step by step, what is the code doing.
>
> Now you’re comparing this against a single table scan, right?
> What’s happening in the table scan…?
>
>
> On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz <konstt2...@gmail.com> wrote:
>
> > Right, My table for example has keys between 0-9. in three regions
> > 0-2,3-7,7-9
> > I lauch three partial scans in parallel. The scans that I'm executing
> are:
> > scan(0,2), scan(3,7), scan(7,9).
> > Each region is if a different RS, so each thread goes to different RS.
> It's
> > not exactly like that, but on the benchmark case it's like it's working.
> >
> > Really the code will execute a thread for each Region not for each
> > RegionServer. But in the test I only have two regions for regionServer. I
> > dont' think that's an important point, there're two threads for RS.
> >
> > 2014-09-12 14:48 GMT+02:00 Michael Segel <michael_se...@hotmail.com>:
> >
> >> Ok, lets again take a step back…
> >>
> >> So you are comparing your partial scan(s) against a full table scan?
> >>
> >> If I understood your question, you launch 3 partial scans where you set
> >> the start row and then end row of each scan, right?
> >>
> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <konstt2...@gmail.com>
> wrote:
> >>
> >>> Okay, then, the partial scan doesn't work as I think.
> >>> How could it exceed the limit of a single region if I calculate the
> >> limits?
> >>>
> >>>
> >>> The only bad point that I see it's that If a region server has three
> >>> regions of the same table,  I'm executing three partial scans about
> this
> >> RS
> >>> and they could compete for resources (network, etc..) on this node.
> It'd
> >> be
> >>> better to have one thread for RS. But, that doesn't answer your
> >> questions.
> >>>
> >>> I keep thinking...
> >>>
> >>> 2014-09-12 9:40 GMT+02:00 Michael Segel <michael_se...@hotmail.com>:
> >>>
> >>>> Hi,
> >>>>
> >>>> I wanted to take a step back from the actual code and to stop and
> think
> >>>> about what you are doing and what HBase is doing under the covers.
> >>>>
> >>>> So in your code, you are asking HBase to do 3 separate scans and then
> >> you
> >>>> take the result set back and join it.
> >>>>
> >>>> What does HBase do when it does a range scan?
> >>>> What happens when that range scan exceeds a single region?
> >>>>
> >>>> If you answer those questions… you’ll have your answer.
> >>>>
> >>>> HTH
> >>>>
> >>>> -Mike
> >>>>
> >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <konstt2...@gmail.com>
> >> wrote:
> >>>>
> >>>>> It's not all the code, I set things like these as well:
> >>>>> scan.setMaxVersions();
> >>>>> scan.setCacheBlocks(false);
> >>>>> ...
> >>>>>
> >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <konstt2...@gmail.com>:
> >>>>>
> >>>>>> yes, that is. I have changed the HBase version to 0.98
> >>>>>>
> >>>>>> I got the start and stop keys with this method:
> >>>>>> private List<RegionScanner> generatePartitions() {
> >>>>>>      List<RegionScanner> regionScanners = new
> >>>>>> ArrayList<RegionScanner>();
> >>>>>>      byte[] startKey;
> >>>>>>      byte[] stopKey;
> >>>>>>      HConnection connection = null;
> >>>>>>      HBaseAdmin hbaseAdmin = null;
> >>>>>>      try {
> >>>>>>          connection = HConnectionManager.
> >>>>>> createConnection(HBaseConfiguration.create());
> >>>>>>          hbaseAdmin = new HBaseAdmin(connection);
> >>>>>>          List<HRegionInfo> regions =
> >>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >>>>>>          RegionScanner regionScanner = null;
> >>>>>>          for (HRegionInfo region : regions) {
> >>>>>>
> >>>>>>              startKey = region.getStartKey();
> >>>>>>              stopKey = region.getEndKey();
> >>>>>>
> >>>>>>              regionScanner = new RegionScanner(startKey, stopKey,
> >>>>>> scanConfiguration);
> >>>>>>              // regionScanner = createRegionScanner(startKey,
> >>>> stopKey);
> >>>>>>              if (regionScanner != null) {
> >>>>>>                  regionScanners.add(regionScanner);
> >>>>>>              }
> >>>>>>          }
> >>>>>>
> >>>>>> And I execute the RegionScanner with this:
> >>>>>> public List<Result> call() throws Exception {
> >>>>>>      HConnection connection =
> >>>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
> >>>>>>      HTableInterface table =
> >>>>>> connection.getTable(configuration.getTable());
> >>>>>>
> >>>>>>  Scan scan = new Scan(startKey, stopKey);
> >>>>>>      scan.setBatch(configuration.getBatch());
> >>>>>>      scan.setCaching(configuration.getCaching());
> >>>>>>      ResultScanner resultScanner = table.getScanner(scan);
> >>>>>>
> >>>>>>      List<Result> results = new ArrayList<Result>();
> >>>>>>      for (Result result : resultScanner) {
> >>>>>>          results.add(result);
> >>>>>>      }
> >>>>>>
> >>>>>>      connection.close();
> >>>>>>      table.close();
> >>>>>>
> >>>>>>      return results;
> >>>>>>  }
> >>>>>>
> >>>>>> They implement Callable.
> >>>>>>
> >>>>>>
> >>>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <michael_se...@hotmail.com
> >:
> >>>>>>
> >>>>>>> Lets take a step back….
> >>>>>>>
> >>>>>>> Your parallel scan is having the client create N threads where in
> >> each
> >>>>>>> thread, you’re doing a partial scan of the table where each partial
> >>>> scan
> >>>>>>> takes the first and last row of each region?
> >>>>>>>
> >>>>>>> Is that correct?
> >>>>>>>
> >>>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <konstt2...@gmail.com
> >
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> I was checking a little bit more about,, I checked the cluster and
> >>>> data
> >>>>>>> is
> >>>>>>>> store in three different regions servers, each one in a differente
> >>>> node.
> >>>>>>>> So, I guess the threads go to different hard-disks.
> >>>>>>>>
> >>>>>>>> If someone has an idea or suggestion.. why it's faster a single
> scan
> >>>>>>> than
> >>>>>>>> this implementation. I based on this implementation
> >>>>>>>> https://github.com/zygm0nt/hbase-distributed-search
> >>>>>>>>
> >>>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <konstt2...@gmail.com
> >:
> >>>>>>>>
> >>>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with 0.98,
> >>>>>>> although
> >>>>>>>>> there is not difference.
> >>>>>>>>> I disabled the table and disabled the blockcache for that family
> >> and
> >>>> I
> >>>>>>> put
> >>>>>>>>> scan.setBlockcache(false) as well for both cases.
> >>>>>>>>>
> >>>>>>>>> I think that it's not possible that I executing an complete scan
> >> for
> >>>>>>> each
> >>>>>>>>> thread since my data are the type:
> >>>>>>>>> 000001 f:q value=1
> >>>>>>>>> 000002 f:q value=2
> >>>>>>>>> 000003 f:q value=3
> >>>>>>>>> ...
> >>>>>>>>>
> >>>>>>>>> I add all the values and get the same result on a single scan
> than
> >> a
> >>>>>>>>> distributed, so, I guess that DistributedScan did well.
> >>>>>>>>> The count from the hbase shell takes about 10-15seconds, I don't
> >>>>>>> remember,
> >>>>>>>>> but like 4x  of the scan time.
> >>>>>>>>> I'm not using any filter for the scans.
> >>>>>>>>>
> >>>>>>>>> This is the way I calculate number of regions/scans
> >>>>>>>>> private List<RegionScanner> generatePartitions() {
> >>>>>>>>>     List<RegionScanner> regionScanners = new
> >>>>>>>>> ArrayList<RegionScanner>();
> >>>>>>>>>     byte[] startKey;
> >>>>>>>>>     byte[] stopKey;
> >>>>>>>>>     HConnection connection = null;
> >>>>>>>>>     HBaseAdmin hbaseAdmin = null;
> >>>>>>>>>     try {
> >>>>>>>>>         connection =
> >>>>>>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
> >>>>>>>>>         hbaseAdmin = new HBaseAdmin(connection);
> >>>>>>>>>         List<HRegionInfo> regions =
> >>>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
> >>>>>>>>>         RegionScanner regionScanner = null;
> >>>>>>>>>         for (HRegionInfo region : regions) {
> >>>>>>>>>
> >>>>>>>>>             startKey = region.getStartKey();
> >>>>>>>>>             stopKey = region.getEndKey();
> >>>>>>>>>
> >>>>>>>>>             regionScanner = new RegionScanner(startKey, stopKey,
> >>>>>>>>> scanConfiguration);
> >>>>>>>>>             // regionScanner = createRegionScanner(startKey,
> >>>>>>> stopKey);
> >>>>>>>>>             if (regionScanner != null) {
> >>>>>>>>>                 regionScanners.add(regionScanner);
> >>>>>>>>>             }
> >>>>>>>>>         }
> >>>>>>>>>
> >>>>>>>>> I did some test for a tiny table and I think that the range for
> >> each
> >>>>>>> scan
> >>>>>>>>> works fine. Although, I though that it was interesting that the
> >> time
> >>>>>>> when I
> >>>>>>>>> execute distributed scan is about 6x.
> >>>>>>>>>
> >>>>>>>>> I'm going to check about the hard disks, but I think that ti's
> >> right.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>:
> >>>>>>>>>
> >>>>>>>>>> Which version of HBase?
> >>>>>>>>>> Can you show us the code?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Your parallel scan with caching 100 takes about 6x as long as
> the
> >>>>>>> single
> >>>>>>>>>> scan, which is suspicious because you say you have 6 regions.
> >>>>>>>>>> Are you sure you're not accidentally scanning all the data in
> each
> >>>> of
> >>>>>>>>>> your parallel scans?
> >>>>>>>>>>
> >>>>>>>>>> -- Lars
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> ________________________________
> >>>>>>>>>> From: Guillermo Ortiz <konstt2...@gmail.com>
> >>>>>>>>>> To: "user@hbase.apache.org" <user@hbase.apache.org>
> >>>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
> >>>>>>>>>> Subject: Scan vs Parallel scan.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> I developed an distributed scan, I create an thread for each
> >> region.
> >>>>>>> After
> >>>>>>>>>> that, I've tried to get some times Scan vs DistributedScan.
> >>>>>>>>>> I have disabled blockcache in my table. My cluster has 3 region
> >>>>>>> servers
> >>>>>>>>>> with 2 regions each one, in total there are 100.000 rows and
> >>>> execute a
> >>>>>>>>>> complete scan.
> >>>>>>>>>>
> >>>>>>>>>> My partitions are
> >>>>>>>>>> -01666 -> request 16665
> >>>>>>>>>> 016666-033332 -> request 16666
> >>>>>>>>>> 033332-049998 -> request 16666
> >>>>>>>>>> 049998-066664 -> request 16666
> >>>>>>>>>> 066664-083330 -> request 16666
> >>>>>>>>>> 083330- -> request 16671
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
> >>>>>>> PARALLEL:22089ms,Counter:2 ->
> >>>>>>>>>> Caching 10
> >>>>>>>>>>
> >>>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
> >>>>>>> PARALJEL:16598ms,Counter:2 ->
> >>>>>>>>>> Caching 100
> >>>>>>>>>>
> >>>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
> >>>>>>> PARALLEL:16497ms,Counter:2 ->
> >>>>>>>>>> Caching 1000
> >>>>>>>>>>
> >>>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN
> >> NORMAL:68288ms,Counter:2
> >>>>>>> ->
> >>>>>>>>>> Caching 1
> >>>>>>>>>>
> >>>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN
> >> NORMAL:2646ms,Counter:2
> >>>> ->
> >>>>>>>>>> Caching 100
> >>>>>>>>>>
> >>>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
> >>>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN
> >> NORMAL:3903ms,Counter:2
> >>>> ->
> >>>>>>>>>> Caching 1000
> >>>>>>>>>>
> >>>>>>>>>> Parallel scan works much worse than simple scan,, and I don't
> know
> >>>> why
> >>>>>>>>>> it's
> >>>>>>>>>> so fast,, it's really much faster than execute an "count" from
> >> hbase
> >>>>>>>>>> shell,
> >>>>>>>>>> what it doesn't look pretty notmal. The only time that it works
> >>>> better
> >>>>>>>>>> parallel is when I execute a normal scan with caching 1.
> >>>>>>>>>>
> >>>>>>>>>> Any clue about it?
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: Scan vs Parallel scan.

Reply via email to