Re: Scan vs Parallel scan.

Michael Segel Fri, 12 Sep 2014 06:25:42 -0700

It doesn’t matter which RS, but that you have 1 thread for each region. 

So for each thread, what’s happening. 
Step by step, what is the code doing.


Now you’re comparing this against a single table scan, right? 
What’s happening in the table scan…?


On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz <konstt2...@gmail.com> wrote:

> Right, My table for example has keys between 0-9. in three regions
> 0-2,3-7,7-9
> I lauch three partial scans in parallel. The scans that I'm executing are:
> scan(0,2), scan(3,7), scan(7,9).
> Each region is if a different RS, so each thread goes to different RS. It's
> not exactly like that, but on the benchmark case it's like it's working.
> 
> Really the code will execute a thread for each Region not for each
> RegionServer. But in the test I only have two regions for regionServer. I
> dont' think that's an important point, there're two threads for RS.
> 
> 2014-09-12 14:48 GMT+02:00 Michael Segel <michael_se...@hotmail.com>:
> 
>> Ok, lets again take a step back…
>> 
>> So you are comparing your partial scan(s) against a full table scan?
>> 
>> If I understood your question, you launch 3 partial scans where you set
>> the start row and then end row of each scan, right?
>> 
>> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <konstt2...@gmail.com> wrote:
>> 
>>> Okay, then, the partial scan doesn't work as I think.
>>> How could it exceed the limit of a single region if I calculate the
>> limits?
>>> 
>>> 
>>> The only bad point that I see it's that If a region server has three
>>> regions of the same table,  I'm executing three partial scans about this
>> RS
>>> and they could compete for resources (network, etc..) on this node. It'd
>> be
>>> better to have one thread for RS. But, that doesn't answer your
>> questions.
>>> 
>>> I keep thinking...
>>> 
>>> 2014-09-12 9:40 GMT+02:00 Michael Segel <michael_se...@hotmail.com>:
>>> 
>>>> Hi,
>>>> 
>>>> I wanted to take a step back from the actual code and to stop and think
>>>> about what you are doing and what HBase is doing under the covers.
>>>> 
>>>> So in your code, you are asking HBase to do 3 separate scans and then
>> you
>>>> take the result set back and join it.
>>>> 
>>>> What does HBase do when it does a range scan?
>>>> What happens when that range scan exceeds a single region?
>>>> 
>>>> If you answer those questions… you’ll have your answer.
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> 
>>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <konstt2...@gmail.com>
>> wrote:
>>>> 
>>>>> It's not all the code, I set things like these as well:
>>>>> scan.setMaxVersions();
>>>>> scan.setCacheBlocks(false);
>>>>> ...
>>>>> 
>>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <konstt2...@gmail.com>:
>>>>> 
>>>>>> yes, that is. I have changed the HBase version to 0.98
>>>>>> 
>>>>>> I got the start and stop keys with this method:
>>>>>> private List<RegionScanner> generatePartitions() {
>>>>>>      List<RegionScanner> regionScanners = new
>>>>>> ArrayList<RegionScanner>();
>>>>>>      byte[] startKey;
>>>>>>      byte[] stopKey;
>>>>>>      HConnection connection = null;
>>>>>>      HBaseAdmin hbaseAdmin = null;
>>>>>>      try {
>>>>>>          connection = HConnectionManager.
>>>>>> createConnection(HBaseConfiguration.create());
>>>>>>          hbaseAdmin = new HBaseAdmin(connection);
>>>>>>          List<HRegionInfo> regions =
>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>>>>>>          RegionScanner regionScanner = null;
>>>>>>          for (HRegionInfo region : regions) {
>>>>>> 
>>>>>>              startKey = region.getStartKey();
>>>>>>              stopKey = region.getEndKey();
>>>>>> 
>>>>>>              regionScanner = new RegionScanner(startKey, stopKey,
>>>>>> scanConfiguration);
>>>>>>              // regionScanner = createRegionScanner(startKey,
>>>> stopKey);
>>>>>>              if (regionScanner != null) {
>>>>>>                  regionScanners.add(regionScanner);
>>>>>>              }
>>>>>>          }
>>>>>> 
>>>>>> And I execute the RegionScanner with this:
>>>>>> public List<Result> call() throws Exception {
>>>>>>      HConnection connection =
>>>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
>>>>>>      HTableInterface table =
>>>>>> connection.getTable(configuration.getTable());
>>>>>> 
>>>>>>  Scan scan = new Scan(startKey, stopKey);
>>>>>>      scan.setBatch(configuration.getBatch());
>>>>>>      scan.setCaching(configuration.getCaching());
>>>>>>      ResultScanner resultScanner = table.getScanner(scan);
>>>>>> 
>>>>>>      List<Result> results = new ArrayList<Result>();
>>>>>>      for (Result result : resultScanner) {
>>>>>>          results.add(result);
>>>>>>      }
>>>>>> 
>>>>>>      connection.close();
>>>>>>      table.close();
>>>>>> 
>>>>>>      return results;
>>>>>>  }
>>>>>> 
>>>>>> They implement Callable.
>>>>>> 
>>>>>> 
>>>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <michael_se...@hotmail.com>:
>>>>>> 
>>>>>>> Lets take a step back….
>>>>>>> 
>>>>>>> Your parallel scan is having the client create N threads where in
>> each
>>>>>>> thread, you’re doing a partial scan of the table where each partial
>>>> scan
>>>>>>> takes the first and last row of each region?
>>>>>>> 
>>>>>>> Is that correct?
>>>>>>> 
>>>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <konstt2...@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> I was checking a little bit more about,, I checked the cluster and
>>>> data
>>>>>>> is
>>>>>>>> store in three different regions servers, each one in a differente
>>>> node.
>>>>>>>> So, I guess the threads go to different hard-disks.
>>>>>>>> 
>>>>>>>> If someone has an idea or suggestion.. why it's faster a single scan
>>>>>>> than
>>>>>>>> this implementation. I based on this implementation
>>>>>>>> https://github.com/zygm0nt/hbase-distributed-search
>>>>>>>> 
>>>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <konstt2...@gmail.com>:
>>>>>>>> 
>>>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with 0.98,
>>>>>>> although
>>>>>>>>> there is not difference.
>>>>>>>>> I disabled the table and disabled the blockcache for that family
>> and
>>>> I
>>>>>>> put
>>>>>>>>> scan.setBlockcache(false) as well for both cases.
>>>>>>>>> 
>>>>>>>>> I think that it's not possible that I executing an complete scan
>> for
>>>>>>> each
>>>>>>>>> thread since my data are the type:
>>>>>>>>> 000001 f:q value=1
>>>>>>>>> 000002 f:q value=2
>>>>>>>>> 000003 f:q value=3
>>>>>>>>> ...
>>>>>>>>> 
>>>>>>>>> I add all the values and get the same result on a single scan than
>> a
>>>>>>>>> distributed, so, I guess that DistributedScan did well.
>>>>>>>>> The count from the hbase shell takes about 10-15seconds, I don't
>>>>>>> remember,
>>>>>>>>> but like 4x  of the scan time.
>>>>>>>>> I'm not using any filter for the scans.
>>>>>>>>> 
>>>>>>>>> This is the way I calculate number of regions/scans
>>>>>>>>> private List<RegionScanner> generatePartitions() {
>>>>>>>>>     List<RegionScanner> regionScanners = new
>>>>>>>>> ArrayList<RegionScanner>();
>>>>>>>>>     byte[] startKey;
>>>>>>>>>     byte[] stopKey;
>>>>>>>>>     HConnection connection = null;
>>>>>>>>>     HBaseAdmin hbaseAdmin = null;
>>>>>>>>>     try {
>>>>>>>>>         connection =
>>>>>>>>> HConnectionManager.createConnection(HBaseConfiguration.create());
>>>>>>>>>         hbaseAdmin = new HBaseAdmin(connection);
>>>>>>>>>         List<HRegionInfo> regions =
>>>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>>>>>>>>>         RegionScanner regionScanner = null;
>>>>>>>>>         for (HRegionInfo region : regions) {
>>>>>>>>> 
>>>>>>>>>             startKey = region.getStartKey();
>>>>>>>>>             stopKey = region.getEndKey();
>>>>>>>>> 
>>>>>>>>>             regionScanner = new RegionScanner(startKey, stopKey,
>>>>>>>>> scanConfiguration);
>>>>>>>>>             // regionScanner = createRegionScanner(startKey,
>>>>>>> stopKey);
>>>>>>>>>             if (regionScanner != null) {
>>>>>>>>>                 regionScanners.add(regionScanner);
>>>>>>>>>             }
>>>>>>>>>         }
>>>>>>>>> 
>>>>>>>>> I did some test for a tiny table and I think that the range for
>> each
>>>>>>> scan
>>>>>>>>> works fine. Although, I though that it was interesting that the
>> time
>>>>>>> when I
>>>>>>>>> execute distributed scan is about 6x.
>>>>>>>>> 
>>>>>>>>> I'm going to check about the hard disks, but I think that ti's
>> right.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>:
>>>>>>>>> 
>>>>>>>>>> Which version of HBase?
>>>>>>>>>> Can you show us the code?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Your parallel scan with caching 100 takes about 6x as long as the
>>>>>>> single
>>>>>>>>>> scan, which is suspicious because you say you have 6 regions.
>>>>>>>>>> Are you sure you're not accidentally scanning all the data in each
>>>> of
>>>>>>>>>> your parallel scans?
>>>>>>>>>> 
>>>>>>>>>> -- Lars
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> ________________________________
>>>>>>>>>> From: Guillermo Ortiz <konstt2...@gmail.com>
>>>>>>>>>> To: "user@hbase.apache.org" <user@hbase.apache.org>
>>>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM
>>>>>>>>>> Subject: Scan vs Parallel scan.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> I developed an distributed scan, I create an thread for each
>> region.
>>>>>>> After
>>>>>>>>>> that, I've tried to get some times Scan vs DistributedScan.
>>>>>>>>>> I have disabled blockcache in my table. My cluster has 3 region
>>>>>>> servers
>>>>>>>>>> with 2 regions each one, in total there are 100.000 rows and
>>>> execute a
>>>>>>>>>> complete scan.
>>>>>>>>>> 
>>>>>>>>>> My partitions are
>>>>>>>>>> -01666 -> request 16665
>>>>>>>>>> 016666-033332 -> request 16666
>>>>>>>>>> 033332-049998 -> request 16666
>>>>>>>>>> 049998-066664 -> request 16666
>>>>>>>>>> 066664-083330 -> request 16666
>>>>>>>>>> 083330- -> request 16671
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN
>>>>>>> PARALLEL:22089ms,Counter:2 ->
>>>>>>>>>> Caching 10
>>>>>>>>>> 
>>>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN
>>>>>>> PARALJEL:16598ms,Counter:2 ->
>>>>>>>>>> Caching 100
>>>>>>>>>> 
>>>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN
>>>>>>> PARALLEL:16497ms,Counter:2 ->
>>>>>>>>>> Caching 1000
>>>>>>>>>> 
>>>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN
>> NORMAL:68288ms,Counter:2
>>>>>>> ->
>>>>>>>>>> Caching 1
>>>>>>>>>> 
>>>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN
>> NORMAL:2646ms,Counter:2
>>>> ->
>>>>>>>>>> Caching 100
>>>>>>>>>> 
>>>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
>>>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN
>> NORMAL:3903ms,Counter:2
>>>> ->
>>>>>>>>>> Caching 1000
>>>>>>>>>> 
>>>>>>>>>> Parallel scan works much worse than simple scan,, and I don't know
>>>> why
>>>>>>>>>> it's
>>>>>>>>>> so fast,, it's really much faster than execute an "count" from
>> hbase
>>>>>>>>>> shell,
>>>>>>>>>> what it doesn't look pretty notmal. The only time that it works
>>>> better
>>>>>>>>>> parallel is when I execute a normal scan with caching 1.
>>>>>>>>>> 
>>>>>>>>>> Any clue about it?
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Scan vs Parallel scan.

Reply via email to