Re: Scan vs Parallel scan.

Guillermo Ortiz Thu, 11 Sep 2014 23:37:29 -0700

I was checking a little bit more about,, I checked the cluster and data is
store in three different regions servers, each one in a differente node.
So, I guess the threads go to different hard-disks.


If someone has an idea or suggestion.. why it's faster a single scan than
this implementation. I based on this implementation
https://github.com/zygm0nt/hbase-distributed-search

2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <konstt2...@gmail.com>:

> I'm working with HBase 0.94 for this case,, I'll try with 0.98, although
> there is not difference.
> I disabled the table and disabled the blockcache for that family and I put
> scan.setBlockcache(false) as well for both cases.
>
> I think that it's not possible that I executing an complete scan for each
> thread since my data are the type:
> 000001 f:q value=1
> 000002 f:q value=2
> 000003 f:q value=3
> ...
>
> I add all the values and get the same result on a single scan than a
> distributed, so, I guess that DistributedScan did well.
> The count from the hbase shell takes about 10-15seconds, I don't remember,
> but like 4x  of the scan time.
> I'm not using any filter for the scans.
>
> This is the way I calculate number of regions/scans
> private List<RegionScanner> generatePartitions() {
>         List<RegionScanner> regionScanners = new
> ArrayList<RegionScanner>();
>         byte[] startKey;
>         byte[] stopKey;
>         HConnection connection = null;
>         HBaseAdmin hbaseAdmin = null;
>         try {
>             connection =
> HConnectionManager.createConnection(HBaseConfiguration.create());
>             hbaseAdmin = new HBaseAdmin(connection);
>             List<HRegionInfo> regions =
> hbaseAdmin.getTableRegions(scanConfiguration.getTable());
>             RegionScanner regionScanner = null;
>             for (HRegionInfo region : regions) {
>
>                 startKey = region.getStartKey();
>                 stopKey = region.getEndKey();
>
>                 regionScanner = new RegionScanner(startKey, stopKey,
> scanConfiguration);
>                 // regionScanner = createRegionScanner(startKey, stopKey);
>                 if (regionScanner != null) {
>                     regionScanners.add(regionScanner);
>                 }
>             }
>
> I did some test for a tiny table and I think that the range for each scan
> works fine. Although, I though that it was interesting that the time when I
> execute distributed scan is about 6x.
>
> I'm going to check about the hard disks, but I think that ti's right.
>
>
>
>
> 2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>:
>
>> Which version of HBase?
>> Can you show us the code?
>>
>>
>> Your parallel scan with caching 100 takes about 6x as long as the single
>> scan, which is suspicious because you say you have 6 regions.
>> Are you sure you're not accidentally scanning all the data in each of
>> your parallel scans?
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: Guillermo Ortiz <konstt2...@gmail.com>
>> To: "user@hbase.apache.org" <user@hbase.apache.org>
>> Sent: Wednesday, September 10, 2014 1:40 AM
>> Subject: Scan vs Parallel scan.
>>
>>
>> Hi,
>>
>> I developed an distributed scan, I create an thread for each region. After
>> that, I've tried to get some times Scan vs DistributedScan.
>> I have disabled blockcache in my table. My cluster has 3 region servers
>> with 2 regions each one, in total there are 100.000 rows and execute a
>> complete scan.
>>
>> My partitions are
>> -01666 -> request 16665
>> 016666-033332 -> request 16666
>> 033332-049998 -> request 16666
>> 049998-066664 -> request 16666
>> 066664-083330 -> request 16666
>> 083330- -> request 16671
>>
>>
>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 ->
>> Caching 10
>>
>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 ->
>> Caching 100
>>
>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 ->
>> Caching 1000
>>
>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 ->
>> Caching 1
>>
>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 ->
>> Caching 100
>>
>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000
>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 ->
>> Caching 1000
>>
>> Parallel scan works much worse than simple scan,, and I don't know why
>> it's
>> so fast,, it's really much faster than execute an "count" from hbase
>> shell,
>> what it doesn't look pretty notmal. The only time that it works better
>> parallel is when I execute a normal scan with caching 1.
>>
>> Any clue about it?
>>
>
>

Re: Scan vs Parallel scan.

Reply via email to