Lets take a step back…. Your parallel scan is having the client create N threads where in each thread, you’re doing a partial scan of the table where each partial scan takes the first and last row of each region?
Is that correct? On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <konstt2...@gmail.com> wrote: > I was checking a little bit more about,, I checked the cluster and data is > store in three different regions servers, each one in a differente node. > So, I guess the threads go to different hard-disks. > > If someone has an idea or suggestion.. why it's faster a single scan than > this implementation. I based on this implementation > https://github.com/zygm0nt/hbase-distributed-search > > 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <konstt2...@gmail.com>: > >> I'm working with HBase 0.94 for this case,, I'll try with 0.98, although >> there is not difference. >> I disabled the table and disabled the blockcache for that family and I put >> scan.setBlockcache(false) as well for both cases. >> >> I think that it's not possible that I executing an complete scan for each >> thread since my data are the type: >> 000001 f:q value=1 >> 000002 f:q value=2 >> 000003 f:q value=3 >> ... >> >> I add all the values and get the same result on a single scan than a >> distributed, so, I guess that DistributedScan did well. >> The count from the hbase shell takes about 10-15seconds, I don't remember, >> but like 4x of the scan time. >> I'm not using any filter for the scans. >> >> This is the way I calculate number of regions/scans >> private List<RegionScanner> generatePartitions() { >> List<RegionScanner> regionScanners = new >> ArrayList<RegionScanner>(); >> byte[] startKey; >> byte[] stopKey; >> HConnection connection = null; >> HBaseAdmin hbaseAdmin = null; >> try { >> connection = >> HConnectionManager.createConnection(HBaseConfiguration.create()); >> hbaseAdmin = new HBaseAdmin(connection); >> List<HRegionInfo> regions = >> hbaseAdmin.getTableRegions(scanConfiguration.getTable()); >> RegionScanner regionScanner = null; >> for (HRegionInfo region : regions) { >> >> startKey = region.getStartKey(); >> stopKey = region.getEndKey(); >> >> regionScanner = new RegionScanner(startKey, stopKey, >> scanConfiguration); >> // regionScanner = createRegionScanner(startKey, stopKey); >> if (regionScanner != null) { >> regionScanners.add(regionScanner); >> } >> } >> >> I did some test for a tiny table and I think that the range for each scan >> works fine. Although, I though that it was interesting that the time when I >> execute distributed scan is about 6x. >> >> I'm going to check about the hard disks, but I think that ti's right. >> >> >> >> >> 2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>: >> >>> Which version of HBase? >>> Can you show us the code? >>> >>> >>> Your parallel scan with caching 100 takes about 6x as long as the single >>> scan, which is suspicious because you say you have 6 regions. >>> Are you sure you're not accidentally scanning all the data in each of >>> your parallel scans? >>> >>> -- Lars >>> >>> >>> >>> ________________________________ >>> From: Guillermo Ortiz <konstt2...@gmail.com> >>> To: "user@hbase.apache.org" <user@hbase.apache.org> >>> Sent: Wednesday, September 10, 2014 1:40 AM >>> Subject: Scan vs Parallel scan. >>> >>> >>> Hi, >>> >>> I developed an distributed scan, I create an thread for each region. After >>> that, I've tried to get some times Scan vs DistributedScan. >>> I have disabled blockcache in my table. My cluster has 3 region servers >>> with 2 regions each one, in total there are 100.000 rows and execute a >>> complete scan. >>> >>> My partitions are >>> -01666 -> request 16665 >>> 016666-033332 -> request 16666 >>> 033332-049998 -> request 16666 >>> 049998-066664 -> request 16666 >>> 066664-083330 -> request 16666 >>> 083330- -> request 16671 >>> >>> >>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 -> >>> Caching 10 >>> >>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 -> >>> Caching 100 >>> >>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 -> >>> Caching 1000 >>> >>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 -> >>> Caching 1 >>> >>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 -> >>> Caching 100 >>> >>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 -> >>> Caching 1000 >>> >>> Parallel scan works much worse than simple scan,, and I don't know why >>> it's >>> so fast,, it's really much faster than execute an "count" from hbase >>> shell, >>> what it doesn't look pretty notmal. The only time that it works better >>> parallel is when I execute a normal scan with caching 1. >>> >>> Any clue about it? >>> >> >>