I was checking a little bit more about,, I checked the cluster and data is store in three different regions servers, each one in a differente node. So, I guess the threads go to different hard-disks.
If someone has an idea or suggestion.. why it's faster a single scan than this implementation. I based on this implementation https://github.com/zygm0nt/hbase-distributed-search 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <konstt2...@gmail.com>: > I'm working with HBase 0.94 for this case,, I'll try with 0.98, although > there is not difference. > I disabled the table and disabled the blockcache for that family and I put > scan.setBlockcache(false) as well for both cases. > > I think that it's not possible that I executing an complete scan for each > thread since my data are the type: > 000001 f:q value=1 > 000002 f:q value=2 > 000003 f:q value=3 > ... > > I add all the values and get the same result on a single scan than a > distributed, so, I guess that DistributedScan did well. > The count from the hbase shell takes about 10-15seconds, I don't remember, > but like 4x of the scan time. > I'm not using any filter for the scans. > > This is the way I calculate number of regions/scans > private List<RegionScanner> generatePartitions() { > List<RegionScanner> regionScanners = new > ArrayList<RegionScanner>(); > byte[] startKey; > byte[] stopKey; > HConnection connection = null; > HBaseAdmin hbaseAdmin = null; > try { > connection = > HConnectionManager.createConnection(HBaseConfiguration.create()); > hbaseAdmin = new HBaseAdmin(connection); > List<HRegionInfo> regions = > hbaseAdmin.getTableRegions(scanConfiguration.getTable()); > RegionScanner regionScanner = null; > for (HRegionInfo region : regions) { > > startKey = region.getStartKey(); > stopKey = region.getEndKey(); > > regionScanner = new RegionScanner(startKey, stopKey, > scanConfiguration); > // regionScanner = createRegionScanner(startKey, stopKey); > if (regionScanner != null) { > regionScanners.add(regionScanner); > } > } > > I did some test for a tiny table and I think that the range for each scan > works fine. Although, I though that it was interesting that the time when I > execute distributed scan is about 6x. > > I'm going to check about the hard disks, but I think that ti's right. > > > > > 2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>: > >> Which version of HBase? >> Can you show us the code? >> >> >> Your parallel scan with caching 100 takes about 6x as long as the single >> scan, which is suspicious because you say you have 6 regions. >> Are you sure you're not accidentally scanning all the data in each of >> your parallel scans? >> >> -- Lars >> >> >> >> ________________________________ >> From: Guillermo Ortiz <konstt2...@gmail.com> >> To: "user@hbase.apache.org" <user@hbase.apache.org> >> Sent: Wednesday, September 10, 2014 1:40 AM >> Subject: Scan vs Parallel scan. >> >> >> Hi, >> >> I developed an distributed scan, I create an thread for each region. After >> that, I've tried to get some times Scan vs DistributedScan. >> I have disabled blockcache in my table. My cluster has 3 region servers >> with 2 regions each one, in total there are 100.000 rows and execute a >> complete scan. >> >> My partitions are >> -01666 -> request 16665 >> 016666-033332 -> request 16666 >> 033332-049998 -> request 16666 >> 049998-066664 -> request 16666 >> 066664-083330 -> request 16666 >> 083330- -> request 16671 >> >> >> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000 >> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN PARALLEL:22089ms,Counter:2 -> >> Caching 10 >> >> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000 >> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN PARALJEL:16598ms,Counter:2 -> >> Caching 100 >> >> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000 >> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN PARALLEL:16497ms,Counter:2 -> >> Caching 1000 >> >> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000 >> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN NORMAL:68288ms,Counter:2 -> >> Caching 1 >> >> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000 >> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN NORMAL:2646ms,Counter:2 -> >> Caching 100 >> >> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000 >> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN NORMAL:3903ms,Counter:2 -> >> Caching 1000 >> >> Parallel scan works much worse than simple scan,, and I don't know why >> it's >> so fast,, it's really much faster than execute an "count" from hbase >> shell, >> what it doesn't look pretty notmal. The only time that it works better >> parallel is when I execute a normal scan with caching 1. >> >> Any clue about it? >> > >