It doesn’t matter which RS, but that you have 1 thread for each region. So for each thread, what’s happening. Step by step, what is the code doing.
Now you’re comparing this against a single table scan, right? What’s happening in the table scan…? On Sep 12, 2014, at 2:04 PM, Guillermo Ortiz <konstt2...@gmail.com> wrote: > Right, My table for example has keys between 0-9. in three regions > 0-2,3-7,7-9 > I lauch three partial scans in parallel. The scans that I'm executing are: > scan(0,2), scan(3,7), scan(7,9). > Each region is if a different RS, so each thread goes to different RS. It's > not exactly like that, but on the benchmark case it's like it's working. > > Really the code will execute a thread for each Region not for each > RegionServer. But in the test I only have two regions for regionServer. I > dont' think that's an important point, there're two threads for RS. > > 2014-09-12 14:48 GMT+02:00 Michael Segel <michael_se...@hotmail.com>: > >> Ok, lets again take a step back… >> >> So you are comparing your partial scan(s) against a full table scan? >> >> If I understood your question, you launch 3 partial scans where you set >> the start row and then end row of each scan, right? >> >> On Sep 12, 2014, at 9:16 AM, Guillermo Ortiz <konstt2...@gmail.com> wrote: >> >>> Okay, then, the partial scan doesn't work as I think. >>> How could it exceed the limit of a single region if I calculate the >> limits? >>> >>> >>> The only bad point that I see it's that If a region server has three >>> regions of the same table, I'm executing three partial scans about this >> RS >>> and they could compete for resources (network, etc..) on this node. It'd >> be >>> better to have one thread for RS. But, that doesn't answer your >> questions. >>> >>> I keep thinking... >>> >>> 2014-09-12 9:40 GMT+02:00 Michael Segel <michael_se...@hotmail.com>: >>> >>>> Hi, >>>> >>>> I wanted to take a step back from the actual code and to stop and think >>>> about what you are doing and what HBase is doing under the covers. >>>> >>>> So in your code, you are asking HBase to do 3 separate scans and then >> you >>>> take the result set back and join it. >>>> >>>> What does HBase do when it does a range scan? >>>> What happens when that range scan exceeds a single region? >>>> >>>> If you answer those questions… you’ll have your answer. >>>> >>>> HTH >>>> >>>> -Mike >>>> >>>> On Sep 12, 2014, at 8:34 AM, Guillermo Ortiz <konstt2...@gmail.com> >> wrote: >>>> >>>>> It's not all the code, I set things like these as well: >>>>> scan.setMaxVersions(); >>>>> scan.setCacheBlocks(false); >>>>> ... >>>>> >>>>> 2014-09-12 9:33 GMT+02:00 Guillermo Ortiz <konstt2...@gmail.com>: >>>>> >>>>>> yes, that is. I have changed the HBase version to 0.98 >>>>>> >>>>>> I got the start and stop keys with this method: >>>>>> private List<RegionScanner> generatePartitions() { >>>>>> List<RegionScanner> regionScanners = new >>>>>> ArrayList<RegionScanner>(); >>>>>> byte[] startKey; >>>>>> byte[] stopKey; >>>>>> HConnection connection = null; >>>>>> HBaseAdmin hbaseAdmin = null; >>>>>> try { >>>>>> connection = HConnectionManager. >>>>>> createConnection(HBaseConfiguration.create()); >>>>>> hbaseAdmin = new HBaseAdmin(connection); >>>>>> List<HRegionInfo> regions = >>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable()); >>>>>> RegionScanner regionScanner = null; >>>>>> for (HRegionInfo region : regions) { >>>>>> >>>>>> startKey = region.getStartKey(); >>>>>> stopKey = region.getEndKey(); >>>>>> >>>>>> regionScanner = new RegionScanner(startKey, stopKey, >>>>>> scanConfiguration); >>>>>> // regionScanner = createRegionScanner(startKey, >>>> stopKey); >>>>>> if (regionScanner != null) { >>>>>> regionScanners.add(regionScanner); >>>>>> } >>>>>> } >>>>>> >>>>>> And I execute the RegionScanner with this: >>>>>> public List<Result> call() throws Exception { >>>>>> HConnection connection = >>>>>> HConnectionManager.createConnection(HBaseConfiguration.create()); >>>>>> HTableInterface table = >>>>>> connection.getTable(configuration.getTable()); >>>>>> >>>>>> Scan scan = new Scan(startKey, stopKey); >>>>>> scan.setBatch(configuration.getBatch()); >>>>>> scan.setCaching(configuration.getCaching()); >>>>>> ResultScanner resultScanner = table.getScanner(scan); >>>>>> >>>>>> List<Result> results = new ArrayList<Result>(); >>>>>> for (Result result : resultScanner) { >>>>>> results.add(result); >>>>>> } >>>>>> >>>>>> connection.close(); >>>>>> table.close(); >>>>>> >>>>>> return results; >>>>>> } >>>>>> >>>>>> They implement Callable. >>>>>> >>>>>> >>>>>> 2014-09-12 9:26 GMT+02:00 Michael Segel <michael_se...@hotmail.com>: >>>>>> >>>>>>> Lets take a step back…. >>>>>>> >>>>>>> Your parallel scan is having the client create N threads where in >> each >>>>>>> thread, you’re doing a partial scan of the table where each partial >>>> scan >>>>>>> takes the first and last row of each region? >>>>>>> >>>>>>> Is that correct? >>>>>>> >>>>>>> On Sep 12, 2014, at 7:36 AM, Guillermo Ortiz <konstt2...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I was checking a little bit more about,, I checked the cluster and >>>> data >>>>>>> is >>>>>>>> store in three different regions servers, each one in a differente >>>> node. >>>>>>>> So, I guess the threads go to different hard-disks. >>>>>>>> >>>>>>>> If someone has an idea or suggestion.. why it's faster a single scan >>>>>>> than >>>>>>>> this implementation. I based on this implementation >>>>>>>> https://github.com/zygm0nt/hbase-distributed-search >>>>>>>> >>>>>>>> 2014-09-11 12:05 GMT+02:00 Guillermo Ortiz <konstt2...@gmail.com>: >>>>>>>> >>>>>>>>> I'm working with HBase 0.94 for this case,, I'll try with 0.98, >>>>>>> although >>>>>>>>> there is not difference. >>>>>>>>> I disabled the table and disabled the blockcache for that family >> and >>>> I >>>>>>> put >>>>>>>>> scan.setBlockcache(false) as well for both cases. >>>>>>>>> >>>>>>>>> I think that it's not possible that I executing an complete scan >> for >>>>>>> each >>>>>>>>> thread since my data are the type: >>>>>>>>> 000001 f:q value=1 >>>>>>>>> 000002 f:q value=2 >>>>>>>>> 000003 f:q value=3 >>>>>>>>> ... >>>>>>>>> >>>>>>>>> I add all the values and get the same result on a single scan than >> a >>>>>>>>> distributed, so, I guess that DistributedScan did well. >>>>>>>>> The count from the hbase shell takes about 10-15seconds, I don't >>>>>>> remember, >>>>>>>>> but like 4x of the scan time. >>>>>>>>> I'm not using any filter for the scans. >>>>>>>>> >>>>>>>>> This is the way I calculate number of regions/scans >>>>>>>>> private List<RegionScanner> generatePartitions() { >>>>>>>>> List<RegionScanner> regionScanners = new >>>>>>>>> ArrayList<RegionScanner>(); >>>>>>>>> byte[] startKey; >>>>>>>>> byte[] stopKey; >>>>>>>>> HConnection connection = null; >>>>>>>>> HBaseAdmin hbaseAdmin = null; >>>>>>>>> try { >>>>>>>>> connection = >>>>>>>>> HConnectionManager.createConnection(HBaseConfiguration.create()); >>>>>>>>> hbaseAdmin = new HBaseAdmin(connection); >>>>>>>>> List<HRegionInfo> regions = >>>>>>>>> hbaseAdmin.getTableRegions(scanConfiguration.getTable()); >>>>>>>>> RegionScanner regionScanner = null; >>>>>>>>> for (HRegionInfo region : regions) { >>>>>>>>> >>>>>>>>> startKey = region.getStartKey(); >>>>>>>>> stopKey = region.getEndKey(); >>>>>>>>> >>>>>>>>> regionScanner = new RegionScanner(startKey, stopKey, >>>>>>>>> scanConfiguration); >>>>>>>>> // regionScanner = createRegionScanner(startKey, >>>>>>> stopKey); >>>>>>>>> if (regionScanner != null) { >>>>>>>>> regionScanners.add(regionScanner); >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> I did some test for a tiny table and I think that the range for >> each >>>>>>> scan >>>>>>>>> works fine. Although, I though that it was interesting that the >> time >>>>>>> when I >>>>>>>>> execute distributed scan is about 6x. >>>>>>>>> >>>>>>>>> I'm going to check about the hard disks, but I think that ti's >> right. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> 2014-09-11 7:50 GMT+02:00 lars hofhansl <la...@apache.org>: >>>>>>>>> >>>>>>>>>> Which version of HBase? >>>>>>>>>> Can you show us the code? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Your parallel scan with caching 100 takes about 6x as long as the >>>>>>> single >>>>>>>>>> scan, which is suspicious because you say you have 6 regions. >>>>>>>>>> Are you sure you're not accidentally scanning all the data in each >>>> of >>>>>>>>>> your parallel scans? >>>>>>>>>> >>>>>>>>>> -- Lars >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ________________________________ >>>>>>>>>> From: Guillermo Ortiz <konstt2...@gmail.com> >>>>>>>>>> To: "user@hbase.apache.org" <user@hbase.apache.org> >>>>>>>>>> Sent: Wednesday, September 10, 2014 1:40 AM >>>>>>>>>> Subject: Scan vs Parallel scan. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I developed an distributed scan, I create an thread for each >> region. >>>>>>> After >>>>>>>>>> that, I've tried to get some times Scan vs DistributedScan. >>>>>>>>>> I have disabled blockcache in my table. My cluster has 3 region >>>>>>> servers >>>>>>>>>> with 2 regions each one, in total there are 100.000 rows and >>>> execute a >>>>>>>>>> complete scan. >>>>>>>>>> >>>>>>>>>> My partitions are >>>>>>>>>> -01666 -> request 16665 >>>>>>>>>> 016666-033332 -> request 16666 >>>>>>>>>> 033332-049998 -> request 16666 >>>>>>>>>> 049998-066664 -> request 16666 >>>>>>>>>> 066664-083330 -> request 16666 >>>>>>>>>> 083330- -> request 16671 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 14/09/10 09:15:47 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>>>>>>>>> 14/09/10 09:15:47 INFO util.TimerUtil: SCAN >>>>>>> PARALLEL:22089ms,Counter:2 -> >>>>>>>>>> Caching 10 >>>>>>>>>> >>>>>>>>>> 14/09/10 09:16:04 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>>>>>>>>> 14/09/10 09:16:04 INFO util.TimerUtil: SCAN >>>>>>> PARALJEL:16598ms,Counter:2 -> >>>>>>>>>> Caching 100 >>>>>>>>>> >>>>>>>>>> 14/09/10 09:16:22 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>>>>>>>>> 14/09/10 09:16:22 INFO util.TimerUtil: SCAN >>>>>>> PARALLEL:16497ms,Counter:2 -> >>>>>>>>>> Caching 1000 >>>>>>>>>> >>>>>>>>>> 14/09/10 09:17:41 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>>>>>>>>> 14/09/10 09:17:41 INFO util.TimerUtil: SCAN >> NORMAL:68288ms,Counter:2 >>>>>>> -> >>>>>>>>>> Caching 1 >>>>>>>>>> >>>>>>>>>> 14/09/10 09:17:48 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>>>>>>>>> 14/09/10 09:17:48 INFO util.TimerUtil: SCAN >> NORMAL:2646ms,Counter:2 >>>> -> >>>>>>>>>> Caching 100 >>>>>>>>>> >>>>>>>>>> 14/09/10 09:17:58 INFO hbase.HbaseScanTest: NUM ROWS 100000 >>>>>>>>>> 14/09/10 09:17:58 INFO util.TimerUtil: SCAN >> NORMAL:3903ms,Counter:2 >>>> -> >>>>>>>>>> Caching 1000 >>>>>>>>>> >>>>>>>>>> Parallel scan works much worse than simple scan,, and I don't know >>>> why >>>>>>>>>> it's >>>>>>>>>> so fast,, it's really much faster than execute an "count" from >> hbase >>>>>>>>>> shell, >>>>>>>>>> what it doesn't look pretty notmal. The only time that it works >>>> better >>>>>>>>>> parallel is when I execute a normal scan with caching 1. >>>>>>>>>> >>>>>>>>>> Any clue about it? >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >>>> >> >>