Re: Accumulo Seek performance

dlmarion Wed, 24 Aug 2016 09:13:08 -0700

Doesn't this use the 6 batch scanners serially? 

----- Original Message -----


From: "Sven Hodapp" <[email protected]> 
To: "user" <[email protected]> 
Sent: Wednesday, August 24, 2016 11:56:14 AM 
Subject: Re: Accumulo Seek performance 

Hi Josh, 

thanks for your reply! 

I've tested your suggestion with a implementation like that: 

val ranges500 = ranges.asScala.grouped(500) // this means 6 BatchScanners will 
be created 

time("mult-scanner") { 
for (ranges <- ranges500) { 
val bscan = instance.createBatchScanner(ARTIFACTS, auths, 1) 
bscan.setRanges(ranges.asJava) 
for (entry <- bscan.asScala) yield { 
entry.getKey() 
} 
} 
} 

And the result is a bit disappointing: 

background log: info: mult-scanner time: 18064.969281 ms 
background log: info: single-scanner time: 6527.482383 ms 

I'm doing something wrong here? 


Regards, 
Sven 

-- 
Sven Hodapp, M.Sc., 
Fraunhofer Institute for Algorithms and Scientific Computing SCAI, 
Department of Bioinformatics 
Schloss Birlinghoven, 53754 Sankt Augustin, Germany 
[email protected] 
www.scai.fraunhofer.de 

----- Ursprüngliche Mail ----- 
> Von: "Josh Elser" <[email protected]> 
> An: "user" <[email protected]> 
> Gesendet: Mittwoch, 24. August 2016 16:33:37 
> Betreff: Re: Accumulo Seek performance 

> This reminded me of https://issues.apache.org/jira/browse/ACCUMULO-3710 
> 
> I don't feel like 3000 ranges is too many, but this isn't quantitative. 
> 
> IIRC, the BatchScanner will take each Range you provide, bin each Range 
> to the TabletServer(s) currently hosting the corresponding data, clip 
> (truncate) each Range to match the Tablet boundaries, and then does an 
> RPC to each TabletServer with just the Ranges hosted there. 
> 
> Inside the TabletServer, it will then have many Ranges, binned by Tablet 
> (KeyExtent, to be precise). This will spawn a 
> org.apache.accumulo.tserver.scan.LookupTask will will start collecting 
> results to send back to the client. 
> 
> The caveat here is that those ranges are processed serially on a 
> TabletServer. Maybe, you're swamping one TabletServer with lots of 
> Ranges that it could be processing in parallel. 
> 
> Could you experiment with using multiple BatchScanners and something 
> like Guava's Iterables.concat to make it appear like one Iterator? 
> 
> I'm curious if we should put an optimization into the BatchScanner 
> itself to limit the number of ranges we send in one RPC to a 
> TabletServer (e.g. one BatchScanner might open multiple 
> MultiScanSessions to a TabletServer). 
> 
> Sven Hodapp wrote: 
>> Hi there, 
>> 
>> currently we're experimenting with a two node Accumulo cluster (two tablet 
>> servers) setup for document storage. 
>> This documents are decomposed up to the sentence level. 
>> 
>> Now I'm using a BatchScanner to assemble the full document like this: 
>> 
>> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 10) // ARTIFACTS 
>> table 
>> currently hosts ~30GB data, ~200M entries on ~45 tablets 
>> bscan.setRanges(ranges) // there are like 3000 Range.exact's in the 
>> ranges-list 
>> for (entry<- bscan.asScala) yield { 
>> val key = entry.getKey() 
>> val value = entry.getValue() 
>> // etc. 
>> } 
>> 
>> For larger full documents (e.g. 3000 exact ranges), this operation will take 
>> about 12 seconds. 
>> But shorter documents are assembled blazing fast... 
>> 
>> Is that to much for a BatchScanner / I'm misusing the BatchScaner? 
>> Is that a normal time for such a (seek) operation? 
>> Can I do something to get a better seek performance? 
>> 
>> Note: I have already enabled bloom filtering on that table. 
>> 
>> Thank you for any advice! 
>> 
>> Regards, 
>> Sven

Re: Accumulo Seek performance

Reply via email to