Doesn't this use the 6 batch scanners serially? ----- Original Message -----
From: "Sven Hodapp" <[email protected]> To: "user" <[email protected]> Sent: Wednesday, August 24, 2016 11:56:14 AM Subject: Re: Accumulo Seek performance Hi Josh, thanks for your reply! I've tested your suggestion with a implementation like that: val ranges500 = ranges.asScala.grouped(500) // this means 6 BatchScanners will be created time("mult-scanner") { for (ranges <- ranges500) { val bscan = instance.createBatchScanner(ARTIFACTS, auths, 1) bscan.setRanges(ranges.asJava) for (entry <- bscan.asScala) yield { entry.getKey() } } } And the result is a bit disappointing: background log: info: mult-scanner time: 18064.969281 ms background log: info: single-scanner time: 6527.482383 ms I'm doing something wrong here? Regards, Sven -- Sven Hodapp, M.Sc., Fraunhofer Institute for Algorithms and Scientific Computing SCAI, Department of Bioinformatics Schloss Birlinghoven, 53754 Sankt Augustin, Germany [email protected] www.scai.fraunhofer.de ----- Ursprüngliche Mail ----- > Von: "Josh Elser" <[email protected]> > An: "user" <[email protected]> > Gesendet: Mittwoch, 24. August 2016 16:33:37 > Betreff: Re: Accumulo Seek performance > This reminded me of https://issues.apache.org/jira/browse/ACCUMULO-3710 > > I don't feel like 3000 ranges is too many, but this isn't quantitative. > > IIRC, the BatchScanner will take each Range you provide, bin each Range > to the TabletServer(s) currently hosting the corresponding data, clip > (truncate) each Range to match the Tablet boundaries, and then does an > RPC to each TabletServer with just the Ranges hosted there. > > Inside the TabletServer, it will then have many Ranges, binned by Tablet > (KeyExtent, to be precise). This will spawn a > org.apache.accumulo.tserver.scan.LookupTask will will start collecting > results to send back to the client. > > The caveat here is that those ranges are processed serially on a > TabletServer. Maybe, you're swamping one TabletServer with lots of > Ranges that it could be processing in parallel. > > Could you experiment with using multiple BatchScanners and something > like Guava's Iterables.concat to make it appear like one Iterator? > > I'm curious if we should put an optimization into the BatchScanner > itself to limit the number of ranges we send in one RPC to a > TabletServer (e.g. one BatchScanner might open multiple > MultiScanSessions to a TabletServer). > > Sven Hodapp wrote: >> Hi there, >> >> currently we're experimenting with a two node Accumulo cluster (two tablet >> servers) setup for document storage. >> This documents are decomposed up to the sentence level. >> >> Now I'm using a BatchScanner to assemble the full document like this: >> >> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 10) // ARTIFACTS >> table >> currently hosts ~30GB data, ~200M entries on ~45 tablets >> bscan.setRanges(ranges) // there are like 3000 Range.exact's in the >> ranges-list >> for (entry<- bscan.asScala) yield { >> val key = entry.getKey() >> val value = entry.getValue() >> // etc. >> } >> >> For larger full documents (e.g. 3000 exact ranges), this operation will take >> about 12 seconds. >> But shorter documents are assembled blazing fast... >> >> Is that to much for a BatchScanner / I'm misusing the BatchScaner? >> Is that a normal time for such a (seek) operation? >> Can I do something to get a better seek performance? >> >> Note: I have already enabled bloom filtering on that table. >> >> Thank you for any advice! >> >> Regards, >> Sven
