Hi Dave, toList will exhaust the iterator. But all 6 iterators will be concurrently exhausted within the Future object (http://docs.scala-lang.org/overviews/core/futures.html).
Regards, Sven -- Sven Hodapp, M.Sc., Fraunhofer Institute for Algorithms and Scientific Computing SCAI, Department of Bioinformatics Schloss Birlinghoven, 53754 Sankt Augustin, Germany [email protected] www.scai.fraunhofer.de ----- Ursprüngliche Mail ----- > Von: [email protected] > An: "user" <[email protected]> > Gesendet: Donnerstag, 25. August 2016 16:22:35 > Betreff: Re: Accumulo Seek performance > But does toList exhaust the first iterator() before going to the next? > > - Dave > > > ----- Original Message ----- > > From: "Sven Hodapp" <[email protected]> > To: "user" <[email protected]> > Sent: Thursday, August 25, 2016 9:42:00 AM > Subject: Re: Accumulo Seek performance > > Hi dlmarion, > > toList should also call iterator(), and that is done in independently for each > batch scanner iterator in the context of the Future. > > Regards, > Sven > > -- > Sven Hodapp, M.Sc., > Fraunhofer Institute for Algorithms and Scientific Computing SCAI, > Department of Bioinformatics > Schloss Birlinghoven, 53754 Sankt Augustin, Germany > [email protected] > www.scai.fraunhofer.de > > ----- Ursprüngliche Mail ----- >> Von: [email protected] >> An: "user" <[email protected]> >> Gesendet: Donnerstag, 25. August 2016 14:34:39 >> Betreff: Re: Accumulo Seek performance > >> Calling BatchScanner.iterator() is what starts the work on the server side. >> You >> should do this first for all 6 batch scanners, then iterate over all of them >> in >> parallel. >> >> ----- Original Message ----- >> >> From: "Sven Hodapp" <[email protected]> >> To: "user" <[email protected]> >> Sent: Thursday, August 25, 2016 4:53:41 AM >> Subject: Re: Accumulo Seek performance >> >> Hi, >> >> I've changed the code a little bit, so that it uses a thread pool (via the >> Future): >> >> val ranges500 = ranges.asScala.grouped(500) // this means 6 BatchScanners >> will >> be created >> >> for (ranges <- ranges500) { >> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 2) >> bscan.setRanges(ranges.asJava) >> Future { >> time("mult-scanner") { >> bscan.asScala.toList // toList forces the iteration of the iterator >> } >> } >> } >> >> Here are the results: >> >> background log: info: mult-scanner time: 4807.289358 ms >> background log: info: mult-scanner time: 4930.996522 ms >> background log: info: mult-scanner time: 9510.010808 ms >> background log: info: mult-scanner time: 11394.152391 ms >> background log: info: mult-scanner time: 13297.247295 ms >> background log: info: mult-scanner time: 14032.704837 ms >> >> background log: info: single-scanner time: 15322.624393 ms >> >> Every Future completes independent, but in return every batch scanner >> iterator >> needs more time to complete. :( >> This means the batch scanners aren't really processed in parallel on the >> server >> side? >> Should I reconfigure something? Maybe the tablet servers haven't/can't >> allocate >> enough threads or memory? (Every of the two nodes has 8 cores and 64GB memory >> and a storage with ~300MB/s...) >> >> Regards, >> Sven >> >> -- >> Sven Hodapp, M.Sc., >> Fraunhofer Institute for Algorithms and Scientific Computing SCAI, >> Department of Bioinformatics >> Schloss Birlinghoven, 53754 Sankt Augustin, Germany >> [email protected] >> www.scai.fraunhofer.de >> >> ----- Ursprüngliche Mail ----- >>> Von: "Josh Elser" <[email protected]> >>> An: "user" <[email protected]> >>> Gesendet: Mittwoch, 24. August 2016 18:36:42 >>> Betreff: Re: Accumulo Seek performance >> >>> Ahh duh. Bad advice from me in the first place :) >>> >>> Throw 'em in a threadpool locally. >>> >>> [email protected] wrote: >>>> Doesn't this use the 6 batch scanners serially? >>>> >>>> ------------------------------------------------------------------------ >>>> *From: *"Sven Hodapp" <[email protected]> >>>> *To: *"user" <[email protected]> >>>> *Sent: *Wednesday, August 24, 2016 11:56:14 AM >>>> *Subject: *Re: Accumulo Seek performance >>>> >>>> Hi Josh, >>>> >>>> thanks for your reply! >>>> >>>> I've tested your suggestion with a implementation like that: >>>> >>>> val ranges500 = ranges.asScala.grouped(500) // this means 6 >>>> BatchScanners will be created >>>> >>>> time("mult-scanner") { >>>> for (ranges <- ranges500) { >>>> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 1) >>>> bscan.setRanges(ranges.asJava) >>>> for (entry <- bscan.asScala) yield { >>>> entry.getKey() >>>> } >>>> } >>>> } >>>> >>>> And the result is a bit disappointing: >>>> >>>> background log: info: mult-scanner time: 18064.969281 ms >>>> background log: info: single-scanner time: 6527.482383 ms >>>> >>>> I'm doing something wrong here? >>>> >>>> >>>> Regards, >>>> Sven >>>> >>>> -- >>>> Sven Hodapp, M.Sc., >>>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI, >>>> Department of Bioinformatics >>>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany >>>> [email protected] >>>> www.scai.fraunhofer.de >>>> >>>> ----- Ursprüngliche Mail ----- >>>> > Von: "Josh Elser" <[email protected]> >>>> > An: "user" <[email protected]> >>>> > Gesendet: Mittwoch, 24. August 2016 16:33:37 >>>> > Betreff: Re: Accumulo Seek performance >>>> >>>> > This reminded me of https://issues.apache.org/jira/browse/ACCUMULO-3710 >>>> > >>>> > I don't feel like 3000 ranges is too many, but this isn't quantitative. >>>> > >>>> > IIRC, the BatchScanner will take each Range you provide, bin each Range >>>> > to the TabletServer(s) currently hosting the corresponding data, clip >>>> > (truncate) each Range to match the Tablet boundaries, and then does an >>>> > RPC to each TabletServer with just the Ranges hosted there. >>>> > >>>> > Inside the TabletServer, it will then have many Ranges, binned by Tablet >>>> > (KeyExtent, to be precise). This will spawn a >>>> > org.apache.accumulo.tserver.scan.LookupTask will will start collecting >>>> > results to send back to the client. >>>> > >>>> > The caveat here is that those ranges are processed serially on a >>>> > TabletServer. Maybe, you're swamping one TabletServer with lots of >>>> > Ranges that it could be processing in parallel. >>>> > >>>> > Could you experiment with using multiple BatchScanners and something >>>> > like Guava's Iterables.concat to make it appear like one Iterator? >>>> > >>>> > I'm curious if we should put an optimization into the BatchScanner >>>> > itself to limit the number of ranges we send in one RPC to a >>>> > TabletServer (e.g. one BatchScanner might open multiple >>>> > MultiScanSessions to a TabletServer). >>>> > >>>> > Sven Hodapp wrote: >>>> >> Hi there, >>>> >> >>>> >> currently we're experimenting with a two node Accumulo cluster (two >>>> tablet >>>> >> servers) setup for document storage. >>>> >> This documents are decomposed up to the sentence level. >>>> >> >>>> >> Now I'm using a BatchScanner to assemble the full document like this: >>>> >> >>>> >> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 10) // >>>> ARTIFACTS table >>>> >> currently hosts ~30GB data, ~200M entries on ~45 tablets >>>> >> bscan.setRanges(ranges) // there are like 3000 Range.exact's in the >>>> ranges-list >>>> >> for (entry<- bscan.asScala) yield { >>>> >> val key = entry.getKey() >>>> >> val value = entry.getValue() >>>> >> // etc. >>>> >> } >>>> >> >>>> >> For larger full documents (e.g. 3000 exact ranges), this operation >>>> will take >>>> >> about 12 seconds. >>>> >> But shorter documents are assembled blazing fast... >>>> >> >>>> >> Is that to much for a BatchScanner / I'm misusing the BatchScaner? >>>> >> Is that a normal time for such a (seek) operation? >>>> >> Can I do something to get a better seek performance? >>>> >> >>>> >> Note: I have already enabled bloom filtering on that table. >>>> >> >>>> >> Thank you for any advice! >>>> >> >>>> >> Regards, > > >> >> Sven
