Re: Accumulo Seek performance

Sven Hodapp Thu, 25 Aug 2016 07:46:39 -0700

Hi Dave,

toList will exhaust the iterator. But all 6 iterators will be concurrently 
exhausted within the Future object 
(http://docs.scala-lang.org/overviews/core/futures.html).


Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
[email protected]
www.scai.fraunhofer.de

----- Ursprüngliche Mail -----
> Von: [email protected]
> An: "user" <[email protected]>
> Gesendet: Donnerstag, 25. August 2016 16:22:35
> Betreff: Re: Accumulo Seek performance

> But does toList exhaust the first iterator() before going to the next?
> 
> - Dave
> 
> 
> ----- Original Message -----
> 
> From: "Sven Hodapp" <[email protected]>
> To: "user" <[email protected]>
> Sent: Thursday, August 25, 2016 9:42:00 AM
> Subject: Re: Accumulo Seek performance
> 
> Hi dlmarion,
> 
> toList should also call iterator(), and that is done in independently for each
> batch scanner iterator in the context of the Future.
> 
> Regards,
> Sven
> 
> --
> Sven Hodapp, M.Sc.,
> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
> Department of Bioinformatics
> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
> [email protected]
> www.scai.fraunhofer.de
> 
> ----- Ursprüngliche Mail -----
>> Von: [email protected]
>> An: "user" <[email protected]>
>> Gesendet: Donnerstag, 25. August 2016 14:34:39
>> Betreff: Re: Accumulo Seek performance
> 
>> Calling BatchScanner.iterator() is what starts the work on the server side. 
>> You
>> should do this first for all 6 batch scanners, then iterate over all of them 
>> in
>> parallel.
>> 
>> ----- Original Message -----
>> 
>> From: "Sven Hodapp" <[email protected]>
>> To: "user" <[email protected]>
>> Sent: Thursday, August 25, 2016 4:53:41 AM
>> Subject: Re: Accumulo Seek performance
>> 
>> Hi,
>> 
>> I've changed the code a little bit, so that it uses a thread pool (via the
>> Future):
>> 
>> val ranges500 = ranges.asScala.grouped(500) // this means 6 BatchScanners 
>> will
>> be created
>> 
>> for (ranges <- ranges500) {
>> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 2)
>> bscan.setRanges(ranges.asJava)
>> Future {
>> time("mult-scanner") {
>> bscan.asScala.toList // toList forces the iteration of the iterator
>> }
>> }
>> }
>> 
>> Here are the results:
>> 
>> background log: info: mult-scanner time: 4807.289358 ms
>> background log: info: mult-scanner time: 4930.996522 ms
>> background log: info: mult-scanner time: 9510.010808 ms
>> background log: info: mult-scanner time: 11394.152391 ms
>> background log: info: mult-scanner time: 13297.247295 ms
>> background log: info: mult-scanner time: 14032.704837 ms
>> 
>> background log: info: single-scanner time: 15322.624393 ms
>> 
>> Every Future completes independent, but in return every batch scanner 
>> iterator
>> needs more time to complete. :(
>> This means the batch scanners aren't really processed in parallel on the 
>> server
>> side?
>> Should I reconfigure something? Maybe the tablet servers haven't/can't 
>> allocate
>> enough threads or memory? (Every of the two nodes has 8 cores and 64GB memory
>> and a storage with ~300MB/s...)
>> 
>> Regards,
>> Sven
>> 
>> --
>> Sven Hodapp, M.Sc.,
>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
>> Department of Bioinformatics
>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
>> [email protected]
>> www.scai.fraunhofer.de
>> 
>> ----- Ursprüngliche Mail -----
>>> Von: "Josh Elser" <[email protected]>
>>> An: "user" <[email protected]>
>>> Gesendet: Mittwoch, 24. August 2016 18:36:42
>>> Betreff: Re: Accumulo Seek performance
>> 
>>> Ahh duh. Bad advice from me in the first place :)
>>> 
>>> Throw 'em in a threadpool locally.
>>> 
>>> [email protected] wrote:
>>>> Doesn't this use the 6 batch scanners serially?
>>>> 
>>>> ------------------------------------------------------------------------
>>>> *From: *"Sven Hodapp" <[email protected]>
>>>> *To: *"user" <[email protected]>
>>>> *Sent: *Wednesday, August 24, 2016 11:56:14 AM
>>>> *Subject: *Re: Accumulo Seek performance
>>>> 
>>>> Hi Josh,
>>>> 
>>>> thanks for your reply!
>>>> 
>>>> I've tested your suggestion with a implementation like that:
>>>> 
>>>> val ranges500 = ranges.asScala.grouped(500) // this means 6
>>>> BatchScanners will be created
>>>> 
>>>> time("mult-scanner") {
>>>> for (ranges <- ranges500) {
>>>> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 1)
>>>> bscan.setRanges(ranges.asJava)
>>>> for (entry <- bscan.asScala) yield {
>>>> entry.getKey()
>>>> }
>>>> }
>>>> }
>>>> 
>>>> And the result is a bit disappointing:
>>>> 
>>>> background log: info: mult-scanner time: 18064.969281 ms
>>>> background log: info: single-scanner time: 6527.482383 ms
>>>> 
>>>> I'm doing something wrong here?
>>>> 
>>>> 
>>>> Regards,
>>>> Sven
>>>> 
>>>> --
>>>> Sven Hodapp, M.Sc.,
>>>> Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
>>>> Department of Bioinformatics
>>>> Schloss Birlinghoven, 53754 Sankt Augustin, Germany
>>>> [email protected]
>>>> www.scai.fraunhofer.de
>>>> 
>>>> ----- Ursprüngliche Mail -----
>>>> > Von: "Josh Elser" <[email protected]>
>>>> > An: "user" <[email protected]>
>>>> > Gesendet: Mittwoch, 24. August 2016 16:33:37
>>>> > Betreff: Re: Accumulo Seek performance
>>>> 
>>>> > This reminded me of https://issues.apache.org/jira/browse/ACCUMULO-3710
>>>> > 
>>>> > I don't feel like 3000 ranges is too many, but this isn't quantitative.
>>>> > 
>>>> > IIRC, the BatchScanner will take each Range you provide, bin each Range
>>>> > to the TabletServer(s) currently hosting the corresponding data, clip
>>>> > (truncate) each Range to match the Tablet boundaries, and then does an
>>>> > RPC to each TabletServer with just the Ranges hosted there.
>>>> > 
>>>> > Inside the TabletServer, it will then have many Ranges, binned by Tablet
>>>> > (KeyExtent, to be precise). This will spawn a
>>>> > org.apache.accumulo.tserver.scan.LookupTask will will start collecting
>>>> > results to send back to the client.
>>>> > 
>>>> > The caveat here is that those ranges are processed serially on a
>>>> > TabletServer. Maybe, you're swamping one TabletServer with lots of
>>>> > Ranges that it could be processing in parallel.
>>>> > 
>>>> > Could you experiment with using multiple BatchScanners and something
>>>> > like Guava's Iterables.concat to make it appear like one Iterator?
>>>> > 
>>>> > I'm curious if we should put an optimization into the BatchScanner
>>>> > itself to limit the number of ranges we send in one RPC to a
>>>> > TabletServer (e.g. one BatchScanner might open multiple
>>>> > MultiScanSessions to a TabletServer).
>>>> > 
>>>> > Sven Hodapp wrote:
>>>> >> Hi there,
>>>> >> 
>>>> >> currently we're experimenting with a two node Accumulo cluster (two
>>>> tablet
>>>> >> servers) setup for document storage.
>>>> >> This documents are decomposed up to the sentence level.
>>>> >> 
>>>> >> Now I'm using a BatchScanner to assemble the full document like this:
>>>> >> 
>>>> >> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 10) //
>>>> ARTIFACTS table
>>>> >> currently hosts ~30GB data, ~200M entries on ~45 tablets
>>>> >> bscan.setRanges(ranges) // there are like 3000 Range.exact's in the
>>>> ranges-list
>>>> >> for (entry<- bscan.asScala) yield {
>>>> >> val key = entry.getKey()
>>>> >> val value = entry.getValue()
>>>> >> // etc.
>>>> >> }
>>>> >> 
>>>> >> For larger full documents (e.g. 3000 exact ranges), this operation
>>>> will take
>>>> >> about 12 seconds.
>>>> >> But shorter documents are assembled blazing fast...
>>>> >> 
>>>> >> Is that to much for a BatchScanner / I'm misusing the BatchScaner?
>>>> >> Is that a normal time for such a (seek) operation?
>>>> >> Can I do something to get a better seek performance?
>>>> >> 
>>>> >> Note: I have already enabled bloom filtering on that table.
>>>> >> 
>>>> >> Thank you for any advice!
>>>> >> 
>>>> >> Regards,
> > >> >> Sven

Re: Accumulo Seek performance

Reply via email to