Re: Accumulo Seek performance

Josh Elser Wed, 24 Aug 2016 09:37:26 -0700

Ahh duh. Bad advice from me in the first place :)

Throw 'em in a threadpool locally.


[email protected] wrote:

Doesn't this use the 6 batch scanners serially?

------------------------------------------------------------------------
*From: *"Sven Hodapp" <[email protected]>
*To: *"user" <[email protected]>
*Sent: *Wednesday, August 24, 2016 11:56:14 AM
*Subject: *Re: Accumulo Seek performance

Hi Josh,

thanks for your reply!

I've tested your suggestion with a implementation like that:

val ranges500 = ranges.asScala.grouped(500) // this means 6
BatchScanners will be created

time("mult-scanner") {
for (ranges <- ranges500) {
val bscan = instance.createBatchScanner(ARTIFACTS, auths, 1)
bscan.setRanges(ranges.asJava)
for (entry <- bscan.asScala) yield {
entry.getKey()
}
}
}

And the result is a bit disappointing:

background log: info: mult-scanner time: 18064.969281 ms
background log: info: single-scanner time: 6527.482383 ms

I'm doing something wrong here?


Regards,
Sven

--
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
[email protected]
www.scai.fraunhofer.de

----- Ursprüngliche Mail -----
 > Von: "Josh Elser" <[email protected]>
 > An: "user" <[email protected]>
 > Gesendet: Mittwoch, 24. August 2016 16:33:37
 > Betreff: Re: Accumulo Seek performance

 > This reminded me of https://issues.apache.org/jira/browse/ACCUMULO-3710
 >
 > I don't feel like 3000 ranges is too many, but this isn't quantitative.
 >
 > IIRC, the BatchScanner will take each Range you provide, bin each Range
 > to the TabletServer(s) currently hosting the corresponding data, clip
 > (truncate) each Range to match the Tablet boundaries, and then does an
 > RPC to each TabletServer with just the Ranges hosted there.
 >
 > Inside the TabletServer, it will then have many Ranges, binned by Tablet
 > (KeyExtent, to be precise). This will spawn a
 > org.apache.accumulo.tserver.scan.LookupTask will will start collecting
 > results to send back to the client.
 >
 > The caveat here is that those ranges are processed serially on a
 > TabletServer. Maybe, you're swamping one TabletServer with lots of
 > Ranges that it could be processing in parallel.
 >
 > Could you experiment with using multiple BatchScanners and something
 > like Guava's Iterables.concat to make it appear like one Iterator?
 >
 > I'm curious if we should put an optimization into the BatchScanner
 > itself to limit the number of ranges we send in one RPC to a
 > TabletServer (e.g. one BatchScanner might open multiple
 > MultiScanSessions to a TabletServer).
 >
 > Sven Hodapp wrote:
 >> Hi there,
 >>
 >> currently we're experimenting with a two node Accumulo cluster (two
tablet
 >> servers) setup for document storage.
 >> This documents are decomposed up to the sentence level.
 >>
 >> Now I'm using a BatchScanner to assemble the full document like this:
 >>
 >> val bscan = instance.createBatchScanner(ARTIFACTS, auths, 10) //
ARTIFACTS table
 >> currently hosts ~30GB data, ~200M entries on ~45 tablets
 >> bscan.setRanges(ranges) // there are like 3000 Range.exact's in the
ranges-list
 >> for (entry<- bscan.asScala) yield {
 >> val key = entry.getKey()
 >> val value = entry.getValue()
 >> // etc.
 >> }
 >>
 >> For larger full documents (e.g. 3000 exact ranges), this operation
will take
 >> about 12 seconds.
 >> But shorter documents are assembled blazing fast...
 >>
 >> Is that to much for a BatchScanner / I'm misusing the BatchScaner?
 >> Is that a normal time for such a (seek) operation?
 >> Can I do something to get a better seek performance?
 >>
 >> Note: I have already enabled bloom filtering on that table.
 >>
 >> Thank you for any advice!
 >>
 >> Regards,
 >> Sven

Re: Accumulo Seek performance

Reply via email to