Re: Accumulo Seek performance

Sven Hodapp Wed, 24 Aug 2016 08:56:54 -0700

Hi Josh,

thanks for your reply!


I've tested your suggestion with a implementation like that:

    val ranges500 = ranges.asScala.grouped(500)  // this means 6 BatchScanners 
will be created

    time("mult-scanner") {
      for (ranges <- ranges500) {
        val bscan = instance.createBatchScanner(ARTIFACTS, auths, 1)
        bscan.setRanges(ranges.asJava)
        for (entry <- bscan.asScala) yield {
          entry.getKey()
        }
      }
    }

And the result is a bit disappointing:

background log: info: mult-scanner time: 18064.969281 ms
background log: info: single-scanner time: 6527.482383 ms

I'm doing something wrong here?


Regards,
Sven

-- 
Sven Hodapp, M.Sc.,
Fraunhofer Institute for Algorithms and Scientific Computing SCAI,
Department of Bioinformatics
Schloss Birlinghoven, 53754 Sankt Augustin, Germany
[email protected]
www.scai.fraunhofer.de

----- Ursprüngliche Mail -----
> Von: "Josh Elser" <[email protected]>
> An: "user" <[email protected]>
> Gesendet: Mittwoch, 24. August 2016 16:33:37
> Betreff: Re: Accumulo Seek performance

> This reminded me of https://issues.apache.org/jira/browse/ACCUMULO-3710
> 
> I don't feel like 3000 ranges is too many, but this isn't quantitative.
> 
> IIRC, the BatchScanner will take each Range you provide, bin each Range
> to the TabletServer(s) currently hosting the corresponding data, clip
> (truncate) each Range to match the Tablet boundaries, and then does an
> RPC to each TabletServer with just the Ranges hosted there.
> 
> Inside the TabletServer, it will then have many Ranges, binned by Tablet
> (KeyExtent, to be precise). This will spawn a
> org.apache.accumulo.tserver.scan.LookupTask will will start collecting
> results to send back to the client.
> 
> The caveat here is that those ranges are processed serially on a
> TabletServer. Maybe, you're swamping one TabletServer with lots of
> Ranges that it could be processing in parallel.
> 
> Could you experiment with using multiple BatchScanners and something
> like Guava's Iterables.concat to make it appear like one Iterator?
> 
> I'm curious if we should put an optimization into the BatchScanner
> itself to limit the number of ranges we send in one RPC to a
> TabletServer (e.g. one BatchScanner might open multiple
> MultiScanSessions to a TabletServer).
> 
> Sven Hodapp wrote:
>> Hi there,
>>
>> currently we're experimenting with a two node Accumulo cluster (two tablet
>> servers) setup for document storage.
>> This documents are decomposed up to the sentence level.
>>
>> Now I'm using a BatchScanner to assemble the full document like this:
>>
>>      val bscan = instance.createBatchScanner(ARTIFACTS, auths, 10) // 
>> ARTIFACTS table
>>      currently hosts ~30GB data, ~200M entries on ~45 tablets
>>      bscan.setRanges(ranges)  // there are like 3000 Range.exact's in the 
>> ranges-list
>>        for (entry<- bscan.asScala) yield {
>>          val key = entry.getKey()
>>          val value = entry.getValue()
>>          // etc.
>>        }
>>
>> For larger full documents (e.g. 3000 exact ranges), this operation will take
>> about 12 seconds.
>> But shorter documents are assembled blazing fast...
>>
>> Is that to much for a BatchScanner / I'm misusing the BatchScaner?
>> Is that a normal time for such a (seek) operation?
>> Can I do something to get a better seek performance?
>>
>> Note: I have already enabled bloom filtering on that table.
>>
>> Thank you for any advice!
>>
>> Regards,
>> Sven

Re: Accumulo Seek performance

Reply via email to