Re: BatchScanner taking too much time to scan rows

Eric Newton Wed, 13 May 2015 07:14:24 -0700

This use case is one of the things Accumulo was designed to handle well.
It's the reason there is a BatchScanner.


I've created:

https://issues.apache.org/jira/browse/ACCUMULO-3813

so we can investigate and track down any problems or improvements.

Feel free to add any other details to the JIRA ticket.

-Eric


On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz <[email protected]>
wrote:

>  It sounds like each of your ranges is an ID, e.g. a single row. I've
> found that scanning lots of non-sequential single-row ranges is pretty slow
> in accumulo. Your best approach is probably to create an index table on
> whatever you are originally trying to query (assuming those 10000 ids came
> from some other query).
>
> Thanks,
>
> Emilio
>
>
> On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:
>
>  The rf files per tablet vary between 2 to 5 per tablet. The entries
> returned to me by the batchScanner is 460000. The approx. average data rate
> is 0.5 MB/s as seen on the accumulo monitor page.
>
>  A simple scan on the table has an average data rate of about 7-8 MB/s.
>
>  All the ids exist in the accumulo table.
>
> On 12 May 2015 at 23:39, Keith Turner <[email protected]> wrote:
>
>> Do you know how much data is being brought back (i.e. 100 megabytes)? I
>> am wondering what the data rate is in MB/s.  Do you know how many files per
>> tablet you have?  Do most of the 10,000 ids you are querying for exist?
>>
>> On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal <
>> [email protected]> wrote:
>>
>>> I have 194 tablets. Currently I am using 20 threads to create the
>>> batchscanner inside the createBatchScanner method.
>>>  On 12-May-2015 11:19 pm, "Keith Turner" <[email protected]> wrote:
>>>
>>>>   How many tablets do you have?  The batch scanner does not
>>>> parallelize operations within a tablet.
>>>>
>>>>  If you give the batch scanner more threads than there are tservers,
>>>> it will make multilple parallel rpc calls to each tserver if the tserver
>>>> has multiple tablets.  Each rpc may include multiple tablets and ranges for
>>>> each tablet.
>>>>
>>>>  If the batch scanner has less threads than tservers, it will make one
>>>> rpc per tserver per thread.  Each rpc call will include all tablets and
>>>> associated ranges for that tserver.
>>>>
>>>>  Keith
>>>>
>>>>
>>>>
>>>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>  I am using BatchScanner to scan rows from a accumulo table. The
>>>>> table has around 187m entries and I am using a 3 node cluster which has
>>>>> accumulo 1.6.1.
>>>>>
>>>>>  I have passed 10000 ids which are stored as row id in my table as a
>>>>> list in the setRanges() method.
>>>>>
>>>>>  This whole process takes around 50 secs(from adding the ids in the
>>>>> list to scanning the whole table using the BatchScanner).
>>>>>
>>>>>  I tried switching on bloom filters but that didn't work.
>>>>>
>>>>>  Also if anyone could briefly explain how a BatchScanner works, how
>>>>> it does parallel scanning it would help me understand what I am doing
>>>>> better.
>>>>>
>>>>>  Thanks
>>>>>  Vaibhav
>>>>>
>>>>>
>>>>>
>>>>
>>
>
>

Re: BatchScanner taking too much time to scan rows

Reply via email to