Re: BatchScanner taking too much time to scan rows

vaibhav thapliyal Wed, 13 May 2015 08:33:54 -0700

Thank you Eric.

One thing I would like to know. Does pre-splitting the data play a part in
querying accumulo?


Because I managed to somewhat decrease the querying time.
I did the following steps:
My table was around 1.47gb so I explicity set the split parameter to 256mb
instead of the default 1gb.

So I had just 8 tablets. Now when I carried out the same query, it finished
in 15s.

Is it because of the split points are more evenly distributed?

The previous table on which the query took 50s had entries unevenly
distributed across the tablets.
Thanks
Vaibhav
On 13-May-2015 7:43 pm, "Eric Newton" <[email protected]> wrote:

> This use case is one of the things Accumulo was designed to handle well.
> It's the reason there is a BatchScanner.
>
> I've created:
>
> https://issues.apache.org/jira/browse/ACCUMULO-3813
>
> so we can investigate and track down any problems or improvements.
>
> Feel free to add any other details to the JIRA ticket.
>
> -Eric
>
>
> On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz <[email protected]>
> wrote:
>
>>  It sounds like each of your ranges is an ID, e.g. a single row. I've
>> found that scanning lots of non-sequential single-row ranges is pretty slow
>> in accumulo. Your best approach is probably to create an index table on
>> whatever you are originally trying to query (assuming those 10000 ids came
>> from some other query).
>>
>> Thanks,
>>
>> Emilio
>>
>>
>> On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:
>>
>>  The rf files per tablet vary between 2 to 5 per tablet. The entries
>> returned to me by the batchScanner is 460000. The approx. average data rate
>> is 0.5 MB/s as seen on the accumulo monitor page.
>>
>>  A simple scan on the table has an average data rate of about 7-8 MB/s.
>>
>>  All the ids exist in the accumulo table.
>>
>> On 12 May 2015 at 23:39, Keith Turner <[email protected]> wrote:
>>
>>> Do you know how much data is being brought back (i.e. 100 megabytes)? I
>>> am wondering what the data rate is in MB/s.  Do you know how many files per
>>> tablet you have?  Do most of the 10,000 ids you are querying for exist?
>>>
>>> On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal <
>>> [email protected]> wrote:
>>>
>>>> I have 194 tablets. Currently I am using 20 threads to create the
>>>> batchscanner inside the createBatchScanner method.
>>>>  On 12-May-2015 11:19 pm, "Keith Turner" <[email protected]> wrote:
>>>>
>>>>>   How many tablets do you have?  The batch scanner does not
>>>>> parallelize operations within a tablet.
>>>>>
>>>>>  If you give the batch scanner more threads than there are tservers,
>>>>> it will make multilple parallel rpc calls to each tserver if the tserver
>>>>> has multiple tablets.  Each rpc may include multiple tablets and ranges 
>>>>> for
>>>>> each tablet.
>>>>>
>>>>>  If the batch scanner has less threads than tservers, it will make one
>>>>> rpc per tserver per thread.  Each rpc call will include all tablets and
>>>>> associated ranges for that tserver.
>>>>>
>>>>>  Keith
>>>>>
>>>>>
>>>>>
>>>>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>  I am using BatchScanner to scan rows from a accumulo table. The
>>>>>> table has around 187m entries and I am using a 3 node cluster which has
>>>>>> accumulo 1.6.1.
>>>>>>
>>>>>>  I have passed 10000 ids which are stored as row id in my table as a
>>>>>> list in the setRanges() method.
>>>>>>
>>>>>>  This whole process takes around 50 secs(from adding the ids in the
>>>>>> list to scanning the whole table using the BatchScanner).
>>>>>>
>>>>>>  I tried switching on bloom filters but that didn't work.
>>>>>>
>>>>>>  Also if anyone could briefly explain how a BatchScanner works, how
>>>>>> it does parallel scanning it would help me understand what I am doing
>>>>>> better.
>>>>>>
>>>>>>  Thanks
>>>>>>  Vaibhav
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>
>>
>

Re: BatchScanner taking too much time to scan rows

Reply via email to