Re: BatchScanner taking too much time to scan rows

vaibhav thapliyal Thu, 14 May 2015 10:59:00 -0700

Dylan could you elaborate on the average query time you had?
Thanks
Vaibhav
On 14-May-2015 11:03 pm, "Dylan Hutchison" <[email protected]> wrote:


> I think this is the same issue I found for ACCUMULO-3710
> <https://issues.apache.org/jira/browse/ACCUMULO-3710>, only in my case
> the tserver ran out of memory.  Accumulo doesn't handle large numbers of
> small, disjoint ranges well.  I bet there's room for improvement on both
> the client and tablet server.
> ~Dylan
>
> On Wed, May 13, 2015 at 3:13 PM, Eric Newton <[email protected]>
> wrote:
>
>> Yes, hot-spotting does affect accumulo because you have fewer servers and
>> caches handling your request.
>>
>> Let's say your data is spread out, in a normal distribution from
>> "0".."9".
>>
>> What if you have only 1 split?  You would want it at "5", to divide the
>> data in half, and you could host the halves on different servers.  But if
>> you split at 1, now 10% of your queries go to one tablet, and 90% go to the
>> other.
>>
>> -Eric
>>
>>
>> On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal <
>> [email protected]> wrote:
>>
>>> Thank you Eric. I will surely do the same. Should uneven distribution
>>> across the tablets affect querying in accumulo?  If this case, it is. Is
>>> this behaviour normal?
>>> On 13-May-2015 10:58 pm, "Eric Newton" <[email protected]> wrote:
>>>
>>>> Yes, that's a great way to split the data evenly.
>>>>
>>>> Also, since the data set is so small, turn on data caching for your
>>>> table:
>>>>
>>>> shell> config -t mytable -s table.cache.block.enable=true
>>>>
>>>> You may want to increase the size of your tserver JVM, and increase the
>>>> size of the cache:
>>>>
>>>> shell> config -s tserver.cache.data.size=1G
>>>>
>>>> This will help with repeated random look-ups.
>>>>
>>>> -Eric
>>>>
>>>> On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal <
>>>> [email protected]> wrote:
>>>>
>>>>> Thank you Eric.
>>>>>
>>>>> One thing I would like to know. Does pre-splitting the data play a
>>>>> part in querying accumulo?
>>>>>
>>>>> Because I managed to somewhat decrease the querying time.
>>>>> I did the following steps:
>>>>> My table was around 1.47gb so I explicity set the split parameter to
>>>>> 256mb instead of the default 1gb.
>>>>>
>>>>> So I had just 8 tablets. Now when I carried out the same query, it
>>>>> finished in 15s.
>>>>>
>>>>> Is it because of the split points are more evenly distributed?
>>>>>
>>>>> The previous table on which the query took 50s had entries unevenly
>>>>> distributed across the tablets.
>>>>> Thanks
>>>>> Vaibhav
>>>>> On 13-May-2015 7:43 pm, "Eric Newton" <[email protected]> wrote:
>>>>>
>>>>>> This use case is one of the things Accumulo was designed to handle
>>>>>> well. It's the reason there is a BatchScanner.
>>>>>>
>>>>>> I've created:
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/ACCUMULO-3813
>>>>>>
>>>>>> so we can investigate and track down any problems or improvements.
>>>>>>
>>>>>> Feel free to add any other details to the JIRA ticket.
>>>>>>
>>>>>> -Eric
>>>>>>
>>>>>>
>>>>>> On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>>  It sounds like each of your ranges is an ID, e.g. a single row.
>>>>>>> I've found that scanning lots of non-sequential single-row ranges is 
>>>>>>> pretty
>>>>>>> slow in accumulo. Your best approach is probably to create an index 
>>>>>>> table
>>>>>>> on whatever you are originally trying to query (assuming those 10000 ids
>>>>>>> came from some other query).
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Emilio
>>>>>>>
>>>>>>>
>>>>>>> On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:
>>>>>>>
>>>>>>>  The rf files per tablet vary between 2 to 5 per tablet. The
>>>>>>> entries returned to me by the batchScanner is 460000. The approx. 
>>>>>>> average
>>>>>>> data rate is 0.5 MB/s as seen on the accumulo monitor page.
>>>>>>>
>>>>>>>  A simple scan on the table has an average data rate of about 7-8
>>>>>>> MB/s.
>>>>>>>
>>>>>>>  All the ids exist in the accumulo table.
>>>>>>>
>>>>>>> On 12 May 2015 at 23:39, Keith Turner <[email protected]> wrote:
>>>>>>>
>>>>>>>> Do you know how much data is being brought back (i.e. 100
>>>>>>>> megabytes)? I am wondering what the data rate is in MB/s.  Do you know 
>>>>>>>> how
>>>>>>>> many files per tablet you have?  Do most of the 10,000 ids you are 
>>>>>>>> querying
>>>>>>>> for exist?
>>>>>>>>
>>>>>>>> On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> I have 194 tablets. Currently I am using 20 threads to create the
>>>>>>>>> batchscanner inside the createBatchScanner method.
>>>>>>>>>  On 12-May-2015 11:19 pm, "Keith Turner" <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>>>   How many tablets do you have?  The batch scanner does not
>>>>>>>>>> parallelize operations within a tablet.
>>>>>>>>>>
>>>>>>>>>>  If you give the batch scanner more threads than there are
>>>>>>>>>> tservers, it will make multilple parallel rpc calls to each tserver 
>>>>>>>>>> if the
>>>>>>>>>> tserver has multiple tablets.  Each rpc may include multiple tablets 
>>>>>>>>>> and
>>>>>>>>>> ranges for each tablet.
>>>>>>>>>>
>>>>>>>>>>  If the batch scanner has less threads than tservers, it will
>>>>>>>>>> make one rpc per tserver per thread.  Each rpc call will include all
>>>>>>>>>> tablets and associated ranges for that tserver.
>>>>>>>>>>
>>>>>>>>>>  Keith
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>>  I am using BatchScanner to scan rows from a accumulo table.
>>>>>>>>>>> The table has around 187m entries and I am using a 3 node cluster 
>>>>>>>>>>> which has
>>>>>>>>>>> accumulo 1.6.1.
>>>>>>>>>>>
>>>>>>>>>>>  I have passed 10000 ids which are stored as row id in my table
>>>>>>>>>>> as a list in the setRanges() method.
>>>>>>>>>>>
>>>>>>>>>>>  This whole process takes around 50 secs(from adding the ids in
>>>>>>>>>>> the list to scanning the whole table using the BatchScanner).
>>>>>>>>>>>
>>>>>>>>>>>  I tried switching on bloom filters but that didn't work.
>>>>>>>>>>>
>>>>>>>>>>>  Also if anyone could briefly explain how a BatchScanner works,
>>>>>>>>>>> how it does parallel scanning it would help me understand what I am 
>>>>>>>>>>> doing
>>>>>>>>>>> better.
>>>>>>>>>>>
>>>>>>>>>>>  Thanks
>>>>>>>>>>>  Vaibhav
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>
>

Re: BatchScanner taking too much time to scan rows

Reply via email to