Re: BatchScanner taking too much time to scan rows

Dylan Hutchison Thu, 14 May 2015 11:58:30 -0700

Sorry, just remembered that my setup was to scan an index table and gather
rowIDs, then scan a main data table using the rowIDs as the BatchScan
ranges.  Effectively it is a join of part of the index table to a main data
table.


The scan rate I achieved is therefore double the value I cited previously:
I showed about 76k entries/second.  Still not the best but it is more
within Accumulo standards.


On Thu, May 14, 2015 at 2:15 PM, Dylan Hutchison <dhutc...@mit.edu> wrote:

> I didn't have an average query time-- the tablet server crashed.  A quick
> solution is to batch the ranges into groups of 50k (or 500k, I forgot which
> one) and do many BatchScans-- not ideal.  I think I achieved 33k
> entries/second retrieval on a single-node Accumulo.  Accumulo is better for
> sequential lookup than random.
>
> On Thu, May 14, 2015 at 1:57 PM, vaibhav thapliyal <
> vaibhav.thapliyal...@gmail.com> wrote:
>
>> Dylan could you elaborate on the average query time you had?
>> Thanks
>> Vaibhav
>> On 14-May-2015 11:03 pm, "Dylan Hutchison" <dhutc...@mit.edu> wrote:
>>
>>> I think this is the same issue I found for ACCUMULO-3710
>>> <https://issues.apache.org/jira/browse/ACCUMULO-3710>, only in my case
>>> the tserver ran out of memory.  Accumulo doesn't handle large numbers of
>>> small, disjoint ranges well.  I bet there's room for improvement on both
>>> the client and tablet server.
>>> ~Dylan
>>>
>>> On Wed, May 13, 2015 at 3:13 PM, Eric Newton <eric.new...@gmail.com>
>>> wrote:
>>>
>>>> Yes, hot-spotting does affect accumulo because you have fewer servers
>>>> and caches handling your request.
>>>>
>>>> Let's say your data is spread out, in a normal distribution from
>>>> "0".."9".
>>>>
>>>> What if you have only 1 split?  You would want it at "5", to divide the
>>>> data in half, and you could host the halves on different servers.  But if
>>>> you split at 1, now 10% of your queries go to one tablet, and 90% go to the
>>>> other.
>>>>
>>>> -Eric
>>>>
>>>>
>>>> On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal <
>>>> vaibhav.thapliyal...@gmail.com> wrote:
>>>>
>>>>> Thank you Eric. I will surely do the same. Should uneven distribution
>>>>> across the tablets affect querying in accumulo?  If this case, it is. Is
>>>>> this behaviour normal?
>>>>> On 13-May-2015 10:58 pm, "Eric Newton" <eric.new...@gmail.com> wrote:
>>>>>
>>>>>> Yes, that's a great way to split the data evenly.
>>>>>>
>>>>>> Also, since the data set is so small, turn on data caching for your
>>>>>> table:
>>>>>>
>>>>>> shell> config -t mytable -s table.cache.block.enable=true
>>>>>>
>>>>>> You may want to increase the size of your tserver JVM, and increase
>>>>>> the size of the cache:
>>>>>>
>>>>>> shell> config -s tserver.cache.data.size=1G
>>>>>>
>>>>>> This will help with repeated random look-ups.
>>>>>>
>>>>>> -Eric
>>>>>>
>>>>>> On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal <
>>>>>> vaibhav.thapliyal...@gmail.com> wrote:
>>>>>>
>>>>>>> Thank you Eric.
>>>>>>>
>>>>>>> One thing I would like to know. Does pre-splitting the data play a
>>>>>>> part in querying accumulo?
>>>>>>>
>>>>>>> Because I managed to somewhat decrease the querying time.
>>>>>>> I did the following steps:
>>>>>>> My table was around 1.47gb so I explicity set the split parameter to
>>>>>>> 256mb instead of the default 1gb.
>>>>>>>
>>>>>>> So I had just 8 tablets. Now when I carried out the same query, it
>>>>>>> finished in 15s.
>>>>>>>
>>>>>>> Is it because of the split points are more evenly distributed?
>>>>>>>
>>>>>>> The previous table on which the query took 50s had entries unevenly
>>>>>>> distributed across the tablets.
>>>>>>> Thanks
>>>>>>> Vaibhav
>>>>>>> On 13-May-2015 7:43 pm, "Eric Newton" <eric.new...@gmail.com> wrote:
>>>>>>>
>>>>>>>> This use case is one of the things Accumulo was designed to handle
>>>>>>>> well. It's the reason there is a BatchScanner.
>>>>>>>>
>>>>>>>> I've created:
>>>>>>>>
>>>>>>>> https://issues.apache.org/jira/browse/ACCUMULO-3813
>>>>>>>>
>>>>>>>> so we can investigate and track down any problems or improvements.
>>>>>>>>
>>>>>>>> Feel free to add any other details to the JIRA ticket.
>>>>>>>>
>>>>>>>> -Eric
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz <
>>>>>>>> elahrvi...@ccri.com> wrote:
>>>>>>>>
>>>>>>>>>  It sounds like each of your ranges is an ID, e.g. a single row.
>>>>>>>>> I've found that scanning lots of non-sequential single-row ranges is 
>>>>>>>>> pretty
>>>>>>>>> slow in accumulo. Your best approach is probably to create an index 
>>>>>>>>> table
>>>>>>>>> on whatever you are originally trying to query (assuming those 10000 
>>>>>>>>> ids
>>>>>>>>> came from some other query).
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Emilio
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 05/13/2015 09:14 AM, vaibhav thapliyal wrote:
>>>>>>>>>
>>>>>>>>>  The rf files per tablet vary between 2 to 5 per tablet. The
>>>>>>>>> entries returned to me by the batchScanner is 460000. The approx. 
>>>>>>>>> average
>>>>>>>>> data rate is 0.5 MB/s as seen on the accumulo monitor page.
>>>>>>>>>
>>>>>>>>>  A simple scan on the table has an average data rate of about 7-8
>>>>>>>>> MB/s.
>>>>>>>>>
>>>>>>>>>  All the ids exist in the accumulo table.
>>>>>>>>>
>>>>>>>>> On 12 May 2015 at 23:39, Keith Turner <ke...@deenlo.com> wrote:
>>>>>>>>>
>>>>>>>>>> Do you know how much data is being brought back (i.e. 100
>>>>>>>>>> megabytes)? I am wondering what the data rate is in MB/s.  Do you 
>>>>>>>>>> know how
>>>>>>>>>> many files per tablet you have?  Do most of the 10,000 ids you are 
>>>>>>>>>> querying
>>>>>>>>>> for exist?
>>>>>>>>>>
>>>>>>>>>> On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal <
>>>>>>>>>> vaibhav.thapliyal...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I have 194 tablets. Currently I am using 20 threads to create
>>>>>>>>>>> the batchscanner inside the createBatchScanner method.
>>>>>>>>>>>  On 12-May-2015 11:19 pm, "Keith Turner" <ke...@deenlo.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>   How many tablets do you have?  The batch scanner does not
>>>>>>>>>>>> parallelize operations within a tablet.
>>>>>>>>>>>>
>>>>>>>>>>>>  If you give the batch scanner more threads than there are
>>>>>>>>>>>> tservers, it will make multilple parallel rpc calls to each 
>>>>>>>>>>>> tserver if the
>>>>>>>>>>>> tserver has multiple tablets.  Each rpc may include multiple 
>>>>>>>>>>>> tablets and
>>>>>>>>>>>> ranges for each tablet.
>>>>>>>>>>>>
>>>>>>>>>>>>  If the batch scanner has less threads than tservers, it will
>>>>>>>>>>>> make one rpc per tserver per thread.  Each rpc call will include 
>>>>>>>>>>>> all
>>>>>>>>>>>> tablets and associated ranges for that tserver.
>>>>>>>>>>>>
>>>>>>>>>>>>  Keith
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal <
>>>>>>>>>>>> vaibhav.thapliyal...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>>  I am using BatchScanner to scan rows from a accumulo table.
>>>>>>>>>>>>> The table has around 187m entries and I am using a 3 node cluster 
>>>>>>>>>>>>> which has
>>>>>>>>>>>>> accumulo 1.6.1.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  I have passed 10000 ids which are stored as row id in my
>>>>>>>>>>>>> table as a list in the setRanges() method.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  This whole process takes around 50 secs(from adding the ids
>>>>>>>>>>>>> in the list to scanning the whole table using the BatchScanner).
>>>>>>>>>>>>>
>>>>>>>>>>>>>  I tried switching on bloom filters but that didn't work.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Also if anyone could briefly explain how a BatchScanner
>>>>>>>>>>>>> works, how it does parallel scanning it would help me understand 
>>>>>>>>>>>>> what I am
>>>>>>>>>>>>> doing better.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Thanks
>>>>>>>>>>>>>  Vaibhav
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>>
>

Re: BatchScanner taking too much time to scan rows

Reply via email to