Sorry, just remembered that my setup was to scan an index table and gather rowIDs, then scan a main data table using the rowIDs as the BatchScan ranges. Effectively it is a join of part of the index table to a main data table.
The scan rate I achieved is therefore double the value I cited previously: I showed about 76k entries/second. Still not the best but it is more within Accumulo standards. On Thu, May 14, 2015 at 2:15 PM, Dylan Hutchison <dhutc...@mit.edu> wrote: > I didn't have an average query time-- the tablet server crashed. A quick > solution is to batch the ranges into groups of 50k (or 500k, I forgot which > one) and do many BatchScans-- not ideal. I think I achieved 33k > entries/second retrieval on a single-node Accumulo. Accumulo is better for > sequential lookup than random. > > On Thu, May 14, 2015 at 1:57 PM, vaibhav thapliyal < > vaibhav.thapliyal...@gmail.com> wrote: > >> Dylan could you elaborate on the average query time you had? >> Thanks >> Vaibhav >> On 14-May-2015 11:03 pm, "Dylan Hutchison" <dhutc...@mit.edu> wrote: >> >>> I think this is the same issue I found for ACCUMULO-3710 >>> <https://issues.apache.org/jira/browse/ACCUMULO-3710>, only in my case >>> the tserver ran out of memory. Accumulo doesn't handle large numbers of >>> small, disjoint ranges well. I bet there's room for improvement on both >>> the client and tablet server. >>> ~Dylan >>> >>> On Wed, May 13, 2015 at 3:13 PM, Eric Newton <eric.new...@gmail.com> >>> wrote: >>> >>>> Yes, hot-spotting does affect accumulo because you have fewer servers >>>> and caches handling your request. >>>> >>>> Let's say your data is spread out, in a normal distribution from >>>> "0".."9". >>>> >>>> What if you have only 1 split? You would want it at "5", to divide the >>>> data in half, and you could host the halves on different servers. But if >>>> you split at 1, now 10% of your queries go to one tablet, and 90% go to the >>>> other. >>>> >>>> -Eric >>>> >>>> >>>> On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal < >>>> vaibhav.thapliyal...@gmail.com> wrote: >>>> >>>>> Thank you Eric. I will surely do the same. Should uneven distribution >>>>> across the tablets affect querying in accumulo? If this case, it is. Is >>>>> this behaviour normal? >>>>> On 13-May-2015 10:58 pm, "Eric Newton" <eric.new...@gmail.com> wrote: >>>>> >>>>>> Yes, that's a great way to split the data evenly. >>>>>> >>>>>> Also, since the data set is so small, turn on data caching for your >>>>>> table: >>>>>> >>>>>> shell> config -t mytable -s table.cache.block.enable=true >>>>>> >>>>>> You may want to increase the size of your tserver JVM, and increase >>>>>> the size of the cache: >>>>>> >>>>>> shell> config -s tserver.cache.data.size=1G >>>>>> >>>>>> This will help with repeated random look-ups. >>>>>> >>>>>> -Eric >>>>>> >>>>>> On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal < >>>>>> vaibhav.thapliyal...@gmail.com> wrote: >>>>>> >>>>>>> Thank you Eric. >>>>>>> >>>>>>> One thing I would like to know. Does pre-splitting the data play a >>>>>>> part in querying accumulo? >>>>>>> >>>>>>> Because I managed to somewhat decrease the querying time. >>>>>>> I did the following steps: >>>>>>> My table was around 1.47gb so I explicity set the split parameter to >>>>>>> 256mb instead of the default 1gb. >>>>>>> >>>>>>> So I had just 8 tablets. Now when I carried out the same query, it >>>>>>> finished in 15s. >>>>>>> >>>>>>> Is it because of the split points are more evenly distributed? >>>>>>> >>>>>>> The previous table on which the query took 50s had entries unevenly >>>>>>> distributed across the tablets. >>>>>>> Thanks >>>>>>> Vaibhav >>>>>>> On 13-May-2015 7:43 pm, "Eric Newton" <eric.new...@gmail.com> wrote: >>>>>>> >>>>>>>> This use case is one of the things Accumulo was designed to handle >>>>>>>> well. It's the reason there is a BatchScanner. >>>>>>>> >>>>>>>> I've created: >>>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/ACCUMULO-3813 >>>>>>>> >>>>>>>> so we can investigate and track down any problems or improvements. >>>>>>>> >>>>>>>> Feel free to add any other details to the JIRA ticket. >>>>>>>> >>>>>>>> -Eric >>>>>>>> >>>>>>>> >>>>>>>> On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz < >>>>>>>> elahrvi...@ccri.com> wrote: >>>>>>>> >>>>>>>>> It sounds like each of your ranges is an ID, e.g. a single row. >>>>>>>>> I've found that scanning lots of non-sequential single-row ranges is >>>>>>>>> pretty >>>>>>>>> slow in accumulo. Your best approach is probably to create an index >>>>>>>>> table >>>>>>>>> on whatever you are originally trying to query (assuming those 10000 >>>>>>>>> ids >>>>>>>>> came from some other query). >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> Emilio >>>>>>>>> >>>>>>>>> >>>>>>>>> On 05/13/2015 09:14 AM, vaibhav thapliyal wrote: >>>>>>>>> >>>>>>>>> The rf files per tablet vary between 2 to 5 per tablet. The >>>>>>>>> entries returned to me by the batchScanner is 460000. The approx. >>>>>>>>> average >>>>>>>>> data rate is 0.5 MB/s as seen on the accumulo monitor page. >>>>>>>>> >>>>>>>>> A simple scan on the table has an average data rate of about 7-8 >>>>>>>>> MB/s. >>>>>>>>> >>>>>>>>> All the ids exist in the accumulo table. >>>>>>>>> >>>>>>>>> On 12 May 2015 at 23:39, Keith Turner <ke...@deenlo.com> wrote: >>>>>>>>> >>>>>>>>>> Do you know how much data is being brought back (i.e. 100 >>>>>>>>>> megabytes)? I am wondering what the data rate is in MB/s. Do you >>>>>>>>>> know how >>>>>>>>>> many files per tablet you have? Do most of the 10,000 ids you are >>>>>>>>>> querying >>>>>>>>>> for exist? >>>>>>>>>> >>>>>>>>>> On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal < >>>>>>>>>> vaibhav.thapliyal...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> I have 194 tablets. Currently I am using 20 threads to create >>>>>>>>>>> the batchscanner inside the createBatchScanner method. >>>>>>>>>>> On 12-May-2015 11:19 pm, "Keith Turner" <ke...@deenlo.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> How many tablets do you have? The batch scanner does not >>>>>>>>>>>> parallelize operations within a tablet. >>>>>>>>>>>> >>>>>>>>>>>> If you give the batch scanner more threads than there are >>>>>>>>>>>> tservers, it will make multilple parallel rpc calls to each >>>>>>>>>>>> tserver if the >>>>>>>>>>>> tserver has multiple tablets. Each rpc may include multiple >>>>>>>>>>>> tablets and >>>>>>>>>>>> ranges for each tablet. >>>>>>>>>>>> >>>>>>>>>>>> If the batch scanner has less threads than tservers, it will >>>>>>>>>>>> make one rpc per tserver per thread. Each rpc call will include >>>>>>>>>>>> all >>>>>>>>>>>> tablets and associated ranges for that tserver. >>>>>>>>>>>> >>>>>>>>>>>> Keith >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal < >>>>>>>>>>>> vaibhav.thapliyal...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I am using BatchScanner to scan rows from a accumulo table. >>>>>>>>>>>>> The table has around 187m entries and I am using a 3 node cluster >>>>>>>>>>>>> which has >>>>>>>>>>>>> accumulo 1.6.1. >>>>>>>>>>>>> >>>>>>>>>>>>> I have passed 10000 ids which are stored as row id in my >>>>>>>>>>>>> table as a list in the setRanges() method. >>>>>>>>>>>>> >>>>>>>>>>>>> This whole process takes around 50 secs(from adding the ids >>>>>>>>>>>>> in the list to scanning the whole table using the BatchScanner). >>>>>>>>>>>>> >>>>>>>>>>>>> I tried switching on bloom filters but that didn't work. >>>>>>>>>>>>> >>>>>>>>>>>>> Also if anyone could briefly explain how a BatchScanner >>>>>>>>>>>>> works, how it does parallel scanning it would help me understand >>>>>>>>>>>>> what I am >>>>>>>>>>>>> doing better. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks >>>>>>>>>>>>> Vaibhav >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>> >>> >