This use case is one of the things Accumulo was designed to handle well. It's the reason there is a BatchScanner.
I've created: https://issues.apache.org/jira/browse/ACCUMULO-3813 so we can investigate and track down any problems or improvements. Feel free to add any other details to the JIRA ticket. -Eric On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz <elahrvi...@ccri.com> wrote: > It sounds like each of your ranges is an ID, e.g. a single row. I've > found that scanning lots of non-sequential single-row ranges is pretty slow > in accumulo. Your best approach is probably to create an index table on > whatever you are originally trying to query (assuming those 10000 ids came > from some other query). > > Thanks, > > Emilio > > > On 05/13/2015 09:14 AM, vaibhav thapliyal wrote: > > The rf files per tablet vary between 2 to 5 per tablet. The entries > returned to me by the batchScanner is 460000. The approx. average data rate > is 0.5 MB/s as seen on the accumulo monitor page. > > A simple scan on the table has an average data rate of about 7-8 MB/s. > > All the ids exist in the accumulo table. > > On 12 May 2015 at 23:39, Keith Turner <ke...@deenlo.com> wrote: > >> Do you know how much data is being brought back (i.e. 100 megabytes)? I >> am wondering what the data rate is in MB/s. Do you know how many files per >> tablet you have? Do most of the 10,000 ids you are querying for exist? >> >> On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal < >> vaibhav.thapliyal...@gmail.com> wrote: >> >>> I have 194 tablets. Currently I am using 20 threads to create the >>> batchscanner inside the createBatchScanner method. >>> On 12-May-2015 11:19 pm, "Keith Turner" <ke...@deenlo.com> wrote: >>> >>>> How many tablets do you have? The batch scanner does not >>>> parallelize operations within a tablet. >>>> >>>> If you give the batch scanner more threads than there are tservers, >>>> it will make multilple parallel rpc calls to each tserver if the tserver >>>> has multiple tablets. Each rpc may include multiple tablets and ranges for >>>> each tablet. >>>> >>>> If the batch scanner has less threads than tservers, it will make one >>>> rpc per tserver per thread. Each rpc call will include all tablets and >>>> associated ranges for that tserver. >>>> >>>> Keith >>>> >>>> >>>> >>>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal < >>>> vaibhav.thapliyal...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I am using BatchScanner to scan rows from a accumulo table. The >>>>> table has around 187m entries and I am using a 3 node cluster which has >>>>> accumulo 1.6.1. >>>>> >>>>> I have passed 10000 ids which are stored as row id in my table as a >>>>> list in the setRanges() method. >>>>> >>>>> This whole process takes around 50 secs(from adding the ids in the >>>>> list to scanning the whole table using the BatchScanner). >>>>> >>>>> I tried switching on bloom filters but that didn't work. >>>>> >>>>> Also if anyone could briefly explain how a BatchScanner works, how >>>>> it does parallel scanning it would help me understand what I am doing >>>>> better. >>>>> >>>>> Thanks >>>>> Vaibhav >>>>> >>>>> >>>>> >>>> >> > >