Thank you Eric. I will surely do the same. Should uneven distribution across the tablets affect querying in accumulo? If this case, it is. Is this behaviour normal? On 13-May-2015 10:58 pm, "Eric Newton" <eric.new...@gmail.com> wrote:
> Yes, that's a great way to split the data evenly. > > Also, since the data set is so small, turn on data caching for your table: > > shell> config -t mytable -s table.cache.block.enable=true > > You may want to increase the size of your tserver JVM, and increase the > size of the cache: > > shell> config -s tserver.cache.data.size=1G > > This will help with repeated random look-ups. > > -Eric > > On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal < > vaibhav.thapliyal...@gmail.com> wrote: > >> Thank you Eric. >> >> One thing I would like to know. Does pre-splitting the data play a part >> in querying accumulo? >> >> Because I managed to somewhat decrease the querying time. >> I did the following steps: >> My table was around 1.47gb so I explicity set the split parameter to >> 256mb instead of the default 1gb. >> >> So I had just 8 tablets. Now when I carried out the same query, it >> finished in 15s. >> >> Is it because of the split points are more evenly distributed? >> >> The previous table on which the query took 50s had entries unevenly >> distributed across the tablets. >> Thanks >> Vaibhav >> On 13-May-2015 7:43 pm, "Eric Newton" <eric.new...@gmail.com> wrote: >> >>> This use case is one of the things Accumulo was designed to handle well. >>> It's the reason there is a BatchScanner. >>> >>> I've created: >>> >>> https://issues.apache.org/jira/browse/ACCUMULO-3813 >>> >>> so we can investigate and track down any problems or improvements. >>> >>> Feel free to add any other details to the JIRA ticket. >>> >>> -Eric >>> >>> >>> On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz <elahrvi...@ccri.com >>> > wrote: >>> >>>> It sounds like each of your ranges is an ID, e.g. a single row. I've >>>> found that scanning lots of non-sequential single-row ranges is pretty slow >>>> in accumulo. Your best approach is probably to create an index table on >>>> whatever you are originally trying to query (assuming those 10000 ids came >>>> from some other query). >>>> >>>> Thanks, >>>> >>>> Emilio >>>> >>>> >>>> On 05/13/2015 09:14 AM, vaibhav thapliyal wrote: >>>> >>>> The rf files per tablet vary between 2 to 5 per tablet. The entries >>>> returned to me by the batchScanner is 460000. The approx. average data rate >>>> is 0.5 MB/s as seen on the accumulo monitor page. >>>> >>>> A simple scan on the table has an average data rate of about 7-8 MB/s. >>>> >>>> All the ids exist in the accumulo table. >>>> >>>> On 12 May 2015 at 23:39, Keith Turner <ke...@deenlo.com> wrote: >>>> >>>>> Do you know how much data is being brought back (i.e. 100 megabytes)? >>>>> I am wondering what the data rate is in MB/s. Do you know how many files >>>>> per tablet you have? Do most of the 10,000 ids you are querying for >>>>> exist? >>>>> >>>>> On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal < >>>>> vaibhav.thapliyal...@gmail.com> wrote: >>>>> >>>>>> I have 194 tablets. Currently I am using 20 threads to create the >>>>>> batchscanner inside the createBatchScanner method. >>>>>> On 12-May-2015 11:19 pm, "Keith Turner" <ke...@deenlo.com> wrote: >>>>>> >>>>>>> How many tablets do you have? The batch scanner does not >>>>>>> parallelize operations within a tablet. >>>>>>> >>>>>>> If you give the batch scanner more threads than there are >>>>>>> tservers, it will make multilple parallel rpc calls to each tserver if >>>>>>> the >>>>>>> tserver has multiple tablets. Each rpc may include multiple tablets and >>>>>>> ranges for each tablet. >>>>>>> >>>>>>> If the batch scanner has less threads than tservers, it will make >>>>>>> one rpc per tserver per thread. Each rpc call will include all tablets >>>>>>> and >>>>>>> associated ranges for that tserver. >>>>>>> >>>>>>> Keith >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal < >>>>>>> vaibhav.thapliyal...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I am using BatchScanner to scan rows from a accumulo table. The >>>>>>>> table has around 187m entries and I am using a 3 node cluster which has >>>>>>>> accumulo 1.6.1. >>>>>>>> >>>>>>>> I have passed 10000 ids which are stored as row id in my table as >>>>>>>> a list in the setRanges() method. >>>>>>>> >>>>>>>> This whole process takes around 50 secs(from adding the ids in >>>>>>>> the list to scanning the whole table using the BatchScanner). >>>>>>>> >>>>>>>> I tried switching on bloom filters but that didn't work. >>>>>>>> >>>>>>>> Also if anyone could briefly explain how a BatchScanner works, >>>>>>>> how it does parallel scanning it would help me understand what I am >>>>>>>> doing >>>>>>>> better. >>>>>>>> >>>>>>>> Thanks >>>>>>>> Vaibhav >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>> >>>> >>> >