Dylan could you elaborate on the average query time you had? Thanks Vaibhav On 14-May-2015 11:03 pm, "Dylan Hutchison" <dhutc...@mit.edu> wrote:
> I think this is the same issue I found for ACCUMULO-3710 > <https://issues.apache.org/jira/browse/ACCUMULO-3710>, only in my case > the tserver ran out of memory. Accumulo doesn't handle large numbers of > small, disjoint ranges well. I bet there's room for improvement on both > the client and tablet server. > ~Dylan > > On Wed, May 13, 2015 at 3:13 PM, Eric Newton <eric.new...@gmail.com> > wrote: > >> Yes, hot-spotting does affect accumulo because you have fewer servers and >> caches handling your request. >> >> Let's say your data is spread out, in a normal distribution from >> "0".."9". >> >> What if you have only 1 split? You would want it at "5", to divide the >> data in half, and you could host the halves on different servers. But if >> you split at 1, now 10% of your queries go to one tablet, and 90% go to the >> other. >> >> -Eric >> >> >> On Wed, May 13, 2015 at 1:56 PM, vaibhav thapliyal < >> vaibhav.thapliyal...@gmail.com> wrote: >> >>> Thank you Eric. I will surely do the same. Should uneven distribution >>> across the tablets affect querying in accumulo? If this case, it is. Is >>> this behaviour normal? >>> On 13-May-2015 10:58 pm, "Eric Newton" <eric.new...@gmail.com> wrote: >>> >>>> Yes, that's a great way to split the data evenly. >>>> >>>> Also, since the data set is so small, turn on data caching for your >>>> table: >>>> >>>> shell> config -t mytable -s table.cache.block.enable=true >>>> >>>> You may want to increase the size of your tserver JVM, and increase the >>>> size of the cache: >>>> >>>> shell> config -s tserver.cache.data.size=1G >>>> >>>> This will help with repeated random look-ups. >>>> >>>> -Eric >>>> >>>> On Wed, May 13, 2015 at 11:31 AM, vaibhav thapliyal < >>>> vaibhav.thapliyal...@gmail.com> wrote: >>>> >>>>> Thank you Eric. >>>>> >>>>> One thing I would like to know. Does pre-splitting the data play a >>>>> part in querying accumulo? >>>>> >>>>> Because I managed to somewhat decrease the querying time. >>>>> I did the following steps: >>>>> My table was around 1.47gb so I explicity set the split parameter to >>>>> 256mb instead of the default 1gb. >>>>> >>>>> So I had just 8 tablets. Now when I carried out the same query, it >>>>> finished in 15s. >>>>> >>>>> Is it because of the split points are more evenly distributed? >>>>> >>>>> The previous table on which the query took 50s had entries unevenly >>>>> distributed across the tablets. >>>>> Thanks >>>>> Vaibhav >>>>> On 13-May-2015 7:43 pm, "Eric Newton" <eric.new...@gmail.com> wrote: >>>>> >>>>>> This use case is one of the things Accumulo was designed to handle >>>>>> well. It's the reason there is a BatchScanner. >>>>>> >>>>>> I've created: >>>>>> >>>>>> https://issues.apache.org/jira/browse/ACCUMULO-3813 >>>>>> >>>>>> so we can investigate and track down any problems or improvements. >>>>>> >>>>>> Feel free to add any other details to the JIRA ticket. >>>>>> >>>>>> -Eric >>>>>> >>>>>> >>>>>> On Wed, May 13, 2015 at 10:03 AM, Emilio Lahr-Vivaz < >>>>>> elahrvi...@ccri.com> wrote: >>>>>> >>>>>>> It sounds like each of your ranges is an ID, e.g. a single row. >>>>>>> I've found that scanning lots of non-sequential single-row ranges is >>>>>>> pretty >>>>>>> slow in accumulo. Your best approach is probably to create an index >>>>>>> table >>>>>>> on whatever you are originally trying to query (assuming those 10000 ids >>>>>>> came from some other query). >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Emilio >>>>>>> >>>>>>> >>>>>>> On 05/13/2015 09:14 AM, vaibhav thapliyal wrote: >>>>>>> >>>>>>> The rf files per tablet vary between 2 to 5 per tablet. The >>>>>>> entries returned to me by the batchScanner is 460000. The approx. >>>>>>> average >>>>>>> data rate is 0.5 MB/s as seen on the accumulo monitor page. >>>>>>> >>>>>>> A simple scan on the table has an average data rate of about 7-8 >>>>>>> MB/s. >>>>>>> >>>>>>> All the ids exist in the accumulo table. >>>>>>> >>>>>>> On 12 May 2015 at 23:39, Keith Turner <ke...@deenlo.com> wrote: >>>>>>> >>>>>>>> Do you know how much data is being brought back (i.e. 100 >>>>>>>> megabytes)? I am wondering what the data rate is in MB/s. Do you know >>>>>>>> how >>>>>>>> many files per tablet you have? Do most of the 10,000 ids you are >>>>>>>> querying >>>>>>>> for exist? >>>>>>>> >>>>>>>> On Tue, May 12, 2015 at 1:58 PM, vaibhav thapliyal < >>>>>>>> vaibhav.thapliyal...@gmail.com> wrote: >>>>>>>> >>>>>>>>> I have 194 tablets. Currently I am using 20 threads to create the >>>>>>>>> batchscanner inside the createBatchScanner method. >>>>>>>>> On 12-May-2015 11:19 pm, "Keith Turner" <ke...@deenlo.com> wrote: >>>>>>>>> >>>>>>>>>> How many tablets do you have? The batch scanner does not >>>>>>>>>> parallelize operations within a tablet. >>>>>>>>>> >>>>>>>>>> If you give the batch scanner more threads than there are >>>>>>>>>> tservers, it will make multilple parallel rpc calls to each tserver >>>>>>>>>> if the >>>>>>>>>> tserver has multiple tablets. Each rpc may include multiple tablets >>>>>>>>>> and >>>>>>>>>> ranges for each tablet. >>>>>>>>>> >>>>>>>>>> If the batch scanner has less threads than tservers, it will >>>>>>>>>> make one rpc per tserver per thread. Each rpc call will include all >>>>>>>>>> tablets and associated ranges for that tserver. >>>>>>>>>> >>>>>>>>>> Keith >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, May 12, 2015 at 1:39 PM, vaibhav thapliyal < >>>>>>>>>> vaibhav.thapliyal...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I am using BatchScanner to scan rows from a accumulo table. >>>>>>>>>>> The table has around 187m entries and I am using a 3 node cluster >>>>>>>>>>> which has >>>>>>>>>>> accumulo 1.6.1. >>>>>>>>>>> >>>>>>>>>>> I have passed 10000 ids which are stored as row id in my table >>>>>>>>>>> as a list in the setRanges() method. >>>>>>>>>>> >>>>>>>>>>> This whole process takes around 50 secs(from adding the ids in >>>>>>>>>>> the list to scanning the whole table using the BatchScanner). >>>>>>>>>>> >>>>>>>>>>> I tried switching on bloom filters but that didn't work. >>>>>>>>>>> >>>>>>>>>>> Also if anyone could briefly explain how a BatchScanner works, >>>>>>>>>>> how it does parallel scanning it would help me understand what I am >>>>>>>>>>> doing >>>>>>>>>>> better. >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Vaibhav >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >> >