This use case is one of the things Accumulo was designed to handle well.
It's the reason there is a BatchScanner.
I've created:
https://issues.apache.org/jira/browse/ACCUMULO-3813
so we can investigate and track down any problems or improvements.
Feel free to add any other details to the JIRA
It sounds like each of your ranges is an ID, e.g. a single row. I've
found that scanning lots of non-sequential single-row ranges is pretty
slow in accumulo. Your best approach is probably to create an index
table on whatever you are originally trying to query (assuming those
1 ids came
As long as you're managing your expectations (which I sounds like you've
considered well), there could be some worth.
A concern would be how using a different filesystem implementation
actually impacts the validity of your benchmark though.
e.g. w/ a local FS (which is by default what MAC
Yes, hot-spotting does affect accumulo because you have fewer servers and
caches handling your request.
Let's say your data is spread out, in a normal distribution from 0..9.
What if you have only 1 split? You would want it at 5, to divide the
data in half, and you could host the halves on
Yes, that's a great way to split the data evenly.
Also, since the data set is so small, turn on data caching for your table:
shell config -t mytable -s table.cache.block.enable=true
You may want to increase the size of your tserver JVM, and increase the
size of the cache:
shell config -s
Thank you Eric.
One thing I would like to know. Does pre-splitting the data play a part in
querying accumulo?
Because I managed to somewhat decrease the querying time.
I did the following steps:
My table was around 1.47gb so I explicity set the split parameter to 256mb
instead of the default
Thank you Eric. I will surely do the same. Should uneven distribution
across the tablets affect querying in accumulo? If this case, it is. Is
this behaviour normal?
On 13-May-2015 10:58 pm, Eric Newton eric.new...@gmail.com wrote:
Yes, that's a great way to split the data evenly.
Also, since
Hi,
Is it crazy to use a MiniAccumuloCluster to measure the *relative*
performance of two different implementations of iterators?
Obviously it would be better to do it on a real Accumulo cluster, but
that's not possible for several reasons.
The approach would be something like:
- Fire up a Mini