Hi all,

I started to have some issues with our Riak cluster, one of our buckets is
starting to get quite big, between 60M - 80M entries, and now we are
starting to see huge latency issues when we try to remove data out of it.

Our usual get/put latency average in normal write operation is 5000us get
and 9000us put. We have over 1000 workers executing a task and sending
their results to the Riak cluster. That part works really well.

The problem is that we can't keep all that data forever so each of our task
results has and expiry field that we query using SOLR to delete expired
data. We have 1 thread that issues queries to SOLR to get a bunch of
results to delete and 50 worker thread that issue the delete commands to
the Riak cluster. Now the problem is that has soon as I start issuing
search queries latency goes through the roof. Just issuing the search
queries to gather the data adds a 60 000us get latency and a 50 000us put
latency to the KV gets and puts. I don't get how searching through SOLR
could create such latency in the gets and put, it should not have any
effect. And if I actually turn on the deleting workers its yet another 10
000us get latency and 10 000us put latency. The deletion workers latency, I
get that, they are actually changing the data. It's the search latency that
I find a bit odd.

Even worst, now the search index in that bucket seems extremely slow. Any
queries that I do on that search index takes at minimum 1800ms. That may be
part of the problem. But at the end of the day the index is not even that
big, 15GB on each node.

To put you more in context, this is a break down of our Riak cluster:

6x Dell r620
     2x10 cores CPUs with HT = 40 logical cores
     256 GB Ram
     3x300GB 15k RPM SATA in Mirror = OS
     6x900GB 15k RPM SATA in RAID5 = Riak+SOLR
     4x10G Nic (but only one is configured for now)
     Ubuntu 14.04.1 Server
     *I've done all suggested tweaks from Riak docs on the OS

Current Riak config changed from default:
leveldb
ring size of 128
solr -xmx16g and ConcMarkSweepGC
erlang at smp:20:20
*tried erlang.async_thread to 256 but didn't do anything

The result bucket replication is set to n=2

Load average on each nodes never goes over 5-6, which I find extremely low
for such a big machine. There is never any read on the disk because all
leveldb files and the search index fit on the OS cache.

Am I expecting too much from Riak/SOLR? There has to be something that I'm
doing wrong somewhere...

Steve
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to