Hi all, I started to have some issues with our Riak cluster, one of our buckets is starting to get quite big, between 60M - 80M entries, and now we are starting to see huge latency issues when we try to remove data out of it.
Our usual get/put latency average in normal write operation is 5000us get and 9000us put. We have over 1000 workers executing a task and sending their results to the Riak cluster. That part works really well. The problem is that we can't keep all that data forever so each of our task results has and expiry field that we query using SOLR to delete expired data. We have 1 thread that issues queries to SOLR to get a bunch of results to delete and 50 worker thread that issue the delete commands to the Riak cluster. Now the problem is that has soon as I start issuing search queries latency goes through the roof. Just issuing the search queries to gather the data adds a 60 000us get latency and a 50 000us put latency to the KV gets and puts. I don't get how searching through SOLR could create such latency in the gets and put, it should not have any effect. And if I actually turn on the deleting workers its yet another 10 000us get latency and 10 000us put latency. The deletion workers latency, I get that, they are actually changing the data. It's the search latency that I find a bit odd. Even worst, now the search index in that bucket seems extremely slow. Any queries that I do on that search index takes at minimum 1800ms. That may be part of the problem. But at the end of the day the index is not even that big, 15GB on each node. To put you more in context, this is a break down of our Riak cluster: 6x Dell r620 2x10 cores CPUs with HT = 40 logical cores 256 GB Ram 3x300GB 15k RPM SATA in Mirror = OS 6x900GB 15k RPM SATA in RAID5 = Riak+SOLR 4x10G Nic (but only one is configured for now) Ubuntu 14.04.1 Server *I've done all suggested tweaks from Riak docs on the OS Current Riak config changed from default: leveldb ring size of 128 solr -xmx16g and ConcMarkSweepGC erlang at smp:20:20 *tried erlang.async_thread to 256 but didn't do anything The result bucket replication is set to n=2 Load average on each nodes never goes over 5-6, which I find extremely low for such a big machine. There is never any read on the disk because all leveldb files and the search index fit on the OS cache. Am I expecting too much from Riak/SOLR? There has to be something that I'm doing wrong somewhere... Steve
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com