Hi Fred, Thanks for taking the time! Yes, I noticed that unbalance yesterday when writing, looked into it after sending and found our config is corrupt with one node ommitted and another in there twice. But, with such low traffic levels and the spikes being on the non-favoured node I'm not currently ranking that as a likely factor.
Another interesting case from last night, this sample was taken at 2017-6-23 06:04:09 Riak node 1 "search_query_throughput_one": 27 "search_query_latency_max": 10417 Riak node 2 "search_query_throughput_one": 49 "search_query_latency_max": 8952 Riak node 3 "search_query_throughput_one": 18 "search_query_throughput_count": 2507 "search_query_latency_min": 1757 "search_query_latency_median": 14775 "search_query_latency_mean": 5628361 "search_query_latency_max": 18298854 "search_query_latency_999": 18298854 "search_query_latency_99": 18298854 "search_query_latency_95": 16539782 Riak node 4 "search_query_throughput_one": 25 "search_query_latency_max": 10217 Brushing up my maths and focussing on node 3, from the 99 & 95% figures we can tell the 2 slowest response times were 18,298 & 16,539ms, 34,837 ms in total. And from the request count for the minute & the mean we can tell that in total these 18 requests spent a total of 101,310 ms being processed. >From the median & min we know the 9 quickest took between 18 & 265 ms in total. This leaves in the region of 66 sec for the other 7 requests, enough for all 7 to have timed out. Cross referencing with our application logs I can see: On application node 1 at 2017-06-23 06:03:17 we had 3 search request timeouts to index A with 3 different filters, one field of which, lets call it field X, had the same value. We immediately retried these and at 2017-06-23 06:03:19 2 of those timed out again and were retried again. They all succeeded on this retry, so this suggests that the same requests sent to other riak nodes was fine, but to this riak node at this time was a problem. On application node 2 at: 2017-06-23 06:03:27 2017-06-23 06:03:29 2017-06-23 06:03:31 2017-06-23 06:03:33 we had 4 more timeouts on search requests to index A, these requests had 2 different filters but in both cases field X had the same value as in the previous example. So these application logs show 9 riak timeouts, that must correlate with the riak stats. I can't definitively say that no other search requests went to this riak node between 06:03:15 & 06:03:33 but the circumstantial evidence is that it had a problem for 18 seconds, which is quiet a big window. The index that all these requests were directed at currently has 490K entries with 8 different fields defined in each. The corresponding riak bucket has allow_mult = false, if that's relevant. We see a similar pattern on our test system, I'm going to setup a test to repeatedly do searches and see if I can trigger this consistently. Will let ye know if anything interesting comes out of it. I know it's relatively new to the product, do we know is riak solr used much in production systems? I assume no one else has seen these spikes? //Sean. On Thu, Jun 22, 2017 at 9:40 PM, Fred Dushin <f...@dushin.net> wrote: > It's pretty strange that you are seeing no search latency measurements on > node 5. Are you sure your round robining is working? Are you favoring > node 1? > > In general, I don't think which node you hit for query should make a > difference, but I'd have to stare at the code some to be sure. In essence, > all the node that services the query does is convert the query into a > sharded Solr query based on a coverage plan, which changes every minute or > so, and then runs the sharded query on the local Solr node. The Solr node > then distributes the query to the rest of the nodes in the cluster, but > that's all Solr comms -- Riak is out of the picture, by then. > > Now, if you have a lot of sharded queries accumulating on one node, that > might make a difference to Solr. I am not a Solr expert, and I don't even > play one on TV. But maybe the fact that you are not hitting node 5 is > relevant for that reason? > > Can you do more analysis on your client, to make sure you are not favoring > node 1? > > -Fred > > > On Jun 22, 2017, at 10:20 AM, sean mcevoy <sean.mce...@gmail.com> wrote: > > > > Hi List, > > > > We have a standard riak cluster with 5 nodes and at the minute the > traffic levels are fairly low. Each of our application nodes has 25 client > connections, 5 to each riak node which get selected in a round robin. > > > > Our application level requests involve multiple riak requests so our > traffic tends to make requests in small bursts. Everything works fine for > KV gets, puts & deletes but we're seeing timeouts & weird response time > spikes on solr search operations. > > > > In the past 36 hours (the only period I have riak stats for) I see one > response time of 38.8 seconds, 3 hours earlier a response time of 20.8 > seconds, and the third biggest spike is an acceptable 3.5 seconds. > > > > See below all search_query stats for the minute of the 38 sec sample. In > the application request we made 5 riak search requests to the same index in > parallel, which happens for each request of this type and normally doesn't > have an issue. But in this case all 5 timed out, and one timed out again on > retry with the other 4 succeeding. > > > > Anyone ever seen anything like this before? Is there any known deadlock > in solr that I might hit if I make the same request on another connection > before the first has completed? This is what we do when our riak client > times out after 2 seconds and immediately retries. > > > > Any advice or pointers welcomed. > > Thanks, > > //Sean. > > > > > > Riak node 1 > > search_query_throughput_one: 14 > > search_query_throughput_count: 259 > > search_query_latency_min: 2776 > > search_query_latency_median: 69411 > > search_query_latency_mean: 4900973 > > search_query_latency_max: 38887902 > > search_query_latency_999: 38887902 > > search_query_latency_99: 38887902 > > search_query_latency_95: 2046215 > > search_query_fail_one: 0 > > search_query_fail_count: 0 > > > > Riak node 2 > > search_query_throughput_one: 22 > > search_query_throughput_count: 564 > > search_query_latency_min: 4006 > > search_query_latency_median: 8800 > > search_query_latency_mean: 11834 > > search_query_latency_max: 25509 > > search_query_latency_999: 25509 > > search_query_latency_99: 25509 > > search_query_latency_95: 24035 > > search_query_fail_one: 0 > > search_query_fail_count: 0 > > > > Riak node 3 > > search_query_throughput_one: 6 > > search_query_throughput_count: 298 > > search_query_latency_min: 3200 > > search_query_latency_median: 15391 > > search_query_latency_mean: 18062 > > search_query_latency_max: 31759 > > search_query_latency_999: 31759 > > search_query_latency_99: 31759 > > search_query_latency_95: 31759 > > search_query_fail_one: 0 > > search_query_fail_count: 0 > > > > Riak node 4 > > search_query_throughput_one: 8 > > search_query_throughput_count: 334 > > search_query_latency_min: 2404 > > search_query_latency_median: 7230 > > search_query_latency_mean: 10211 > > search_query_latency_max: 22502 > > search_query_latency_999: 22502 > > search_query_latency_99: 22502 > > search_query_latency_95: 22502 > > search_query_fail_one: 0 > > search_query_fail_count: 0 > > > > Riak node 5 > > search_query_throughput_one: 0 > > search_query_throughput_count: 0 > > search_query_latency_min: 0 > > search_query_latency_median: 0 > > search_query_latency_mean: 0 > > search_query_latency_max: 0 > > search_query_latency_999: 0 > > search_query_latency_99: 0 > > search_query_latency_95: 0 > > search_query_fail_one: 0 > > search_query_fail_count: 0 > > > > _______________________________________________ > > riak-users mailing list > > riak-users@lists.basho.com > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com