Re: Solr search response time spikes

2017-07-03 Thread sean mcevoy
Hi List, Fred,

After a week of going cross-eyed looking at stats & trying to engineer a
test case to make this happen in the test env I think I've made a
breakthrough.

We have a low but steady level of riak traffic but our application level
actions that result in solr reads are actually fairly infrequent. And when
one of these actions occur it results in multiple parallel reads to our
solr indexes.

What I've observed is that our timeouts are most easily reproduced after a
period of inactivity. And once I see a timeout after 2 seconds I kick off
multiple other reads to random keys and observe that some return instantly
while others can take several seconds, but then all return at the same time.

It's almost as if some shards in the java VM have gone to sleep due to
inactivity and we see a cluster of timeouts when we try to read from it.

I'm setting up a "pinger" script in our prod env to keep these awake and
see if our observed timeout rate reduces.

If this is actually our problem are there any JVM config options we can use
to keep the index active all the time?

//Sean.

On Fri, Jun 23, 2017 at 1:48 PM, sean mcevoy  wrote:

> Hi Fred,
>
> Thanks for taking the time!
> Yes, I noticed that unbalance yesterday when writing, looked into it after
> sending and found our config is corrupt with one node ommitted and another
> in there twice.
> But, with such low traffic levels and the spikes being on the non-favoured
> node I'm not currently ranking that as a likely factor.
>
>
> Another interesting case from last night, this sample was taken at
> 2017-6-23 06:04:09
>
> Riak node 1
> "search_query_throughput_one": 27
> "search_query_latency_max": 10417
>
> Riak node 2
> "search_query_throughput_one": 49
> "search_query_latency_max": 8952
>
> Riak node 3
> "search_query_throughput_one": 18
> "search_query_throughput_count": 2507
> "search_query_latency_min": 1757
> "search_query_latency_median": 14775
> "search_query_latency_mean": 5628361
> "search_query_latency_max": 18298854
> "search_query_latency_999": 18298854
> "search_query_latency_99": 18298854
> "search_query_latency_95": 16539782
>
> Riak node 4
> "search_query_throughput_one": 25
> "search_query_latency_max": 10217
>
>
> Brushing up my maths and focussing on node 3, from the 99 & 95% figures we
> can tell the 2 slowest response times were 18,298 & 16,539ms, 34,837 ms in
> total.
> And from the request count for the minute & the mean we can tell that in
> total these 18 requests spent a total of 101,310 ms being processed.
> From the median & min we know the 9 quickest took between 18 & 265 ms in
> total.
> This leaves in the region of 66 sec for the other 7 requests, enough for
> all 7 to have timed out.
>
>
> Cross referencing with our application logs I can see:
>
> On application node 1 at 2017-06-23 06:03:17 we had 3 search request
> timeouts to index A with 3 different filters, one field of which, lets call
> it field X, had the same value.
> We immediately retried these and at 2017-06-23 06:03:19 2 of those timed
> out again and were retried again.
> They all succeeded on this retry, so this suggests that the same requests
> sent to other riak nodes was fine, but to this riak node at this time was a
> problem.
>
> On application node 2 at:
> 2017-06-23 06:03:27
> 2017-06-23 06:03:29
> 2017-06-23 06:03:31
> 2017-06-23 06:03:33
>
> we had 4 more timeouts on search requests to index A, these requests had 2
> different filters but in both cases field X had the same value as in the
> previous example.
>
>
> So these application logs show 9 riak timeouts, that must correlate with
> the riak stats.
> I can't definitively say that no other search requests went to this riak
> node between 06:03:15 & 06:03:33 but the circumstantial evidence is that it
> had a problem for 18 seconds, which is quiet a big window.
>
>
> The index that all these requests were directed at currently has 490K
> entries with 8 different fields defined in each. The corresponding riak
> bucket has allow_mult = false, if that's relevant.
>
> We see a similar pattern on our test system, I'm going to setup a test to
> repeatedly do searches and see if I can trigger this consistently. Will let
> ye know if anything interesting comes out of it.
>
> I know it's relatively new to the product, do we know is riak solr used
> much in production systems?
> I assume no one else has seen these spikes?
>
> //Sean.
>
>
> On Thu, Jun 22, 2017 at 9:40 PM, Fred Dushin  wrote:
>
>> It's pretty strange that you are seeing no search latency measurements on
>> node 5.  Are you sure your round robining is working?  Are you favoring
>> node 1?
>>
>> In general, I don't think which node you hit for query should make a
>> difference, but I'd have to stare at the code some to be sure.  In essence,
>> all the node that services the query does is convert the query into a
>> sharded Solr query based on a coverage plan, which changes every minute or
>> so, and then 

Re: Solr search response time spikes

2017-06-23 Thread sean mcevoy
Hi Fred,

Thanks for taking the time!
Yes, I noticed that unbalance yesterday when writing, looked into it after
sending and found our config is corrupt with one node ommitted and another
in there twice.
But, with such low traffic levels and the spikes being on the non-favoured
node I'm not currently ranking that as a likely factor.


Another interesting case from last night, this sample was taken at
2017-6-23 06:04:09

Riak node 1
"search_query_throughput_one": 27
"search_query_latency_max": 10417

Riak node 2
"search_query_throughput_one": 49
"search_query_latency_max": 8952

Riak node 3
"search_query_throughput_one": 18
"search_query_throughput_count": 2507
"search_query_latency_min": 1757
"search_query_latency_median": 14775
"search_query_latency_mean": 5628361
"search_query_latency_max": 18298854
"search_query_latency_999": 18298854
"search_query_latency_99": 18298854
"search_query_latency_95": 16539782

Riak node 4
"search_query_throughput_one": 25
"search_query_latency_max": 10217


Brushing up my maths and focussing on node 3, from the 99 & 95% figures we
can tell the 2 slowest response times were 18,298 & 16,539ms, 34,837 ms in
total.
And from the request count for the minute & the mean we can tell that in
total these 18 requests spent a total of 101,310 ms being processed.
>From the median & min we know the 9 quickest took between 18 & 265 ms in
total.
This leaves in the region of 66 sec for the other 7 requests, enough for
all 7 to have timed out.


Cross referencing with our application logs I can see:

On application node 1 at 2017-06-23 06:03:17 we had 3 search request
timeouts to index A with 3 different filters, one field of which, lets call
it field X, had the same value.
We immediately retried these and at 2017-06-23 06:03:19 2 of those timed
out again and were retried again.
They all succeeded on this retry, so this suggests that the same requests
sent to other riak nodes was fine, but to this riak node at this time was a
problem.

On application node 2 at:
2017-06-23 06:03:27
2017-06-23 06:03:29
2017-06-23 06:03:31
2017-06-23 06:03:33

we had 4 more timeouts on search requests to index A, these requests had 2
different filters but in both cases field X had the same value as in the
previous example.


So these application logs show 9 riak timeouts, that must correlate with
the riak stats.
I can't definitively say that no other search requests went to this riak
node between 06:03:15 & 06:03:33 but the circumstantial evidence is that it
had a problem for 18 seconds, which is quiet a big window.


The index that all these requests were directed at currently has 490K
entries with 8 different fields defined in each. The corresponding riak
bucket has allow_mult = false, if that's relevant.

We see a similar pattern on our test system, I'm going to setup a test to
repeatedly do searches and see if I can trigger this consistently. Will let
ye know if anything interesting comes out of it.

I know it's relatively new to the product, do we know is riak solr used
much in production systems?
I assume no one else has seen these spikes?

//Sean.


On Thu, Jun 22, 2017 at 9:40 PM, Fred Dushin  wrote:

> It's pretty strange that you are seeing no search latency measurements on
> node 5.  Are you sure your round robining is working?  Are you favoring
> node 1?
>
> In general, I don't think which node you hit for query should make a
> difference, but I'd have to stare at the code some to be sure.  In essence,
> all the node that services the query does is convert the query into a
> sharded Solr query based on a coverage plan, which changes every minute or
> so, and then runs the sharded query on the local Solr node.  The Solr node
> then distributes the query to the rest of the nodes in the cluster, but
> that's all Solr comms -- Riak is out of the picture, by then.
>
> Now, if you have a lot of sharded queries accumulating on one node, that
> might make a difference to Solr.  I am not a Solr expert, and I don't even
> play one on TV.  But maybe the fact that you are not hitting node 5 is
> relevant for that reason?
>
> Can you do more analysis on your client, to make sure you are not favoring
> node 1?
>
> -Fred
>
> > On Jun 22, 2017, at 10:20 AM, sean mcevoy  wrote:
> >
> > Hi List,
> >
> > We have a standard riak cluster with 5 nodes and at the minute the
> traffic levels are fairly low. Each of our application nodes has 25 client
> connections, 5 to each riak node which get selected in a round robin.
> >
> > Our application level requests involve multiple riak requests so our
> traffic tends to make requests in small bursts. Everything works fine for
> KV gets, puts & deletes but we're seeing timeouts & weird response time
> spikes on solr search operations.
> >
> > In the past 36 hours (the only period I have riak stats for) I see one
> response time of 38.8 seconds, 3 hours earlier a response time of 20.8
> seconds, and the third biggest spike is an acceptable 

Re: Solr search response time spikes

2017-06-22 Thread Fred Dushin
It's pretty strange that you are seeing no search latency measurements on node 
5.  Are you sure your round robining is working?  Are you favoring node 1?

In general, I don't think which node you hit for query should make a 
difference, but I'd have to stare at the code some to be sure.  In essence, all 
the node that services the query does is convert the query into a sharded Solr 
query based on a coverage plan, which changes every minute or so, and then runs 
the sharded query on the local Solr node.  The Solr node then distributes the 
query to the rest of the nodes in the cluster, but that's all Solr comms -- 
Riak is out of the picture, by then.

Now, if you have a lot of sharded queries accumulating on one node, that might 
make a difference to Solr.  I am not a Solr expert, and I don't even play one 
on TV.  But maybe the fact that you are not hitting node 5 is relevant for that 
reason?

Can you do more analysis on your client, to make sure you are not favoring node 
1?

-Fred

> On Jun 22, 2017, at 10:20 AM, sean mcevoy  wrote:
> 
> Hi List,
> 
> We have a standard riak cluster with 5 nodes and at the minute the traffic 
> levels are fairly low. Each of our application nodes has 25 client 
> connections, 5 to each riak node which get selected in a round robin.
> 
> Our application level requests involve multiple riak requests so our traffic 
> tends to make requests in small bursts. Everything works fine for KV gets, 
> puts & deletes but we're seeing timeouts & weird response time spikes on solr 
> search operations.
> 
> In the past 36 hours (the only period I have riak stats for) I see one 
> response time of 38.8 seconds, 3 hours earlier a response time of 20.8 
> seconds, and the third biggest spike is an acceptable 3.5 seconds.
> 
> See below all search_query stats for the minute of the 38 sec sample. In the 
> application request we made 5 riak search requests to the same index in 
> parallel, which happens for each request of this type and normally doesn't 
> have an issue. But in this case all 5 timed out, and one timed out again on 
> retry with the other 4 succeeding.
> 
> Anyone ever seen anything like this before? Is there any known deadlock in 
> solr that I might hit if I make the same request on another connection before 
> the first has completed? This is what we do when our riak client times out 
> after 2 seconds and immediately retries.
> 
> Any advice or pointers welcomed.
> Thanks,
> //Sean.
> 
> 
> Riak node 1
> search_query_throughput_one: 14
> search_query_throughput_count: 259
> search_query_latency_min: 2776
> search_query_latency_median: 69411
> search_query_latency_mean: 4900973
> search_query_latency_max: 38887902
> search_query_latency_999: 38887902
> search_query_latency_99: 38887902
> search_query_latency_95: 2046215
> search_query_fail_one: 0
> search_query_fail_count: 0
> 
> Riak node 2
> search_query_throughput_one: 22
> search_query_throughput_count: 564
> search_query_latency_min: 4006
> search_query_latency_median: 8800
> search_query_latency_mean: 11834
> search_query_latency_max: 25509
> search_query_latency_999: 25509
> search_query_latency_99: 25509
> search_query_latency_95: 24035
> search_query_fail_one: 0
> search_query_fail_count: 0
> 
> Riak node 3
> search_query_throughput_one: 6
> search_query_throughput_count: 298
> search_query_latency_min: 3200
> search_query_latency_median: 15391
> search_query_latency_mean: 18062
> search_query_latency_max: 31759
> search_query_latency_999: 31759
> search_query_latency_99: 31759
> search_query_latency_95: 31759
> search_query_fail_one: 0
> search_query_fail_count: 0
> 
> Riak node 4
> search_query_throughput_one: 8
> search_query_throughput_count: 334
> search_query_latency_min: 2404
> search_query_latency_median: 7230
> search_query_latency_mean: 10211
> search_query_latency_max: 22502
> search_query_latency_999: 22502
> search_query_latency_99: 22502
> search_query_latency_95: 22502
> search_query_fail_one: 0
> search_query_fail_count: 0
> 
> Riak node 5
> search_query_throughput_one: 0
> search_query_throughput_count: 0
> search_query_latency_min: 0
> search_query_latency_median: 0
> search_query_latency_mean: 0
> search_query_latency_max: 0
> search_query_latency_999: 0
> search_query_latency_99: 0
> search_query_latency_95: 0
> search_query_fail_one: 0
> search_query_fail_count: 0
> 
> ___
> riak-users mailing list
> riak-users@lists.basho.com
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


___
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com