[ 
https://issues.apache.org/jira/browse/SOLR-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14100991#comment-14100991
 ] 

Steve Davids commented on SOLR-5986:
------------------------------------

We came across the issue again and added a lot more probes to get a grasp on 
what exactly is happening, I believe further tickets might be necessary to 
address various pieces.

#1) We are setting the "timeout" request parameter which tells the 
TimeLimitingCollector to throw a TimeExceededException, though in our logs we 
see the error messages thrown after about an hour for one of the queries we 
tried, even though the timeout is set for a couple of minutes. This is 
presumably due to the query parsing taking about an hour and once the query is 
finally parsed and handed to the collector the TimeLimitingCollector 
immediately throws in exception. We should have something similar throw the 
same exception while in the query building phase (this way the partial results 
warnings will continue to just work). It looks like the current work is more in 
the realm of solving this issue which may fix the problems we saw described in 
#2.

#2) We set socket read timeouts on HTTPClient which causes the same query to be 
sent into the cluster multiple times giving it a slow, painful death. This is 
even more problematic while using the SolrJ API, what ends up happening from 
SolrJ's LBHttpSolrServer is that it will loop through *every* host in the 
cluster and if a socket read timeout happens it tries the next item in the 
list. Internally every single request made to the cluster from an outside SolrJ 
client will try to gather the results for all shards in the cluster, once a 
socket read timeout happens internal to the cluster the same retry logic will 
attempt to gather results from the next replica in the list. So, if we 
hypothetically had 10 shards with 3 replicas, and made a request from an 
outside client it would make 30 (external SolrJ call to each host to request a 
distributed search) * 30 (each host will be called at least once for the 
internal distributed request) = 900 overall requests (each individual search 
host will handle 30 requests). This should probably become it's own ticket to 
track, to either a) don't retry on a socket read timeout or b) specify a retry 
timeout of some sort in the LBHttpSolrServer (this is something we did 
internally for simplicity sake).

> Don't allow runaway queries from harming Solr cluster health or search 
> performance
> ----------------------------------------------------------------------------------
>
>                 Key: SOLR-5986
>                 URL: https://issues.apache.org/jira/browse/SOLR-5986
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Steve Davids
>            Assignee: Anshum Gupta
>            Priority: Critical
>             Fix For: 4.10
>
>         Attachments: SOLR-5986.patch
>
>
> The intent of this ticket is to have all distributed search requests stop 
> wasting CPU cycles on requests that have already timed out or are so 
> complicated that they won't be able to execute. We have come across a case 
> where a nasty wildcard query within a proximity clause was causing the 
> cluster to enumerate terms for hours even though the query timeout was set to 
> minutes. This caused a noticeable slowdown within the system which made us 
> restart the replicas that happened to service that one request, the worst 
> case scenario are users with a relatively low zk timeout value will have 
> nodes start dropping from the cluster due to long GC pauses.
> [~amccurry] Built a mechanism into Apache Blur to help with the issue in 
> BLUR-142 (see commit comment for code, though look at the latest code on the 
> trunk for newer bug fixes).
> Solr should be able to either prevent these problematic queries from running 
> by some heuristic (possibly estimated size of heap usage) or be able to 
> execute a thread interrupt on all query threads once the time threshold is 
> met. This issue mirrors what others have discussed on the mailing list: 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%3c856ac15f0903272054q2dbdbd19kea3c5ba9e105b...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to