[ 
https://issues.apache.org/jira/browse/SOLR-13975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988216#comment-16988216
 ] 

Andrzej Bialecki commented on SOLR-13975:
-----------------------------------------

Updated patch:
 * includes a tweaked patch from SOLR-13896.
 * adds a new system property {{solr.cloud.client.stallTime}} to control the 
maximum stall time. This is 10,000 ms by default.
 * This may be controversial: if a stall is detected an IOException is thrown 
instead of just logging a warning. This doesn't break any tests but perhaps it 
may cause issues in external applications? Still, I think it's better to report 
this error up front instead of hiding it in the logs and pretending that 
nothing happened.
 * unit test.

[~shalin] [~caomanhdat] I would appreciate a review.

> ConcurrentUpdateSolrClient connection stall prevention
> ------------------------------------------------------
>
>                 Key: SOLR-13975
>                 URL: https://issues.apache.org/jira/browse/SOLR-13975
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 8.3, 8.4
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Major
>             Fix For: 8.4
>
>         Attachments: SOLR-13975.patch, SOLR-13975.patch
>
>
> When a Solr process, which hosts replicas of a collection, is suspended - 
> that is, the OS process is suspended using eg. {{kill -STOP <pid>}} - a long 
> stall may occur in CUSC until a socket timeout is reached.
> During this stall updates from the leader are not forwarded to any replica, 
> even though other replicas are still active and can receive updates.  If the 
> sender uses CUSC (eg. via {{CloudSolrClient}}) then it becomes stalled 
> because the leader stops processing updates, too.
> This situation is caused by several issues:
> * when a process is suspended its sockets remain open - so there is no 
> immediate disconnect as if the process died, but the process becomes 
> unresponsive. Eventually, a socket timeout will be reached 
> (distribUpdateSoTimeout) - but in the default version of {{solr.xml}} this is 
> set to 10 min. During this time all indexing to that shard will be stuck.
> * there are several infinite {{for}} loops in CUSC (eg. in 
> {{blockUntilFinished}}, {{waitForEmptyQueue}} and even in {{request}}), which 
> rely either on the relatively quick success of the call or an exception to be 
> thrown. However, in this situation neither happens quickly - the call is 
> stuck waiting for the remote end until soTimeout expires.
> This issue proposes to add a stall prevention logic, which would break these 
> infinite loops long before the socket timeout occurs based on the progress of 
> the queue processing.
> This is a follow-up to SOLR-13896.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to