[ https://issues.apache.org/jira/browse/SOLR-13975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992932#comment-16992932 ]
ASF subversion and git services commented on SOLR-13975: -------------------------------------------------------- Commit c4f0c3363828c088eefa2b99783178848c2f1f7a in lucene-solr's branch refs/heads/master from Andrzej Bialecki [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=c4f0c33 ] SOLR-13975, SOLR-13896: ConcurrentUpdateSolrClient connection stall prevention. > ConcurrentUpdateSolrClient connection stall prevention > ------------------------------------------------------ > > Key: SOLR-13975 > URL: https://issues.apache.org/jira/browse/SOLR-13975 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Affects Versions: 8.3, 8.4 > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Priority: Major > Fix For: 8.4 > > Attachments: SOLR-13975.patch, SOLR-13975.patch > > > When a Solr process, which hosts replicas of a collection, is suspended - > that is, the OS process is suspended using eg. {{kill -STOP <pid>}} - a long > stall may occur in CUSC until a socket timeout is reached. > During this stall updates from the leader are not forwarded to any replica, > even though other replicas are still active and can receive updates. If the > sender uses CUSC (eg. via {{CloudSolrClient}}) then it becomes stalled > because the leader stops processing updates, too. > This situation is caused by several issues: > * when a process is suspended its sockets remain open - so there is no > immediate disconnect as if the process died, but the process becomes > unresponsive. Eventually, a socket timeout will be reached > (distribUpdateSoTimeout) - but in the default version of {{solr.xml}} this is > set to 10 min. During this time all indexing to that shard will be stuck. > * there are several infinite {{for}} loops in CUSC (eg. in > {{blockUntilFinished}}, {{waitForEmptyQueue}} and even in {{request}}), which > rely either on the relatively quick success of the call or an exception to be > thrown. However, in this situation neither happens quickly - the call is > stuck waiting for the remote end until soTimeout expires. > This issue proposes to add a stall prevention logic, which would break these > infinite loops long before the socket timeout occurs based on the progress of > the queue processing. > This is a follow-up to SOLR-13896. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org