[ https://issues.apache.org/jira/browse/SOLR-7571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Erick Erickson resolved SOLR-7571. ---------------------------------- Resolution: Duplicate SOLR-7344 is a much better approach, Solr should survive ill-mannered clients. > Return metrics with update requests to allow clients to self-throttle > --------------------------------------------------------------------- > > Key: SOLR-7571 > URL: https://issues.apache.org/jira/browse/SOLR-7571 > Project: Solr > Issue Type: Improvement > Affects Versions: 4.10.3 > Reporter: Erick Erickson > Assignee: Erick Erickson > > I've assigned this to myself to keep track of it, anyone who wants please > feel free to take this. > I've recently seen a setup with 10 shards and 4 replicas. The SolrJ client > (and post.jar for json files for that matter) firehose updates (150 separate > threads in total) at Solr. Eventually, replicas (not leaders) go into > recovery and the state cascades and eventually the entire cluster becomes > unusable. SOLR-5850 delays the behavior, but it still occurs. There are no > errors in the follower's logs this is leader-initiated-recovery because of a > timeout. > I think the root problem is that the client is just sending too many requests > to the cluster, and ConcurrentUpdateSolrClient/Server (used by the leader to > distribute update requests to all the followers) (this was observed in Solr > 4.10.3+). I see thread counts of 500+ when this happens. > So assuming that this is the root cause, the obvious "cure" is "don't index > that fast". This is unsatisfactory since "that fast" is variable, the only > recourse is to set that threshold low enough that the Solr cluster isn't > being driven as fast is it can be. > We should provide some mechanism for having the client throttle itself. The > number of outstanding update threads is one possibility. The client could > then slow down sending updates to Solr. > I'm not sure there's a good way to deal with this on the server. Once the > timeout is encountered, you don't know whether the doc has actually been > indexed on the follower (actually, in this case it _is_ indexed, it just take > a while). Ideally we'd just manage it all magically, but an alternative to > let clients dynamically throttle themselves seems do-able. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org