[ https://issues.apache.org/jira/browse/SOLR-14462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147685#comment-17147685 ]
Andrzej Bialecki commented on SOLR-14462: ----------------------------------------- This should be back-ported to 8.6, it's an important bugfix. > Autoscaling placement wrong with concurrent collection creations > ---------------------------------------------------------------- > > Key: SOLR-14462 > URL: https://issues.apache.org/jira/browse/SOLR-14462 > Project: Solr > Issue Type: Bug > Components: AutoScaling > Affects Versions: master (9.0) > Reporter: Ilan Ginzburg > Assignee: Ilan Ginzburg > Priority: Major > Attachments: PolicyHelperNewLogs.txt, policylogs.txt > > Time Spent: 2h 50m > Remaining Estimate: 0h > > Under concurrent collection creation, wrong Autoscaling placement decisions > can lead to severely unbalanced clusters. > Sequential creation of the same collections is handled correctly and the > cluster is balanced. > *TL;DR;* under high load, the way sessions that cache future changes to > Zookeeper are managed cause placement decisions of multiple concurrent > Collection API calls to ignore each other, be based on identical “initial” > cluster state, possibly leading to identical placement decisions and as a > consequence cluster imbalance. > *Some context first* for those less familiar with how Autoscaling deals with > cluster state change: a PolicyHelper.Session is created with a snapshot of > the Zookeeper cluster state and is used to track already decided but not yet > persisted to Zookeeper cluster state changes so that Collection API commands > can make the right placement decisions. > A Collection API command either uses an existing cached Session (that > includes changes computed by previous command(s)) or creates a new Session > initialized from the Zookeeper cluster state (i.e. with only state changes > already persisted). > When a Collection API command requires a Session - and one is needed for any > cluster state update computation - if one exists but is currently in use, the > command can wait up to 10 seconds. If the session becomes available, it is > reused. Otherwise, a new one is created. > The Session lifecycle is as follows: it is created in COMPUTING state by a > Collection API command and is initialized with a snapshot of cluster state > from Zookeeper (does not require a Zookeeper read, this is running on > Overseer that maintains a cache of cluster state). The command has exclusive > access to the Session and can change the state of the Session. When the > command is done changing the Session, the Session is “returned” and its state > changes to EXECUTING while the command continues to run to persist the state > to Zookeeper and interact with the nodes, but no longer interacts with the > Session. Another command can then grab a Session in EXECUTING state, change > its state to COMPUTING to compute new changes taking into account previous > changes. When all commands having used the session have completed their work, > the session is “released” and destroyed (at this stage, Zookeeper contains > all the state changes that were computed using that Session). > The issue arises when multiple Collection API commands are executed at once. > A first Session is created and commands start using it one by one. In a > simple 1 shard 1 replica collection creation test run with 100 parallel > Collection API requests (see debug logs from PolicyHelper in file > policy.logs), this Session update phase (Session in COMPUTING status in > SessionWrapper) takes about 250-300ms (MacBook Pro). > This means that about 40 commands can run by using in turn the same Session > (45 in the sample run). The commands that have been waiting for too long time > out after 10 seconds, more or less all at the same time (at the rate at which > they have been received by the OverseerCollectionMessageHandler, approx one > per 100ms in the sample run) and most/all independently decide to create a > new Session. These new Sessions are based on Zookeeper state, they might or > might not include some of the changes from the first 40 commands (depending > on if these commands got their changes written to Zookeeper by the time of > the 10 seconds timeout, a few might have made it, see below). > These new Sessions (54 sessions in addition to the initial one) are based on > more or less the same state, so all remaining commands are making placement > decisions that do not take into account each other (and likely not much of > the first 44 placement decisions either). > The sample run whose relevant logs are attached led for the 100 single shard > single replica collection creations to 82 collections on the Overseer node, > and 5 and 13 collections on the two other nodes of a 3 nodes cluster. Given > that the initial session was used 45 times (once initially then reused 44 > times), one would have expected at least the first 45 collections to be > evenly distributed, i.e. 15 replicas on each node. This was not the case, > possibly a sign of other issues (other runs even ended up placing 0 replicas > out of the 100 on one of the nodes). > From the client perspective, http admin collection CREATE requests averaged > 19.5 seconds each and lasted between 7 and 28 seconds (100 parallel threads). > This is likely an indication that the last 55 collection creations didn’t see > much of the state updates done by the first 45 creations (client delay is > longer though than actual Overseer command execution time by http time + > Collections API Zookeeper queue time) . > *A possible fix* is to not observe any delay before creating a new Session > when the currently cached session is busy (i.e. COMPUTING). It will be > somewhat less optimal in low load cases (this is likely not an issue, future > creations will compensate for slight unbalance and under optimal placement) > but will speed up Collection API calls (no waiting) and will prevent multiple > waiting commands from all creating new Sessions based on an identical > Zookeeper state in cases such as the one described here. For long (minutes > and more) autoscaling computations it will likely not make a big difference. > If we had more than a single Session being cached (and reused), then less > ongoing updates would be lost. > Maybe, rather than caching the new updated cluster state after each change, > the changes themselves (the deltas) should be tracked. This might allow to > propagate changes between sessions or to reconcile cluster state read from > Zookeeper with the stream of changes stored in a Session by identifying which > deltas made it to Zookeeper, which ones are new from Zookeeper (originating > from an update in another session) and which are still pending. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org