[ https://issues.apache.org/jira/browse/SOLR-9030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15270875#comment-15270875 ]
Scott Blum commented on SOLR-9030: ---------------------------------- ZkStateWriter is basically a write cache. It should be much simpler than it is. A few things that bug me in no particular order: 1) Tracking lastStateFormat / lastCollectionName and in general having a maybeFlushBefore / maybeFlushAfter makes no real sense to me. If ZkStateWriter were capable of operating as a perfect write cache, the *content* of what's being written should never force a flush. It should be able to just always keep queuing operations until the desired time delay is hit, or it's flushed from the outside. 2) ZkStateWriter's ClusterState liveNodes should probably be a view on ZkStateReader's ClusterState liveNode. 3) ZkWriteCallback - the one place this is used is the Overseer stateUpdateQueue handling. I think the way that loop works would ZkStateWriter could be done a little better. Ideally, I would want to peek up to N children at a time from that queue, send them all through ZkStateWriter in succession, flush, then remove those N items from the stateUpdateQueue. If the flush failed from some reason, it could return a count of items committed so we could remove that many items from the stateUpdateQueue. It seems a little nuts to have a second workQueue in operation the way it is today. I get that in some situations we'd end up doing more net cluster state writes, but I think we'd still do fewer net writes to ZK since we do so much queue management. > The 'downnode' command can trip asserts in ZkStateWriter or cause > BadVersionException in Overseer > ------------------------------------------------------------------------------------------------- > > Key: SOLR-9030 > URL: https://issues.apache.org/jira/browse/SOLR-9030 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Reporter: Shalin Shekhar Mangar > Fix For: 6.1, master > > > While working on SOLR-9014 I came across a strange test failure. > {code} > [junit4] ERROR 16.9s | > AsyncCallRequestStatusResponseTest.testAsyncCallStatusResponse <<< > [junit4] > Throwable #1: > com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an > uncaught exception in thread: Thread[id=46, > name=OverseerStateUpdate-95769832112259076-127.0.0.1:51135_z_oeg%2Ft-n_0000000000, > state=RUNNABLE, group=Overseer state updater.] > [junit4] > at > __randomizedtesting.SeedInfo.seed([91F68DA7E10807C3:CBF7E84BCF328A1A]:0) > [junit4] > Caused by: java.lang.AssertionError > [junit4] > at > __randomizedtesting.SeedInfo.seed([91F68DA7E10807C3]:0) > [junit4] > at > org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:231) > [junit4] > at > org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:240) > [junit4] > at java.lang.Thread.run(Thread.java:745) > {code} > The underlying problem can manifest by tripping the above assert or a > BadVersionException as well. I found that this was introduced in SOLR-7281 > where a new 'downnode' command was added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org