[ 
https://issues.apache.org/jira/browse/SOLR-9030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15270875#comment-15270875
 ] 

Scott Blum commented on SOLR-9030:
----------------------------------

ZkStateWriter is basically a write cache.  It should be much simpler than it 
is.  A few things that bug me in no particular order:

1) Tracking lastStateFormat / lastCollectionName and in general having a 
maybeFlushBefore / maybeFlushAfter makes no real sense to me.  If ZkStateWriter 
were capable of operating as a perfect write cache, the *content* of what's 
being written should never force a flush.  It should be able to just always 
keep queuing operations until the desired time delay is hit, or it's flushed 
from the outside.

2) ZkStateWriter's ClusterState liveNodes should probably be a view on 
ZkStateReader's ClusterState liveNode.

3) ZkWriteCallback - the one place this is used is the Overseer 
stateUpdateQueue handling.  I think the way that loop works would ZkStateWriter 
could be done a little better.  Ideally, I would want to peek up to N children 
at a time from that queue, send them all through ZkStateWriter in succession, 
flush, then remove those N items from the stateUpdateQueue.   If the flush 
failed from some reason, it could return a count of items committed so we could 
remove that many items from the stateUpdateQueue.  It seems a little nuts to 
have a second workQueue in operation the way it is today.  I get that in some 
situations we'd end up doing more net cluster state writes, but I think we'd 
still do fewer net writes to ZK since we do so much queue management.

> The 'downnode' command can trip asserts in ZkStateWriter or cause 
> BadVersionException in Overseer
> -------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-9030
>                 URL: https://issues.apache.org/jira/browse/SOLR-9030
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>            Reporter: Shalin Shekhar Mangar
>             Fix For: 6.1, master
>
>
> While working on SOLR-9014 I came across a strange test failure.
> {code}
>    [junit4] ERROR   16.9s | 
> AsyncCallRequestStatusResponseTest.testAsyncCallStatusResponse <<<
>    [junit4]    > Throwable #1: 
> com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an 
> uncaught exception in thread: Thread[id=46, 
> name=OverseerStateUpdate-95769832112259076-127.0.0.1:51135_z_oeg%2Ft-n_0000000000,
>  state=RUNNABLE, group=Overseer state updater.]
>    [junit4]    >      at 
> __randomizedtesting.SeedInfo.seed([91F68DA7E10807C3:CBF7E84BCF328A1A]:0)
>    [junit4]    > Caused by: java.lang.AssertionError
>    [junit4]    >      at 
> __randomizedtesting.SeedInfo.seed([91F68DA7E10807C3]:0)
>    [junit4]    >      at 
> org.apache.solr.cloud.overseer.ZkStateWriter.writePendingUpdates(ZkStateWriter.java:231)
>    [junit4]    >      at 
> org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:240)
>    [junit4]    >      at java.lang.Thread.run(Thread.java:745)
> {code}
> The underlying problem can manifest by tripping the above assert or a 
> BadVersionException as well. I found that this was introduced in SOLR-7281 
> where a new 'downnode' command was added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to