[ 
https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217033#comment-15217033
 ] 

Hoss Man commented on SOLR-8914:
--------------------------------

{quote}
The part I don't understand is why this watcher is getting fired multiple times 
on different threads. I (re)wrote some of this code, and one of my implicit 
assumptions that was that any given watcher would not get re-fired until the 
previous watcher invocation had returned. But maybe that was a really bad 
assumption that I carried over from Curator, or perhaps the thread model in ZK 
has changed?
{quote}

I have no idea what the thread model for ZK is, but let's assume your implicit 
assumption was correct...

That would still match the observed behavior, and a hypothetical sequence of 
events very similar to the one i outlined, since the 
{{zkClient.getChildren(...)}} calls are passing the LiveNodeWatcher back to ZK. 
 so T1's call to {{zkClient.getChildren(...)}} call adds the LiveNodeWatcher 
back, but before T1 has a chance to write it's local state that watcher fires 
and triggers T2, and T2's {{zkClient.getChildren(...)}} call adds the 
LiveNodeWatcher backm but before either T1 or T2 can write to the local state 
that watcher fires and triggers T3, ... after T3 writes the local state, T1 and 
T2 overwrite it with their stale data.

Your implicit assumption would alos explain the part i was confused about: why 
there are only 4 {{live_nodes}} child events triggered instead of 5.  2 nodes 
must have come online add added themselves to {{live_nodes/}} between the time 
T2's watch even was triggered and when T2 added a new watcher via the 
{{zkClient.getChildren(...)}} call (explaining the jump from "1" to "3")

> ZkStateReader's refreshLiveNodes(Watcher) is not thread safe
> ------------------------------------------------------------
>
>                 Key: SOLR-8914
>                 URL: https://issues.apache.org/jira/browse/SOLR-8914
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Hoss Man
>         Attachments: SOLR-8914.patch, 
> jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt, 
> live_node_mentions_port56361_with_threadIds.log.txt, 
> live_nodes_mentions.log.txt
>
>
> Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the 
> weekend....
> {noformat}
> http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText
> Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 
> (refs/remotes/origin/branch_6x)
> Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC
> {noformat}
> The failure happened during the static setup of the test, when a 
> MiniSolrCloudCluster & several clients are initialized -- before any code 
> related to TolerantUpdateProcessor is ever used.
> I can't reproduce this, or really make sense of what i'm (not) seeing here in 
> the logs, so i'm filing this jira with my analysis in the hopes that someone 
> else can help make sense of it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to