[ https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15218306#comment-15218306 ]
Hoss Man commented on SOLR-8914: -------------------------------- Scott: i'm hammering on your updated patch now, so far 60 runs w/o failure of deadlock (3 times as long as i've ever seen it go w/o failure, 10 times longer then your previous patch went w/o deadlocks) As someone unfamiliar with most of the ZkStateReader code, some of the changes are clearly relevant as far as the current bug goes (the lastFetchedLiveNodes AtomicRef and associated refreshLiveNodesLock) but it's not immediately obvious to me what the other changes (refreshCollectionListLock etc..) in your patch are for .. are these unrelated to the live nodes bug? Is this a similar bug pattern in watching the list of collections? (ie: two COLLECTIONS_ZKNODE, watchers may fire in rapid succession and updating the local collection states might happen out of order) > ZkStateReader's refreshLiveNodes(Watcher) is not thread safe > ------------------------------------------------------------ > > Key: SOLR-8914 > URL: https://issues.apache.org/jira/browse/SOLR-8914 > Project: Solr > Issue Type: Bug > Reporter: Hoss Man > Attachments: SOLR-8914.patch, SOLR-8914.patch, SOLR-8914.patch, > jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt, > live_node_mentions_port56361_with_threadIds.log.txt, > live_nodes_mentions.log.txt > > > Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the > weekend.... > {noformat} > http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText > Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 > (refs/remotes/origin/branch_6x) > Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC > {noformat} > The failure happened during the static setup of the test, when a > MiniSolrCloudCluster & several clients are initialized -- before any code > related to TolerantUpdateProcessor is ever used. > I can't reproduce this, or really make sense of what i'm (not) seeing here in > the logs, so i'm filing this jira with my analysis in the hopes that someone > else can help make sense of it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org