[
https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15217612#comment-15217612
]
Scott Blum commented on SOLR-8914:
----------------------------------
Alternatively I could imagine a more golang channel style formulation like
this..
{code}
// We don't get a Stat or track versions on getChildren() calls, so force
linearization.
private final Object refreshLiveNodesLock = new Object();
private final Queue<Set<String>> newLiveNodesQueue = new
ConcurrentLinkedQueue<>();
/**
* Refresh live_nodes.
*/
private void refreshLiveNodes(Watcher watcher) throws KeeperException,
InterruptedException {
synchronized (refreshLiveNodesLock) {
try {
List<String> nodeList = zkClient.getChildren(LIVE_NODES_ZKNODE,
watcher, true);
newLiveNodesQueue.add(new HashSet<>(nodeList));
} catch (KeeperException.NoNodeException e) {
newLiveNodesQueue.add(emptySet());
}
}
Set<String> oldLiveNodes;
synchronized (getUpdateLock()) {
Set<String> newLiveNodes = newLiveNodesQueue.remove();
oldLiveNodes = this.liveNodes;
this.liveNodes = newLiveNodes;
if (clusterState != null) {
clusterState.setLiveNodes(newLiveNodes);
}
LOG.info("Updated live nodes from ZooKeeper... ({}) -> ({})",
oldLiveNodes.size(), newLiveNodes.size());
if (LOG.isDebugEnabled()) {
LOG.debug("Updated live nodes from ZooKeeper... {} -> {}", new
TreeSet<>(oldLiveNodes), new TreeSet<>(newLiveNodes));
}
}
}
{code}
To avoid the updateLock -> liveNodesLock -> updateLock cycle, but still
linearize the results of successive invocations.
> ZkStateReader's refreshLiveNodes(Watcher) is not thread safe
> ------------------------------------------------------------
>
> Key: SOLR-8914
> URL: https://issues.apache.org/jira/browse/SOLR-8914
> Project: Solr
> Issue Type: Bug
> Reporter: Hoss Man
> Attachments: SOLR-8914.patch, SOLR-8914.patch,
> jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt,
> live_node_mentions_port56361_with_threadIds.log.txt,
> live_nodes_mentions.log.txt
>
>
> Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the
> weekend....
> {noformat}
> http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText
> Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7
> (refs/remotes/origin/branch_6x)
> Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC
> {noformat}
> The failure happened during the static setup of the test, when a
> MiniSolrCloudCluster & several clients are initialized -- before any code
> related to TolerantUpdateProcessor is ever used.
> I can't reproduce this, or really make sense of what i'm (not) seeing here in
> the logs, so i'm filing this jira with my analysis in the hopes that someone
> else can help make sense of it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]