[ https://issues.apache.org/jira/browse/SOLR-14505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17113005#comment-17113005 ]
Colvin Cowie commented on SOLR-14505: ------------------------------------- Oh dear, another race condition: I created https://issues.apache.org/jira/browse/SOLR-14504 a little while ago :) > Intermittent NPE in ZkController.registerLiveNodesListener() > ------------------------------------------------------------ > > Key: SOLR-14505 > URL: https://issues.apache.org/jira/browse/SOLR-14505 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Affects Versions: 8.3.1 > Reporter: Andrzej Bialecki > Assignee: Andrzej Bialecki > Priority: Minor > > Reported by [~cjcowie] in SOLR-13072 & mailing lists: > "Running on Solr 8.3.1 > {code:java} > 2020-05-19 03:44:40.220 INFO (main) [ ] o.a.s.c.ZkContainer Zookeeper > client=XXXXXXXXX:9983/xxxx_cluster > 2020-05-19 03:44:40.238 INFO (main) [ ] o.a.s.c.c.SolrZkClient Using > ZkCredentialsProvider: > xxxxxxx.zookeeper.auth.internal.EncodedZkCredentialsProvider > 2020-05-19 03:44:40.241 INFO (main) [ ] o.a.s.c.c.ConnectionManager > Waiting for client to connect to ZooKeeper > 2020-05-19 03:44:40.245 INFO (zkConnectionManagerCallback-11-thread-1) [ ] > o.a.s.c.c.ConnectionManager zkClient has connected > 2020-05-19 03:44:40.245 INFO (main) [ ] o.a.s.c.c.ConnectionManager Client > is connected to ZooKeeper > 2020-05-19 03:44:40.359 INFO (main) [ ] o.a.s.c.c.ConnectionManager > Waiting for client to connect to ZooKeeper > 2020-05-19 03:44:40.361 INFO (zkConnectionManagerCallback-13-thread-1) [ ] > o.a.s.c.c.ConnectionManager zkClient has connected > 2020-05-19 03:44:40.361 INFO (main) [ ] o.a.s.c.c.ConnectionManager Client > is connected to ZooKeeper > 2020-05-19 03:44:40.417 INFO (main) [ ] o.a.s.c.c.ZkStateReader Updated > live nodes from ZooKeeper... (0) -> (1) > 2020-05-19 03:44:56.606 INFO (zkCallback-12-thread-2) [ ] > o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (1) -> (0) > 2020-05-19 03:44:56.614 ERROR (main) [ ] o.a.s.s.SolrDispatchFilter Could > not start Solr. Check solr/home property and the logs > 2020-05-19 03:44:56.639 ERROR (main) [ ] o.a.s.c.SolrCore > null:java.lang.NullPointerException > at > org.apache.solr.cloud.ZkController.lambda$registerLiveNodesListener$10(ZkController.java:1020) > at > org.apache.solr.common.cloud.ZkStateReader.registerLiveNodesListener(ZkStateReader.java:880) > at > org.apache.solr.cloud.ZkController.registerLiveNodesListener(ZkController.java:1035) > at org.apache.solr.cloud.ZkController.init(ZkController.java:917) > at org.apache.solr.cloud.ZkController.<init>(ZkController.java:473) > at org.apache.solr.core.ZkContainer.initZooKeeper(ZkContainer.java:115) > at org.apache.solr.core.CoreContainer.load(CoreContainer.java:631)at > xxxxxxx.solr.servlet.RobustSolrDispatchFilter.createCoreContainer(RobustSolrDispatchFilter.java:71) > {code} > I couldn't find any bug reports in JIRA for the NPE. > > Here's the full log > [https://drive.google.com/open?id=1hQrF25blNgKLXijOMYJ30wn-Lfy6uKVm] > > The NPE is coming from > [https://github.com/apache/lucene-solr/blob/c18666ad05afc02979c150aacd4810cff02e43f3/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1020] > > _byte[] json = Utils.toJSON(Collections.singletonMap("timestamp", > cloudManager.getTimeSource().getEpochTimeNs()));_ > so I don't know whether it's the cloudManager or the time source that's null. > That bit of the ZkController was added by > https://issues.apache.org/jira/browse/SOLR-13072 and I see it is only hit if > > _zkStateReader.getAutoScalingConfig().hasTriggerForEvents(TriggerEventType.NODELOST);_ > > We have never (knowingly) configured autoscaling and we don't use it, but I > see the autoscaling files are present in ZK. Is the autoscaling.json etc > created by default when it is absent in ZooKeeper? > > The interesting bit of the log above, aside from the NPE, is this: > _2020-05-19 03:44:40.417 INFO (main) [ ] o.a.s.c.c.ZkStateReader Updated > live nodes from ZooKeeper... (0) -> (1)_ > _2020-05-19 03:44:56.606 INFO (zkCallback-12-thread-2) [ ] > o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (1) -> (0)_ > Which suggests to me that there's either a race condition or the problem is > caused by some zookeeper outage during startup. Since there's a 16 second gap > between those messages. > > It's possible that the problem is in some way caused by our own code in > xxxxxxx.solr.servlet.RobustSolrDispatchFilter which wraps the > SolrDispatchFIlter, and creates a SolrZkClient in a try/with resources (so > should be autoclosed), but that all happens before createCoreContainer is > called." -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org