Hello,

I've only seen this very occasionally on our automated tests so I've never
really dug into it.
Running on Solr 8.3.1



















*2020-05-19 03:44:40.220 INFO  (main) [   ] o.a.s.c.ZkContainer Zookeeper
client=XXXXXXXXX:9983/xxxx_cluster2020-05-19 03:44:40.238 INFO  (main) [
] o.a.s.c.c.SolrZkClient Using ZkCredentialsProvider:
xxxxxxx.zookeeper.auth.internal.EncodedZkCredentialsProvider2020-05-19
03:44:40.241 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Waiting for
client to connect to ZooKeeper2020-05-19 03:44:40.245 INFO
 (zkConnectionManagerCallback-11-thread-1) [   ]
o.a.s.c.c.ConnectionManager zkClient has connected2020-05-19 03:44:40.245
INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client is connected to
ZooKeeper2020-05-19 03:44:40.359 INFO  (main) [   ]
o.a.s.c.c.ConnectionManager Waiting for client to connect to
ZooKeeper2020-05-19 03:44:40.361 INFO
 (zkConnectionManagerCallback-13-thread-1) [   ]
o.a.s.c.c.ConnectionManager zkClient has connected2020-05-19 03:44:40.361
INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client is connected to
ZooKeeper2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader
Updated live nodes from ZooKeeper... (0) -> (1)2020-05-19 03:44:56.606 INFO
 (zkCallback-12-thread-2) [   ] o.a.s.c.c.ZkStateReader Updated live nodes
from ZooKeeper... (1) -> (0)2020-05-19 03:44:56.614 ERROR (main) [   ]
o.a.s.s.SolrDispatchFilter Could not start Solr. Check solr/home property
and the logs2020-05-19 03:44:56.639 ERROR (main) [   ] o.a.s.c.SolrCore
null:java.lang.NullPointerException at
org.apache.solr.cloud.ZkController.lambda$registerLiveNodesListener$10(ZkController.java:1020)
at
org.apache.solr.common.cloud.ZkStateReader.registerLiveNodesListener(ZkStateReader.java:880)
at
org.apache.solr.cloud.ZkController.registerLiveNodesListener(ZkController.java:1035)
at org.apache.solr.cloud.ZkController.init(ZkController.java:917) at
org.apache.solr.cloud.ZkController.<init>(ZkController.java:473) at
org.apache.solr.core.ZkContainer.initZooKeeper(ZkContainer.java:115) at
org.apache.solr.core.CoreContainer.load(CoreContainer.java:631)*
*at
xxxxxxx.solr.servlet.RobustSolrDispatchFilter.createCoreContainer(RobustSolrDispatchFilter.java:71)*

I couldn't find any bug reports in JIRA for the NPE.

Here's the full log
https://drive.google.com/open?id=1hQrF25blNgKLXijOMYJ30wn-Lfy6uKVm

The NPE is coming from
https://github.com/apache/lucene-solr/blob/c18666ad05afc02979c150aacd4810cff02e43f3/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1020


*byte[] json = Utils.toJSON(Collections.singletonMap("timestamp",
cloudManager.getTimeSource().getEpochTimeNs())); *
so I don't know whether it's the cloudManager or the time source that's
null.
That bit of the ZkController was added by
https://issues.apache.org/jira/browse/SOLR-13072 and I see it is only hit
if
*zkStateReader.getAutoScalingConfig().hasTriggerForEvents(TriggerEventType.NODELOST);*

We have never (knowingly) configured autoscaling and we don't use it, but I
see the autoscaling files are present in ZK. Is the autoscaling.json etc
created by default when it is absent in ZooKeeper?


The interesting bit of the log above, aside from the NPE, is this:

*2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader Updated
live nodes from ZooKeeper... (0) -> (1)2020-05-19 03:44:56.606 INFO
 (zkCallback-12-thread-2) [   ] o.a.s.c.c.ZkStateReader Updated live nodes
from ZooKeeper... (1) -> (0)*
Which suggests to me that there's either a race condition or the problem is
caused by some zookeeper outage during startup. Since there's a 16 second
gap between those messages.

It's possible that the problem is in some way caused by our own code in
xxxxxxx.solr.servlet.RobustSolrDispatchFilter which wraps the
SolrDispatchFIlter, and creates a SolrZkClient in a try/with resources (so
should be autoclosed), but that all happens before createCoreContainer is
called.

Any thoughts?

Thanks,
Colvin

Reply via email to