Hi Colvin,
Off topic: but did you consider contributing your robust SDF changes
upstream? It you need it, others might as well. And if everyone is using
it, the tracking down of bugs in supported code becomes easier as well. 🙂
Regards,
Ishan

On Wed, 20 May, 2020, 7:39 pm Colvin Cowie, <colvin.cowie....@gmail.com>
wrote:

> Hello,
>
> I've only seen this very occasionally on our automated tests so I've never
> really dug into it.
> Running on Solr 8.3.1
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *2020-05-19 03:44:40.220 INFO  (main) [   ] o.a.s.c.ZkContainer Zookeeper
> client=XXXXXXXXX:9983/xxxx_cluster2020-05-19 03:44:40.238 INFO  (main) [
> ] o.a.s.c.c.SolrZkClient Using ZkCredentialsProvider:
> xxxxxxx.zookeeper.auth.internal.EncodedZkCredentialsProvider2020-05-19
> 03:44:40.241 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Waiting for
> client to connect to ZooKeeper2020-05-19 03:44:40.245 INFO
>  (zkConnectionManagerCallback-11-thread-1) [   ]
> o.a.s.c.c.ConnectionManager zkClient has connected2020-05-19 03:44:40.245
> INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client is connected to
> ZooKeeper2020-05-19 03:44:40.359 INFO  (main) [   ]
> o.a.s.c.c.ConnectionManager Waiting for client to connect to
> ZooKeeper2020-05-19 03:44:40.361 INFO
>  (zkConnectionManagerCallback-13-thread-1) [   ]
> o.a.s.c.c.ConnectionManager zkClient has connected2020-05-19 03:44:40.361
> INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client is connected to
> ZooKeeper2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader
> Updated live nodes from ZooKeeper... (0) -> (1)2020-05-19 03:44:56.606 INFO
>  (zkCallback-12-thread-2) [   ] o.a.s.c.c.ZkStateReader Updated live nodes
> from ZooKeeper... (1) -> (0)2020-05-19 03:44:56.614 ERROR (main) [   ]
> o.a.s.s.SolrDispatchFilter Could not start Solr. Check solr/home property
> and the logs2020-05-19 03:44:56.639 ERROR (main) [   ] o.a.s.c.SolrCore
> null:java.lang.NullPointerException at
> org.apache.solr.cloud.ZkController.lambda$registerLiveNodesListener$10(ZkController.java:1020)
> at
> org.apache.solr.common.cloud.ZkStateReader.registerLiveNodesListener(ZkStateReader.java:880)
> at
> org.apache.solr.cloud.ZkController.registerLiveNodesListener(ZkController.java:1035)
> at org.apache.solr.cloud.ZkController.init(ZkController.java:917) at
> org.apache.solr.cloud.ZkController.<init>(ZkController.java:473) at
> org.apache.solr.core.ZkContainer.initZooKeeper(ZkContainer.java:115) at
> org.apache.solr.core.CoreContainer.load(CoreContainer.java:631)*
> *at
> xxxxxxx.solr.servlet.RobustSolrDispatchFilter.createCoreContainer(RobustSolrDispatchFilter.java:71)*
>
> I couldn't find any bug reports in JIRA for the NPE.
>
> Here's the full log
> https://drive.google.com/open?id=1hQrF25blNgKLXijOMYJ30wn-Lfy6uKVm
>
> The NPE is coming from
> https://github.com/apache/lucene-solr/blob/c18666ad05afc02979c150aacd4810cff02e43f3/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1020
>
>
> *byte[] json = Utils.toJSON(Collections.singletonMap("timestamp",
> cloudManager.getTimeSource().getEpochTimeNs())); *
> so I don't know whether it's the cloudManager or the time source that's
> null.
> That bit of the ZkController was added by
> https://issues.apache.org/jira/browse/SOLR-13072 and I see it is only hit
> if
>
> *zkStateReader.getAutoScalingConfig().hasTriggerForEvents(TriggerEventType.NODELOST);*
>
> We have never (knowingly) configured autoscaling and we don't use it, but
> I see the autoscaling files are present in ZK. Is the autoscaling.json etc
> created by default when it is absent in ZooKeeper?
>
>
> The interesting bit of the log above, aside from the NPE, is this:
>
> *2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader
> Updated live nodes from ZooKeeper... (0) -> (1)2020-05-19 03:44:56.606 INFO
>  (zkCallback-12-thread-2) [   ] o.a.s.c.c.ZkStateReader Updated live nodes
> from ZooKeeper... (1) -> (0)*
> Which suggests to me that there's either a race condition or the problem
> is caused by some zookeeper outage during startup. Since there's a 16
> second gap between those messages.
>
> It's possible that the problem is in some way caused by our own code in
> xxxxxxx.solr.servlet.RobustSolrDispatchFilter which wraps the
> SolrDispatchFIlter, and creates a SolrZkClient in a try/with resources (so
> should be autoclosed), but that all happens before createCoreContainer is
> called.
>
> Any thoughts?
>
> Thanks,
> Colvin
>

Reply via email to