Found the problem in ZkController, I've opened https://issues.apache.org/jira/browse/SOLR-14504 and will add a patch
On Wed, 20 May 2020 at 18:58, Colvin Cowie <[email protected]> wrote: > (Patch for that issue available on > https://issues.apache.org/jira/browse/SOLR-14503) > > On Wed, 20 May 2020 at 17:06, Colvin Cowie <[email protected]> > wrote: > >> Hi Ishan, thanks for prompting me to look at it again. >> The robust SDF was something I added to our product before >> https://issues.apache.org/jira/browse/SOLR-5129 was implemented, which >> should achieve the same effect (unless my understanding of the intention >> behind it is wrong). >> However when I tested it previously it didn't work... I can see why it >> doesn't work though, so I will create an issue for that and can put a patch >> in. >> >> >> On Wed, 20 May 2020 at 16:18, Ishan Chattopadhyaya < >> [email protected]> wrote: >> >>> Hi Colvin, >>> Off topic: but did you consider contributing your robust SDF changes >>> upstream? It you need it, others might as well. And if everyone is using >>> it, the tracking down of bugs in supported code becomes easier as well. 🙂 >>> Regards, >>> Ishan >>> >>> On Wed, 20 May, 2020, 7:39 pm Colvin Cowie, <[email protected]> >>> wrote: >>> >>>> Hello, >>>> >>>> I've only seen this very occasionally on our automated tests so I've >>>> never really dug into it. >>>> Running on Solr 8.3.1 >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> *2020-05-19 03:44:40.220 INFO (main) [ ] o.a.s.c.ZkContainer >>>> Zookeeper client=XXXXXXXXX:9983/xxxx_cluster2020-05-19 03:44:40.238 INFO >>>> (main) [ ] o.a.s.c.c.SolrZkClient Using ZkCredentialsProvider: >>>> xxxxxxx.zookeeper.auth.internal.EncodedZkCredentialsProvider2020-05-19 >>>> 03:44:40.241 INFO (main) [ ] o.a.s.c.c.ConnectionManager Waiting for >>>> client to connect to ZooKeeper2020-05-19 03:44:40.245 INFO >>>> (zkConnectionManagerCallback-11-thread-1) [ ] >>>> o.a.s.c.c.ConnectionManager zkClient has connected2020-05-19 03:44:40.245 >>>> INFO (main) [ ] o.a.s.c.c.ConnectionManager Client is connected to >>>> ZooKeeper2020-05-19 03:44:40.359 INFO (main) [ ] >>>> o.a.s.c.c.ConnectionManager Waiting for client to connect to >>>> ZooKeeper2020-05-19 03:44:40.361 INFO >>>> (zkConnectionManagerCallback-13-thread-1) [ ] >>>> o.a.s.c.c.ConnectionManager zkClient has connected2020-05-19 03:44:40.361 >>>> INFO (main) [ ] o.a.s.c.c.ConnectionManager Client is connected to >>>> ZooKeeper2020-05-19 03:44:40.417 INFO (main) [ ] o.a.s.c.c.ZkStateReader >>>> Updated live nodes from ZooKeeper... (0) -> (1)2020-05-19 03:44:56.606 INFO >>>> (zkCallback-12-thread-2) [ ] o.a.s.c.c.ZkStateReader Updated live nodes >>>> from ZooKeeper... (1) -> (0)2020-05-19 03:44:56.614 ERROR (main) [ ] >>>> o.a.s.s.SolrDispatchFilter Could not start Solr. Check solr/home property >>>> and the logs2020-05-19 03:44:56.639 ERROR (main) [ ] o.a.s.c.SolrCore >>>> null:java.lang.NullPointerException at >>>> org.apache.solr.cloud.ZkController.lambda$registerLiveNodesListener$10(ZkController.java:1020) >>>> at >>>> org.apache.solr.common.cloud.ZkStateReader.registerLiveNodesListener(ZkStateReader.java:880) >>>> at >>>> org.apache.solr.cloud.ZkController.registerLiveNodesListener(ZkController.java:1035) >>>> at org.apache.solr.cloud.ZkController.init(ZkController.java:917) at >>>> org.apache.solr.cloud.ZkController.<init>(ZkController.java:473) at >>>> org.apache.solr.core.ZkContainer.initZooKeeper(ZkContainer.java:115) at >>>> org.apache.solr.core.CoreContainer.load(CoreContainer.java:631)* >>>> *at >>>> xxxxxxx.solr.servlet.RobustSolrDispatchFilter.createCoreContainer(RobustSolrDispatchFilter.java:71)* >>>> >>>> I couldn't find any bug reports in JIRA for the NPE. >>>> >>>> Here's the full log >>>> https://drive.google.com/open?id=1hQrF25blNgKLXijOMYJ30wn-Lfy6uKVm >>>> >>>> The NPE is coming from >>>> https://github.com/apache/lucene-solr/blob/c18666ad05afc02979c150aacd4810cff02e43f3/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1020 >>>> >>>> >>>> *byte[] json = Utils.toJSON(Collections.singletonMap("timestamp", >>>> cloudManager.getTimeSource().getEpochTimeNs())); * >>>> so I don't know whether it's the cloudManager or the time source that's >>>> null. >>>> That bit of the ZkController was added by >>>> https://issues.apache.org/jira/browse/SOLR-13072 and I see it is only >>>> hit if >>>> >>>> *zkStateReader.getAutoScalingConfig().hasTriggerForEvents(TriggerEventType.NODELOST);* >>>> >>>> We have never (knowingly) configured autoscaling and we don't use it, >>>> but I see the autoscaling files are present in ZK. Is the autoscaling.json >>>> etc created by default when it is absent in ZooKeeper? >>>> >>>> >>>> The interesting bit of the log above, aside from the NPE, is this: >>>> >>>> *2020-05-19 03:44:40.417 INFO (main) [ ] o.a.s.c.c.ZkStateReader >>>> Updated live nodes from ZooKeeper... (0) -> (1)2020-05-19 03:44:56.606 INFO >>>> (zkCallback-12-thread-2) [ ] o.a.s.c.c.ZkStateReader Updated live nodes >>>> from ZooKeeper... (1) -> (0)* >>>> Which suggests to me that there's either a race condition or the >>>> problem is caused by some zookeeper outage during startup. Since there's a >>>> 16 second gap between those messages. >>>> >>>> It's possible that the problem is in some way caused by our own code in >>>> xxxxxxx.solr.servlet.RobustSolrDispatchFilter which wraps the >>>> SolrDispatchFIlter, and creates a SolrZkClient in a try/with resources (so >>>> should be autoclosed), but that all happens before createCoreContainer is >>>> called. >>>> >>>> Any thoughts? >>>> >>>> Thanks, >>>> Colvin >>>> >>>
