Found the problem in ZkController, I've opened
https://issues.apache.org/jira/browse/SOLR-14504 and will add a patch

On Wed, 20 May 2020 at 18:58, Colvin Cowie <[email protected]>
wrote:

> (Patch for that issue available on
> https://issues.apache.org/jira/browse/SOLR-14503)
>
> On Wed, 20 May 2020 at 17:06, Colvin Cowie <[email protected]>
> wrote:
>
>> Hi Ishan, thanks for prompting me to look at it again.
>> The robust SDF was something I added to our product before
>> https://issues.apache.org/jira/browse/SOLR-5129 was implemented, which
>> should achieve the same effect (unless my understanding of the intention
>> behind it is wrong).
>> However when I tested it previously it didn't work... I can see why it
>> doesn't work though, so I will create an issue for that and can put a patch
>> in.
>>
>>
>> On Wed, 20 May 2020 at 16:18, Ishan Chattopadhyaya <
>> [email protected]> wrote:
>>
>>> Hi Colvin,
>>> Off topic: but did you consider contributing your robust SDF changes
>>> upstream? It you need it, others might as well. And if everyone is using
>>> it, the tracking down of bugs in supported code becomes easier as well. 🙂
>>> Regards,
>>> Ishan
>>>
>>> On Wed, 20 May, 2020, 7:39 pm Colvin Cowie, <[email protected]>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I've only seen this very occasionally on our automated tests so I've
>>>> never really dug into it.
>>>> Running on Solr 8.3.1
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *2020-05-19 03:44:40.220 INFO  (main) [   ] o.a.s.c.ZkContainer
>>>> Zookeeper client=XXXXXXXXX:9983/xxxx_cluster2020-05-19 03:44:40.238 INFO
>>>>  (main) [   ] o.a.s.c.c.SolrZkClient Using ZkCredentialsProvider:
>>>> xxxxxxx.zookeeper.auth.internal.EncodedZkCredentialsProvider2020-05-19
>>>> 03:44:40.241 INFO  (main) [   ] o.a.s.c.c.ConnectionManager Waiting for
>>>> client to connect to ZooKeeper2020-05-19 03:44:40.245 INFO
>>>>  (zkConnectionManagerCallback-11-thread-1) [   ]
>>>> o.a.s.c.c.ConnectionManager zkClient has connected2020-05-19 03:44:40.245
>>>> INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client is connected to
>>>> ZooKeeper2020-05-19 03:44:40.359 INFO  (main) [   ]
>>>> o.a.s.c.c.ConnectionManager Waiting for client to connect to
>>>> ZooKeeper2020-05-19 03:44:40.361 INFO
>>>>  (zkConnectionManagerCallback-13-thread-1) [   ]
>>>> o.a.s.c.c.ConnectionManager zkClient has connected2020-05-19 03:44:40.361
>>>> INFO  (main) [   ] o.a.s.c.c.ConnectionManager Client is connected to
>>>> ZooKeeper2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader
>>>> Updated live nodes from ZooKeeper... (0) -> (1)2020-05-19 03:44:56.606 INFO
>>>>  (zkCallback-12-thread-2) [   ] o.a.s.c.c.ZkStateReader Updated live nodes
>>>> from ZooKeeper... (1) -> (0)2020-05-19 03:44:56.614 ERROR (main) [   ]
>>>> o.a.s.s.SolrDispatchFilter Could not start Solr. Check solr/home property
>>>> and the logs2020-05-19 03:44:56.639 ERROR (main) [   ] o.a.s.c.SolrCore
>>>> null:java.lang.NullPointerException at
>>>> org.apache.solr.cloud.ZkController.lambda$registerLiveNodesListener$10(ZkController.java:1020)
>>>> at
>>>> org.apache.solr.common.cloud.ZkStateReader.registerLiveNodesListener(ZkStateReader.java:880)
>>>> at
>>>> org.apache.solr.cloud.ZkController.registerLiveNodesListener(ZkController.java:1035)
>>>> at org.apache.solr.cloud.ZkController.init(ZkController.java:917) at
>>>> org.apache.solr.cloud.ZkController.<init>(ZkController.java:473) at
>>>> org.apache.solr.core.ZkContainer.initZooKeeper(ZkContainer.java:115) at
>>>> org.apache.solr.core.CoreContainer.load(CoreContainer.java:631)*
>>>> *at
>>>> xxxxxxx.solr.servlet.RobustSolrDispatchFilter.createCoreContainer(RobustSolrDispatchFilter.java:71)*
>>>>
>>>> I couldn't find any bug reports in JIRA for the NPE.
>>>>
>>>> Here's the full log
>>>> https://drive.google.com/open?id=1hQrF25blNgKLXijOMYJ30wn-Lfy6uKVm
>>>>
>>>> The NPE is coming from
>>>> https://github.com/apache/lucene-solr/blob/c18666ad05afc02979c150aacd4810cff02e43f3/solr/core/src/java/org/apache/solr/cloud/ZkController.java#L1020
>>>>
>>>>
>>>> *byte[] json = Utils.toJSON(Collections.singletonMap("timestamp",
>>>> cloudManager.getTimeSource().getEpochTimeNs())); *
>>>> so I don't know whether it's the cloudManager or the time source that's
>>>> null.
>>>> That bit of the ZkController was added by
>>>> https://issues.apache.org/jira/browse/SOLR-13072 and I see it is only
>>>> hit if
>>>>
>>>> *zkStateReader.getAutoScalingConfig().hasTriggerForEvents(TriggerEventType.NODELOST);*
>>>>
>>>> We have never (knowingly) configured autoscaling and we don't use it,
>>>> but I see the autoscaling files are present in ZK. Is the autoscaling.json
>>>> etc created by default when it is absent in ZooKeeper?
>>>>
>>>>
>>>> The interesting bit of the log above, aside from the NPE, is this:
>>>>
>>>> *2020-05-19 03:44:40.417 INFO  (main) [   ] o.a.s.c.c.ZkStateReader
>>>> Updated live nodes from ZooKeeper... (0) -> (1)2020-05-19 03:44:56.606 INFO
>>>>  (zkCallback-12-thread-2) [   ] o.a.s.c.c.ZkStateReader Updated live nodes
>>>> from ZooKeeper... (1) -> (0)*
>>>> Which suggests to me that there's either a race condition or the
>>>> problem is caused by some zookeeper outage during startup. Since there's a
>>>> 16 second gap between those messages.
>>>>
>>>> It's possible that the problem is in some way caused by our own code in
>>>> xxxxxxx.solr.servlet.RobustSolrDispatchFilter which wraps the
>>>> SolrDispatchFIlter, and creates a SolrZkClient in a try/with resources (so
>>>> should be autoclosed), but that all happens before createCoreContainer is
>>>> called.
>>>>
>>>> Any thoughts?
>>>>
>>>> Thanks,
>>>> Colvin
>>>>
>>>

Reply via email to