Hi AB, Please accept my I apologies for the heated discussion. The objective was not that at all.
I saw the real damage that was caused to our client. It was devastating. We were a little worried about the same happening to another user who might upgrade. So, I suggested a revert. Whatever happened after that was due to the stress the situation has put us in. We asked our client to upgrade from 8.4 and to 8.8 and the cluster had a meltdown. Please let's forget about what has transpired in the JIRA and let's get back to saving the next user from such a meltdown 1) warn our users from upgrading from 8.4 (if they have not already done it) 2) revert this change and do a break fix release 3) Fix the actual bug that caused the null node_name in the first place regards Noble Paul On Tue, May 18, 2021 at 10:22 PM Andrzej Białecki <[email protected]> wrote: > > Ishan, as I pointed out in Jira I don’t care for you implying that I have > evil intentions, I resent also your implication that I’m behaving > irrationally or don’t care for the users. Those of you who are interested may > read the comments in Jira and judge for themselves. > > You conveniently don’t mention that I WITHDREW my objection, and instead > proposed a lenient validation (but validation nonetheless!). It’s easy to > scream “revert! revert!” but it actually takes some consideration to properly > address the original purpose of this change - that is, detecting and avoiding > the corruption of replica state. Let’s focus on this and not on pointing > fingers. > > As for the production outage - I’m sorry this happened to you. As I hope you > and Noble and others are sorry for other inadvertently introduced bugs, which > I’m sure brought down many clusters at inconvenient hours... > > > On 18 May 2021, at 13:26, Ishan Chattopadhyaya <[email protected]> > wrote: > > https://issues.apache.org/jira/browse/SOLR-14245 > > There was a production outage at odd hours at my (and Noble's) client, due to > this above change in Solr 8.5 onwards by Andrzej Bialecki. > > In short, there is some bug in Solr where a replica gets "null" as the > node_name (upon invocation of a collection API command). On the rare > occasions where we encountered such situations in the past, the replica would > be unavailable and the system would work fine overall. However, this change > (which introduces strict validation of errors while *reading* Replica > objects) now means that if such a situation arises (where some Solr's APIs > itself results in node_name being null in a state.json), all SolrJ clients > and all Solr nodes will go for a toss (possibly crash, and not start back up). > > This change was rushed in, without any discussions or review, without > extensive testing for the failures it will cause on existing systems where > cluster state is messed up but system is running, and without any > consideration for the impact on users. > > Noble and I are of the opinion that this change should be reverted > immediately, considering the impact to users. However, there is strong > disagreement on Andrzej's part. > > Mistakes happen, but doubling down on them irrationally [1] will destroy the > reputation of the project, let alone the peace of mind of those who are > running Solr in production. > > Does someone have any thoughts or opinions? > > [1] - > https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758 > > -- ----------------------------------------------------- Noble Paul --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
