Re: Doubling down on our mistakes?

Noble Paul Tue, 18 May 2021 05:46:40 -0700

Hi AB,
Please accept my I apologies for the heated discussion. The objective
was not that at all.


I saw the real damage that was caused to our client. It was
devastating. We were a little worried about the same happening to
another user who might upgrade.

So, I suggested a revert.

Whatever happened after that was due to the stress the situation has
put us in. We asked our client to upgrade from 8.4 and to 8.8 and the
cluster had a meltdown.

Please let's forget about what has transpired in the JIRA and let's
get back to saving the next user from such a meltdown

1) warn our users from upgrading from 8.4 (if they have not already done it)
2) revert this change and do a break fix release
3) Fix the actual bug that caused the null node_name in the first place

regards
Noble Paul


On Tue, May 18, 2021 at 10:22 PM Andrzej Białecki <[email protected]> wrote:
>
> Ishan, as I pointed out in Jira I don’t care for you implying that I have 
> evil intentions, I resent also your implication that I’m behaving 
> irrationally or don’t care for the users. Those of you who are interested may 
> read the comments in Jira and judge for themselves.
>
> You conveniently don’t mention that I WITHDREW my objection, and instead 
> proposed a lenient validation (but validation nonetheless!). It’s easy to 
> scream “revert! revert!” but it actually takes some consideration to properly 
> address the original purpose of this change - that is, detecting and avoiding 
> the corruption of replica state. Let’s focus on this and not on pointing 
> fingers.
>
> As for the production outage - I’m sorry this happened to you. As I hope you 
> and Noble and others are sorry for other inadvertently introduced bugs, which 
> I’m sure brought down many clusters at inconvenient hours...
>
>
> On 18 May 2021, at 13:26, Ishan Chattopadhyaya <[email protected]> 
> wrote:
>
> https://issues.apache.org/jira/browse/SOLR-14245
>
> There was a production outage at odd hours at my (and Noble's) client, due to 
> this above change in Solr 8.5 onwards by Andrzej Bialecki.
>
> In short, there is some bug in Solr where a replica gets "null" as the 
> node_name (upon invocation of a collection API command). On the rare 
> occasions where we encountered such situations in the past, the replica would 
> be unavailable and the system would work fine overall. However, this change 
> (which introduces strict validation of errors while *reading* Replica 
> objects) now means that if such a situation arises (where some Solr's APIs 
> itself results in node_name being null in a state.json), all SolrJ clients 
> and all Solr nodes will go for a toss (possibly crash, and not start back up).
>
> This change was rushed in, without any discussions or review, without 
> extensive testing for the failures it will cause on existing systems where 
> cluster state is messed up but system is running, and without any 
> consideration for the impact on users.
>
> Noble and I are of the opinion that this change should be reverted 
> immediately, considering the impact to users. However, there is strong 
> disagreement on Andrzej's part.
>
> Mistakes happen, but doubling down on them irrationally [1] will destroy the 
> reputation of the project, let alone the peace of mind of those who are 
> running Solr in production.
>
> Does someone have any thoughts or opinions?
>
> [1] - 
> https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758
>
>


-- 
-----------------------------------------------------
Noble Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Doubling down on our mistakes?

Reply via email to