I removed [email protected] from my response here. Please everyone do the same and don't email both Lucene & Solr at the same time. I recall that's an old best practice / rule in general -- never address an email to more than one list.
I agree 100% with Erick. It's shameful and looks bad on our community and it's just so not necessary. It's a clear code-of-conduct violation. I hope Andrzej is "okay" emotionally; I'd be a mess in his shoes. At least the apologies are very reasonable to me; I was expecting Ishan/Noble to dig their heels in (as I witnessed some months ago) and I'm relieved not to see that. The internal complexity of Solr (esp. SolrCloud) is very high; it's difficult to make changes and not have some worry that maybe a change has some ill effect. Yet we can't simply not touch it. The irony here is that the change in question was targeted directly at improving the quality of Solr; I love those types of changes, honestly. Perhaps Solr getting it's own Docker images as part of the project may lead to automated Solr-upgrade testing to catch compatibility bugs? Maybe that might be done at the K8S Solr Operator level integration tests since I'm guessing the Operator facilitates upgrades already? ~ David Smiley Apache Lucene/Solr Search Developer http://www.linkedin.com/in/davidwsmiley On Tue, May 18, 2021 at 8:54 AM Ishan Chattopadhyaya < [email protected]> wrote: > I apologize for the harsh words, and personally to Andrzej for hurting > your feelings. I had no such intentions. > > > You conveniently don’t mention that I WITHDREW my objection, and instead > proposed a lenient validation (but validation nonetheless!). > Yes, let me mention that you agreed in principal to reduce the impact of > the change (even though not completely revert it). I welcome that and thank > you for that. By the time you replied on JIRA, I had already sent this mail. > > > I see no urgency at all in this matter. This can be handled as > day-to-day bug fixing as usual. > I think this requires an immediate notification to all users to be aware > of this situation before upgrading. Also, an immediate breakfix should be > helpful for them. > > > My feelings are hurt, and I'm greatly disappointed in your words, quick > attacking off the cuff regularly rude (IMO) because you happened to have a > bad day. > I apologize. > > How I saw things is that we have a commitment to our users to give them > good quality software that they can rely on. My intention was not to attack > Andrzej personally, but to bring about collective awareness regarding this > problem: that we, as a community, don't care enough for our users. We need > to get better at testing, get better at reviews, better at benchmarks, etc. > Individually, we all have the best of intentions, and obviously so does > Andrzej. However, we need to get better, and I wanted this to be a starting > point in that conversation. Clearly, I was carried over and I apologize for > that. > > On Tue, May 18, 2021 at 5:52 PM Andrzej Białecki <[email protected]> wrote: > >> Ishan, as I pointed out in Jira I don’t care for you implying that I have >> evil intentions, I resent also your implication that I’m behaving >> irrationally or don’t care for the users. Those of you who are interested >> may read the comments in Jira and judge for themselves. >> >> You conveniently don’t mention that I WITHDREW my objection, and instead >> proposed a lenient validation (but validation nonetheless!). It’s easy to >> scream “revert! revert!” but it actually takes some consideration to >> properly address the original purpose of this change - that is, detecting >> and avoiding the corruption of replica state. Let’s focus on this and not >> on pointing fingers. >> >> As for the production outage - I’m sorry this happened to you. As I hope >> you and Noble and others are sorry for other inadvertently introduced bugs, >> which I’m sure brought down many clusters at inconvenient hours... >> >> >> On 18 May 2021, at 13:26, Ishan Chattopadhyaya <[email protected]> >> wrote: >> >> https://issues.apache.org/jira/browse/SOLR-14245 >> >> There was a *production outage* at *odd hours* at my (and Noble's) >> client, due to this above change in Solr 8.5 onwards by *Andrzej >> Bialecki*. >> >> In short, there is some bug in Solr where a replica gets "null" as the >> node_name (upon invocation of a collection API command). On the rare >> occasions where we encountered such situations in the past, the replica >> would be unavailable and the system would work fine overall. However, this >> change (which introduces strict validation of errors while *reading* >> Replica objects) now means that if such a situation arises (where some >> Solr's APIs itself results in node_name being null in a state.json), all >> SolrJ clients and all Solr nodes will go for a toss (possibly crash, and >> not start back up). >> >> This change was rushed in, *without any discussions or review*, without >> extensive testing for the failures it will cause on existing systems where >> cluster state is messed up but system is running, and *without any >> consideration for the impact on users*. >> >> Noble and I are of the opinion that this change should be *reverted >> immediately*, considering the impact to users. However, there is *strong >> disagreement on Andrzej's part*. >> >> *Mistakes* happen, but *doubling down on them irrationally* [1] will >> destroy the reputation of the project, let alone the peace of mind of those >> who are running Solr in production. >> >> Does someone have any thoughts or opinions? >> >> [1] - >> https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758 >> >> >>
