Re: Doubling down on our mistakes?

David Smiley Thu, 20 May 2021 20:53:12 -0700

I removed [email protected] from my response here.  Please everyone do
the same and don't email both Lucene & Solr at the same time.  I recall
that's an old best practice / rule in general -- never address an email to
more than one list.


I agree 100% with Erick.  It's shameful and looks bad on our community and
it's just so not necessary.  It's a clear code-of-conduct violation.  I
hope Andrzej is "okay" emotionally; I'd be a mess in his shoes.  At least
the apologies are very reasonable to me; I was expecting Ishan/Noble to dig
their heels in (as I witnessed some months ago) and I'm relieved not to see
that.

The internal complexity of Solr (esp. SolrCloud) is very high; it's
difficult to make changes and not have some worry that maybe a change has
some ill effect.  Yet we can't simply not touch it.  The irony here is that
the change in question was targeted directly at improving the quality of
Solr; I love those types of changes, honestly.

Perhaps Solr getting it's own Docker images as part of the project may lead
to automated Solr-upgrade testing to catch compatibility bugs?  Maybe that
might be done at the K8S Solr Operator level integration tests since I'm
guessing the Operator facilitates upgrades already?

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Tue, May 18, 2021 at 8:54 AM Ishan Chattopadhyaya <
[email protected]> wrote:

> I apologize for the harsh words, and personally to Andrzej for hurting
> your feelings. I had no such intentions.
>
> > You conveniently don’t mention that I WITHDREW my objection, and instead
> proposed a lenient validation (but validation nonetheless!).
> Yes, let me mention that you agreed in principal to reduce the impact of
> the change (even though not completely revert it). I welcome that and thank
> you for that. By the time you replied on JIRA, I had already sent this mail.
>
> > I see no urgency at all in this matter. This can be handled as
> day-to-day bug fixing as usual.
> I think this requires an immediate notification to all users to be aware
> of this situation before upgrading. Also, an immediate breakfix should be
> helpful for them.
>
> > My feelings are hurt, and I'm greatly disappointed in your words, quick
> attacking off the cuff regularly rude (IMO) because you happened to have a
> bad day.
> I apologize.
>
> How I saw things is that we have a commitment to our users to give them
> good quality software that they can rely on. My intention was not to attack
> Andrzej personally, but to bring about collective awareness regarding this
> problem: that we, as a community, don't care enough for our users. We need
> to get better at testing, get better at reviews, better at benchmarks, etc.
> Individually, we all have the best of intentions, and obviously so does
> Andrzej. However, we need to get better, and I wanted this to be a starting
> point in that conversation. Clearly, I was carried over and I apologize for
> that.
>
> On Tue, May 18, 2021 at 5:52 PM Andrzej Białecki <[email protected]> wrote:
>
>> Ishan, as I pointed out in Jira I don’t care for you implying that I have
>> evil intentions, I resent also your implication that I’m behaving
>> irrationally or don’t care for the users. Those of you who are interested
>> may read the comments in Jira and judge for themselves.
>>
>> You conveniently don’t mention that I WITHDREW my objection, and instead
>> proposed a lenient validation (but validation nonetheless!). It’s easy to
>> scream “revert! revert!” but it actually takes some consideration to
>> properly address the original purpose of this change - that is, detecting
>> and avoiding the corruption of replica state. Let’s focus on this and not
>> on pointing fingers.
>>
>> As for the production outage - I’m sorry this happened to you. As I hope
>> you and Noble and others are sorry for other inadvertently introduced bugs,
>> which I’m sure brought down many clusters at inconvenient hours...
>>
>>
>> On 18 May 2021, at 13:26, Ishan Chattopadhyaya <[email protected]>
>> wrote:
>>
>> https://issues.apache.org/jira/browse/SOLR-14245
>>
>> There was a *production outage* at *odd hours* at my (and Noble's)
>> client, due to this above change in Solr 8.5 onwards by *Andrzej
>> Bialecki*.
>>
>> In short, there is some bug in Solr where a replica gets "null" as the
>> node_name (upon invocation of a collection API command). On the rare
>> occasions where we encountered such situations in the past, the replica
>> would be unavailable and the system would work fine overall. However, this
>> change (which introduces strict validation of errors while *reading*
>> Replica objects) now means that if such a situation arises (where some
>> Solr's APIs itself results in node_name being null in a state.json), all
>> SolrJ clients and all Solr nodes will go for a toss (possibly crash, and
>> not start back up).
>>
>> This change was rushed in, *without any discussions or review*, without
>> extensive testing for the failures it will cause on existing systems where
>> cluster state is messed up but system is running, and *without any
>> consideration for the impact on users*.
>>
>> Noble and I are of the opinion that this change should be *reverted
>> immediately*, considering the impact to users. However, there is *strong
>> disagreement on Andrzej's part*.
>>
>> *Mistakes* happen, but *doubling down on them irrationally* [1] will
>> destroy the reputation of the project, let alone the peace of mind of those
>> who are running Solr in production.
>>
>> Does someone have any thoughts or opinions?
>>
>> [1] -
>> https://issues.apache.org/jira/browse/SOLR-14245?focusedCommentId=17346758&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17346758
>>
>>
>>

Re: Doubling down on our mistakes?

Reply via email to