Re: CEP-21 - Transactional cluster metadata merged to trunk

Ekaterina Dimitrova Tue, 28 Nov 2023 17:37:06 -0800

I hate to say it, but I was disappointed that this email thread was started
after the TCM work had already been committed. Especially knowing how we
had even an epic with patches spread around the codebase, which are waiting
on TCM to get committed first so that we do not disturb any rebase. The
latest commits in trunk were relatively isolated. We could have committed
the last outstanding tickets part of the push for the next release and
still made some exceptions for the TCM work merge after that — just a
constructive approach and visibility.


On a positive note, thank you to everyone who worked and keeps working on
TCM! This is an extreme effort!! I'm super excited!

"Can we get these tests temporarily annotated as skipped while all the
subtickets to 19055 are being worked on"

I am against that. Considering I can see Sam, Alex, and Marcus working
around the clock to solve any test-related tickets, I think there is no
need to do this. Also, in general, ignoring tests also leads to risks of
something being forgotten, a ticket being closed by mistake, and new types
of failures being missed.

"it makes sense to do a repeated run of the new tests."
I do agree with Jacek here.

"I also had to go the route of extracting a blend of what's in circle and
what's in ASF CI (in terms of test suites, filtering, etc) since neither
represented a complete view of our CI ecosystem;"
There is a single test suite in CircleCI, and the packaging needs to be
included.

"From a cursory inspection it looks like most of the breakages being
tracked on the ticket Sam linked for TCM are likely to be circle env
specific (new *nix optimized deletion having a race, OOM's, etc). The TCM
merge is actually a great forcing function for us to surface anything env
specific in terms of timing and resourcing up-front;"
I spent my afternoon triaging the tickets and running repeated runs on
reported failures in the CASSANDRA-19055, as I do not believe we can/should
blindly blame all new failures/flakies on TCM. Unfortunately, all the
tickets I triaged were accurately assigned to TCM, and I don't think there
were more than 1 or 2 OOM tickets.

"My gut tells me it's basically impossible to have a merge of this size
that doesn't disrupt what it's merging into"
I agree with this. There is no way that everything will end smoothly and
perfectly with such extensive work. But we should all recognize the
excellent work that Sam, Alex, and Marcus are doing here to identify and
fix outstanding issues in the past two days. Thank you! I am sure everyone
appreciates that! And I would like to appeal to the community members to
keep triaging and bisecting any new test flakies/failures as we were doing
and not just blindly assign everything to the TCM follow-up epic. We should
be constructive.

"for the record, I don't think we should hold off on merging things just
because some folks are on holiday. :)"

I do not believe anyone advocates for that. I personally even often commit
disruptive patches on the weekend. This gives me the time to deal with the
fallout and reduce the potential disruption to people's work during
office hours. My main concern is what was mentioned already by Benjamin and
Jacek - creating precedent here after all the discussions that happened.


On Mon, 27 Nov 2023 at 16:34, Josh McKenzie <[email protected]> wrote:

> on our internal CI system
>
> Some more context:
>
> This environment adheres to the requirements we laid out in pre-commit CI
> on Cassandra
> <https://docs.google.com/document/d/1TaYMvE5ryOYX03cxzY6XzuUS651fktVER02JHmZR5FU/edit>
>  with
> a couple required differences. We don't yet include the resource
> restriction detail in the test report; it's on my backlog of things to do
> but I can confirm that less CPU and <= equivalent ASFCI memory is being
> allocated for each test suite. I also had to go the route of extracting a
> blend of what's in circle and what's in ASF CI (in terms of test suites,
> filtering, etc) since neither represented a complete view of our CI
> ecosystem; there are currently things executed in either environment not
> executed in the other.
>
> I've been tracking the upstreaming of that declarative combination in
> CASSANDRA-18731 but have had some other priorities take front-seat (i.e.
> getting a new CI system based on that working since neither upstream ASF CI
> nor circle are re-usable in their current form) and will be upstreaming
> that ASAP. https://issues.apache.org/jira/browse/CASSANDRA-18731
>
> I've left a pretty long comment on CASSANDRA-18731 about the structure of
> things and where my opinion falls; *I think we need a separate DISCUSS
> thread on the ML about CI and what we require for pre-commit smoke*
> suites:
> https://issues.apache.org/jira/browse/CASSANDRA-18731?focusedCommentId=17790270&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17790270
>
> The TL;DR:
>
> With an *incredibly large* patch in the form of TCM (88k+ LoC, 900+ files
> touched), we have less than a .002% test failure injection rate using the
> above restricted smoke heuristic, and many of them look to be circle ci env
> specific and not asf ci.
>
>
> From a cursory inspection it looks like most of the breakages being
> tracked on the ticket Sam linked for TCM are likely to be circle env
> specific (new *nix optimized deletion having a race, OOM's, etc). The TCM
> merge is actually a great forcing function for us to surface anything env
> specific in terms of timing and resourcing up-front; I'm glad we have this
> opportunity but it's unfortunate that it's been interpreted as merging
> w/out passing CI as opposed to having some env-difference specific kinks to
> work out.
>
> *This was an incredibly huge merge.* For comparison, I just did a --stat
> on the merge for CASSANDRA-8099:
>
> 645 files changed, 49381 insertions(+), 42227 deletions(-)
>
>
> TCM from the C* repo:
>
>  934 files changed, 66185 insertions(+), 21669 deletions(-)
>
> My gut tells me it's basically impossible to have a merge of this size
> that doesn't disrupt what it's merging into, or the authors just end up
> slowly dying in rebase hell. Or both. This was a massive undertaking and
> compared to our past on this project, has had an incredibly low impact on
> the target it was merged into and the authors are rapidly burning down
> failures.
>
> To the authors - great work, and thanks for being so diligent on following
> up on any disruptions this body of work has caused to other contributors'
> environments.
>
> To the folks who were disrupted - I get it. This is deeply frustrating,
> green CI has long been many of our white whale's, and having something
> merge over a US holiday week with an incredibly active project where we
> don't all have time to keep up with everything can make things like this
> feel like a huge surprise. It's incredibly unfortunate that the timing on
> us transitioning to this new CI system and working out the kinks is when
> this behemoth of a merge needed to come through, but silver-lining.
>
> We're making great strides. Let's not lose sight of our growth because of
> the pain in the moment of it.
>
> ~Josh
>
> p.s. - for the record, I don't think we should hold off on merging things
> just because some folks are on holiday. :)
>
> On Mon, Nov 27, 2023, at 3:38 PM, Sam Tunnicliffe wrote:
>
> I ought to clarify, we did actually have green CI modulo 3 flaky tests on
> our internal CI system. I've attached the test artefacts to CASSANDRA-18330
> now[1][2]: 2 of the 3 failures are upgrade dtests, with 1 other python
> dtest failure noted. None of these were reproducible in a dev setup, so we
> suspected them to be environmental and intended to merge before returning
> to confirm that. The "known" failures that we mentioned in the email that
> started this thread were ones observed by Mick running the cep-21-tcm
> branch through Circle before merging.
>
> As the CEP-21 changeset was approaching 88k LoC touching over 900 files,
> permanently rebasing as we tried to eradicate every flaky test was simply
> unrealistic, especially as other significant patches continued to land in
> trunk. With that in mind, we took the decision to merge so that we could
> focus on actually removing any remaining instability.
>
> [1]
> https://issues.apache.org/jira/secure/attachment/13064727/ci_summary.html
> [2]
> https://issues.apache.org/jira/secure/attachment/13064728/result_details.tar.gz
>
>
> On 27 Nov 2023, at 10:28, Berenguer Blasi <[email protected]>
> wrote:
>
> Hi,
>
> I have written this email like 10 times before sending it and I can't
> manage to avoid making it sound with a negative spin to it. So pardon my
> English or poor choice of words in advance and try to read it in a positive
> way.
>
> It is really demotivating to me seeing things getting merged without green
> CI. I had to go through an herculean effort and pain (at least to me) to
> keep rebasing the TTL patch continuously (a huge one imo) when it would
> have been altogether much easier to merge, post-fix and post-add
> downgradability along the TCM merge lines.
>
> If this merge-post fix approach is a thing I would like it clarified so we
> can all benefit from it and to avoid the big-patch rebase pain.
>
> Regards
> On 27/11/23 10:38, Jacek Lewandowski wrote:
>
> Hi,
>
> I'm happy to hear that the feature got merged. Though, I share Benjamin's
> worries about that being a bad precedent.
>
> I don't think it makes sense to do repeated runs in this particular case.
> Detecting flaky tests would not prove anything; they can be caused by this
> patch, but we would not know that for sure. We would have to have a similar
> build with the same tests repeated to compare. It would take time and
> resources, and in the end, we will have to fix those flaky tests regardless
> of whether they were caused by this change. IMO, it makes sense to do a
> repeated run of the new tests, though. Aside from that, we can also
> consider making it easier and more automated for the developer to determine
> whether a particular flakiness comes from a feature branch one wants to
> merge.
> thanks,
> Jacek
>
>
> pon., 27 lis 2023 o 10:15 Benjamin Lerer <[email protected]> napisał(a):
>
> Hi,
>
> I must admit that I have been surprised by this merge and this following
> email. We had lengthy discussions recently and the final agreement was that
> the requirement for a merge was a green CI.
> I could understand that for some reasons as a community we could wish to
> make some exceptions. In this present case there was no official discussion
> to ask for an exception.
> I believe that this merge creates a bad precedent where anybody can feel
> entitled to merge without a green CI and disregard any previous community
> agreement.
>
> Le sam. 25 nov. 2023 à 09:22, Mick Semb Wever <[email protected]> a écrit :
>
>
> Great work Sam, Alex & Marcus !
>
>
> There are about 15-20 flaky or failing tests in total, spread over several
> test jobs[2] (i.e. single digit failures in a few of these). We have filed
> JIRAs for the failures and are working on getting those fixed as a top
> priority. CASSANDRA-19055[3] is the umbrella ticket for this follow up work.
>
> There are also a number of improvements we will work on in the coming
> weeks, we will file JIRAs for those early next week and add them as
> subtasks to CASSANDRA-19055.
>
>
>
> Can we get these tests temporarily annotated as skipped while all the
> subtickets to 19055 are being worked on ?
>
> As we have seen from CASSANDRA-18166 and CASSANDRA-19034 there's a lot of
> overhead now on 5.0 tickets having to navigate around these failures in
> trunk CI runs.
>
> Also, we're still trying to figure out how to do repeated runs for a patch
> so big… (the list of touched tests was too long for circleci, i need to
> figure out what the limit is and chunk it into separate circleci configs) …
> and it probably makes sense to wait until most of 19055 is done (or tests
> are temporarily annotated as skipped).
>
>
>
>

Re: CEP-21 - Transactional cluster metadata merged to trunk

Reply via email to