Hello fellow jamers ! The Jenkinsfile in the PR works, up until the test suite fails, the tests failures are from seemingly "unstable" tests that fail because of timing issues. Benoit fixed the first one in https://github.com/apache/james-project/pull/267 by disabling read repairs during consistency checks (I have no idea what it means but it sounds awesome :) ), I fixed the second one in https://github.com/apache/james-project/pull/269 where the event bus sender and receivers where closed out of order on shutdown sometimes leading up to events being sent to a closed receiver.
After some cleanup, Matthieu recreated a buildable PR which lead to yet another unstable test in https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/1/tests I started investigating the issue and ended up roping in Matthieu since the symptoms for the issue left me completely puzzled. Matthieu managed to pinpoint the root cause to a NPE sometimes thrown from within org.apache.james.server.core.MimeMessageCopyOnWriteProxy which in turn triggered further NullPointerExceptions in the mailet pipeline error handling code. We finally confirmed a concurrency issue in the refcounting management of the proxy which if I understand correctly can lead to unrecoverable data loss. We wrote a test to trigger it [1] in an almost deterministic manner. Once we had a test to reproduce the race condition, we tried to fix the issue only to realize that it led to even more concurrency issues. The rather depressing conclusion we reached yesterday was that the whole implementation is currently unsound with regard to concurrency. I am unable to estimate the resolution effort at this point, Matthieu has some ideas and will work on it (as well as I) when time allows. Which leads me to my current interrogations: I feel that fixing such long standing issues in the test suite is not actually part of configuring the apache CI but I am unsure how to proceed. Here is what I would like to do at this stage : - Isolate the unstable tests under with an unstable tag (akin to "feature tags") - exclude these tests from the default surefire execution profile, - add a parallel pipeline step for these tests where the step failure doesn't fail the pipeline [2] - ensure that the build is green - merge so the project finally has a working public CI I intend to start working on this quickly so we can all enjoy a functional public CI. Alternatives: - Merge the jenkinsfile after the whole pipeline has been tested in the PR branch, which may not happen in a short-medium term... - Merging as is, means that many builds on PRs will end up failing and the last steps (snapshot publish) might fail even if the testsuite succeeds since it never ran. - Something I haven't thought of ? Another issue I want to raise is the availability of the CI builds. As you have seen from my experiments, the CI triggers configuration will only build commits from : - all branches of the main repository - all PRs opened from the main repository - all PRs opened by someone with write access to the main repository Which means that PRs for external contributors will not be built at all. I tried adding the issueCommentTrigger to the jenkins file but neither my comments nor those of someone with commit access were able to trigger the build. I think that one of the project members should revise the current settings to make it possible to build external contributors PR one way or another. (only project members have access or can have access to the jenkins project configuration). Here are two options: - the easiest and quickest modification is to let the CI build all and every PR, there are relatively few PRs on james so the burden on the CI platform shouldn't be too bad. - alternatively it may be possible to configure jenkins to require a comment for someone with write access to trigger a build. unfortunately I am not certain how to set this up, maybe INFRA can help. I know this was a long piece, I look forward to reading your opinions ! Jean [1] see https://github.com/jeantil/james-project/tree/james-3225-concurrency-bug-mimemessagecow [2] see https://stackoverflow.com/questions/44022775/jenkins-ignore-failure-in-pipeline-build-step On Thu, Nov 26, 2020 at 11:22 AM Jean Helou <jean.he...@gmail.com> wrote: > The good news is that docker does indeed work, the bad news is that the > tests fail with an issue that's too involved for me :/ > > [INFO] > [INFO] Results: > [INFO] > [ERROR] Failures: > [ERROR] > CassandraMailboxManagerConsistencyTest$FailuresOnDeletion$DeleteOnce.deleteMailboxByPathShouldBeConsistentWhenMailboxPathDaoFails:433 > Multiple Failures (1 failure) > > Expecting: > <[]> > to contain exactly (and in same order): > <[#private:user:INBOX]> > but could not find the following elements: > <[#private:user:INBOX]> > > at > CassandraMailboxManagerConsistencyTest$FailuresOnDeletion$DeleteOnce.lambda$deleteMailboxByPathShouldBeConsistentWhenMailboxPathDaoFails$8(CassandraMailboxManagerConsistencyTest$FailuresOnDeletion$DeleteOnce.java:440) > > so unless the build for > > * 6fab99364a - JAMES-3448 Rewrite links to http://james.apache.org/server/3/ > (Mon Nov 23 15:10:36 2020 +0700) <Benoit Tellier> N > > is broken which sounds unlikely, I'm going to need help > > jean > > On Thu, Nov 26, 2020 at 10:53 AM Jean Helou <jean.he...@gmail.com> wrote: > >> on a loosely related note : the test suite logs are scary to look at: >> piles upon piles of stack traces and error logs but the tests actually pass >> ... >> >> On Thu, Nov 26, 2020 at 10:50 AM Jean Helou <jean.he...@gmail.com> wrote: >> >>> Thanks benoit, >>> >>> Matthieu pointed me to numerous apache projects with jenkinsfiles which >>> mention docker in >>> https://github.com/search?q=org%3Aapache++filename%3AJenkinsfile+docker&type=Code >>> so I'm trying out things based on that >>> >>> the logs seem promising so far : >>> ``` >>> >>> [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: >>> 0.697 s - in >>> org.apache.james.backends.rabbitmq.RabbitMQConnectionFactoryTest >>> ℹ︎ Checking the system... >>> ✔ Docker version should be at least 1.6.0 >>> ✔ Docker environment should have more than 2GB free disk space >>> [INFO] Running org.apache.james.backends.rabbitmq.RabbitMQTest >>> ``` >>> >>> >>> On Thu, Nov 26, 2020 at 10:40 AM Tellier Benoit <btell...@apache.org> >>> wrote: >>> >>>> Done >>>> >>>> Le 26/11/2020 à 16:25, Jean Helou a écrit : >>>> > hi all, >>>> > >>>> > As you know I started a PR to setup jenkins CI, the latest iteration >>>> sees >>>> > the compilation of the project complete in 5 minutes ( thanks to T1C) >>>> but >>>> > the tests fail to initialize docker containers with the disastrous >>>> > consequences you can imagine :D >>>> > >>>> > I opened https://issues.apache.org/jira/browse/INFRA-21144 to ask if >>>> it is >>>> > possible to have the docker service enable don some nodes, since I am >>>> not >>>> > official member of the project I think it may be useful if you chimed >>>> in on >>>> > the ticket to confirm that this is a legitimate request. >>>> > >>>> > Best regards, >>>> > Jean >>>> > >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org >>>> For additional commands, e-mail: server-dev-h...@james.apache.org >>>> >>>>