Hello fellow jamers !

The Jenkinsfile in the PR works, up until the test suite fails, the tests
failures are from seemingly "unstable" tests that fail because of timing
issues. Benoit fixed the first one in
https://github.com/apache/james-project/pull/267 by disabling read repairs
during consistency checks (I have no idea what it means but it sounds
awesome :) ), I fixed the second one in
https://github.com/apache/james-project/pull/269 where the event bus sender
and receivers where closed out of order on shutdown sometimes leading up to
events being sent to a closed receiver.

After some cleanup, Matthieu recreated a buildable PR which lead to yet
another unstable test in
https://builds.apache.org/blue/organizations/jenkins/james%2FApacheJames/detail/PR-268/1/tests

I started investigating the issue and ended up roping in Matthieu since the
symptoms for the issue left me completely puzzled. Matthieu managed to
pinpoint the root cause to a NPE sometimes thrown from
within org.apache.james.server.core.MimeMessageCopyOnWriteProxy which in
turn triggered further NullPointerExceptions in the mailet pipeline error
handling code.
We finally confirmed a concurrency issue in the refcounting management of
the proxy which if I understand correctly can lead to unrecoverable data
loss. We wrote a test to trigger it [1] in an almost deterministic manner.

Once we had a test to reproduce the race condition, we tried to fix the
issue only to realize that it led to even more concurrency issues. The
rather depressing conclusion we reached yesterday was that the whole
implementation is currently unsound with regard to concurrency. I am unable
to estimate the resolution effort at this point, Matthieu has some ideas
and will work on it (as well as I) when time allows.

Which leads me to my current interrogations: I feel that fixing such long
standing issues in the test suite is not actually part of configuring the
apache CI but I am unsure how to proceed.

Here is what I would like to do at this stage :
- Isolate the unstable tests under with an unstable tag (akin to "feature
tags")
- exclude these tests from the default surefire execution profile,
- add a parallel pipeline step for these tests where the step failure
doesn't fail the pipeline [2]
- ensure that the build is green
- merge so the project finally has a working public CI

I intend to start working on this quickly so we can all enjoy a functional
public CI.

Alternatives:
- Merge the jenkinsfile after the whole pipeline has been tested in the PR
branch, which may not happen in a short-medium term...
- Merging as is, means that many builds on PRs will end up failing and the
last steps (snapshot publish) might fail even if the testsuite succeeds
since it never ran.
- Something I haven't thought of ?

Another issue I want to raise is the availability of the CI builds. As you
have seen from my experiments, the CI triggers configuration will only
build commits from :
- all branches of the main repository
- all PRs opened from the main repository
- all PRs opened by someone with write access to the main repository

Which means that PRs for external contributors will not be built at all.

I tried adding the  issueCommentTrigger to the jenkins file but neither my
comments nor those of someone with commit access were able to trigger the
build.

I think that one of the project members should revise the current settings
to make it possible to build external contributors PR one way or another.
(only project members have access or can have access to the jenkins project
configuration).
Here are two options:
- the easiest and quickest modification is to let the CI build all and
every PR, there are relatively few PRs on james so the burden on the CI
platform shouldn't be too bad.
- alternatively it may be possible to configure jenkins to require a
comment for someone with write access to trigger a build. unfortunately I
am not certain how to set this up, maybe INFRA can help.

I know this was a long piece, I look forward to reading your opinions !
Jean

[1] see
https://github.com/jeantil/james-project/tree/james-3225-concurrency-bug-mimemessagecow
[2] see
https://stackoverflow.com/questions/44022775/jenkins-ignore-failure-in-pipeline-build-step

On Thu, Nov 26, 2020 at 11:22 AM Jean Helou <jean.he...@gmail.com> wrote:

> The good news is that docker does indeed work, the bad news is that the
> tests fail with an issue that's too involved for me :/
>
> [INFO]
> [INFO] Results:
> [INFO]
> [ERROR] Failures:
> [ERROR]   
> CassandraMailboxManagerConsistencyTest$FailuresOnDeletion$DeleteOnce.deleteMailboxByPathShouldBeConsistentWhenMailboxPathDaoFails:433
>  Multiple Failures (1 failure)
>       
> Expecting:
>   <[]>
> to contain exactly (and in same order):
>   <[#private:user:INBOX]>
> but could not find the following elements:
>   <[#private:user:INBOX]>
>
> at 
> CassandraMailboxManagerConsistencyTest$FailuresOnDeletion$DeleteOnce.lambda$deleteMailboxByPathShouldBeConsistentWhenMailboxPathDaoFails$8(CassandraMailboxManagerConsistencyTest$FailuresOnDeletion$DeleteOnce.java:440)
>
> so unless the build for
>
> * 6fab99364a - JAMES-3448 Rewrite links to http://james.apache.org/server/3/ 
> (Mon Nov 23 15:10:36 2020 +0700) <Benoit Tellier> N
>
> is broken which sounds unlikely, I'm going to need help
>
> jean
>
> On Thu, Nov 26, 2020 at 10:53 AM Jean Helou <jean.he...@gmail.com> wrote:
>
>> on a loosely related note : the test suite logs are scary to look at:
>> piles upon piles of stack traces and error logs but the tests actually pass
>> ...
>>
>> On Thu, Nov 26, 2020 at 10:50 AM Jean Helou <jean.he...@gmail.com> wrote:
>>
>>> Thanks benoit,
>>>
>>> Matthieu pointed me to numerous apache projects with jenkinsfiles which
>>> mention docker in
>>> https://github.com/search?q=org%3Aapache++filename%3AJenkinsfile+docker&type=Code
>>> so I'm trying out things based on that
>>>
>>> the logs seem promising so far :
>>> ```
>>>
>>> [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
>>> 0.697 s - in 
>>> org.apache.james.backends.rabbitmq.RabbitMQConnectionFactoryTest
>>>         ℹ︎ Checking the system...
>>>         ✔ Docker version should be at least 1.6.0
>>>         ✔ Docker environment should have more than 2GB free disk space
>>> [INFO] Running org.apache.james.backends.rabbitmq.RabbitMQTest
>>> ```
>>>
>>>
>>> On Thu, Nov 26, 2020 at 10:40 AM Tellier Benoit <btell...@apache.org>
>>> wrote:
>>>
>>>> Done
>>>>
>>>> Le 26/11/2020 à 16:25, Jean Helou a écrit :
>>>> > hi all,
>>>> >
>>>> > As you know I started a PR to setup jenkins CI, the latest iteration
>>>> sees
>>>> > the compilation of the project complete in 5 minutes ( thanks to T1C)
>>>> but
>>>> > the tests fail to initialize docker containers with the disastrous
>>>> > consequences you can imagine :D
>>>> >
>>>> > I opened https://issues.apache.org/jira/browse/INFRA-21144 to ask if
>>>> it is
>>>> > possible to have the docker service enable don some nodes, since I am
>>>> not
>>>> > official member of the project I think it may be useful if you chimed
>>>> in on
>>>> > the ticket to confirm that this is a legitimate request.
>>>> >
>>>> > Best regards,
>>>> > Jean
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
>>>> For additional commands, e-mail: server-dev-h...@james.apache.org
>>>>
>>>>

Reply via email to