Hello Jean,
Mistake on my part indeed, was not the intention of making this private,
so i'm pushing it as public again on the ML. Thanks :)
Thanks for taking a look.
Interesting the diagnoses we can seem to get with develocity !
Regards,
Rene.
On 2/23/24 15:37, Jean Helou wrote:
hello rene,
I'm not sure if you answered privately on purpose or not so I'm
answering in the same manner but I do thinks this conversation could
benefit from being public :D
I looked in the testcontainer issue you mention and damn what a shitty
design. There were mentions of changing it when introducing the
ImagePullPolicy but I reviewed the MR and it wasn't changed and it
still isn't. The effort to work around it is quite significant (but
doable) I might poc it after my vacation.
However, going through the details of the errors using develocity,
there are a lot more times when the cassandra tests fail because of
Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException:
Query timed out after PT2S
than there are times where it failed with a docker init error
looking around I also found that :
cassandra, opensearch, s3 and ldap are the subsystems that show errors
rabbit or pulsar for instance seem unaffected ( no failures over 3
weeks of builds) there may be less integration tests for these 2 but
this questions the volume of integration test done for the others :p
for now the volume of tests is a bit low and I need to include all PR
builds to get some trends. over time we will be able to target master
only to get a better view of what's going on
jean
Le jeu. 22 févr. 2024 à 04:12, Rene Cordier <rcord...@apache.org> a
écrit :
Hi Jean,
Thanks for the heads up on your work on this, very interesting.
Regarding the test failures, to add a bit more insight, I never
really
took the time to fully look into it, but yes the issues with a test
containers not starting because could not initialize some class error
log has been going on for a while, and is plaguing our builds.
As I think Quan noticed a while ago, this issue opened on the
testcontainers java backlog seems related:
https://github.com/testcontainers/testcontainers-java/issues/1872
It seems this error is misleading and hides the real one. It could
ber
elated to infra, or something maybe real tricky on our build.
Might need
to take the time to dig deeper at some point, but looks like a tricky
one to me :)
Hope this bit of info helps somehow!
Regards,
Rene.
On 2/21/24 18:01, Jean Helou wrote:
> Hi fellows,
>
> The Develocity <https://ge.apache.org> integration of
james-project build
> has now been running for long enough that we can start seeing
interesting
> things beyond the raw build scans.
> First we can give a short look at build trends
>
<https://ge.apache.org/scans/trends?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.tags=master&search.tasks=clean%2Cinstall&search.timeZoneId=Europe%2FParis#
<https://ge.apache.org/scans/trends?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.tags=master&search.tasks=clean%2Cinstall&search.timeZoneId=Europe%2FParis#>>
> for the `clean install` command on master (the build stage), its
hard to
> derive actual trends since
> - we only have 15 days of history
> - we run on runners with varying cpu power ( build scans show
T1C go from
> 16 to 24)
> but it can be used to establish a baseline so we can compare
later on
> In average this stage took 9min10s
> the week from feb 5 to feb 11th has had a lot more spread 5th-95th
> percentile was 4m54s to 11m26s
> last week was a bit more homogenous with the same percentiles
being 7m54 -
> 10m18
> this week so far looks more similar to last week than the
previous, 7m12 -
> 9m4
>
> Adding the local build cache added about 14s of overhead and
provided no
> benefits on CI (completely expected, maybe I should disable
local caching
> on CI since it spawns in a fresh env everytime) but this will be
> interesting to compare with the remote build cache.
>
> The tests
>
<https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.timeZoneId=Europe%2FParis&tests.sortField=FLAKY
<https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.timeZoneId=Europe%2FParis&tests.sortField=FLAKY>>
> monitoring is also interesting. This screen monitors 2 things:
> - which tests fail the most often
> - which tests are flaky
> To measure this I enabled retries for surefire in this commit
> <https://github.com/apache/james-project/commit/39a50194> so
that a failing
> test is retried once.
> Failed is both tries failed
> Flakyness is derived by looking at retries (1 fail followed by 1
success).
>
> The results probably won't surprise you too much :
>
> The test suite with most failures (11% of the runs) is
> org.apache.james.blob.cassandra.CassandraBlobStoreTest
>
<https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.timeZoneId=Europe%2FParis&tests.container=org.apache.james.blob.cassandra.CassandraBlobStoreTest
<https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.timeZoneId=Europe%2FParis&tests.container=org.apache.james.blob.cassandra.CassandraBlobStoreTest>>
> The test suite with most flakyness (6% of the runs) is
> org.apache.james.blob.cassandra.CassandraBlobStoreDAOTest
>
<https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.timeZoneId=Europe%2FParis&tests.container=org.apache.james.blob.cassandra.CassandraBlobStoreDAOTest&tests.sortField=FLAKY
<https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.timeZoneId=Europe%2FParis&tests.container=org.apache.james.blob.cassandra.CassandraBlobStoreDAOTest&tests.sortField=FLAKY>>
> The longest running testsuite is
> org.apache.james.backends.rabbitmq.RabbitMQTest
>
<https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.timeZoneId=Europe%2FParis&tests.container=org.apache.james.backends.rabbitmq.RabbitMQTest&tests.sortField=MEAN_DURATION
<https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.timeZoneId=Europe%2FParis&tests.container=org.apache.james.backends.rabbitmq.RabbitMQTest&tests.sortField=MEAN_DURATION>>
> at 4min27s
>
> Top ten failing test suites
> org.apache.james.blob.cassandra.CassandraBlobStoreTest
> org.apache.james.blob.cassandra.cache.CachedBlobStoreTest
> org.apache.james.blob.cassandra.CassandraBucketDAOTest
> org.apache.james.blob.cassandra.CassandraDefaultBucketDAOTest
> org.apache.james.blob.cassandra.CassandraPassTroughBlobStoreTest
> org.apache.james.blob.cassandra.CassandraBlobStoreClOneTest
> org.apache.james.blob.cassandra.CassandraBlobStoreDAOTest
> org.apache.james.blob.cassandra.cache.CassandraBlobStoreCacheTest
> org.apache.james.blob.objectstorage.aws.S3MinioTest
> org.apache.james.blob.objectstorage.aws.S3BlobStoreDAOTest
>
> Top ten flaky doesn't completely overlap
> org.apache.james.blob.cassandra.CassandraBlobStoreDAOTest
>
org.apache.james.user.ldap.ReadOnlyUsersLDAPRepositoryWithLocalPartAsLoginNameTest$WhenEnableVirtualHosting$LocalPartLogin
>
org.apache.james.user.ldap.ReadOnlyUsersLDAPRepositoryWithLocalPartAsLoginNameTest$WhenEnableVirtualHosting$OneAppPasswordLogin
>
org.apache.james.user.ldap.ReadOnlyUsersLDAPRepositoryTest$WhenDisableVirtualHosting
> org.apache.james.blob.cassandra.CassandraBlobStoreTest
>
org.apache.james.user.ldap.ReadOnlyUsersLDAPRepositoryWithLocalPartAsLoginNameTest$WhenEnableVirtualHosting
> org.apache.james.blob.cassandra.CassandraPassTroughBlobStoreTest
> org.apache.james.user.ldap.LdapHealthCheckTest
>
org.apache.james.user.ldap.ReadOnlyUsersLDAPRepositoryTest$WhenEnableVirtualHosting
>
org.apache.james.webadmin.integration.rabbitmq.vault.RabbitMQDeletedMessageVaultIntegrationTest
>
> you can follow all the details on the apache develocity :D
>
> Cheers
> jean
>