Hello Jean,

Mistake on my part indeed, was not the intention of making this private, so i'm pushing it as public again on the ML. Thanks :)

Thanks for taking a look.

Interesting the diagnoses we can seem to get with develocity !

Regards,

Rene.

On 2/23/24 15:37, Jean Helou wrote:
hello rene,

I'm not sure if you answered privately on purpose or not so I'm answering in the same manner but I do thinks this conversation could benefit from being public :D

I looked in the testcontainer issue you mention and damn what a shitty design. There were mentions of changing it when introducing the ImagePullPolicy but I reviewed the MR and it wasn't changed and it still isn't. The effort to work around it is quite significant (but doable) I might poc it after my vacation.

However, going through the details of the errors using develocity, there are a lot more times when the cassandra tests fail because of Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT2S

than there are times where it failed with a docker init error

looking around I also found that :
cassandra, opensearch, s3 and ldap are the subsystems that show errors
rabbit or pulsar for instance seem unaffected ( no failures over 3 weeks of builds) there may be less integration tests for these 2 but this questions the volume of integration test done for the others :p


for now the volume of tests is a bit low and I need to include all PR builds to get some trends. over time we will be able to target master only to get a better view of what's going on

jean

Le jeu. 22 févr. 2024 à 04:12, Rene Cordier <rcord...@apache.org> a écrit :

    Hi Jean,

    Thanks for the heads up on your work on this, very interesting.

    Regarding the test failures, to add a bit more insight, I never
    really
    took the time to fully look into it, but yes the issues with a test
    containers not starting because could not initialize some class error
    log has been going on for a while, and is plaguing our builds.

    As I think Quan noticed a while ago, this issue opened on the
    testcontainers java backlog seems related:
    https://github.com/testcontainers/testcontainers-java/issues/1872

    It seems this error is misleading and hides the real one. It could
    ber
    elated to infra, or something maybe real tricky on our build.
    Might need
    to take the time to dig deeper at some point, but looks like a tricky
    one to me :)

    Hope this bit of info helps somehow!

    Regards,

    Rene.

    On 2/21/24 18:01, Jean Helou wrote:
    > Hi fellows,
    >
    > The Develocity <https://ge.apache.org> integration of
    james-project build
    > has now been running for long enough that we can start seeing
    interesting
    > things beyond the raw build scans.
    > First we can give a short look at build trends
    >
    
<https://ge.apache.org/scans/trends?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.tags=master&search.tasks=clean%2Cinstall&search.timeZoneId=Europe%2FParis#
    
<https://ge.apache.org/scans/trends?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.tags=master&search.tasks=clean%2Cinstall&search.timeZoneId=Europe%2FParis#>>
    > for the `clean install` command on master (the build stage), its
    hard to
    > derive actual trends since
    > - we only have 15 days of history
    > - we run on runners with varying cpu power ( build scans show
    T1C go from
    > 16 to 24)
    > but it can be used to establish a baseline so we can compare
    later on
    > In average this stage took 9min10s
    > the week from feb 5 to feb 11th has had a lot more spread 5th-95th
    > percentile was 4m54s to 11m26s
    > last week was a bit more homogenous with the same percentiles
    being 7m54 -
    > 10m18
    > this week so far looks more similar to last week than the
    previous, 7m12 -
    > 9m4
    >
    > Adding the local build cache added about 14s of overhead and
    provided no
    > benefits on CI (completely expected, maybe I should disable
    local caching
    > on CI since it spawns in a fresh env everytime) but this will be
    > interesting to compare with the remote build cache.
    >
    > The tests
    >
    
<https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.timeZoneId=Europe%2FParis&tests.sortField=FLAKY
    
<https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.timeZoneId=Europe%2FParis&tests.sortField=FLAKY>>
    > monitoring is also interesting. This screen monitors 2 things:
    > - which tests fail the most often
    > - which tests are flaky
    > To measure this I enabled retries for surefire in this commit
    > <https://github.com/apache/james-project/commit/39a50194> so
    that a failing
    > test is retried once.
    > Failed is both tries failed
    > Flakyness is derived by looking at retries (1 fail followed by 1
    success).
    >
    > The results probably won't surprise you too much :
    >
    > The test suite with most failures (11% of the runs) is
    > org.apache.james.blob.cassandra.CassandraBlobStoreTest
    >
    
<https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.timeZoneId=Europe%2FParis&tests.container=org.apache.james.blob.cassandra.CassandraBlobStoreTest
    
<https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.timeZoneId=Europe%2FParis&tests.container=org.apache.james.blob.cassandra.CassandraBlobStoreTest>>
    > The test suite with most flakyness (6% of the runs) is
    > org.apache.james.blob.cassandra.CassandraBlobStoreDAOTest
    >
    
<https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.timeZoneId=Europe%2FParis&tests.container=org.apache.james.blob.cassandra.CassandraBlobStoreDAOTest&tests.sortField=FLAKY
    
<https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.timeZoneId=Europe%2FParis&tests.container=org.apache.james.blob.cassandra.CassandraBlobStoreDAOTest&tests.sortField=FLAKY>>
    > The longest running testsuite is
    > org.apache.james.backends.rabbitmq.RabbitMQTest
    >
    
<https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.timeZoneId=Europe%2FParis&tests.container=org.apache.james.backends.rabbitmq.RabbitMQTest&tests.sortField=MEAN_DURATION
    
<https://ge.apache.org/scans/tests?search.relativeStartTime=P28D&search.rootProjectNames=*james*&search.timeZoneId=Europe%2FParis&tests.container=org.apache.james.backends.rabbitmq.RabbitMQTest&tests.sortField=MEAN_DURATION>>
    > at 4min27s
    >
    > Top ten failing test suites
    > org.apache.james.blob.cassandra.CassandraBlobStoreTest
    > org.apache.james.blob.cassandra.cache.CachedBlobStoreTest
    > org.apache.james.blob.cassandra.CassandraBucketDAOTest
    > org.apache.james.blob.cassandra.CassandraDefaultBucketDAOTest
    > org.apache.james.blob.cassandra.CassandraPassTroughBlobStoreTest
    > org.apache.james.blob.cassandra.CassandraBlobStoreClOneTest
    > org.apache.james.blob.cassandra.CassandraBlobStoreDAOTest
    > org.apache.james.blob.cassandra.cache.CassandraBlobStoreCacheTest
    > org.apache.james.blob.objectstorage.aws.S3MinioTest
    > org.apache.james.blob.objectstorage.aws.S3BlobStoreDAOTest
    >
    > Top ten flaky doesn't completely overlap
    > org.apache.james.blob.cassandra.CassandraBlobStoreDAOTest
    >
    
org.apache.james.user.ldap.ReadOnlyUsersLDAPRepositoryWithLocalPartAsLoginNameTest$WhenEnableVirtualHosting$LocalPartLogin
    >
    
org.apache.james.user.ldap.ReadOnlyUsersLDAPRepositoryWithLocalPartAsLoginNameTest$WhenEnableVirtualHosting$OneAppPasswordLogin
    >
    
org.apache.james.user.ldap.ReadOnlyUsersLDAPRepositoryTest$WhenDisableVirtualHosting
    > org.apache.james.blob.cassandra.CassandraBlobStoreTest
    >
    
org.apache.james.user.ldap.ReadOnlyUsersLDAPRepositoryWithLocalPartAsLoginNameTest$WhenEnableVirtualHosting
    > org.apache.james.blob.cassandra.CassandraPassTroughBlobStoreTest
    > org.apache.james.user.ldap.LdapHealthCheckTest
    >
    
org.apache.james.user.ldap.ReadOnlyUsersLDAPRepositoryTest$WhenEnableVirtualHosting
    >
    
org.apache.james.webadmin.integration.rabbitmq.vault.RabbitMQDeletedMessageVaultIntegrationTest
    >
    > you can follow all the details on the apache develocity :D
    >
    > Cheers
    > jean
    >

Reply via email to