[jira] [Commented] (ARTEMIS-3435) ActiveMQTestBase.checkLibaio waits can cause hours to be added to failing test suite

2021-08-20 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/ARTEMIS-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402276#comment-17402276
 ] 

ASF subversion and git services commented on ARTEMIS-3435:
--

Commit 02a829682130e7899f58845baca245b0264ebfcb in activemq-artemis's branch 
refs/heads/main from Robbie Gemmell
[ https://gitbox.apache.org/repos/asf?p=activemq-artemis.git;h=02a8296 ]

ARTEMIS-3435: fail-fast if AIO check fouls, rather than burning hours and still 
failing, and report/log detals to aid analysis


> ActiveMQTestBase.checkLibaio waits can cause hours to be added to failing 
> test suite
> 
>
> Key: ARTEMIS-3435
> URL: https://issues.apache.org/jira/browse/ARTEMIS-3435
> Project: ActiveMQ Artemis
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.18.0
>Reporter: Robbie Gemmell
>Priority: Major
>
> I recently observered a 'full test' run take over an extra 3 hours, 
> essentially doubling the time to 6hrs 40mins (repeated, it took 3hrs 30mins), 
> with nearly 450 failures in the integration-tests module mostly all due to 
> the same 'test did not close all its files 4096' reason.
> Investigating, I found that all the failures appeared linked and were likely 
> caused by a single prior test having done something awry, with the effect 
> then continuing to fail all subsequent tests classes (after they had run 
> their tests, successfully or otherwise) using the same ActiveMQTestBase 
> parent class for the same reason. Before invoking the fail it waits 30 
> seconds for the ActiveMQTestBase.checkLibaio check to assert a value hits 
> 0after the test class, which it repeatedly didnt. This added 400+ 30sec waits 
> to the test run, failing those 400+ test classes.
> I wasnt able to get all the classes in the output summary, the continued test 
> output filled the buffer, but I got a couple snippets:
> {noformat}
> [ERROR] Failures: 
> [ERROR] 
> QuorumFailOverTest.testQuorumVotingLiveNotDead:143->ClusterTestBase.verifyReceiveRoundRobinInSomeOrder:1167->Assert.assertNotNull:713->Assert.assertTrue:42->Assert.fail:89
>  msg must exist
> [ERROR] 
> PluggableQuorumReplicaTimeoutTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89
>  test did not close all its files 4096
> [ERROR] 
> PluggableQuorumReplicatedDistributionTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89
>  test did not close all its files 4096
> [ERROR] 
> PluggableQuorumReplicatedLargeMessageFailoverTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89
>  test did not close all its files 4096
> [ERROR] 
> PluggableQuorumReplicatedLargeMessageWithDelayFailoverTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89
>  test did not close all its files 4096
> [ERROR] 
> PluggableQuorumReplicatedPagingFailoverTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89
>  test did not close all its files 4096
> [ERROR] 
> FailoverWithSharedStoreTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 
> test did not close all its files 4096
> [ERROR] 
> MultiThreadRandomReattachTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89
>  test did not close all its files 4096
> [ERROR] 
> NettyMultiThreadRandomReattachTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89
>  test did not close all its files 4096
> [ERROR] OrderReattachTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 
> test did not close all its files 4096
> [ERROR] RandomReattachTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 
> test did not close all its files 4096
> ...lots missing...
> {noformat}
> {noformat}
> ...lots missing...
> [ERROR] 
> FederationBrokerPluginTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 
> test did not close all its files 4096
> [ERROR] JvmMetricsTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 test 
> did not close all its files 4096
> [ERROR] MetricsPluginTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 
> test did not close all its files 4096
> [ERROR] MqttPluginTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 test 
> did not close all its files 4096
> [ERROR] OpenwirePluginTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 
> test did not close all its files 4096
> [ERROR] 
> ResourceBrokerPluginTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 
> test did not close all its files 4096
> [ERROR] StompPluginTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 
> test did not close all its files 4096
> [ERROR] XmlConfigPluginTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 
> test did not close all its files 4096
> [ERROR] 
> ActiveMQActivationTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 test 
> did not close all its files 4096
> [ERROR] 
> ActiveMQClusteredTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 test 
> did not 

[jira] [Commented] (ARTEMIS-3435) ActiveMQTestBase.checkLibaio waits can cause hours to be added to failing test suite

2021-08-20 Thread Robbie Gemmell (Jira)


[ 
https://issues.apache.org/jira/browse/ARTEMIS-3435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402275#comment-17402275
 ] 

Robbie Gemmell commented on ARTEMIS-3435:
-

The check is operating on a static valid that is incremented and decremented as 
the LibaioContext instances are created and closed. If a single instance fails 
to be cleaned up correct, all subsequent test classes using ActiveMQTestBase 
fail, after 30 seconds of waiting in the cleanup.

The fact it is static means it isnt easy to account around a prior failure as 
change may still happen after a delayed time, and it is hard to identify 
whether it happened in the detecting class or a previous one that perhaps 
doesnt use ActiveMQTestBase, so I think the simplest thing for now is to make 
the detail available the clearest it can be, and fail things as fast as 
possible instead of taking hours extra, helping reduce the vast amount of 
output and narrowing down a window when the issue occurred that could then be 
looked at more closely.

I have prepared a patch so the after class check will now record the failure if 
it is >0 after 1 wait, then future classes will fail immediately without 
running, reporting the class that detected the issue so its logs can be 
checked. Prior to this they will also log if the value was 0 before the class 
started (in case it is a non-ActiveMQTestBase class causing it).

> ActiveMQTestBase.checkLibaio waits can cause hours to be added to failing 
> test suite
> 
>
> Key: ARTEMIS-3435
> URL: https://issues.apache.org/jira/browse/ARTEMIS-3435
> Project: ActiveMQ Artemis
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.18.0
>Reporter: Robbie Gemmell
>Priority: Major
>
> I recently observered a 'full test' run take over an extra 3 hours, 
> essentially doubling the time to 6hrs 40mins (repeated, it took 3hrs 30mins), 
> with nearly 450 failures in the integration-tests module mostly all due to 
> the same 'test did not close all its files 4096' reason.
> Investigating, I found that all the failures appeared linked and were likely 
> caused by a single prior test having done something awry, with the effect 
> then continuing to fail all subsequent tests classes (after they had run 
> their tests, successfully or otherwise) using the same ActiveMQTestBase 
> parent class for the same reason. Before invoking the fail it waits 30 
> seconds for the ActiveMQTestBase.checkLibaio check to assert a value hits 
> 0after the test class, which it repeatedly didnt. This added 400+ 30sec waits 
> to the test run, failing those 400+ test classes.
> I wasnt able to get all the classes in the output summary, the continued test 
> output filled the buffer, but I got a couple snippets:
> {noformat}
> [ERROR] Failures: 
> [ERROR] 
> QuorumFailOverTest.testQuorumVotingLiveNotDead:143->ClusterTestBase.verifyReceiveRoundRobinInSomeOrder:1167->Assert.assertNotNull:713->Assert.assertTrue:42->Assert.fail:89
>  msg must exist
> [ERROR] 
> PluggableQuorumReplicaTimeoutTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89
>  test did not close all its files 4096
> [ERROR] 
> PluggableQuorumReplicatedDistributionTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89
>  test did not close all its files 4096
> [ERROR] 
> PluggableQuorumReplicatedLargeMessageFailoverTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89
>  test did not close all its files 4096
> [ERROR] 
> PluggableQuorumReplicatedLargeMessageWithDelayFailoverTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89
>  test did not close all its files 4096
> [ERROR] 
> PluggableQuorumReplicatedPagingFailoverTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89
>  test did not close all its files 4096
> [ERROR] 
> FailoverWithSharedStoreTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 
> test did not close all its files 4096
> [ERROR] 
> MultiThreadRandomReattachTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89
>  test did not close all its files 4096
> [ERROR] 
> NettyMultiThreadRandomReattachTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89
>  test did not close all its files 4096
> [ERROR] OrderReattachTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 
> test did not close all its files 4096
> [ERROR] RandomReattachTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 
> test did not close all its files 4096
> ...lots missing...
> {noformat}
> {noformat}
> ...lots missing...
> [ERROR] 
> FederationBrokerPluginTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 
> test did not close all its files 4096
> [ERROR] JvmMetricsTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 test 
> did not close all its files 4096
> [ERROR] MetricsPluginTest>ActiveMQTestBase.checkLibaio:2061->Assert.fail:89 
>