[jira] [Commented] (HDFS-17224) TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing intermittently
[ https://issues.apache.org/jira/browse/HDFS-17224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17775162#comment-17775162 ] Ayush Saxena commented on HDFS-17224: - HDFS has parallel-tests profile, which is used in jenkins to speedup tests. AFAIK that uses maven parallel execution: [https://maven.apache.org/surefire/maven-surefire-plugin/examples/fork-options-and-parallel-execution.html#parallel-test-execution] In this doc it is mentioned: {noformat} The important thing to remember with the parallel option is: the concurrency happens within the same JVM process. That is efficient in terms of memory and execution time, but you may be more vulnerable towards race conditions or other unexpected and hard to reproduce behavior. {noformat} There is some more stuff below that as well, didn't read full some tests in HDFS like TestDatanodeMetrics are annotated with {{NotThreadSafe}} as well, will try find some reasons. I maybe overthinking here around a test running parallel, if it is, I believe it should be quite rare, it can be just some test poor cleanup & maybe just a cleanup can fix things > TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing intermittently > > > Key: HDFS-17224 > URL: https://issues.apache.org/jira/browse/HDFS-17224 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsadmin, test >Affects Versions: 3.4.0 >Reporter: Steve Loughran >Priority: Major > > TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing because the > static mbean isn't null. This is inevitably related to the fact that in test > runs, the jvm is reused and so the mbean may be present from a previous test > -maybe one which didn't clean up. > it does not fail standalone -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17224) TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing intermittently
[ https://issues.apache.org/jira/browse/HDFS-17224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17775028#comment-17775028 ] Steve Loughran commented on HDFS-17224: --- good analysis btw > TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing intermittently > > > Key: HDFS-17224 > URL: https://issues.apache.org/jira/browse/HDFS-17224 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsadmin, test >Affects Versions: 3.4.0 >Reporter: Steve Loughran >Priority: Major > > TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing because the > static mbean isn't null. This is inevitably related to the fact that in test > runs, the jvm is reused and so the mbean may be present from a previous test > -maybe one which didn't clean up. > it does not fail standalone -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17224) TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing intermittently
[ https://issues.apache.org/jira/browse/HDFS-17224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17775027#comment-17775027 ] Steve Loughran commented on HDFS-17224: --- is hdfs doing its tests in the same process? aws module parallel tests have a pool of jvms but then execute each junit suite sequentially within the pool, so contamination is generally limited to cached fs instances or again, some other static state. doing pre-emptive cleanup is good. had to do one with the latest aws uploads test where it didn't clean up pending uploads from the previous run if that stopped partway through, which then broke the next > TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing intermittently > > > Key: HDFS-17224 > URL: https://issues.apache.org/jira/browse/HDFS-17224 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsadmin, test >Affects Versions: 3.4.0 >Reporter: Steve Loughran >Priority: Major > > TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing because the > static mbean isn't null. This is inevitably related to the fact that in test > runs, the jvm is reused and so the mbean may be present from a previous test > -maybe one which didn't clean up. > it does not fail standalone -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17224) TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing intermittently
[ https://issues.apache.org/jira/browse/HDFS-17224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17775005#comment-17775005 ] Ayush Saxena commented on HDFS-17224: - Well two tests failed in the same class: [https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4996/21/testReport/org.apache.hadoop.hdfs/TestRollingUpgrade/] The first one failed here: [https://github.com/apache/hadoop/blob/85af6c3a2850ffa0d3216bb62c19c55ab6e4dba3/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestRollingUpgrade.java#L134] Kind of precheck, rolling upgrade was never kicked in, it was the first time, with an illegal argument, which on CLI failed was confirmed in the line above(expected). So, this MBean is coming from somewhere else Checking both the tests which failed. Both failed with MBean not being Null, first one didn't had a GenericTestUtils.waitFor, Other had, HDFS-16336 added a wait, So, the same exception is bit below, The wait was added for the same exception here in this ticket, but looks like it wasn't just some latency An interesting thing to observe. The two tests that failed both each uses their own MiniDfsCluster. [From First one|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4996/21/testReport/org.apache.hadoop.hdfs/TestRollingUpgrade/testDFSAdminRollingUpgradeCommands/] {noformat} (itemName=startTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long,contents={blockPoolId=BP-1679863569-172.17.0.2-1696910973814, createdRollbackImages=true, finalizeTime=0, startTime=1696910977372})> {noformat} [From the Second One|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4996/21/testReport/org.apache.hadoop.hdfs/TestRollingUpgrade/testRollback/] {noformat} (itemName=startTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long,contents={blockPoolId=BP-1679863569-172.17.0.2-1696910973814, createdRollbackImages=true, finalizeTime=0, startTime=1696910977372})> {noformat} Both these tests have their own MiniDfsCluster, *still the same {{blockPoolId}} and {{startTime}} in the exception.* So, as [~ste...@apache.org] mentioned some other tests poor cleanup, Which one would be bit time consuming or tough to find IMO, or there is some test running in parallel and messing up things :( I haven't played with these MBeans too much but maybe if before starting the test, We check if the MBean is registered & if we unregister that, may be that can solve this problem, if it is a poor cleanup of some test. Though it would be tough to confirm if it does or not... But if two tests are running in parallel & each does rollingUpgrade then it won't help... I think there is some annotation like {{{}@NotThreadSafe{}}}, the test annotated with this should run alone in a thread, maybe that can help, If read this doc right: [https://maven.apache.org/surefire/maven-surefire-plugin/examples/fork-options-and-parallel-execution.html#parallel-test-execution-and-single-thread-execution] > TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing intermittently > > > Key: HDFS-17224 > URL: https://issues.apache.org/jira/browse/HDFS-17224 > Project: Hadoop HDFS > Issue Type: Bug > Components: dfsadmin, test >Affects Versions: 3.4.0 >Reporter: Steve Loughran >Priority: Major > > TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing because the > static mbean isn't null. This is inevitably related to the fact that in test > runs, the jvm is reused and so the mbean may be present from a previous test > -maybe one which didn't clean up. > it does not fail standalone -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org