[jira] [Commented] (HDFS-17224) TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing intermittently

2023-10-14 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17775162#comment-17775162
 ] 

Ayush Saxena commented on HDFS-17224:
-

HDFS has parallel-tests profile, which is used in jenkins to speedup tests. 
AFAIK that uses maven parallel execution: 

[https://maven.apache.org/surefire/maven-surefire-plugin/examples/fork-options-and-parallel-execution.html#parallel-test-execution]

In this doc it is mentioned:
{noformat}
The important thing to remember with the parallel option is: the concurrency 
happens within the same JVM process. That is efficient in terms of memory and 
execution time, but you may be more vulnerable towards race conditions or other 
unexpected and hard to reproduce behavior.
{noformat}
There is some more stuff below that as well, didn't read full some tests in 
HDFS like TestDatanodeMetrics are annotated with {{NotThreadSafe}} as well, 
will try find some reasons.

I maybe overthinking here around a test running parallel, if it is, I believe 
it should be quite rare, it can be just some test poor cleanup & maybe just a 
cleanup can fix things

> TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing intermittently
> 
>
> Key: HDFS-17224
> URL: https://issues.apache.org/jira/browse/HDFS-17224
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsadmin, test
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Priority: Major
>
> TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing because the 
> static mbean isn't null. This is inevitably related to the fact that in test 
> runs, the jvm is reused and so the mbean may be present from a previous test 
> -maybe one which didn't clean up.
> it does not fail standalone



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17224) TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing intermittently

2023-10-13 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17775028#comment-17775028
 ] 

Steve Loughran commented on HDFS-17224:
---

good analysis btw

> TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing intermittently
> 
>
> Key: HDFS-17224
> URL: https://issues.apache.org/jira/browse/HDFS-17224
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsadmin, test
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Priority: Major
>
> TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing because the 
> static mbean isn't null. This is inevitably related to the fact that in test 
> runs, the jvm is reused and so the mbean may be present from a previous test 
> -maybe one which didn't clean up.
> it does not fail standalone



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17224) TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing intermittently

2023-10-13 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17775027#comment-17775027
 ] 

Steve Loughran commented on HDFS-17224:
---

is hdfs doing its tests in the same process? aws module parallel tests have a 
pool of jvms but then execute each junit suite sequentially within the pool, so 
contamination is generally limited to cached fs instances or again, some other 
static state.

doing pre-emptive cleanup is good. had to do one with the latest aws uploads 
test where it didn't clean up pending uploads from the previous run if that 
stopped partway through, which then broke the next

> TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing intermittently
> 
>
> Key: HDFS-17224
> URL: https://issues.apache.org/jira/browse/HDFS-17224
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsadmin, test
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Priority: Major
>
> TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing because the 
> static mbean isn't null. This is inevitably related to the fact that in test 
> runs, the jvm is reused and so the mbean may be present from a previous test 
> -maybe one which didn't clean up.
> it does not fail standalone



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17224) TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing intermittently

2023-10-13 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17775005#comment-17775005
 ] 

Ayush Saxena commented on HDFS-17224:
-

Well two tests failed in the same class:
[https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4996/21/testReport/org.apache.hadoop.hdfs/TestRollingUpgrade/]

The first one failed here:
[https://github.com/apache/hadoop/blob/85af6c3a2850ffa0d3216bb62c19c55ab6e4dba3/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestRollingUpgrade.java#L134]

 

Kind of precheck, rolling upgrade was never kicked in, it was the first time, 
with an illegal argument, which on CLI failed was confirmed in the line 
above(expected).

So, this MBean is coming from somewhere else

Checking both the tests which failed. Both failed with MBean not being Null, 
first one didn't had a GenericTestUtils.waitFor, Other had, HDFS-16336 added a 
wait, So, the same exception is bit below, The wait was added for the same 
exception here in this ticket, but looks like it wasn't just some latency

An interesting thing to observe. The two tests that failed both each uses their 
own MiniDfsCluster.
[From First 
one|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4996/21/testReport/org.apache.hadoop.hdfs/TestRollingUpgrade/testDFSAdminRollingUpgradeCommands/]
{noformat}
(itemName=startTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long,contents={blockPoolId=BP-1679863569-172.17.0.2-1696910973814,
 createdRollbackImages=true, finalizeTime=0, startTime=1696910977372})>

{noformat}
[From the Second 
One|https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4996/21/testReport/org.apache.hadoop.hdfs/TestRollingUpgrade/testRollback/]
{noformat}
(itemName=startTime,itemType=javax.management.openmbean.SimpleType(name=java.lang.Long,contents={blockPoolId=BP-1679863569-172.17.0.2-1696910973814,
 createdRollbackImages=true, finalizeTime=0, startTime=1696910977372})>
{noformat}
Both these tests have their own MiniDfsCluster, *still the same {{blockPoolId}} 
and {{startTime}} in the exception.*

 

So, as [~ste...@apache.org]  mentioned some other tests poor cleanup, Which one 
would be bit time consuming or tough to find IMO, or there is some test running 
in parallel and messing up things :( 

I haven't played with these MBeans too much but maybe if before starting the 
test, We check if the MBean is registered & if we unregister that, may be that 
can solve this problem, if it is a poor cleanup of some test. Though it would 
be tough to confirm if it does or not...

But if two tests are running in parallel & each does rollingUpgrade then it 
won't help...

I think there is some annotation like {{{}@NotThreadSafe{}}}, the test 
annotated with this should run alone in a thread, maybe that can help, If read 
this doc right:

[https://maven.apache.org/surefire/maven-surefire-plugin/examples/fork-options-and-parallel-execution.html#parallel-test-execution-and-single-thread-execution]

> TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing intermittently
> 
>
> Key: HDFS-17224
> URL: https://issues.apache.org/jira/browse/HDFS-17224
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: dfsadmin, test
>Affects Versions: 3.4.0
>Reporter: Steve Loughran
>Priority: Major
>
> TestRollingUpgrade.testDFSAdminRollingUpgradeCommands failing because the 
> static mbean isn't null. This is inevitably related to the fact that in test 
> runs, the jvm is reused and so the mbean may be present from a previous test 
> -maybe one which didn't clean up.
> it does not fail standalone



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org