[GitHub] [hadoop] amahussein commented on pull request #2581: YARN-10553. Refactor TestDistributedShell
amahussein commented on pull request #2581: URL: https://github.com/apache/hadoop/pull/2581#issuecomment-757950935 > Thanks for the update, @amahussein. +1. Thanks for the review @iwasakims and @goiri This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] amahussein commented on pull request #2581: YARN-10553. Refactor TestDistributedShell
amahussein commented on pull request #2581: URL: https://github.com/apache/hadoop/pull/2581#issuecomment-756533357 I did an experiment on the original `TestDistributedShell` if you run ```bash mvn test -Dtest=TestDistributedShell#testDSShellWithoutDomainV1_5; mvn test -Dtest=TestDistributedShell#testDSShellWithoutDomainV2 ``` Then the testDSShellWithoutDomainV2 will fail with the same exception. It seems that there was an old bug and not introduced by my changes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] amahussein commented on pull request #2581: YARN-10553. Refactor TestDistributedShell
amahussein commented on pull request #2581: URL: https://github.com/apache/hadoop/pull/2581#issuecomment-756531850 @iwasakims and @goiri Do you have insights what could the problem be ? When you run TestDSTimelineV10 -> then TestDSTimelineV20. things will be successful. Only TestDSTimelineV20 fails to initialize the RM if it runs immediately after TestDSTimelineV15 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] amahussein commented on pull request #2581: YARN-10553. Refactor TestDistributedShell
amahussein commented on pull request #2581: URL: https://github.com/apache/hadoop/pull/2581#issuecomment-756530407 > The latest commit seemed to introduce errors. > > ``` > $ mvn test -Dtest='TestDS*' > ... > [ERROR] Errors: > [ERROR] TestDSTimelineV20>DistributedShellBaseTest.setup:159->DistributedShellBaseTest.setupInternal:476 » YarnRuntime > [ERROR] TestDSTimelineV20>DistributedShellBaseTest.setup:159->DistributedShellBaseTest.setupInternal:476 » YarnRuntime > [ERROR] TestDSTimelineV20>DistributedShellBaseTest.setup:159->DistributedShellBaseTest.setupInternal:476 » YarnRuntime > [ERROR] TestDSTimelineV20>DistributedShellBaseTest.setup:159->DistributedShellBaseTest.setupInternal:476 » YarnRuntime > [ERROR] TestDSTimelineV20>DistributedShellBaseTest.setup:159->DistributedShellBaseTest.setupInternal:476 » YarnRuntime > [ERROR] TestDSTimelineV20>DistributedShellBaseTest.setup:159->DistributedShellBaseTest.setupInternal:476 » YarnRuntime > [INFO] > [ERROR] Tests run: 40, Failures: 0, Errors: 6, Skipped: 0 > ``` Hey @iwasakims! I was looking into this failure. TestDSTimelineV20 fails when it runs after TestDSTimelineV15. The exception is thrown as follows ``` at org.apache.hadoop.yarn.server.timelineservice.storage.FileSystemTimelineWriterImpl.mkdirsWithRetries(FileSystemTimelineWriterImpl.java:214) at org.apache.hadoop.yarn.server.timelineservice.storage.FileSystemTimelineWriterImpl.serviceStart(FileSystemTimelineWriterImpl.java:188) ``` This could be something that missing in the configurations of the unitTest or a bug in either timelineserviceV1.5 or timelineserviceV2.0 I have verified that all the configurations are maintained when the code is split. The reason this failure appears only now could be due to the fact that having all the tests in one file did not let V1.5 tests run immediately before V2.0 tests. I haven't figured it yet, but probably this would need filing another jira. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] amahussein commented on pull request #2581: YARN-10553. Refactor TestDistributedShell
amahussein commented on pull request #2581: URL: https://github.com/apache/hadoop/pull/2581#issuecomment-756136730 > Thank for working on this, @amahussein. LGTM overall pending some nits. > > While it is too late here, it is hard to follow which part of the TestDistributedShell is modified if splitting the class and refactoring are mixed up in a single commit. Doing one thing in one PR makes reviewing and cherry-picking easier. Thanks for the review @iwasakims and @goiri ! Just quick point. Can you please use "start review" feature instead of sending separate comments? single comments do not indicate whether the reviewer is done with his reviews or not. Therefore, I could start a new commit before receiving all the feedback. Thanks again guys. I understand that the PR was not straightforward to review. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] amahussein commented on pull request #2581: YARN-10553. Refactor TestDistributedShell
amahussein commented on pull request #2581: URL: https://github.com/apache/hadoop/pull/2581#issuecomment-756134654 > > * The implementation has lots of code redundancy. > > * It is inefficient in the setup and tearing down. The large percentage of time execution is exhausted by starting cluster and stopping the services. > > I think there is a gap between the PR and the above description of [YARN-10553](https://issues.apache.org/jira/browse/YARN-10553). Just splitting the TestDistributedShell does not reduce code nor minicluster ramp-up time. It would be nice to update the JIRA description reflecting what is addressed. I will update the Jira description. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] amahussein commented on pull request #2581: YARN-10553. Refactor TestDistributedShell
amahussein commented on pull request #2581: URL: https://github.com/apache/hadoop/pull/2581#issuecomment-756132738 > Thank for working on this, @amahussein. LGTM overall pending some nits. > > While it is too late here, it is hard to follow which part of the TestDistributedShell is modified if splitting the class and refactoring are mixed up in a single commit. Doing one thing in one PR makes reviewing and cherry-picking easier. That's good point @iwasakims . Sorry that I made it hard. After my first commit, I received feedback from Inigio to transform busy-waiting loops into lambda. It is my bad I haven't managed to separate between refactoring and splitting. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] amahussein commented on pull request #2581: YARN-10553. Refactor TestDistributedShell
amahussein commented on pull request #2581: URL: https://github.com/apache/hadoop/pull/2581#issuecomment-755797255 > > @goiri the code changes fix {{testDSShellWithEnforceExecutionType}}. > > The problem with the test was that it launches two containers executing cmd `date`. Apparently the two containers would exit fast. The unit test will stay blocked waiting for the containers to be exactly "2". > > This does not take into consideration that the containers count is 3 including the AMContainer. > > The fix was to get rid of the equality in the check, and change the application command to `ls` > > That makes sense, does it make sense to make the assert more general than an equality instead of getting rid of the equality? In `DistributedShellBaseTest.waitForContainersLaunch()`, I used inequality `if (containers.size() < nContainers) { return false; }` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] amahussein commented on pull request #2581: YARN-10553. Refactor TestDistributedShell
amahussein commented on pull request #2581: URL: https://github.com/apache/hadoop/pull/2581#issuecomment-755767916 CC: @iwasakims . This includes a fix to `testDSShellWithEnforceExecutionType` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org
[GitHub] [hadoop] amahussein commented on pull request #2581: YARN-10553. Refactor TestDistributedShell
amahussein commented on pull request #2581: URL: https://github.com/apache/hadoop/pull/2581#issuecomment-755765315 @goiri the code changes fix {{testDSShellWithOpportunisticContainers}}. The problem with the test was that it launches two containers executing cmd `date`. Apparently the two containers would exit fast. The unit test will stay blocked waiting for the containers to be exactly "2". This does not take into consideration that the containers count is 3 including the AMContainer. The fix was to get rid of the equality in the check, and change the application command to `ls` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org