[ https://issues.apache.org/jira/browse/HADOOP-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631029#comment-17631029 ]
Szilard Nemeth commented on HADOOP-15327: ----------------------------------------- Hi, CC: [~gandras], [~shuzirra], [~weichiu] Let me summarize what kind of testing I performed to make sure this change won't cause any regression. The project that helped me very much with the testing is called [Hades|https://github.com/9uapaw/hades]. Kudos to [~gandras] for the initial work on the Hades project. h1. TL;DR *Hades was the framework I used to run my testcases.* *All testcases are passed both with the trunk version of Hadoop (this is not surprising at all) and the deployed Hadoop version with my Netty upgrade patch.* *See the attached test logs for details.* *Also see the details below about what Hades is, how I tested, why I chose certain configurations for the testcases and many more..* *Now I'm pretty confident that this patch won't break anything so I'm waiting for reviewers.* ---- h1. HADES IN GENERAL h2. What is Hades? Hades is a CLI tool, that shares a common interface between various Hadoop distributions. It is a collection of commands most frequently used by developers of Hadoop components. Hades supports [Hadock|https://github.com/9uapaw/docker-hadoop-dev], [Cloduera Data Platform|https://www.cloudera.com/products/cloudera-data-platform.html] and standard upstream distribution. h2. Basic features of Hades - Discover cluster: Stores where individual YARN / HDFS daemons are running. - Distribute files on certain nodes - Get config: Prints configuration of selected roles - Read logs of Hadoop roles - Restart: Restarting of certain roles - Run an application on the defined cluster - Status: Prints the status of the cluster - Update config: Update properties on a config file for selected roles - YARN specific commands - Run script: Runs user-defined custom scripts against the cluster. h1. CLUSTER + HADES SETUP h2. Run Hades with the Netty testing script against a cluster First of all, I created a standard cluster and deployed Hadoop to the cluster. Side note: Later on, all the installation that deploys Hadoop on the cluster could be part of Hades as well. It's worth to be mentioned that I have a [PR with netty-related changes|https://github.com/9uapaw/hades/pull/6] against the Hades repo. The branch of this PR is [this|https://github.com/szilard-nemeth/hades/tree/netty4-finish]. [Here are the instructions|https://github.com/szilard-nemeth/hades/blob/c16e95393ecf3e787e125c58d88ec2dc6a44b9e0/README.md#set-up-hades-on-a-cluster-and-run-the-netty-script] for how to set up and run Hades with the Netty testing script. h1. THE NETTY TESTING SCRIPT The Netty testing script [lives here|https://github.com/szilard-nemeth/hades/blob/netty4-finish/script/netty4.py]. As you can see on the code, quite a lot of work has been done to make sure the Netty 4 upgrade won't break anything and won't cause any regression as it is a crucial part of MapReduce. h2. CONCEPTS h3. Test context Class: Netty4TestContext The test context provides a way to encapsulate a base branch and a patch file (if any) applied on top of the base branch. The context can enable or disable Maven compilation. The context can also have certain ways to ensure that the compilation and the deployment of new jars were successful on the cluster. Now, it can verify that certain logs are appearing in the daemon logs, making sure the deployment was okay. The main purpose of the context is to compare it with results of other contexts. For the Netty testing, it was evident that I need to make sure the trunk version and my version with the patch applied on top of trnuk works the same, e.g. there's no regression. For this, I created the context. h3. Testcase Class: Netty4Testcase In general, a testcase can have a name, a simple name, some config changes (dictionary of string keys, string values) and one MR application. h3. Test config: Config options for running the tests Class: Netty4TestConfig These are the main config options for the Netty testing. I won't go into too much details as I defined a ton of options along the way. You can check all the config options [here|https://github.com/szilard-nemeth/hades/blob/c16e95393ecf3e787e125c58d88ec2dc6a44b9e0/script/netty4.py#L655-L687] h3. Compiler As mentioned above, Hades can compile Hadoop with Maven and replace the changed jars / Maven modules on the cluster. This is particularly useful for the Netty testing as I was interested in whether the patch causes any issues so I had to compile Hadoop with my Netty patch, deploy the jars on the cluster and run all the tests and see all of them passing. h2. TESTCASES The testcases are defined with the help of the Netty4TestcasesBuilder. You can find all the testcases [here|https://github.com/szilard-nemeth/hades/blob/c16e95393ecf3e787e125c58d88ec2dc6a44b9e0/script/netty4.py#L739-L766] Let's take the "keepalive" as an example, [CODE|https://github.com/szilard-nemeth/hades/blob/c16e95393ecf3e787e125c58d88ec2dc6a44b9e0/script/netty4.py#L761-L765] This builder sets the config property 'mapreduce.shuffle.connection-keep-alive.enable' to 'true'. The builder also sets the config property 'mapreduce.shuffle.connection-keep-alive.timeout' to either 15 or 25. One other "dimension" of the TC builder is the applications added for the testcases. Now, all the builders are using the default apps: the sleep and the loadgen jobs. So back to the "keepalive" builder, it generates 4 testcases: 1. mapreduce.shuffle.connection-keep-alive.enable: true mapreduce.shuffle.connection-keep-alive.timeout: 15 job: sleep 2. mapreduce.shuffle.connection-keep-alive.enable: true mapreduce.shuffle.connection-keep-alive.timeout: 15 job: loadgen 3. mapreduce.shuffle.connection-keep-alive.enable: true mapreduce.shuffle.connection-keep-alive.timeout: 25 job: sleep 4. mapreduce.shuffle.connection-keep-alive.enable: true mapreduce.shuffle.connection-keep-alive.timeout: 25 job: loadgen As more configuration options are provided for the builder, the number of combinations of config values and apps are growing, therefore more testcases would be generated. For reference, here are the full commands for the loadgen and sleep jobs: {code:java} sudo -u systest /opt/hadoop/bin/yarn jar /opt/hadoop/share/hadoop/mapreduce/*hadoop-mapreduce-client-jobclient-*-tests.jar loadgen -m 4 -r 3 -outKey org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text sudo -u systest /opt/hadoop/bin/yarn jar /opt/hadoop/share/hadoop/mapreduce/*hadoop-mapreduce-client-jobclient-*-tests.jar sleep -m 1 -r 1 -mt 10 -rt 10 {code} h2. CONFIGURATIONS When I finished the coding phase of the Netty 4 upgrade patch, I carefully checked all shuffle-related configs in the codebase. Here's my own note for the related configurations: [https://www.evernote.com/l/ADRNp3Ls3glMepLVxq9ihZMg7KXG4PdXsjc] Based on my patch, I tried to change those configs in the tests that could be vulnerable and possibly affected by my change. Note: From the YARN codebase, I had to copy all the shuffle-related configs with their default values. Those are defined [here|https://github.com/szilard-nemeth/hades/blob/c16e95393ecf3e787e125c58d88ec2dc6a44b9e0/script/netty4.py#L57-L77]. Each testcase just change some of these values, from the defaults. h2. TESTING WORKFLOW The class called 'Netty4RegressionTestDriver' executes the whole testing workflow. The [class itself|https://github.com/szilard-nemeth/hades/blob/c16e95393ecf3e787e125c58d88ec2dc6a44b9e0/script/netty4.py#L1583-L1639] is not so huge and I tried to keep it shorter for clarity. We won't go into too many details here, if you are interested more, feel free to check the code. The driver itself is really just a driver, all the steps are encapsulated in methods of the class called 'Netty4RegressionTestSteps'. The basic workflow of the test suite is the following: {code:java} 1. The driver iterates over each test context. 1.2. Initialization of the current context is performed. 1.3. If the context requires branch / patch setup, it is performed. 1.4. If the context requires compilation with Maven, it is executed. 1.5. The driver then iterates over all testcases of the current context. 1.5.1. Initialization of the current testcase is performed. 1.5.2. SSL setup is performed if required, according to settings. 1.5.3. Restart all YARN services / daemons 1.5.4. All the default configs are loaded (yarn-site.xml, mapred-site.xml, core-site.xml, etc.) and saved to the local machine 1.5.5. The config changes defined by the current testcase are applied and deployed to the cluster hosts. Note: For netty, it was enough to manipulate NodeManager configs. 1.5.6. Set log levels: For our case, the ShuffleHandler's log level is set to DEBUG. 1.5.7. Restart all YARN services again, with the updated configs. Separate restart logs are saved to local machine for NodeManagers. 1.5.8. Verify log levels: Checking if the step 1.5.6. performed correctly and we are having the right log levels set. 1.5.9. Verify configs: Some NM configs are checked to make sure that the step 1.5.5. are in effect. 1.5.10. Starting to collect YARN daemon logs 1.5.11. Run the MR application defined by the testcase and collect test results: app logs, container logs, application final status, etc. 1.5.12. Ensure if Hadoop version is correct: This is done by grepping for certain logs, if the testcase defined this check 1.5.13. Write result files to local disk: YARN logs, testcase config files, etc. 1.5.14. Make sure that the YARN daemon logs are not empty, fail otherwise 1.5.15. Finalize testcase data files on local disk: Whether to decompress them, etc. 1.5.16. Finalize testcase 1.6 Finalize context: Print all generated files, final report of testcases (TC name along with their final status) 2. Compare results: Comparing the results of the 2 defined contexts. If any difference is detected, it is considered as a failure. {code} The possible app states are: PASSED, FAILED, TIMED OUT. Of course, only PASSED is accepted. It can be configured what the driver is doing if a FAILED or TIMED OUT testcase is detected. =================================================================================================================================================== h1. GENERATED FILES / HADES SESSION Each test run produces a session directory in the Hades working directory that is specified in the config file of Hades. h2. Structure of the Hades session bundle Inside the Hades session dir, there are separate directories for each Hades context. For example: ctx_with_netty_patch_based_on_trunk For each testcase, there's also a separate directory. For example: tc22_keepalive_4_loadgen The testcase number is just a consecutive number, also the short name is contained in the directory name (keepalive) and the app type (loadgen). h3. What's inside the testcase directory? 1. YARN daemon logs for all hosts: Example: {code:java} ./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site ./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/SecurityAuth-systest.audit ./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-historyserver-ccycloud-1.snemeth-netty.root.hwx.site.out ./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-resourcemanager-ccycloud-1.snemeth-netty.root.hwx.site.out.1 ./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-namenode-ccycloud-1.snemeth-netty.root.hwx.site.log ./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-secondarynamenode-ccycloud-1.snemeth-netty.root.hwx.site.out ./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-resourcemanager-ccycloud-1.snemeth-netty.root.hwx.site.log ./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-datanode-ccycloud-1.snemeth-netty.root.hwx.site.log ./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-resourcemanager-ccycloud-1.snemeth-netty.root.hwx.site.out.5 ./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-resourcemanager-ccycloud-1.snemeth-netty.root.hwx.site.out.2 ./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-namenode-ccycloud-1.snemeth-netty.root.hwx.site.out ./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-historyserver-ccycloud-1.snemeth-netty.root.hwx.site.log ./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-resourcemanager-ccycloud-1.snemeth-netty.root.hwx.site.out.3 ./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-resourcemanager-ccycloud-1.snemeth-netty.root.hwx.site.out.4 ./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-secondarynamenode-ccycloud-1.snemeth-netty.root.hwx.site.log ./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-resourcemanager-ccycloud-1.snemeth-netty.root.hwx.site.out ./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-datanode-ccycloud-1.snemeth-netty.root.hwx.site.out ... ... ./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site ./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-nodemanager-ccycloud-2.snemeth-netty.root.hwx.site.out.3 ./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-datanode-ccycloud-2.snemeth-netty.root.hwx.site.log ./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-nodemanager-ccycloud-2.snemeth-netty.root.hwx.site.out.4 ./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-nodemanager-ccycloud-2.snemeth-netty.root.hwx.site.out.5 ./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-nodemanager-ccycloud-2.snemeth-netty.root.hwx.site.out.2 ./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-nodemanager-ccycloud-2.snemeth-netty.root.hwx.site.log ./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/SecurityAuth-systest.audit ./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-datanode-ccycloud-2.snemeth-netty.root.hwx.site.out.2 ./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-datanode-ccycloud-2.snemeth-netty.root.hwx.site.out ./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-nodemanager-ccycloud-2.snemeth-netty.root.hwx.site.out.1 ./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-nodemanager-ccycloud-2.snemeth-netty.root.hwx.site.out ./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-datanode-ccycloud-2.snemeth-netty.root.hwx.site.out.1 {code} The YARN daemon logs are only kept for the last testcase to save some space, as all testcase would contain the subset of the latest daemon logs, separately. 2. Initial configs for all hosts (before testcase): {code:java} ./initial_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_core-site.xml ./initial_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_log4j.properties ./initial_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_mapred-site.xml ./initial_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_ssl-client.xml ./initial_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_ssl-server.xml ./initial_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_yarn-site.xml ... ... {code} 3. Testcase configs for all hosts: {code:java} ./testcase_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_core-site.xml ./testcase_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_log4j.properties ./testcase_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_mapred-site.xml ./testcase_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_ssl-client.xml ./testcase_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_ssl-server.xml ./testcase_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_yarn-site.xml ./testcase_config/ccycloud-3.snemeth-netty.root.hwx.site:33809_core-site.xml ./testcase_config/ccycloud-3.snemeth-netty.root.hwx.site:33809_log4j.properties ./testcase_config/ccycloud-3.snemeth-netty.root.hwx.site:33809_mapred-site.xml ./testcase_config/ccycloud-3.snemeth-netty.root.hwx.site:33809_ssl-client.xml ./testcase_config/ccycloud-3.snemeth-netty.root.hwx.site:33809_ssl-server.xml ./testcase_config/ccycloud-3.snemeth-netty.root.hwx.site:33809_yarn-site.xml ... ... {code} 4. Application logs for all hosts, for all containers: Note: The example below just contains one set of logs for one particular container on one host. {code:java} ./application_1667933094368_0001_ccycloud-3.snemeth-netty.root.hwx.site/application_1667933094368_0001/container_1667933094368_0001_01_000041/prelaunch.err ./application_1667933094368_0001_ccycloud-3.snemeth-netty.root.hwx.site/application_1667933094368_0001/container_1667933094368_0001_01_000041/prelaunch.out ./application_1667933094368_0001_ccycloud-3.snemeth-netty.root.hwx.site/application_1667933094368_0001/container_1667933094368_0001_01_000041/stderr ./application_1667933094368_0001_ccycloud-3.snemeth-netty.root.hwx.site/application_1667933094368_0001/container_1667933094368_0001_01_000041/stdout ./application_1667933094368_0001_ccycloud-3.snemeth-netty.root.hwx.site/application_1667933094368_0001/container_1667933094368_0001_01_000041/syslog ./application_1667933094368_0001_ccycloud-3.snemeth-netty.root.hwx.site/application_1667933094368_0001/container_1667933094368_0001_01_000042 ./application_1667933094368_0001_ccycloud-3.snemeth-netty.root.hwx.site/application_1667933094368_0001/container_1667933094368_0001_01_000042/directory.info ./application_1667933094368_0001_ccycloud-3.snemeth-netty.root.hwx.site/application_1667933094368_0001/container_1667933094368_0001_01_000042/launch_container.sh {code} 5. Hades session log: This is the execution log of the Hades Netty testing script, the log level is set to DEBUG so everything can be tracked easily. > Upgrade MR ShuffleHandler to use Netty4 > --------------------------------------- > > Key: HADOOP-15327 > URL: https://issues.apache.org/jira/browse/HADOOP-15327 > Project: Hadoop Common > Issue Type: Sub-task > Reporter: Xiaoyu Yao > Assignee: Szilard Nemeth > Priority: Major > Labels: pull-request-available > Attachments: HADOOP-15327.001.patch, HADOOP-15327.002.patch, > HADOOP-15327.003.patch, HADOOP-15327.004.patch, HADOOP-15327.005.patch, > HADOOP-15327.005.patch, > getMapOutputInfo_BlockingOperationException_awaitUninterruptibly.log, > hades-results-20221108.zip, testfailure-testMapFileAccess-emptyresponse.zip, > testfailure-testReduceFromPartialMem.zip > > Time Spent: 11.5h > Remaining Estimate: 0h > > This way, we can remove the dependencies on the netty3 (jboss.netty) -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org