[ 
https://issues.apache.org/jira/browse/HADOOP-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631029#comment-17631029
 ] 

Szilard Nemeth commented on HADOOP-15327:
-----------------------------------------

Hi,
CC: [~gandras], [~shuzirra], [~weichiu]

Let me summarize what kind of testing I performed to make sure this change 
won't cause any regression.
The project that helped me very much with the testing is called 
[Hades|https://github.com/9uapaw/hades].
Kudos to [~gandras] for the initial work on the Hades project.
h1. TL;DR

*Hades was the framework I used to run my testcases.*
*All testcases are passed both with the trunk version of Hadoop (this is not 
surprising at all) and the deployed Hadoop version with my Netty upgrade patch.*
*See the attached test logs for details.*
*Also see the details below about what Hades is, how I tested, why I chose 
certain configurations for the testcases and many more..*
*Now I'm pretty confident that this patch won't break anything so I'm waiting 
for reviewers.*
----
h1. HADES IN GENERAL
h2. What is Hades?

Hades is a CLI tool, that shares a common interface between various Hadoop 
distributions. It is a collection of commands most frequently used by 
developers of Hadoop components.

Hades supports [Hadock|https://github.com/9uapaw/docker-hadoop-dev], [Cloduera 
Data Platform|https://www.cloudera.com/products/cloudera-data-platform.html] 
and standard upstream distribution.
h2. Basic features of Hades
 - Discover cluster: Stores where individual YARN / HDFS daemons are running.
 - Distribute files on certain nodes
 - Get config: Prints configuration of selected roles
 - Read logs of Hadoop roles
 - Restart: Restarting of certain roles
 - Run an application on the defined cluster
 - Status: Prints the status of the cluster
 - Update config: Update properties on a config file for selected roles
 - YARN specific commands
 - Run script: Runs user-defined custom scripts against the cluster.

h1. CLUSTER + HADES SETUP
h2. Run Hades with the Netty testing script against a cluster

First of all, I created a standard cluster and deployed Hadoop to the cluster.
Side note: Later on, all the installation that deploys Hadoop on the cluster 
could be part of Hades as well.

It's worth to be mentioned that I have a [PR with netty-related 
changes|https://github.com/9uapaw/hades/pull/6] against the Hades repo.
The branch of this PR is 
[this|https://github.com/szilard-nemeth/hades/tree/netty4-finish].

[Here are the 
instructions|https://github.com/szilard-nemeth/hades/blob/c16e95393ecf3e787e125c58d88ec2dc6a44b9e0/README.md#set-up-hades-on-a-cluster-and-run-the-netty-script]
 for how to set up and run Hades with the Netty testing script.
h1. THE NETTY TESTING SCRIPT

The Netty testing script [lives 
here|https://github.com/szilard-nemeth/hades/blob/netty4-finish/script/netty4.py].
As you can see on the code, quite a lot of work has been done to make sure the 
Netty 4 upgrade won't break anything and won't cause any regression as it is a 
crucial part of MapReduce.
h2. CONCEPTS
h3. Test context

Class: Netty4TestContext

The test context provides a way to encapsulate a base branch and a patch file 
(if any) applied on top of the base branch.
The context can enable or disable Maven compilation.
The context can also have certain ways to ensure that the compilation and the 
deployment of new jars were successful on the cluster.
Now, it can verify that certain logs are appearing in the daemon logs, making 
sure the deployment was okay.
The main purpose of the context is to compare it with results of other contexts.
For the Netty testing, it was evident that I need to make sure the trunk 
version and my version with the patch applied on top of trnuk works the same, 
e.g. there's no regression.
For this, I created the context.
h3. Testcase

Class: Netty4Testcase

In general, a testcase can have a name, a simple name, some config changes 
(dictionary of string keys, string values) and one MR application.
h3. Test config: Config options for running the tests

Class: Netty4TestConfig

These are the main config options for the Netty testing.
I won't go into too much details as I defined a ton of options along the way.
You can check all the config options 
[here|https://github.com/szilard-nemeth/hades/blob/c16e95393ecf3e787e125c58d88ec2dc6a44b9e0/script/netty4.py#L655-L687]
h3. Compiler

As mentioned above, Hades can compile Hadoop with Maven and replace the changed 
jars / Maven modules on the cluster.
This is particularly useful for the Netty testing as I was interested in 
whether the patch causes any issues so I had to compile Hadoop with my Netty 
patch, deploy the jars on the cluster and run all the tests and see all of them 
passing.
h2. TESTCASES

The testcases are defined with the help of the Netty4TestcasesBuilder. You can 
find all the testcases 
[here|https://github.com/szilard-nemeth/hades/blob/c16e95393ecf3e787e125c58d88ec2dc6a44b9e0/script/netty4.py#L739-L766]

Let's take the "keepalive" as an example, 
[CODE|https://github.com/szilard-nemeth/hades/blob/c16e95393ecf3e787e125c58d88ec2dc6a44b9e0/script/netty4.py#L761-L765]
This builder sets the config property 
'mapreduce.shuffle.connection-keep-alive.enable' to 'true'.
The builder also sets the config property 
'mapreduce.shuffle.connection-keep-alive.timeout' to either 15 or 25.
One other "dimension" of the TC builder is the applications added for the 
testcases.
Now, all the builders are using the default apps: the sleep and the loadgen 
jobs.
So back to the "keepalive" builder, it generates 4 testcases:

1. 
mapreduce.shuffle.connection-keep-alive.enable: true
mapreduce.shuffle.connection-keep-alive.timeout: 15
job: sleep

2. 
mapreduce.shuffle.connection-keep-alive.enable: true
mapreduce.shuffle.connection-keep-alive.timeout: 15
job: loadgen

3. 
mapreduce.shuffle.connection-keep-alive.enable: true
mapreduce.shuffle.connection-keep-alive.timeout: 25
job: sleep

4. 
mapreduce.shuffle.connection-keep-alive.enable: true
mapreduce.shuffle.connection-keep-alive.timeout: 25
job: loadgen

As more configuration options are provided for the builder, the number of 
combinations of config values and apps are growing, therefore more testcases 
would be generated.

For reference, here are the full commands for the loadgen and sleep jobs:
{code:java}
sudo -u systest /opt/hadoop/bin/yarn jar 
/opt/hadoop/share/hadoop/mapreduce/*hadoop-mapreduce-client-jobclient-*-tests.jar
 loadgen  -m 4 -r 3 -outKey org.apache.hadoop.io.Text -outValue 
org.apache.hadoop.io.Text

sudo -u systest /opt/hadoop/bin/yarn jar 
/opt/hadoop/share/hadoop/mapreduce/*hadoop-mapreduce-client-jobclient-*-tests.jar
 sleep  -m 1 -r 1 -mt 10 -rt 10
{code}
h2. CONFIGURATIONS

When I finished the coding phase of the Netty 4 upgrade patch, I carefully 
checked all shuffle-related configs in the codebase.
Here's my own note for the related configurations: 
[https://www.evernote.com/l/ADRNp3Ls3glMepLVxq9ihZMg7KXG4PdXsjc]
Based on my patch, I tried to change those configs in the tests that could be 
vulnerable and possibly affected by my change.

Note: From the YARN codebase, I had to copy all the shuffle-related configs 
with their default values. Those are defined 
[here|https://github.com/szilard-nemeth/hades/blob/c16e95393ecf3e787e125c58d88ec2dc6a44b9e0/script/netty4.py#L57-L77].

Each testcase just change some of these values, from the defaults.
h2. TESTING WORKFLOW

The class called 'Netty4RegressionTestDriver' executes the whole testing 
workflow.
The [class 
itself|https://github.com/szilard-nemeth/hades/blob/c16e95393ecf3e787e125c58d88ec2dc6a44b9e0/script/netty4.py#L1583-L1639]
 is not so huge and I tried to keep it shorter for clarity.
We won't go into too many details here, if you are interested more, feel free 
to check the code.

The driver itself is really just a driver, all the steps are encapsulated in 
methods of the class called 'Netty4RegressionTestSteps'.

The basic workflow of the test suite is the following:
{code:java}
1. The driver iterates over each test context.
        1.2. Initialization of the current context is performed.
        1.3. If the context requires branch / patch setup, it is performed.
        1.4. If the context requires compilation with Maven, it is executed.
        1.5. The driver then iterates over all testcases of the current context.
                1.5.1. Initialization of the current testcase is performed.
                1.5.2. SSL setup is performed if required, according to 
settings.
                1.5.3. Restart all YARN services / daemons
                1.5.4. All the default configs are loaded (yarn-site.xml, 
mapred-site.xml, core-site.xml, etc.) and saved to the local machine
                1.5.5. The config changes defined by the current testcase are 
applied and deployed to the cluster hosts. Note: For netty, it was enough to 
manipulate NodeManager configs.
                1.5.6. Set log levels: For our case, the ShuffleHandler's log 
level is set to DEBUG.
                1.5.7. Restart all YARN services again, with the updated 
configs. Separate restart logs are saved to local machine for NodeManagers.
                1.5.8. Verify log levels: Checking if the step 1.5.6. performed 
correctly and we are having the right log levels set.
                1.5.9. Verify configs: Some NM configs are checked to make sure 
that the step 1.5.5. are in effect.
                1.5.10. Starting to collect YARN daemon logs
                1.5.11. Run the MR application defined by the testcase and 
collect test results: app logs, container logs, application final status, etc.
                1.5.12. Ensure if Hadoop version is correct: This is done by 
grepping for certain logs, if the testcase defined this check
                1.5.13. Write result files to local disk: YARN logs, testcase 
config files, etc.
                1.5.14. Make sure that the YARN daemon logs are not empty, fail 
otherwise
                1.5.15. Finalize testcase data files on local disk: Whether to 
decompress them, etc.
                1.5.16. Finalize testcase
        1.6 Finalize context: Print all generated files, final report of 
testcases (TC name along with their final status)
2. Compare results: Comparing the results of the 2 defined contexts. If any 
difference is detected, it is considered as a failure.
{code}
The possible app states are: PASSED, FAILED, TIMED OUT. 
Of course, only PASSED is accepted.
It can be configured what the driver is doing if a FAILED or TIMED OUT testcase 
is detected.

===================================================================================================================================================
h1. GENERATED FILES / HADES SESSION

Each test run produces a session directory in the Hades working directory that 
is specified in the config file of Hades.
h2. Structure of the Hades session bundle

Inside the Hades session dir, there are separate directories for each Hades 
context. For example: ctx_with_netty_patch_based_on_trunk
For each testcase, there's also a separate directory. For example: 
tc22_keepalive_4_loadgen
The testcase number is just a consecutive number, also the short name is 
contained in the directory name (keepalive) and the app type (loadgen).
h3. What's inside the testcase directory?

1. YARN daemon logs for all hosts: 
Example:
{code:java}
./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site
./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/SecurityAuth-systest.audit
./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-historyserver-ccycloud-1.snemeth-netty.root.hwx.site.out
./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-resourcemanager-ccycloud-1.snemeth-netty.root.hwx.site.out.1
./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-namenode-ccycloud-1.snemeth-netty.root.hwx.site.log
./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-secondarynamenode-ccycloud-1.snemeth-netty.root.hwx.site.out
./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-resourcemanager-ccycloud-1.snemeth-netty.root.hwx.site.log
./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-datanode-ccycloud-1.snemeth-netty.root.hwx.site.log
./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-resourcemanager-ccycloud-1.snemeth-netty.root.hwx.site.out.5
./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-resourcemanager-ccycloud-1.snemeth-netty.root.hwx.site.out.2
./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-namenode-ccycloud-1.snemeth-netty.root.hwx.site.out
./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-historyserver-ccycloud-1.snemeth-netty.root.hwx.site.log
./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-resourcemanager-ccycloud-1.snemeth-netty.root.hwx.site.out.3
./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-resourcemanager-ccycloud-1.snemeth-netty.root.hwx.site.out.4
./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-secondarynamenode-ccycloud-1.snemeth-netty.root.hwx.site.log
./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-resourcemanager-ccycloud-1.snemeth-netty.root.hwx.site.out
./RM_daemonlogs_ccycloud-1.snemeth-netty.root.hwx.site/hadoop-systest-datanode-ccycloud-1.snemeth-netty.root.hwx.site.out

...
...
./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site
./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-nodemanager-ccycloud-2.snemeth-netty.root.hwx.site.out.3
./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-datanode-ccycloud-2.snemeth-netty.root.hwx.site.log
./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-nodemanager-ccycloud-2.snemeth-netty.root.hwx.site.out.4
./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-nodemanager-ccycloud-2.snemeth-netty.root.hwx.site.out.5
./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-nodemanager-ccycloud-2.snemeth-netty.root.hwx.site.out.2
./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-nodemanager-ccycloud-2.snemeth-netty.root.hwx.site.log
./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/SecurityAuth-systest.audit
./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-datanode-ccycloud-2.snemeth-netty.root.hwx.site.out.2
./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-datanode-ccycloud-2.snemeth-netty.root.hwx.site.out
./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-nodemanager-ccycloud-2.snemeth-netty.root.hwx.site.out.1
./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-nodemanager-ccycloud-2.snemeth-netty.root.hwx.site.out
./NM_daemonlogs_ccycloud-2.snemeth-netty.root.hwx.site/hadoop-systest-datanode-ccycloud-2.snemeth-netty.root.hwx.site.out.1
{code}
The YARN daemon logs are only kept for the last testcase to save some space, as 
all testcase would contain the subset of the latest daemon logs, separately.

2. Initial configs for all hosts (before testcase):
{code:java}
./initial_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_core-site.xml
./initial_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_log4j.properties
./initial_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_mapred-site.xml
./initial_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_ssl-client.xml
./initial_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_ssl-server.xml
./initial_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_yarn-site.xml
...
...
{code}
3. Testcase configs for all hosts:
{code:java}
./testcase_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_core-site.xml
./testcase_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_log4j.properties
./testcase_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_mapred-site.xml
./testcase_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_ssl-client.xml
./testcase_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_ssl-server.xml
./testcase_config/ccycloud-2.snemeth-netty.root.hwx.site:35380_yarn-site.xml
./testcase_config/ccycloud-3.snemeth-netty.root.hwx.site:33809_core-site.xml
./testcase_config/ccycloud-3.snemeth-netty.root.hwx.site:33809_log4j.properties
./testcase_config/ccycloud-3.snemeth-netty.root.hwx.site:33809_mapred-site.xml
./testcase_config/ccycloud-3.snemeth-netty.root.hwx.site:33809_ssl-client.xml
./testcase_config/ccycloud-3.snemeth-netty.root.hwx.site:33809_ssl-server.xml
./testcase_config/ccycloud-3.snemeth-netty.root.hwx.site:33809_yarn-site.xml
...
...
{code}
4. Application logs for all hosts, for all containers:

Note: The example below just contains one set of logs for one particular 
container on one host.
{code:java}
./application_1667933094368_0001_ccycloud-3.snemeth-netty.root.hwx.site/application_1667933094368_0001/container_1667933094368_0001_01_000041/prelaunch.err
./application_1667933094368_0001_ccycloud-3.snemeth-netty.root.hwx.site/application_1667933094368_0001/container_1667933094368_0001_01_000041/prelaunch.out
./application_1667933094368_0001_ccycloud-3.snemeth-netty.root.hwx.site/application_1667933094368_0001/container_1667933094368_0001_01_000041/stderr
./application_1667933094368_0001_ccycloud-3.snemeth-netty.root.hwx.site/application_1667933094368_0001/container_1667933094368_0001_01_000041/stdout
./application_1667933094368_0001_ccycloud-3.snemeth-netty.root.hwx.site/application_1667933094368_0001/container_1667933094368_0001_01_000041/syslog
./application_1667933094368_0001_ccycloud-3.snemeth-netty.root.hwx.site/application_1667933094368_0001/container_1667933094368_0001_01_000042
./application_1667933094368_0001_ccycloud-3.snemeth-netty.root.hwx.site/application_1667933094368_0001/container_1667933094368_0001_01_000042/directory.info
./application_1667933094368_0001_ccycloud-3.snemeth-netty.root.hwx.site/application_1667933094368_0001/container_1667933094368_0001_01_000042/launch_container.sh
{code}
5. Hades session log: 
This is the execution log of the Hades Netty testing script, the log level is 
set to DEBUG so everything can be tracked easily.

> Upgrade MR ShuffleHandler to use Netty4
> ---------------------------------------
>
>                 Key: HADOOP-15327
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15327
>             Project: Hadoop Common
>          Issue Type: Sub-task
>            Reporter: Xiaoyu Yao
>            Assignee: Szilard Nemeth
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HADOOP-15327.001.patch, HADOOP-15327.002.patch, 
> HADOOP-15327.003.patch, HADOOP-15327.004.patch, HADOOP-15327.005.patch, 
> HADOOP-15327.005.patch, 
> getMapOutputInfo_BlockingOperationException_awaitUninterruptibly.log, 
> hades-results-20221108.zip, testfailure-testMapFileAccess-emptyresponse.zip, 
> testfailure-testReduceFromPartialMem.zip
>
>          Time Spent: 11.5h
>  Remaining Estimate: 0h
>
> This way, we can remove the dependencies on the netty3 (jboss.netty)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to