[ 
https://issues.apache.org/jira/browse/HDDS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849266#comment-16849266
 ] 

Eric Yang commented on HDDS-1554:
---------------------------------

1 {quote}The new tests are missing from the distribution tar file 
(hadoop-ozone/dist/target/ozone-0.5.0-SNAPSHOT/tests/). We agreed to support 
the execution of all the new tests from the final tar.{quote}

Yes, I remember that conversation, and not discounting that agreement.  The 
code needs to be rewritten in Python and move to be built prior to distribution 
project to achieve what we agreed on.  What we will lose as part of the process 
are:
* Lose ability to accurately pin point where exception occurs because Java 
stacktrace may not be captured by python tests.
* Working against maven life cycle.  Integration test suppose to come after 
package has happened.  We are sending more testing binaries in release tarball 
that are irrelevant in production.
* Wasting time packaging integration test binaries in release tarball.

2 {quote} I am not sure why we need the normal read/write test. All of the 
smoketests and integration-tests are testing this scenario{quote}

The only difference between this version and smoke test is the cilent is not 
running in the same network as the docker containers.  This has actually helped 
us to catch a few bugs, like SCMCLI client retries, and protobuf versioning 
problem.  It also help us to test if client JDK is different from cluster JDK.  
It provides a better testbed to show what it is like for data injection to 
container cluster look like from external clients.

3 {quote}With the Read/Only test: I don't think that we need to support 
read-only disks. The only question is if the right exception is thrown. I think 
it also can be tested from MiniOzoneCluster / real unit tests in a more 
lightweight way.{quote}

Read only is to prevent disk write to simulate configuration issue for data 
directory, or disk is mounted as read-only incorrectly.  This injects faults 
into normal workflow by changing a few docker parameters, and easy to clean up 
without leaving read-only debris in build directory.  This area needs more 
expansion.  We can add test case that focus on making metadata disk read-only, 
or datanode disk read-only.  Then measure if strained process have negative 
side effect to the cluster, and check replication proceeded correctly.

4 {quote}Anu Engineer suggested multiple times to do the disk failure injection 
on the java code level where more sophisticated tests can be added (eg. 
generate corrupt read with low probability with using specific 
Input/OutputStream). Can you please explain the design consideration to use 
docker images? Why is it better than the suggested solution?{quote}

We have already done that with aspect/J in HDFS-435.  The work was not fruitful 
and [proposed for 
removal|https://issues.apache.org/jira/browse/HDFS-6819?focusedCommentId=15235595&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15235595].
  The key point of fault injection is to catch exceptions that may not have 
been handled correctly.  By randomly adding junk to data file or change files 
to read-only, the tests can exercise the normal routines to generate exceptions 
that may not have been tested as fully.  By using Docker mounted volumes, we 
can generate the faults outside of normal Java code path.  This provides better 
opportunity to create errors asynchronously.

> Create disk tests for fault injection test
> ------------------------------------------
>
>                 Key: HDDS-1554
>                 URL: https://issues.apache.org/jira/browse/HDDS-1554
>             Project: Hadoop Distributed Data Store
>          Issue Type: Improvement
>          Components: build
>            Reporter: Eric Yang
>            Assignee: Eric Yang
>            Priority: Major
>         Attachments: HDDS-1554.001.patch
>
>
> The current plan for fault injection disk tests are:
>  # Scenario 1 - Read/Write test
>  ## Run docker-compose to bring up a cluster
>  ## Initialize scm and om
>  ## Upload data to Ozone cluster
>  ## Verify data is correct
>  ## Shutdown cluster
>  # Scenario 2 - Read/Only test
>  ## Repeat Scenario 1
>  ## Mount data disk as read only
>  ## Try to write data to Ozone cluster
>  ## Validate error message is correct
>  ## Shutdown cluster
>  # Scenario 3 - Corruption test
>  ## Repeat Scenario 2
>  ## Shutdown cluster
>  ## Modify data disk data
>  ## Restart cluster
>  ## Validate error message for read from corrupted data
>  ## Validate error message for write to corrupted volume



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to