[ https://issues.apache.org/jira/browse/HDDS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16849266#comment-16849266 ]
Eric Yang commented on HDDS-1554: --------------------------------- 1 {quote}The new tests are missing from the distribution tar file (hadoop-ozone/dist/target/ozone-0.5.0-SNAPSHOT/tests/). We agreed to support the execution of all the new tests from the final tar.{quote} Yes, I remember that conversation, and not discounting that agreement. The code needs to be rewritten in Python and move to be built prior to distribution project to achieve what we agreed on. What we will lose as part of the process are: * Lose ability to accurately pin point where exception occurs because Java stacktrace may not be captured by python tests. * Working against maven life cycle. Integration test suppose to come after package has happened. We are sending more testing binaries in release tarball that are irrelevant in production. * Wasting time packaging integration test binaries in release tarball. 2 {quote} I am not sure why we need the normal read/write test. All of the smoketests and integration-tests are testing this scenario{quote} The only difference between this version and smoke test is the cilent is not running in the same network as the docker containers. This has actually helped us to catch a few bugs, like SCMCLI client retries, and protobuf versioning problem. It also help us to test if client JDK is different from cluster JDK. It provides a better testbed to show what it is like for data injection to container cluster look like from external clients. 3 {quote}With the Read/Only test: I don't think that we need to support read-only disks. The only question is if the right exception is thrown. I think it also can be tested from MiniOzoneCluster / real unit tests in a more lightweight way.{quote} Read only is to prevent disk write to simulate configuration issue for data directory, or disk is mounted as read-only incorrectly. This injects faults into normal workflow by changing a few docker parameters, and easy to clean up without leaving read-only debris in build directory. This area needs more expansion. We can add test case that focus on making metadata disk read-only, or datanode disk read-only. Then measure if strained process have negative side effect to the cluster, and check replication proceeded correctly. 4 {quote}Anu Engineer suggested multiple times to do the disk failure injection on the java code level where more sophisticated tests can be added (eg. generate corrupt read with low probability with using specific Input/OutputStream). Can you please explain the design consideration to use docker images? Why is it better than the suggested solution?{quote} We have already done that with aspect/J in HDFS-435. The work was not fruitful and [proposed for removal|https://issues.apache.org/jira/browse/HDFS-6819?focusedCommentId=15235595&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-15235595]. The key point of fault injection is to catch exceptions that may not have been handled correctly. By randomly adding junk to data file or change files to read-only, the tests can exercise the normal routines to generate exceptions that may not have been tested as fully. By using Docker mounted volumes, we can generate the faults outside of normal Java code path. This provides better opportunity to create errors asynchronously. > Create disk tests for fault injection test > ------------------------------------------ > > Key: HDDS-1554 > URL: https://issues.apache.org/jira/browse/HDDS-1554 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: build > Reporter: Eric Yang > Assignee: Eric Yang > Priority: Major > Attachments: HDDS-1554.001.patch > > > The current plan for fault injection disk tests are: > # Scenario 1 - Read/Write test > ## Run docker-compose to bring up a cluster > ## Initialize scm and om > ## Upload data to Ozone cluster > ## Verify data is correct > ## Shutdown cluster > # Scenario 2 - Read/Only test > ## Repeat Scenario 1 > ## Mount data disk as read only > ## Try to write data to Ozone cluster > ## Validate error message is correct > ## Shutdown cluster > # Scenario 3 - Corruption test > ## Repeat Scenario 2 > ## Shutdown cluster > ## Modify data disk data > ## Restart cluster > ## Validate error message for read from corrupted data > ## Validate error message for write to corrupted volume -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org