[ https://issues.apache.org/jira/browse/HDDS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882358#comment-16882358 ]
Eric Yang commented on HDDS-1554: --------------------------------- [~elek] Thank you for the thorough review, and here are my answers: {quote}There are no clear separation between the two areas and it makes it impossible to run the same tests in any other environment. As an example: existing smoketests can be easily executed in kubernetes.{quote} StartCluster/StopCluster method is only using docker-compose to start a distributed cluster at this time. It can be extended to perform launching Ozone cluster on different environment such as YARN service later to keep the scope of the current feature set in check. Smoke test only runs in a closed environment. There is no exposure to outside network. Ozone cli client works inside docker container. This limits the tests from interacting with Ozone cluster from a external network. By embedding robot framework in docker container, this poses another risk of inseparable test artifacts in docker container that can not be removed later. Test artifacts makes up 400+MB of the current docker image which is 100% more bloated than necessary. Even when test artifact is separated later, we have no way to be sure that the docker image can properly function without test artifact because the tests can only work in docker containers. The fault injection tests have mapped the ports manually to external network. It can provide better coverage on testing docker network with external environment today. Smoke test can be modified to work with external network but effectively doubling the installation of test tool chain on host system, and doubling the shell script configuration templating to run in Robot framework on host system. Given maven is already a dev toolchain set in stone for Hadoop, and minimalist mindset to make best use of the tool chain. I chose to stay in maven environment to let existing tool chain do the style and error checking for these tests. I respect your approach to write the smoke test in shell + robot framework, but fault injection tests can perform more efficiently with help of Java tool chain imho. {quote}Using `@FixMethodOrder(MethodSorters.NAME_ASCENDING)` is especially error-prone and shows that we use unit test in a way which is different from the normal usage (usually the @Test methods should be independent){quote} There is nothing wrong with using feature, and implemented base on popular request. It is the developer who doesn't know how to use this feature correctly can create problem for themselves. I will group the entire flow into one test case because you don't like using this annotation. {quote}B.) The current tests uses the approach to use external mounts to reuse the /data directory. It introduces new problems (see the workarounds with the id/UID). All of them can be avoided with using internal docker volumes which makes it possible to achieve the same without additional workarounds (ITDiskReadOnly doesn't require direct access to the volume ITDiskCorruption.addCorruption can be replaced with simple shell executions `echo "asd" > ../db`){quote} There are two possible cases where data becomes read-only. B.1) The disk is mounted as read-only, or B.2) the data file is read-only. It would be nice if there are distinct error message to inform system administrator to make adjustment for mounting the data disk, or he made a error when copying data during servicing the server. ITReadOnly test is focusing on case #B.1 and can be expanded to case #B.2, if necessary. Using internal path can not clearly test case #B.1. {quote}C.) The unit test files contains a lot of code duplication inside and outside the cluster. For example ITDiskReadOnly.startCluster and ITDiskReadOnly.stopCluster are almost the same. The logic of waiting for the safe mode is duplicated in all of the subprojects.{quote} Good point, I will clean up in next patch. {quote}D.) I would use JUnit assertion everywhere. For me combining java and junit assertions are confusing (ITDiskReadOnly.startCluster/stopCluster) E.) I would use Autoclosable try-catch blocks (or at least finally blocks) to close opened resources. It would help to define the boundaries where the resources are used.{quote} Setup procedures are not tests. It is common to throw IOException and InterruptedException to handle exceptions in JUnit setup method. This is the reason that they are written this way. You are correct that using autoclosable block is nicer to close file system resources, and will make the change. {quote}F.) We don't need to catch `Exceptions` and `fail` as it's already handled by JUnit framework. Just remove try and catch block.{quote} I like to be able to identify the starting point of the log. This is the reason that I use try, catch and additional string logging for stack trace. This helps to jump to starting point of the error rather than reading the entire stack trace to locate the line number of the test case. Sometimes, the stack trace is too long, the actual line number of test case becomes hidden. I think it is more efficient for problem determination, but I will remove try catch if this looks silly to you. {quote}G.) You don't need to extend TestCase with JUnit4.{quote} Ok, it was faster to get IDE to autogenerate setup/tearDown methods by extending TestCase, but inconsequential detail can be removed. {quote} H.) There are two kind of configuration handling: the environment variable based (usually with the help docker-config) and the file based (where real xml files are mounted). The two of them are mixed which are confusing. {quote} The envtoconf.py script does not generating core-site.xml when there is no change to core-site.xml. We require some key/value pair in core-site.xml now. Hence the reason to mount this resource externally may have fade away. I will clean this up. {quote}I.) We don't need to duplicate the same configuration keys multiple times (docker-config-dn1 docker-config-dn2, docker-config-dn3). the common keys can be loaded from one docker-config file and the special keys (container ipc and ratis ipc ports) cam be adjusted directly from the docker-compose (Similar to the environment variable ENSURE_OM_INITIALIZED which is defined only for the OM){quote} Same point that [~arp] raised. Will clean this up. {quote}G.) There are RAT violations in some of the ozone-site.xml files.{quote} Will fix this. {quote}How can we be sure that the first cluster is started properly before shutting it down? Or do we need a real start or just the initialization of the om / scm?{quote} This is for initialization only. The logic can be smarter. Will revise for the next patch. {quote}I.) It's not clear for me what is the goal for these tests.{quote} ITDiskReadWrite: see HDDS-1771. ITDiskCorruption: We can define a grace period that disk sync should happen to ensure minimum amount of data lost can occur. The test case can be improved to validate if corruption exists over n seconds, then error detection should occur. {quote}ITDiskReadOnly: it checks the use case when the disk is read only, which is fine. But I think it's more interesting to test random disk failures / random disk slowness. I know that other followup jira-s are opened, I think they can be more realistic.{quote} This ticket does not intend to cover every test in one patch. It has a broad starting point to structure the test area. It may require patience for detailed tests to be developed. {quote}J.) ITDiskCorruption.listFiles can be simplified:{quote} This is a place holder. There is probably a set of files to go after when refining the tests. I think data blocks and rocksdb are probably the good starting points. Any suggestion other than list all files. {quote}K.) ITDiskCorruption.addCorruption Are you sure that File.listFiles are recursive? It may return with an empty list of files are the real files are in the `metadata/scm-container.db` not in the `scm-container.db` directory. Can you please add an assertion if you have at least one file which are corrupted?{quote} It currently only corrupts omMetrics. Need to developed a better filter for J. {quote}M.) @Before/@After can help to start/stop cluster before/after each tests in ITDiskReadOnly.{quote} Will make this change, where it is applies. {quote}Here we have no information about the IOException which is checked. It could be an exception from the out.write or an exception from the fs.exists(dest). Unfortunately it's not clear what is the expected behavior.{quote} This doesn't seem to be in place yet in application code for proper fault detection. Therefore, it is a generic IOException at this time. It will be revised, when I know exactly the IOException to catch. Thanks again for the review. Will clean up accordingly. > Create disk tests for fault injection test > ------------------------------------------ > > Key: HDDS-1554 > URL: https://issues.apache.org/jira/browse/HDDS-1554 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: build > Reporter: Eric Yang > Assignee: Eric Yang > Priority: Major > Labels: pull-request-available > Attachments: HDDS-1554.001.patch, HDDS-1554.002.patch, > HDDS-1554.003.patch, HDDS-1554.004.patch, HDDS-1554.005.patch, > HDDS-1554.006.patch, HDDS-1554.007.patch, HDDS-1554.008.patch, > HDDS-1554.009.patch, HDDS-1554.010.patch, HDDS-1554.011.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > The current plan for fault injection disk tests are: > # Scenario 1 - Read/Write test > ## Run docker-compose to bring up a cluster > ## Initialize scm and om > ## Upload data to Ozone cluster > ## Verify data is correct > ## Shutdown cluster > # Scenario 2 - Read/Only test > ## Repeat Scenario 1 > ## Mount data disk as read only > ## Try to write data to Ozone cluster > ## Validate error message is correct > ## Shutdown cluster > # Scenario 3 - Corruption test > ## Repeat Scenario 2 > ## Shutdown cluster > ## Modify data disk data > ## Restart cluster > ## Validate error message for read from corrupted data > ## Validate error message for write to corrupted volume -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org