András Bokor created HDFS-17920:
-----------------------------------
Summary: TestDiskError.testShutdown can run into infint loop
Key: HDFS-17920
URL: https://issues.apache.org/jira/browse/HDFS-17920
Project: Hadoop HDFS
Issue Type: Bug
Components: datanode
Reporter: András Bokor
We found that when running JUnit tests TestDiskError.testShutdown takes a long
and did not finish, also it consumes all the storage space. The log file is
somewhere around 11 GB, but it can be increased by increasing the container
size.
Since the log file is huge and capable of running indefinitely, it is
suspicious that there might be an infinite loop somewhere in the test.
I checked what loops exist [in the test
file;|https://github.com/apache/hadoop/blob/734dd8a67cd6df56b59ff75aa43de57834a0d248/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/datanode/TestDiskError.java#L121]
there aren't many, and with one exception, they all run only a few iterations:
{code:java}
DataNode dn = cluster.getDataNodes().get(dnIndex);
for (int i=0; dn.isDatanodeUp(); i++) {
Path fileName = new Path("/test.txt"+i);
DFSTestUtil.createFile(fs, fileName, 1024, (short)2, 1L);
DFSTestUtil.waitReplication(fs, fileName, (short)2);
fs.delete(fileName, true);
} {code}
Here, we keep creating and deleting new files until the DataNode (DN) dies. I
don't know how long the replication takes, but based on the file size and the
replication factor of 2, it should happen quickly. This is a suspicious section
because if the test doesn't finish quickly (meaning the "bad" DN doesn't shut
itself down), it’s conceivable that a vast number of file operations are
generating a massive amount of logs.I ran a grep on the log file to see how
many iterations are executed, and I found a line like this:
{code:java}
BLOCK* allocate blk_1073970157_229333, replicas=127.0.0.1:34219,
127.0.0.1:39923 for /test.txt114166{code}
This indicates that this single unit test case generates over a hundred
thousand file operations on its own. Based on the log I examined, which covers
a half-hour window, the loop is running about 60 times per second; I'm not even
sure if this makes sense.
Introducing some kind of interval plus a timeout would likely help, as the test
currently works in a way where if the feature under test fails, you don't get
an assertion error—you get an infinite loop.
*Please note that* in our internal release, this unit test fails because the
faulty DataNode does not shut down. In this ticket, {*}we are not addressing
the root cause of the shutdown failure{*}; instead, we are targeting the
resulting infinite loop and the unnecessarily large log file.
Also, I have set the priority to Critical (even though a unit test failure does
not indicate that) because, this issue can block CI process.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]