[ https://issues.apache.org/jira/browse/HDFS-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225103#comment-15225103 ]
Kihwal Lee commented on HDFS-10178: ----------------------------------- {{TestHFlush}}: HDFS-2043 Will review the patch. JDK8 failures don't have logs, so it is hard to debug. {{TestDFSClientRetries}}: timed out. Tried to restart namenode, but timed out. Without seeing the log, it s hard to know what went wrong. {{TestReplication}}: timed out. Datanode shutdown hung at netty shutdown. {noformat} java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.interrupt(Native Method) at sun.nio.ch.EPollArrayWrapper.interrupt(EPollArrayWrapper.java:317) at sun.nio.ch.EPollSelectorImpl.wakeup(EPollSelectorImpl.java:207) at io.netty.channel.nio.NioEventLoop.wakeup(NioEventLoop.java:590) at io.netty.util.concurrent.SingleThreadEventExecutor.shutdownGracefully(SingleThreadEventExecutor.java:503) at io.netty.util.concurrent.MultithreadEventExecutorGroup.shutdownGracefully(MultithreadEventExecutorGroup.java:160) at io.netty.util.concurrent.AbstractEventExecutorGroup.shutdownGracefully(AbstractEventExecutorGroup.java:70) at org.apache.hadoop.hdfs.server.datanode.web.DatanodeHttpServer.close(DatanodeHttpServer.java:249) at org.apache.hadoop.hdfs.server.datanode.DataNode.shutdown(DataNode.java:1863) {noformat} {{TestBlockTokenWithDFS}}: The datanode was restarted and having bind exception. The old port is taken. Test failures are not related to this patch. They are passing when run on my machine. > Permanent write failures can happen if pipeline recoveries occur for the > first packet > ------------------------------------------------------------------------------------- > > Key: HDFS-10178 > URL: https://issues.apache.org/jira/browse/HDFS-10178 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Priority: Critical > Attachments: HDFS-10178.patch, HDFS-10178.v2.patch, > HDFS-10178.v3.patch, HDFS-10178.v4.patch, HDFS-10178.v5.patch > > > We have observed that write fails permanently if the first packet doesn't go > through properly and pipeline recovery happens. If the write op creates a > pipeline, but the actual data packet does not reach one or more datanodes in > time, the pipeline recovery will be done against the 0-byte partial block. > If additional datanodes are added, the block is transferred to the new nodes. > After the transfer, each node will have a meta file containing the header > and 0-length data block file. The pipeline recovery seems to work correctly > up to this point, but write fails when actual data packet is resent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)