[ https://issues.apache.org/jira/browse/HDFS-4504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13745868#comment-13745868 ]
Vinay commented on HDFS-4504: ----------------------------- {quote}A DFSOutputStream is a zombie for one of two reasons: 1. The client can't contact the NameNode (perhaps because of a network problem) 2. The client asked the NameNode to complete the file and it refused, because the NN does not (yet?) have a record that all of the file's blocks are present and complete.{quote} You cannot ignore DataStreamer failure as not zombie. Thats the potential one and can happen frequently too. As already described in the above test, trying to close file, in case of DataStreamer failure, could lead to potential problem. Force complete also fails though. {quote}As I said before, the current code doesn't do anything special in the case of a data streamer failure in DFSOutputStream#close. It just throws up its hands and says "oh well, guess that data's gone!" After the hard-lease period expires, we will complete the file anyway. So it's exactly the same behavior with this patch as without it-- only the timeout is different.{quote} In case of DataStreamer failure due to pipeline failure, Without patch, complete() call wont be called, so no changes of block state at NN side, hence recovery of the file will succeed. *No data will be lost* With patch, complete() call marks the block state to COMMITTED, which will block recovery/force complete until block is reported by DN, which will not. *So data lost* {quote} This might be a good idea, but we should do it in a future JIRA. This patch is big enough, and changes enough things already.{quote} Yes, thats correct. To avoid too many changes in this patch itself, we suggesting to just try to report zombie to NN instead of force complete. This covers all cases you mentioned. > DFSOutputStream#close doesn't always release resources (such as leases) > ----------------------------------------------------------------------- > > Key: HDFS-4504 > URL: https://issues.apache.org/jira/browse/HDFS-4504 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Attachments: HDFS-4504.001.patch, HDFS-4504.002.patch, > HDFS-4504.007.patch, HDFS-4504.008.patch, HDFS-4504.009.patch, > HDFS-4504.010.patch, HDFS-4504.011.patch, HDFS-4504.014.patch, > HDFS-4504.015.patch, HDFS-4504.016.patch > > > {{DFSOutputStream#close}} can throw an {{IOException}} in some cases. One > example is if there is a pipeline error and then pipeline recovery fails. > Unfortunately, in this case, some of the resources used by the > {{DFSOutputStream}} are leaked. One particularly important resource is file > leases. > So it's possible for a long-lived HDFS client, such as Flume, to write many > blocks to a file, but then fail to close it. Unfortunately, the > {{LeaseRenewerThread}} inside the client will continue to renew the lease for > the "undead" file. Future attempts to close the file will just rethrow the > previous exception, and no progress can be made by the client. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira