[ 
https://issues.apache.org/jira/browse/HDFS-4504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13744704#comment-13744704
 ] 

Vinay commented on HDFS-4504:
-----------------------------

{quote}In a lot of cases when close fails, the NameNode is not reachable. The 
behavior I implemented in the patch is designed to let long-running clients 
handle these transient problems gracefully. Currently, uncloseable files get 
created, and a client restart is needed to get rid of them. For example, you 
might have to restart your Flume daemon, your NFS gateway, etc. The client 
wants to be able to tell when the file is actually closed in case it needs to 
reopen or move it. Currently, there is no way to do this. With this patch, it 
can do this by continuing to call close until it no longer throws an 
exception.{quote}
Anyway, for gracefully closing the file also namenode should be reachable even 
after file is considered as zombie. So Uma and me discussed about the following 
possible scenarios and we came to the conclusion mentioned in the above comment.

 In case of pipeline failure, all datanodes may not be complete and they will 
not report to namenode. So even graceful closure of the file may result in 
failure in such cases. Recovery of the lastblock in all those datanodes 
required, then only proper closure of the file can be done. One way to do this 
is to call {{recoverLease}}, but as todd pointed out, this also may lead to 
some other problem.

 As you told, for the long living clients, such as Flume, etc. recovery of 
these files will be done only in case of client restart. 

So what we are trying to propose here is, 
At client side, instead of trying for the graceful closure of the file in 
{{ZombieStreamManager}} or {{DFSOS.close()}} which may fail everytime,

 What we are trying propose here is as follows

# In ZombieStreamManager, periodically check for the streams from 
{{DFSClient.filesBeingWritten}} which are marked as closed, but still not 
removed from this map. Obviously these will be Zombie streams. This period can 
be reasonable. 
# While handling the Zombie stream, ZombieStreamManager can report to NameNode 
via some new RPC as this stream is zombie. Then NN can check and re-assign the 
lease if the client is still having the lease on that file. Here target lease 
can be anything.. So that after hardlimit expiry, NN will automatically recover 
the file.
# So once the reporting zombie stream to NN is success, then DFSOS can be 
removed from the {{DFSClient.filesBeingWritten}} map.

So this way, we can avoid more changes in the patch.

{quote}It seems like the case you are concerned about is the case where we fail 
to get the last block, because of a streamer failure. This is already a 
problem, and I don't think this patch makes it worse (although it doesn't make 
it better, either){quote}
I agree, but  before your patch, {{close()}} call was never allowed till we get 
NPE. So there was no issue.
                
> DFSOutputStream#close doesn't always release resources (such as leases)
> -----------------------------------------------------------------------
>
>                 Key: HDFS-4504
>                 URL: https://issues.apache.org/jira/browse/HDFS-4504
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-4504.001.patch, HDFS-4504.002.patch, 
> HDFS-4504.007.patch, HDFS-4504.008.patch, HDFS-4504.009.patch, 
> HDFS-4504.010.patch, HDFS-4504.011.patch, HDFS-4504.014.patch, 
> HDFS-4504.015.patch
>
>
> {{DFSOutputStream#close}} can throw an {{IOException}} in some cases.  One 
> example is if there is a pipeline error and then pipeline recovery fails.  
> Unfortunately, in this case, some of the resources used by the 
> {{DFSOutputStream}} are leaked.  One particularly important resource is file 
> leases.
> So it's possible for a long-lived HDFS client, such as Flume, to write many 
> blocks to a file, but then fail to close it.  Unfortunately, the 
> {{LeaseRenewerThread}} inside the client will continue to renew the lease for 
> the "undead" file.  Future attempts to close the file will just rethrow the 
> previous exception, and no progress can be made by the client.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to