[ 
https://issues.apache.org/jira/browse/HDFS-8344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697817#comment-14697817
 ] 

Ravi Prakash commented on HDFS-8344:
------------------------------------

bq. If you take down the cluster and bring it back up. All writing pipeline 
will fail and should fail.
That is correct. This JIRA is for the case that data loss has already occurred. 
i.e. client died + the DNs to which it wrote already died. We are trying to 
recover the lease in this JIRA. My argument was that after client+DNs have 
died, if I only have a timeout, I could take down the cluster. When I bring the 
cluster back up after the timeout value, the lease will be recovered without 
trying all the DNs.
bq. This is internal implementation details and I'm very reluctant to make it 
configurable 
Perhaps I should have said "internal hard-coded" configuration? Similar to 
{{recoveryAttemptsBeforeMarkingBlockMissing}} of version 8 of the patch.

bq.  Having only one concept for detecting failures (i.e., time out) is simpler 
than two (i.e., time out and number of retries).
Even if its simpler, there's a chance that recovery is never attempted, and 
that is not acceptable IMHO.


> NameNode doesn't recover lease for files with missing blocks
> ------------------------------------------------------------
>
>                 Key: HDFS-8344
>                 URL: https://issues.apache.org/jira/browse/HDFS-8344
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.7.0
>            Reporter: Ravi Prakash
>            Assignee: Ravi Prakash
>             Fix For: 2.8.0
>
>         Attachments: HDFS-8344.01.patch, HDFS-8344.02.patch, 
> HDFS-8344.03.patch, HDFS-8344.04.patch, HDFS-8344.05.patch, 
> HDFS-8344.06.patch, HDFS-8344.07.patch, HDFS-8344.08.patch
>
>
> I found another\(?) instance in which the lease is not recovered. This is 
> reproducible easily on a pseudo-distributed single node cluster
> # Before you start it helps if you set. This is not necessary, but simply 
> reduces how long you have to wait
> {code}
>       public static final long LEASE_SOFTLIMIT_PERIOD = 30 * 1000;
>       public static final long LEASE_HARDLIMIT_PERIOD = 2 * 
> LEASE_SOFTLIMIT_PERIOD;
> {code}
> # Client starts to write a file. (could be less than 1 block, but it hflushed 
> so some of the data has landed on the datanodes) (I'm copying the client code 
> I am using. I generate a jar and run it using $ hadoop jar TestHadoop.jar)
> # Client crashes. (I simulate this by kill -9 the $(hadoop jar 
> TestHadoop.jar) process after it has printed "Wrote to the bufferedWriter"
> # Shoot the datanode. (Since I ran on a pseudo-distributed cluster, there was 
> only 1)
> I believe the lease should be recovered and the block should be marked 
> missing. However this is not happening. The lease is never recovered.
> The effect of this bug for us was that nodes could not be decommissioned 
> cleanly. Although we knew that the client had crashed, the Namenode never 
> released the leases (even after restarting the Namenode) (even months 
> afterwards). There are actually several other cases too where we don't 
> consider what happens if ALL the datanodes die while the file is being 
> written, but I am going to punt on that for another time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to