[ 
https://issues.apache.org/jira/browse/HDFS-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893345#action_12893345
 ] 

Todd Lipcon commented on HDFS-1262:
-----------------------------------

Hey Sam, finally got a chance to look at this. A few notes:

- why does abandonFile return boolean? looks like right now it can only return 
true or throw, may as well make it void, no?
- in the log message in FSN.abandonFile it looks like there's a missing '+ src 
+' in the second log message
- in the log messages, also log the "holder" argument perhaps
- in previous append-branch patches we've been trying to keep RPC compatibility 
with unpatched 0.20 - ie you can run an updated client against an old NN, with 
the provision that it might not fix all the bugs. Given that, maybe we should 
catch the exception we get if we call abandonFile() and get back an exception 
indicating the method doesn't exist? Check out what we did for HDFS-630 
backport for example.
- looks like there are some other patches that got conflated into this one - eg 
testSimultaneousRecoveries is part of another patch on the append branch.
- missing Apache license on new test file
- typo: Excection instead of Exception
- "(PermissionStatus) anyObject()," might generated an unchecked cast warning - 
I think you can do Matchers.<PermissionStatus>anyObject() or some such to avoid 
the unchecked cast
- given the complexity of the unit test, would be good to add some comments for 
the general flow of what all the mocks/spys are achieving. I found myself a bit 
lost in the abstractions

> Failed pipeline creation during append leaves lease hanging on NN
> -----------------------------------------------------------------
>
>                 Key: HDFS-1262
>                 URL: https://issues.apache.org/jira/browse/HDFS-1262
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client, name-node
>    Affects Versions: 0.20-append
>            Reporter: Todd Lipcon
>            Assignee: sam rash
>            Priority: Critical
>             Fix For: 0.20-append
>
>         Attachments: hdfs-1262-1.txt, hdfs-1262-2.txt, hdfs-1262-3.txt, 
> hdfs-1262-4.txt
>
>
> Ryan Rawson came upon this nasty bug in HBase cluster testing. What happened 
> was the following:
> 1) File's original writer died
> 2) Recovery client tried to open file for append - looped for a minute or so 
> until soft lease expired, then append call initiated recovery
> 3) Recovery completed successfully
> 4) Recovery client calls append again, which succeeds on the NN
> 5) For some reason, the block recovery that happens at the start of append 
> pipeline creation failed on all datanodes 6 times, causing the append() call 
> to throw an exception back to HBase master. HBase assumed the file wasn't 
> open and put it back on a queue to try later
> 6) Some time later, it tried append again, but the lease was still assigned 
> to the same DFS client, so it wasn't able to recover.
> The recovery failure in step 5 is a separate issue, but the problem for this 
> JIRA is that the NN can think it failed to open a file for append when the NN 
> thinks the writer holds a lease. Since the writer keeps renewing its lease, 
> recovery never happens, and no one can open or recover the file until the DFS 
> client shuts down.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to