[ 
https://issues.apache.org/jira/browse/HDFS-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868782#action_12868782
 ] 

Todd Lipcon commented on HDFS-1142:
-----------------------------------

Hey Konstantin,

I agree that this shouldn't be marked blocker while discussion is going on.

Let me better explain the context with regards to HBase. HBase uses ZK already 
to determine regionserver liveness. If a region server dies, it loses its ZK 
session, and thus an ephemeral znode disappears. The master notices this, 
initiates commitlog recovery for that server, and eventually reassigns the 
regions elsewhere. To provide proper database-like semantics, we need to ensure 
that once log recovery commences, the regionserver cannot write any more to 
that log (otherwise writes might be lost forever).

Of course this all works fine if the regionserver has truly died. A big issue 
we face, though, is one of long garbage collection pauses (sound familiar?). In 
some cases, the pauses can last longer than the zk session timeout. Thus, the 
hbase master decides that the server has died and does log splitting, region 
reassignment, etc. Unfortunately, in this scenario, the region server then 
comes back to life and flushes a few more writes to the log file, which 
summarily get lost forever even though the client thinks they're committed. The 
regionserver eventually "notices" that it lost its ZK session and shuts itself 
down, but in practice it often has time to get off some last edits before doing 
so.

Clearly, using locks in ZK is subject to the same issue above - the issue is 
that our ZK coordination is not synchronous with our storage access.

There are two solutions I can think of here: (a) the "STONITH" technique ( 
http://en.wikipedia.org/wiki/STONITH ) - we could run the regionservers in a 
container service which allows us to kill -9 the regionserver when we think it 
should be dead. But this is obviously more complicated with regard to 
deployment, additional RPCs, etc. (b) file access revocation - this is what 
we're trying to do with lease recovery and what you're suggesting should not be 
possible.

Here's a question - as you described it, the original lease holder and the 
recovering lease holder race to recover the lease. If the original holder wins 
the recovery, are we guaranteed that no interceding appends have occurred? eg 
what happens if the recovering process wins, opens the file for append, and 
immediately closes it. Are we guaranteed then that another flush() call from 
the client at that point would definitely fail, or can it transparently regain 
the lease from the now-closed file?

> Lease recovery doesn't reassign lease when triggered by append()
> ----------------------------------------------------------------
>
>                 Key: HDFS-1142
>                 URL: https://issues.apache.org/jira/browse/HDFS-1142
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: name-node
>    Affects Versions: 0.21.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hdfs-1142.txt, hdfs-1142.txt
>
>
> If a soft lease has expired and another writer calls append(), it triggers 
> lease recovery but doesn't reassign the lease to a new owner. Therefore, the 
> old writer can continue to allocate new blocks, try to steal back the lease, 
> etc. This is for the testRecoveryOnBlockBoundary case of HDFS-1139

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to