[ https://issues.apache.org/jira/browse/HDFS-1142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868782#action_12868782 ]
Todd Lipcon commented on HDFS-1142: ----------------------------------- Hey Konstantin, I agree that this shouldn't be marked blocker while discussion is going on. Let me better explain the context with regards to HBase. HBase uses ZK already to determine regionserver liveness. If a region server dies, it loses its ZK session, and thus an ephemeral znode disappears. The master notices this, initiates commitlog recovery for that server, and eventually reassigns the regions elsewhere. To provide proper database-like semantics, we need to ensure that once log recovery commences, the regionserver cannot write any more to that log (otherwise writes might be lost forever). Of course this all works fine if the regionserver has truly died. A big issue we face, though, is one of long garbage collection pauses (sound familiar?). In some cases, the pauses can last longer than the zk session timeout. Thus, the hbase master decides that the server has died and does log splitting, region reassignment, etc. Unfortunately, in this scenario, the region server then comes back to life and flushes a few more writes to the log file, which summarily get lost forever even though the client thinks they're committed. The regionserver eventually "notices" that it lost its ZK session and shuts itself down, but in practice it often has time to get off some last edits before doing so. Clearly, using locks in ZK is subject to the same issue above - the issue is that our ZK coordination is not synchronous with our storage access. There are two solutions I can think of here: (a) the "STONITH" technique ( http://en.wikipedia.org/wiki/STONITH ) - we could run the regionservers in a container service which allows us to kill -9 the regionserver when we think it should be dead. But this is obviously more complicated with regard to deployment, additional RPCs, etc. (b) file access revocation - this is what we're trying to do with lease recovery and what you're suggesting should not be possible. Here's a question - as you described it, the original lease holder and the recovering lease holder race to recover the lease. If the original holder wins the recovery, are we guaranteed that no interceding appends have occurred? eg what happens if the recovering process wins, opens the file for append, and immediately closes it. Are we guaranteed then that another flush() call from the client at that point would definitely fail, or can it transparently regain the lease from the now-closed file? > Lease recovery doesn't reassign lease when triggered by append() > ---------------------------------------------------------------- > > Key: HDFS-1142 > URL: https://issues.apache.org/jira/browse/HDFS-1142 > Project: Hadoop HDFS > Issue Type: Bug > Components: name-node > Affects Versions: 0.21.0 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Attachments: hdfs-1142.txt, hdfs-1142.txt > > > If a soft lease has expired and another writer calls append(), it triggers > lease recovery but doesn't reassign the lease to a new owner. Therefore, the > old writer can continue to allocate new blocks, try to steal back the lease, > etc. This is for the testRecoveryOnBlockBoundary case of HDFS-1139 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.