[jira] [Created] (HBASE-6435) Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes

nkeywal (JIRA) Fri, 20 Jul 2012 10:07:38 -0700

nkeywal created HBASE-6435:
------------------------------

             Summary: Reading WAL files after a recovery leads to time lost in 
HDFS timeouts when using dead datanodes
                 Key: HBASE-6435
                 URL: https://issues.apache.org/jira/browse/HBASE-6435
             Project: HBase
          Issue Type: Improvement
          Components: master, regionserver
    Affects Versions: 0.96.0
            Reporter: nkeywal
            Assignee: nkeywal



HBase writes a Write-Ahead-Log to revover from hardware failure.
This log is written with 'append' on hdfs.
Through ZooKeeper, HBase gets informed usually in 30s that it should start the 
recovery process. 
This means reading the Write-Ahead-Log to replay the edits on the other servers.

In standards deployments, HBase process (regionserver) are deployed on the same 
box as the datanodes.

It means that when the box stops, we've actually lost one of the edits, as we 
lost both the regionserver and the datanode.

As HDFS marks a node as dead after ~10 minutes, it appears as available when we 
try to read the blocks to recover. As such, we are delaying the recovery 
process by 60 seconds as the read will usually fail with a socket timeout. If 
the file is still opened for writing, it adds an extra 20s + a risk of losing 
edits if we connect with ipc to the dead DN.


Possible solutions are:
- shorter dead datanodes detection by the NN. Requires a NN code change.
- better dead datanodes management in DFSClient. Requires a DFS code change.
- NN customisation to write the WAL files on another DN instead of the local 
one.
- reordering the blocks returned by the NN on the client side to put the blocks 
on the same DN as the dead RS at the end of the priority queue. Requires a DFS 
code change or a kind of workaround.

The solution retained is the last one. Compared to what was discussed on the 
mailing list, the proposed patch will not modify HDFS source code but adds a 
proxy. This for two reasons:
- Some HDFS functions managing block orders are static 
(MD5MD5CRC32FileChecksum). Implementing the hook in the DFSClient would require 
to implement partially the fix, change the DFS interface to make this function 
non static, or put the hook static. None of these solution is very clean. 
- Adding a proxy allows to put all the code in HBase, simplifying dependency 
management.

Nevertheless, it would be better to have this in HDFS. But this solution allows 
to target the last version only, and this could allow minimal interface changes 
such as non static methods.

Moreover, writing the blocks to the non local DN would be an even better 
solution long term.






--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HBASE-6435) Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes

Reply via email to