[ https://issues.apache.org/jira/browse/HBASE-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419646#comment-13419646 ]
nkeywal commented on HBASE-6435: -------------------------------- If I can to keep the existing interface Today, when you open a file, there is a call to a datanode if the file is also opened for writing somewhere. In HBase, we want the priorities to be taken into account during this opening, as we have a guess that one of these datanode may be dead. So either I register a callback that the DFSClient will call before using its list, either I change the 'open' interface to add the possibility to provide the list of replicas. Same thing for chooseDataNode called from blockSeekTo: even if we have a list at the beginning, this list is recreated during a read as a part of the retry process (in case the NN discovered new replicas on new datanodes). if we put a callback like We would offer this service. {noformat} class ReplicaSet { public List<Replica> getAvailableReplica(long pos); // return the list of available replicas at given file offset, in priority order public void prioritizeReplica(Replica r); // move given replica to front of list public void blacklistReplica(Replica r); // move replica to back of list } {noformat} The client would need to implement this interface: {noformat} // Implement this interface and provide it to the DFSClient during its construction to manage the replica ordering interface OrganizeReplicaSet{ void organize(String fileName, ReplicaSet rs); } {noformat} And the DFSClient code would become: {noformat} LocatedBlocks callGetBlockLocations(ClientProtocol namenode, String src, long start, long length) throws IOException { try { LocatedBlocks lbs = namenode.getBlockLocations(src, start, length); if (organizeReplicaSet != null){ ReplicaSet rs = LocatedBlocks.getAsReplicaSet() try { organizeReplicaSet.organize(src, rs); }catch (Throwable t){ throw new IOException("ClientBlockReordorer failed. class="+reorderer.getClass(), t); } return new LocatedBlocks(rs); } else return lbs; {noformat} This is called from the DFSInputStream constructor in openInfo today. In real life I would try to use the class ReplicaSet as an interface on the internal LocatedBlock(s) to limit the number of objects created. The callback could also be given as a parameter to the DFSInputStream constructor if a there is a specific rule to apply... > Reading WAL files after a recovery leads to time lost in HDFS timeouts when > using dead datanodes > ------------------------------------------------------------------------------------------------ > > Key: HBASE-6435 > URL: https://issues.apache.org/jira/browse/HBASE-6435 > Project: HBase > Issue Type: Improvement > Components: master, regionserver > Affects Versions: 0.96.0 > Reporter: nkeywal > Assignee: nkeywal > Attachments: 6435.unfinished.patch > > > HBase writes a Write-Ahead-Log to revover from hardware failure. > This log is written with 'append' on hdfs. > Through ZooKeeper, HBase gets informed usually in 30s that it should start > the recovery process. > This means reading the Write-Ahead-Log to replay the edits on the other > servers. > In standards deployments, HBase process (regionserver) are deployed on the > same box as the datanodes. > It means that when the box stops, we've actually lost one of the edits, as we > lost both the regionserver and the datanode. > As HDFS marks a node as dead after ~10 minutes, it appears as available when > we try to read the blocks to recover. As such, we are delaying the recovery > process by 60 seconds as the read will usually fail with a socket timeout. If > the file is still opened for writing, it adds an extra 20s + a risk of losing > edits if we connect with ipc to the dead DN. > Possible solutions are: > - shorter dead datanodes detection by the NN. Requires a NN code change. > - better dead datanodes management in DFSClient. Requires a DFS code change. > - NN customisation to write the WAL files on another DN instead of the local > one. > - reordering the blocks returned by the NN on the client side to put the > blocks on the same DN as the dead RS at the end of the priority queue. > Requires a DFS code change or a kind of workaround. > The solution retained is the last one. Compared to what was discussed on the > mailing list, the proposed patch will not modify HDFS source code but adds a > proxy. This for two reasons: > - Some HDFS functions managing block orders are static > (MD5MD5CRC32FileChecksum). Implementing the hook in the DFSClient would > require to implement partially the fix, change the DFS interface to make this > function non static, or put the hook static. None of these solution is very > clean. > - Adding a proxy allows to put all the code in HBase, simplifying dependency > management. > Nevertheless, it would be better to have this in HDFS. But this solution > allows to target the last version only, and this could allow minimal > interface changes such as non static methods. > Moreover, writing the blocks to the non local DN would be an even better > solution long term. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira