[jira] [Commented] (HBASE-6435) Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes

nkeywal (JIRA) Fri, 20 Jul 2012 15:53:37 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13419646#comment-13419646
 ]


nkeywal commented on HBASE-6435:
--------------------------------

If I can to keep the existing interface


Today, when you open a file, there is a call to a datanode if the file is also 
opened for writing somewhere. In HBase, we want the priorities to be taken into 
account during this opening, as we have a guess that one of these datanode may 
be dead.

So either I register a callback that the DFSClient will call before using its 
list, either I change the 'open' interface to add the possibility to provide 
the list of replicas. Same thing for chooseDataNode called from blockSeekTo: 
even if we have a list at the beginning, this list is recreated during a read 
as a part of the retry process (in case the NN discovered new replicas on new 
datanodes).

if we put a callback like

We would offer this service.
{noformat}
class  ReplicaSet {
  public List<Replica> getAvailableReplica(long pos); // return the list of 
available replicas at given file offset, in priority order
  public void prioritizeReplica(Replica r); // move given replica to front of 
list
  public void blacklistReplica(Replica r); // move replica to back of list
}
{noformat}


The client would need to implement this interface:
{noformat}
// Implement this interface and provide it to the DFSClient during its 
construction to manage the replica ordering
interface OrganizeReplicaSet{
 void organize(String fileName, ReplicaSet rs); 
}
{noformat}

And the DFSClient code would become:
{noformat}
LocatedBlocks callGetBlockLocations(ClientProtocol namenode,
      String src, long start, long length) throws IOException {
    try {
        LocatedBlocks lbs = namenode.getBlockLocations(src, start, length);
        if (organizeReplicaSet != null){
            ReplicaSet rs = LocatedBlocks.getAsReplicaSet()
            try {
                organizeReplicaSet.organize(src, rs);
            }catch (Throwable t){
                throw new IOException("ClientBlockReordorer failed. 
class="+reorderer.getClass(), t);
            }
            return new LocatedBlocks(rs);
        } else
          return lbs;
{noformat}

This is called from the DFSInputStream constructor in openInfo today.

In real life I would try to use the class ReplicaSet as an interface on the 
internal LocatedBlock(s) to limit the number of objects created. The callback 
could also be given as a parameter to the DFSInputStream constructor if a there 
is a specific rule to apply...

                
> Reading WAL files after a recovery leads to time lost in HDFS timeouts when 
> using dead datanodes
> ------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6435
>                 URL: https://issues.apache.org/jira/browse/HBASE-6435
>             Project: HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
>         Attachments: 6435.unfinished.patch
>
>
> HBase writes a Write-Ahead-Log to revover from hardware failure.
> This log is written with 'append' on hdfs.
> Through ZooKeeper, HBase gets informed usually in 30s that it should start 
> the recovery process. 
> This means reading the Write-Ahead-Log to replay the edits on the other 
> servers.
> In standards deployments, HBase process (regionserver) are deployed on the 
> same box as the datanodes.
> It means that when the box stops, we've actually lost one of the edits, as we 
> lost both the regionserver and the datanode.
> As HDFS marks a node as dead after ~10 minutes, it appears as available when 
> we try to read the blocks to recover. As such, we are delaying the recovery 
> process by 60 seconds as the read will usually fail with a socket timeout. If 
> the file is still opened for writing, it adds an extra 20s + a risk of losing 
> edits if we connect with ipc to the dead DN.
> Possible solutions are:
> - shorter dead datanodes detection by the NN. Requires a NN code change.
> - better dead datanodes management in DFSClient. Requires a DFS code change.
> - NN customisation to write the WAL files on another DN instead of the local 
> one.
> - reordering the blocks returned by the NN on the client side to put the 
> blocks on the same DN as the dead RS at the end of the priority queue. 
> Requires a DFS code change or a kind of workaround.
> The solution retained is the last one. Compared to what was discussed on the 
> mailing list, the proposed patch will not modify HDFS source code but adds a 
> proxy. This for two reasons:
> - Some HDFS functions managing block orders are static 
> (MD5MD5CRC32FileChecksum). Implementing the hook in the DFSClient would 
> require to implement partially the fix, change the DFS interface to make this 
> function non static, or put the hook static. None of these solution is very 
> clean. 
> - Adding a proxy allows to put all the code in HBase, simplifying dependency 
> management.
> Nevertheless, it would be better to have this in HDFS. But this solution 
> allows to target the last version only, and this could allow minimal 
> interface changes such as non static methods.
> Moreover, writing the blocks to the non local DN would be an even better 
> solution long term.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-6435) Reading WAL files after a recovery leads to time lost in HDFS timeouts when using dead datanodes

Reply via email to