[jira] Commented: (HADOOP-3232) Datanodes time out

Doug Cutting (JIRA) Fri, 09 May 2008 09:39:17 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12595663#action_12595663
 ]


Doug Cutting commented on HADOOP-3232:
--------------------------------------

> Also not sure why it needs to change Shell stuff. [ ... ]

I think this was because Johan wanted to make DF implement Runnable, and there 
was a conflict, since Shell already has a method named 'run'.  But this changes 
public APIs incompatibly, and is thus not a good approach.

Johan, perhaps instead we could define a nested class in DF.java that extends 
Thread and overrides run() there, or implements Runnable, if you prefer.

Also the default interval is dfs.blockreport.intervalMsec, which seems rather 
long.  It should really be related to the heartbeat interval, no?  Moreover, we 
shouldn't use a DFS parameter in a generic FS class.  So default interval 
should be something safe, perhaps hardwired to 10 minutes or somesuch, and DFS 
should override that when it constructs a DF, if it needs.  Does that make 
sense?

And perhaps the thread should run 'df' first, then sleep, so that values are 
available to clients sooner?

Finally, and most imporant, what evidence do you have that DU is in fact 
causing problems?  It is run very infrequently, not strictly synchronized with 
block reports but rather triggered by heartbeats.  A slow DU would thus result 
in a delayed heartbeat.  None of the stack traces above indicate that DU is 
blocking other activities of the datanode.

> Datanodes time out
> ------------------
>
>                 Key: HADOOP-3232
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3232
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.16.2
>         Environment: 10 node cluster + 1 namenode
>            Reporter: Johan Oskarsson
>            Priority: Critical
>             Fix For: 0.18.0
>
>         Attachments: du-nonblocking-v1.patch, du-nonblocking-v2-trunk.patch, 
> hadoop-hadoop-datanode-new.log, hadoop-hadoop-datanode-new.out, 
> hadoop-hadoop-datanode.out, hadoop-hadoop-namenode-master2.out
>
>
> I recently upgraded to 0.16.2 from 0.15.2 on our 10 node cluster.
> Unfortunately we're seeing datanode timeout issues. In previous versions 
> we've often seen in the nn webui that one or two datanodes "last contact" 
> goes from the usual 0-3 sec to ~200-300 before it drops down to 0 again.
> This causes mild discomfort but the big problems appear when all nodes do 
> this at once, as happened a few times after the upgrade.
> It was suggested that this could be due to namenode garbage collection, but 
> looking at the gc log output it doesn't seem to be the case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3232) Datanodes time out

Reply via email to