To complete the picture: not only was our network swamped, I realized tonight that the NameNode/JobTracker was running on a 99% full disk (it hit 100% full about thirty minutes ago). That poor JobTracker was fighting against a lot of odds. As soon as we upgrade to a bigger disk and switch it back on, I'll apply the supplied patch to the cluster.
Thank you for looking into this! - Aaron On Thu, Oct 30, 2008 at 3:42 PM, Raghu Angadi <[EMAIL PROTECTED]> wrote: > Raghu Angadi wrote: > >> Devaraj fwded the stacks that Aaron sent. As he suspected there is a >> deadlock in RPC server. I will file a blocker for 0.18 and above. This >> deadlock is more likely on a busy network. >> >> > Aaron, > > Could you try the patch attached to > https://issues.apache.org/jira/browse/HADOOP-4552 ? > > Thanks, > Raghu. >