Devaraj Das wrote:
I wrote a patch to address the NPE in JobTracker.killJob() and compiled
it against TRUNK. I've put this on the cluster and it's now been holding
steady for the last hour or so.. so that plus whatever other differences
there are between 18.1 and TRUNK may have fixed things. (I'll submit the
patch to the JIRA as soon as it finishes cranking against the JUnit tests)


Aaron, I don't think this is a solution to the problem you are seeing. The
IPC handlers are tolerant to exceptions. In particular, they must not die in
the event of an exception during RPC processing. Could you please get a
stack trace of the JobTracker threads (without your patch) when the TTs are
unable to talk to it. Access the url http://<jt-host>:<jt-info-port>/stacks
That will tell us what the handlers are up to.

Devaraj fwded the stacks that Aaron sent. As he suspected there is a deadlock in RPC server. I will file a blocker for 0.18 and above. This deadlock is more likely on a busy network.

Raghu.

Reply via email to