Jobs are running on the node and there are just a set of nodes that
constantly show that the job fails because of Node Failure.

Some of the jobs exit within 3 minutes of there run and some go over. No
consistency in the jobs just the node failures.

I have topology setup using topology/tree and I am not sure if this is
causing any problems or not.


But this is what I see in the logs for all of the nodes that go DOWN and
kill off jobs.

[2013-12-04T10:38:30.152] agent/is_node_resp: node:n0268.mako0 rpc:1011 :
Communication connection failure

Here are my timing settings:

BatchStartTimeout       = 10 sec

BOOT_TIME               = 2013-12-04T10:23:25

EpilogMsgTime           = 2000 usec

GetEnvTimeout           = 2 sec

GroupUpdateTime         = 600 sec

KeepAliveTime           = SYSTEM_DEFAULT

MessageTimeout          = 60 sec

OverTimeLimit           = 0 min

ResumeTimeout           = 60 sec

SchedulerTimeSlice      = 30 sec

SlurmctldTimeout        = 300 sec

SlurmdTimeout           = 300 sec

SuspendTime             = NONE

SuspendTimeout          = 30 sec

UnkillableStepTimeout   = 60 sec

WaitTime                = 0 sec


Please let me know if there is something configured wrong or what I should
check.


Thanks


Jackie

Reply via email to