Jobs are running on the node and there are just a set of nodes that constantly show that the job fails because of Node Failure.
Some of the jobs exit within 3 minutes of there run and some go over. No consistency in the jobs just the node failures. I have topology setup using topology/tree and I am not sure if this is causing any problems or not. But this is what I see in the logs for all of the nodes that go DOWN and kill off jobs. [2013-12-04T10:38:30.152] agent/is_node_resp: node:n0268.mako0 rpc:1011 : Communication connection failure Here are my timing settings: BatchStartTimeout = 10 sec BOOT_TIME = 2013-12-04T10:23:25 EpilogMsgTime = 2000 usec GetEnvTimeout = 2 sec GroupUpdateTime = 600 sec KeepAliveTime = SYSTEM_DEFAULT MessageTimeout = 60 sec OverTimeLimit = 0 min ResumeTimeout = 60 sec SchedulerTimeSlice = 30 sec SlurmctldTimeout = 300 sec SlurmdTimeout = 300 sec SuspendTime = NONE SuspendTimeout = 30 sec UnkillableStepTimeout = 60 sec WaitTime = 0 sec Please let me know if there is something configured wrong or what I should check. Thanks Jackie
