On Sat, 27 Oct 2007, Jan Ploski wrote:

Kevin Hildebrand wrote:

Hello, at some point today, my Maui/Torque installation stopped running jobs. It appears that Maui is able to select an available set of nodes, but then can't seem to start the job. I'm not getting any errors on the Torque side, or in fact, I'm not even seeing Torque log entries that the job is even being started. Here's what I'm seeing in the Maui logs:

10/26 16:43:25 INFO: tasks located for job 21542: 2 of 2 required (36 feasible) 10/26 16:43:25 INFO: allocated MNode[000]x2 'compute-2-1.deepthought.umd.edu' to 21542:0
10/26 16:43:25 MJobStart(21542)
10/26 16:43:25 MJobDistributeTasks(21542,DEEPTHOUGHT.UMD.EDU,NodeList,TaskMap)
10/26 16:43:25 INFO:     1 node(s)/2 task(s) added to 21542:0
10/26 16:43:25 INFO: MNode[000] 'compute-2-1.deepthought.umd.edu'(x2) added to job '21542' [020] compute-2-1.deepthought.umd.edu: (P:4,S:5405,M:3946,D:1) [Idle][linux][[NONE]]<0.020000> C:[debug 4:4][narrow-med 4:4][narrow-long 4:4][narrow-extended 4:4][med-exten ded 4:4][wide-debug 4:4][wide-short 4:4][wide-med 4:4][serial 4:4][grid 4:4][dev 4:4][DEFAULT] [noib][prod][dell1950] [debug 4:4][narrow-med 4:4][narrow-long 4:4][narrow-ex tended 4:4][med-extended 4:4][wide-debug 4:4][wide-short 4:4][wide-med 4:4][serial 4:4][grid 4:4][dev 4:4]
10/26 16:43:25 INFO:     end of list reached.  1 nodes found
10/26 16:43:25 INFO:     tasks distributed: 2 (Round-Robin)
10/26 16:43:25 MAMAllocJReserve(21542,RIndex,ErrMsg)
10/26 16:43:25 MRMJobStart(21542,Msg,SC)
10/26 16:43:25 INFO: cannot start job 21542 (cannot start job - fail iteration)
10/26 16:43:25 WARNING:  cannot start job '21542' through resource manager
10/26 16:43:25 ERROR:    MBFFirstFit:  cannot start job 21542.0

Anybody have a clue as to what's going on? (I've tried restarting both Torque and Maui, and the problem continues)

What does checkjob -v 21542 tell you?

Regards,
Jan Ploski


Well, I walked away from it for a few hours and came back, and all of the stuck jobs are running. checkjob wasn't showing anything interesting- it was saying that "job can run in partition DEFAULT", and there were no "Rejection Reasons". Numerous nodes were available that met the job selection criteria.

Kevin
_______________________________________________
mauiusers mailing list
mauiusers@supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to