On Sat, 27 Oct 2007, Jan Ploski wrote:
Kevin Hildebrand wrote:
Hello, at some point today, my Maui/Torque installation stopped running
jobs. It appears that Maui is able to select an available set of nodes,
but then can't seem to start the job. I'm not getting any errors on the
Torque side, or in fact, I'm not even seeing Torque log entries that the
job is even being started. Here's what I'm seeing in the Maui logs:
10/26 16:43:25 INFO: tasks located for job 21542: 2 of 2 required (36
feasible)
10/26 16:43:25 INFO: allocated MNode[000]x2
'compute-2-1.deepthought.umd.edu' to 21542:0
10/26 16:43:25 MJobStart(21542)
10/26 16:43:25
MJobDistributeTasks(21542,DEEPTHOUGHT.UMD.EDU,NodeList,TaskMap)
10/26 16:43:25 INFO: 1 node(s)/2 task(s) added to 21542:0
10/26 16:43:25 INFO: MNode[000] 'compute-2-1.deepthought.umd.edu'(x2)
added to job '21542'
[020] compute-2-1.deepthought.umd.edu: (P:4,S:5405,M:3946,D:1)
[Idle][linux][[NONE]]<0.020000> C:[debug 4:4][narrow-med 4:4][narrow-long
4:4][narrow-extended 4:4][med-exten
ded 4:4][wide-debug 4:4][wide-short 4:4][wide-med 4:4][serial 4:4][grid
4:4][dev 4:4][DEFAULT] [noib][prod][dell1950] [debug 4:4][narrow-med
4:4][narrow-long 4:4][narrow-ex
tended 4:4][med-extended 4:4][wide-debug 4:4][wide-short 4:4][wide-med
4:4][serial 4:4][grid 4:4][dev 4:4]
10/26 16:43:25 INFO: end of list reached. 1 nodes found
10/26 16:43:25 INFO: tasks distributed: 2 (Round-Robin)
10/26 16:43:25 MAMAllocJReserve(21542,RIndex,ErrMsg)
10/26 16:43:25 MRMJobStart(21542,Msg,SC)
10/26 16:43:25 INFO: cannot start job 21542 (cannot start job - fail
iteration)
10/26 16:43:25 WARNING: cannot start job '21542' through resource manager
10/26 16:43:25 ERROR: MBFFirstFit: cannot start job 21542.0
Anybody have a clue as to what's going on? (I've tried restarting both
Torque and Maui, and the problem continues)
What does checkjob -v 21542 tell you?
Regards,
Jan Ploski
Well, I walked away from it for a few hours and came back, and all of the
stuck jobs are running. checkjob wasn't showing anything interesting- it
was saying that "job can run in partition DEFAULT", and there were no
"Rejection Reasons". Numerous nodes were available that met the job
selection criteria.
Kevin
_______________________________________________
mauiusers mailing list
mauiusers@supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers