Itay M wrote:
We've tried the new configuration (unset resources_default.ncpus and unset resources_max.ncpus; from from queues and server levels as well) in the last few days and here are the results:

I suppose you did check with qstat -f that 'ncpus' is not mentioned anywhere any longer?

* For the first time we were able to see that jobs are backfilled! It never happend before, and this is a major improvment. Though we saw it only in one of our queues (named 'b_que') it might have happend in other queues as well (we couln'd verify it yet). * But - the 'insufficient idle procs available' problem is still there. For example, at the moment showq shows that there are plenty of non-busy processors ('65 of 84 Processors Active'), but checkjob says for queued jobs that: checking job 228665
State: Idle
Creds:  user:b group:b   class:b_que  qos:hi
WallTime: 00:00:00 of 00:05:00
SubmitTime: Tue Jan 29 19:47:04
  (Time Queued  Total: 00:07:49  Eligible: 00:07:16)
Total Tasks: 1
Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
Dedicated Resources Per Task: PROCS: 1  MEM: 512M

IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE
PE:  1.00  StartPriority:  1007
job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found)
idle procs:  12  feasible procs:   0
>
> :(
> What should I check next?

Maybe it has something to do with the MEM requirement (just a wild guess... but try removing it). What does diagnose -n say for a node which is incorrectly rejecting the job? Does it have enough free "tokens" (not sure if this is what they are called officially) to run the job in this b_que class?

Regards,
Jan Ploski
_______________________________________________
mauiusers mailing list
mauiusers@supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to