Hi all,

we're using:
maui-3.2.6p19_20.snap.1182974819-4.slc3
maui-server-3.2.6p19_20.snap.1182974819-4.slc3
maui-client-3.2.6p19_20.snap.1182974819-4.slc3
torque-client-2.1.9-4cri.slc3
torque-server-2.1.9-4cri.slc3
torque-2.1.9-4cri.slc3

Last week we notice a strange behaviour in maui, and now, we're able to
reproduce:

1.-) we submit a job requesting a special resource, in example node
with Scientific Linux 3

$ qsub -q slc3 job.sh
3445312.pbs01.pic.es

our queue slc3 ask for that resource:

# qmgr -c "p s"|grep slc3
[...]
set queue slc3 resources_default.neednodes = slc3
[...]

2.-) We close the only node with that resource, so no WN will fit the
job.

# pbsnodes td248.pic.es
td248.pic.es
     state = offline
     np = 10
     properties = slc3
     ntype = cluster
[...]

3.-) Our job goes to the first position of our queue and maui see that
cannot find a WN

[...]
3440629              nsidro    Running     1  3:00:00:00  Thu Dec 13 12:54:10

   196 Active Jobs     196 of  249 Processors Active (78.71%)
                        57 of   62 Nodes Active      (91.94%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

3445312            arnaubria       Idle     1  3:00:00:00  Thu Dec 13 12:30:43
3440631              nsidro       Idle     1  3:00:00:00  Wed Dec 12 20:58:15
[...]


# checkjob 3445312


checking job 3445312

State: Idle
Creds:  user:arnaubria  group:grid  class:slc3  qos:DEFAULT
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Thu Dec 13 12:30:43
  (Time Queued  Total: 00:25:29  Eligible: 00:24:28)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [slc3]


IWD: [NONE]  Executable:  [NONE]
Bypass: 10  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

PE:  1.00  StartPriority:  1000001000  SystemPriority:  1000

job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 
1 procs found)
idle procs: 151  feasible procs:   0

Rejection Reasons: [Features     :   64][State        :    1]




4.-) Maui does not schedule any other job, so the farm gets empty.

Checkjob for the second job in queue.

# checkjob 3440631


checking job 3440631

State: Idle
Creds:  user:nsidro  group:magic  class:long  qos:lhmagic
WallTime: 00:00:00 of 3:00:00:00
SubmitTime: Wed Dec 12 20:58:15
  (Time Queued  Total: 15:57:57  Eligible: 15:51:32)

Total Tasks: 1

Req[0]  TaskCount: 1  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [slc4]


IWD: [NONE]  Executable:  [NONE]
Bypass: 10  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE

PE:  1.00  StartPriority:  89
job can run in partition DEFAULT (32 procs available.  1 procs required)


progress of jobs running/queued in our farm:

224  2358
222  2359
215  2355
213  2365
194  2371
...


5.) We open the WN again, and all works fine again.

# pbsnodes -c td248.pic.es

after that, immediately after jobs start again:

222  2337


Something similar happened when requesting hosst with "slc3 && slc4",
no nodes fit that condition and maui got hanged....

So, is it a bug?¿ Is anyone having same problem ? any workaround? 


Cheers,
Arnau
_______________________________________________
mauiusers mailing list
mauiusers@supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to