A workaround has been found - apparently our problem is due to a bug,
even in the latest Maui snapshot version maui-3.2.6p20-snap.1182974819.

After much experimentation it turned out that Maui refuses to
create a Standing Reservation if the nodes in question are
in an "offline" status in Torque.  The workaround is thus:
1) Stop Maui, 2) clear the offline status with "pbsnodes -c nodelist",
3) restart Maui including the Standing Reservations configuration.

The reason we offlined our new nodes is of course that we didn't
want any jobs to start on those nodes when they initially came up.
We hoped to control access to the nodes using Standing Reservations.

IMHO, the fix needed in Maui is to honor Standing Reservations even
for nodes that are in an "offline" or even "down" state.

Maui developers:  Is this a feasible and desirable modification ?

Thanks,
Ole

Ole Holm Nielsen wrote:
We're trying to set up a new Standing Reservation in the Maui maui.cfg file so that a set of newly installed nodes should be reserved for a small group of
test users.

We have an old SR that works perfectly, but the new SR named "switch5" cannot
be created as shown in the maui.log:

...
09/04 14:30:43 INFO:     MNode[083] 'q083' added to regex list
09/04 14:30:43 INFO:     MNode[084] 'q084' added to regex list
09/04 14:30:43 MSRSetRes(switch5,1,0)
09/04 14:30:43 MJobSetCreds(switch5.0,[ALL],[ALL],[ALL])
09/04 14:30:43 MSRGetAttributes(switch5,0,Start,Duration)
09/04 14:30:43 INFO: attempting standing reservation of 336 procs in -INFINITY for INFINITY 09/04 14:30:43 MSRSelectNodeList(switch5.0,switch5,DstNL,NodeCount,00:00:00,ReqNL,12) 09/04 14:30:43 INFO: 0 feasible tasks found for job switch5.0:0 in partition DEFAULT (1 Needed) 09/04 14:30:43 ALERT: cannot select 336 procs in partition '[ALL]' for SR 'switch5'
09/04 14:30:43 MSRSetRes(switch5,1,1)
09/04 14:30:43 MJobSetCreds(switch5.1,[ALL],[ALL],[ALL])
09/04 14:30:43 MSRGetAttributes(switch5,1,Start,Duration)
09/04 14:30:43 INFO:     reservation not required for specified period
09/04 14:30:43 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)
...

Apparently the 84 nodes (4 CPUs each) are located correctly, but the reason
for the above ALERT message is incomprehensible !  The net result is that
the configured SR isn't working, and the new nodes run production jobs
that shouldn't land on these nodes.  This is a big problem for us :-(
I looked into the code in src/moab/MJob.c without gaining any understanding
of the problem (my fault, of course :-).

Question: Can anyone point to what's wrong with our SR's or with Maui itself ?

FYI, we run Torque 2.1.8 and Maui 3.2.6p20. This is an excerpt from our maui.cfg:

NODESETPOLICY           ONEOF
NODESETATTRIBUTE        FEATURE
NODESETDELAY            1
NODESETLIST             switch1 switch2 switch3 switch4 switch5 infiniband
NODESETPRIORITYTYPE     BESTFIT
# Reservation of the nodes p0XX with Infiniband
SRCFG[infiniband]       HOSTLIST=p0[012][0-9]
SRCFG[infiniband] USERLIST=jensj,bligaard,ohnielse,moses,efernand,studt,ibensig,dc
SRCFG[infiniband]       PERIOD=INFINITY
SRCFG[infiniband]       NODEFEATURES=infiniband
# Testing of new nodes q0XX
SRCFG[switch5]       HOSTLIST=q0[0-9][0-9]
SRCFG[switch5]       USERLIST=jensj,dulak,ohnielse
SRCFG[switch5]       PERIOD=INFINITY
SRCFG[switch5]       NODEFEATURES=switch5

--
Ole Holm Nielsen
Department of Physics, Technical University of Denmark
_______________________________________________
mauiusers mailing list
mauiusers@supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to