Hello all,

I am setting up a cluster that has been moved from one university to
another. I think i'm having Maui issues associated with the new
hostname. The cluster operates on SuSE Linux 10.2, using Torque 2.8
and Maui 3.2. I have made a number of necessary changes and performed
exhaustive searches through both maui and torque forums, to no avail.
I've also looked over the maui and torue manuals, and am now slowly
losing sanity.  Any advice is appreciated. Here is what I did:

1) updated maui.cfg SERVERHOST
2) updated torque server_name and mom_priv/config
3) updated /etc/hosts on master
4) updated network configurations
5) reconfigured pbs_server with new acl_hosts = new_hostname

At this point, I can ping nodes from the master. I logged into one
node and change it's /etc/hosts so that I can now ping the master and
other nodes from this node (this is the node I am working with and
submitting jobs to, before i make any changes on other nodes). mom
logs also indicate good communication. My problem occurs when I submit
a job - it hangs in que. I can sudo qrun no problem.  At first,
checkjob indicated the reason was:

job is deferred.  Reason:  RMFailure  (job cannot be started - cannot
set hostlist)

I tried a few things that did not work:
-releasejob
-running maui in simulation mode, which returned: ERROR:    cannot
open user interface socket on port 42559
- Tried to manually set hostlist in maui.cfg via ' SRHOSTLIST[27]
node2 node3 ...' keeping in mind that the default is ALL. This gives a
different checkjob error:
PE:  1.00  StartPriority:  1
job cannot run in partition DEFAULT (idle procs do not meet
requirements : 0 of 1 procs found)
idle procs:  36  feasible procs:   0

Rejection Reasons: [State        :   17][ReserveTime  :    9]

Detailed Node Availability Information:

node2                    rejected : State
...
node9                    rejected : State
node10                   rejected : ReserveTime
node11                   rejected : ReserveTime
node12                   rejected : ReserveTime
node13                   rejected : ReserveTime
node14                   rejected : State
node15                   rejected : State
node16                   rejected : State
node17                   rejected : ReserveTime
node18                   rejected : State
node19                   rejected : ReserveTime
node20                   rejected : ReserveTime
node21                   rejected : ReserveTime
node22                   rejected : ReserveTime
I checknode the node I submitted to:
checking node node22

State:      Idle  (in current state for 00:07:46)
Configured Resources: PROCS: 4  MEM: 7864M  SWAP: 9803M  DISK: 1M
Utilized   Resources: [NONE]
Dedicated  Resources: [NONE]
Opsys:       DEFAULT  Arch:      [NONE]
Speed:      1.00  Load:       0.000
Network:    [DEFAULT]
Features:   [general]
Attributes: [Batch]
Classes:    [q1 4:4][batch 4:4]

Total Time: 00:07:19  Up: 00:07:19 (100.00%)  Active: 00:00:00 (0.00%)

Reservations:
  User '27.0.0'(x1)  -00:07:46 -> 13:55:21 (14:03:07)
    Blocked Resources@-00:07:46   Procs: 4/4 (100.00%)
User '27.1.0'(x1)  13:55:21 -> 1:13:55:21 (1:00:00:00)
    Blocked Resources@13:55:21    Procs: 4/4 (100.00%)
  User 'normal.0.0'(x1)  -00:07:46 -> 13:55:21 (14:03:07)
    Blocked Resources@-00:07:46   Procs: 4/4 (100.00%)
  User 'normal.1.0'(x1)  13:55:21 -> 1:13:55:21 (1:00:00:00)
    Blocked Resources@13:55:21    Procs: 4/4 (100.00%)
ALERT:  node is overcommitted at time -00:07:46 (P: -4)
ALERT:  node is overcommitted at time 13:55:21 (P: -4)

If I get rid of SRHOSTLIST[27] node2 node3... in maui.cfg,  i get the
previous "RMFailure  (job cannot be started - cannot set hostlist)".
Thus, for now, I am keeping this line active in maui.cfg so that I can
at least see job failure 'reasons'

Can anyone tell me why maui thinks all my nodes are overcommitted even
though I can for them to run with pbs?

Thanks in advance,
Enoch

p.s. Here's some config info that may be of use:

Torque
# Create queues and set their attributes.
#
#
# Create and define queue q1
#
create queue q1
set queue q1 queue_type = Execution
set queue q1 acl_users = ***
***
set queue q1 resources_default.nodes = 1
set queue q1 resources_default.walltime = 100:00:00
set queue q1 enabled = True
set queue q1 started = True
#
# Create and define queue batch (where *** indicates i have changed the output)
#
create queue batch
set queue batch queue_type = Execution
set queue batch acl_users = ***
***
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 100:00:00
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = ***
set server default_queue = q1
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.nodes = 1
set server resources_default.walltime = 100:00:00
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server pbs_version = 2.1.8
--------------------------------------------------------------------------------
MAUI config:
# Maui version 3.2.6p13 (PID: 18804)
# global policies

REJECTNEGPRIOJOBS[0]              FALSE
ENABLENEGJOBPRIORITY[0]           FALSE
ENABLEMULTINODEJOBS[0]            TRUE
ENABLEMULTIREQJOBS[0]             FALSE
BFPRIORITYPOLICY[0]               [NONE]
JOBPRIOACCRUALPOLICY            QUEUEPOLICY
NODELOADPOLICY                  ADJUSTSTATE
USEMACHINESPEED                 FALSE
USESYSTEMQUEUETIME              TRUE
USELOCALMACHINEPRIORITY         FALSE
NODEUNTRACKEDLOADFACTOR         1.2
JOBNODEMATCHPOLICY[0]             EXACTNODE

JOBMAXSTARTTIME[0]                  INFINITY

METAMAXTASKS[0]                   0
NODESETPOLICY[0]                  [NONE]
NODESETATTRIBUTE[0]               [NONE]
NODESETLIST[0]
NODESETDELAY[0]                   00:00:00
NODESETPRIORITYTYPE[0]            MINLOSS
NODESETTOLERANCE[0]                 0.00

BACKFILLPOLICY[0]                 FIRSTFIT
BACKFILLDEPTH[0]                  0
BACKFILLPROCFACTOR[0]             0
BACKFILLMAXSCHEDULES[0]           10000
BACKFILLMETRIC[0]                 PROCS

BFCHUNKDURATION[0]                00:00:00
BFCHUNKSIZE[0]                    0
PREEMPTPOLICY[0]                  REQUEUE
MINADMINSTIME[0]                  00:00:00
RESOURCELIMITPOLICY[0]
NODEAVAILABILITYPOLICY[0]         COMBINED:[DEFAULT]
NODEALLOCATIONPOLICY[0]           CPULOAD
TASKDISTRIBUTIONPOLICY[0]         DEFAULT
RESERVATIONPOLICY[0]              CURRENTHIGHEST
RESERVATIONRETRYTIME[0]           00:00:00
RESERVATIONTHRESHOLDTYPE[0]       NONE
RESERVATIONTHRESHOLDVALUE[0]      0
FSPOLICY                        [NONE]
FSINTERVAL                      12:00:00
FSDEPTH                         8
FSDECAY                         1.00

# Priority Weights
SERVICEWEIGHT[0]                  1
TARGETWEIGHT[0]                   1
CREDWEIGHT[0]                     1
ATTRWEIGHT[0]                     1
FSWEIGHT[0]                       1
RESWEIGHT[0]                      1
USAGEWEIGHT[0]                    1
QUEUETIMEWEIGHT[0]                1
XFACTORWEIGHT[0]                  0
SPVIOLATIONWEIGHT[0]              0
BYPASSWEIGHT[0]                   0
TARGETQUEUETIMEWEIGHT[0]          0
TARGETXFACTORWEIGHT[0]            0
USERWEIGHT[0]                     0
GROUPWEIGHT[0]                    0
ACCOUNTWEIGHT[0]                  0
QOSWEIGHT[0]                      0
CLASSWEIGHT[0]                    0
FSUSERWEIGHT[0]                   0
FSGROUPWEIGHT[0]                  0
FSACCOUNTWEIGHT[0]                0
FSQOSWEIGHT[0]                    0
FSCLASSWEIGHT[0]                  0
ATTRATTRWEIGHT[0]                 0
ATTRSTATEWEIGHT[0]                0
NODEWEIGHT[0]                     0
PROCWEIGHT[0]                     0
MEMWEIGHT[0]                      0
SWAPWEIGHT[0]                     0
DISKWEIGHT[0]                     0
PSWEIGHT[0]                       0
PEWEIGHT[0]                       0
WALLTIMEWEIGHT[0]                 0
UPROCWEIGHT[0]                    0
UJOBWEIGHT[0]                     0
CONSUMEDWEIGHT[0]                 0
REMAININGWEIGHT[0]                0
PERCENTWEIGHT[0]                  0
XFMINWCLIMIT[0]                   00:02:00


# partition DEFAULT policies

REJECTNEGPRIOJOBS[1]              FALSE
ENABLENEGJOBPRIORITY[1]           FALSE
ENABLEMULTINODEJOBS[1]            TRUE
ENABLEMULTIREQJOBS[1]             FALSE
BFPRIORITYPOLICY[1]               [NONE]
JOBPRIOACCRUALPOLICY            QUEUEPOLICY
NODELOADPOLICY                  ADJUSTSTATE
JOBNODEMATCHPOLICY[1]

JOBMAXSTARTTIME[1]                  INFINITY

METAMAXTASKS[1]                   0
NODESETPOLICY[1]                  [NONE]
NODESETATTRIBUTE[1]               [NONE]
NODESETLIST[1]
NODESETDELAY[1]                   00:00:00
NODESETPRIORITYTYPE[1]            MINLOSS
NODESETTOLERANCE[1]                 0.00

# Priority Weights

XFMINWCLIMIT[1]                   00:00:00

SRTASKCOUNT[0]                    0
SRTPN[0]                          0
SRRESOURCES[0]                    PROCS=-1;MEM=0;DISK=0;SWAP=0
SRDEPTH[0]                        2
SRSTARTTIME[0]                    00:00:00
SRENDTIME[0]                      00:00:00
SRWSTARTTIME[0]                   00:00:00
SRWENDTIME[0]                     00:00:00
SRDAYS[0]                         ALL
SRHOSTLIST[0]                     node2 node3 node4 node5 node6 node7
node8 node9 node10 node11 node12 node13 node14 node15 node16 node17
node18 node19
 node20 node21 node22 node23 node24 node25 node26 node27
SRCHARGEACCOUNT[0]
SRCFG[27]                         HOSTLIST=node2 node3 node4 node5
node6 node7 node8 node9 node10 node11 node12 node13 node14 node15
node16 node17 node
18 node19 node20 node21 node22 node23 node24 node25 node26 node27

RMAUTHTYPE[0]                     CHECKSUM

CLASSCFG[[NONE]]  DEFAULT.FEATURES=[NONE]
CLASSCFG[[ALL]]  DEFAULT.FEATURES=[NONE]
CLASSCFG[q1]  DEFAULT.FEATURES=[NONE]
CLASSCFG[batch]  DEFAULT.FEATURES=[NONE]
****skip node specific info****
# SERVER MODULES:  MX
SERVERMODE                      NORMAL
SERVERNAME
SERVERHOST                      ***
SERVERPORT                      42559
LOGFILE                         maui.log
LOGFILEMAXSIZE                  10000000
LOGFILEROLLDEPTH                1
LOGLEVEL                        4
LOGFACILITY                     fALL
SERVERHOMEDIR                   /usr/local/maui/
TOOLSDIR                        /usr/local/maui/tools/
LOGDIR                          /usr/local/maui/log/
STATDIR                         /usr/local/maui/stats/
LOCKFILE                        /usr/local/maui/maui.pid
SERVERCONFIGFILE                /usr/local/maui/maui.cfg
CHECKPOINTFILE                  /usr/local/maui/maui.ck
CHECKPOINTINTERVAL              00:05:00
CHECKPOINTEXPIRATIONTIME        3:11:20:00
TRAPJOB
TRAPNODE
TRAPFUNCTION
RESDEPTH                        24

RMPOLLINTERVAL                  00:00:30
NODEACCESSPOLICY                SHARED
ALLOCLOCALITYPOLICY             [NONE]
SIMTIMEPOLICY                   [NONE]
ADMIN1                          admin1 ***
ADMINHOSTS                      ALL
NODEPOLLFREQUENCY               0
DISPLAYFLAGS
DEFAULTDOMAIN
DEFAULTCLASSLIST                [DEFAULT:1]
FEATURENODETYPEHEADER
FEATUREPROCSPEEDHEADER
FEATUREPARTITIONHEADER
DEFERTIME                       1:00:00
DEFERCOUNT                      24
DEFERSTARTCOUNT                 1
JOBPURGETIME                    0
NODEPURGETIME                   2140000000
APIFAILURETHRESHHOLD            6
NODESYNCTIME                    600
JOBSYNCTIME                     600
JOBMAXOVERRUN                   00:10:00
NODEMAXLOAD                     0.0

PLOTMINTIME                     120
PLOTMAXTIME                     245760
PLOTTIMESCALE                   11
PLOTMINPROC                     1
PLOTMAXPROC                     512
PLOTPROCSCALE                   9
SCHEDCFG[]                        MODE=NORMAL SERVER=***
# RM MODULES: PBS SSS WIKI NATIVE
RMCFG[***] AUTHTYPE=CHECKSUM EPORT=15004 TIMEOUT=00:00:09 TYPE=PBS
SIMWORKLOADTRACEFILE            workload
SIMRESOURCETRACEFILE            resource
SIMAUTOSHUTDOWN                 OFF
SIMSTARTTIME                    0
SIMSCALEJOBRUNTIME              FALSE
SIMFLAGS
SIMJOBSUBMISSIONPOLICY          CONSTANTJOBDEPTH
SIMINITIALQUEUEDEPTH            16
SIMWCACCURACY                   0.00
SIMWCACCURACYCHANGE             0.00
SIMNODECOUNT                    0
SIMNODECONFIGURATION            NORMAL
SIMWCSCALINGPERCENT             100
SIMCOMRATE                      0.10
SIMCOMTYPE                      ROUNDROBIN
COMINTRAFRAMECOST               0.30
COMINTERFRAMECOST               0.30
SIMSTOPITERATION                -1
SIMEXITITERATION                -1
_______________________________________________
mauiusers mailing list
mauiusers@supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to