Hello all, I am setting up a cluster that has been moved from one university to another. I think i'm having Maui issues associated with the new hostname. The cluster operates on SuSE Linux 10.2, using Torque 2.8 and Maui 3.2. I have made a number of necessary changes and performed exhaustive searches through both maui and torque forums, to no avail. I've also looked over the maui and torue manuals, and am now slowly losing sanity. Any advice is appreciated. Here is what I did:
1) updated maui.cfg SERVERHOST 2) updated torque server_name and mom_priv/config 3) updated /etc/hosts on master 4) updated network configurations 5) reconfigured pbs_server with new acl_hosts = new_hostname At this point, I can ping nodes from the master. I logged into one node and change it's /etc/hosts so that I can now ping the master and other nodes from this node (this is the node I am working with and submitting jobs to, before i make any changes on other nodes). mom logs also indicate good communication. My problem occurs when I submit a job - it hangs in que. I can sudo qrun no problem. At first, checkjob indicated the reason was: job is deferred. Reason: RMFailure (job cannot be started - cannot set hostlist) I tried a few things that did not work: -releasejob -running maui in simulation mode, which returned: ERROR: cannot open user interface socket on port 42559 - Tried to manually set hostlist in maui.cfg via ' SRHOSTLIST[27] node2 node3 ...' keeping in mind that the default is ALL. This gives a different checkjob error: PE: 1.00 StartPriority: 1 job cannot run in partition DEFAULT (idle procs do not meet requirements : 0 of 1 procs found) idle procs: 36 feasible procs: 0 Rejection Reasons: [State : 17][ReserveTime : 9] Detailed Node Availability Information: node2 rejected : State ... node9 rejected : State node10 rejected : ReserveTime node11 rejected : ReserveTime node12 rejected : ReserveTime node13 rejected : ReserveTime node14 rejected : State node15 rejected : State node16 rejected : State node17 rejected : ReserveTime node18 rejected : State node19 rejected : ReserveTime node20 rejected : ReserveTime node21 rejected : ReserveTime node22 rejected : ReserveTime I checknode the node I submitted to: checking node node22 State: Idle (in current state for 00:07:46) Configured Resources: PROCS: 4 MEM: 7864M SWAP: 9803M DISK: 1M Utilized Resources: [NONE] Dedicated Resources: [NONE] Opsys: DEFAULT Arch: [NONE] Speed: 1.00 Load: 0.000 Network: [DEFAULT] Features: [general] Attributes: [Batch] Classes: [q1 4:4][batch 4:4] Total Time: 00:07:19 Up: 00:07:19 (100.00%) Active: 00:00:00 (0.00%) Reservations: User '27.0.0'(x1) -00:07:46 -> 13:55:21 (14:03:07) Blocked Resources@-00:07:46 Procs: 4/4 (100.00%) User '27.1.0'(x1) 13:55:21 -> 1:13:55:21 (1:00:00:00) Blocked Resources@13:55:21 Procs: 4/4 (100.00%) User 'normal.0.0'(x1) -00:07:46 -> 13:55:21 (14:03:07) Blocked Resources@-00:07:46 Procs: 4/4 (100.00%) User 'normal.1.0'(x1) 13:55:21 -> 1:13:55:21 (1:00:00:00) Blocked Resources@13:55:21 Procs: 4/4 (100.00%) ALERT: node is overcommitted at time -00:07:46 (P: -4) ALERT: node is overcommitted at time 13:55:21 (P: -4) If I get rid of SRHOSTLIST[27] node2 node3... in maui.cfg, i get the previous "RMFailure (job cannot be started - cannot set hostlist)". Thus, for now, I am keeping this line active in maui.cfg so that I can at least see job failure 'reasons' Can anyone tell me why maui thinks all my nodes are overcommitted even though I can for them to run with pbs? Thanks in advance, Enoch p.s. Here's some config info that may be of use: Torque # Create queues and set their attributes. # # # Create and define queue q1 # create queue q1 set queue q1 queue_type = Execution set queue q1 acl_users = *** *** set queue q1 resources_default.nodes = 1 set queue q1 resources_default.walltime = 100:00:00 set queue q1 enabled = True set queue q1 started = True # # Create and define queue batch (where *** indicates i have changed the output) # create queue batch set queue batch queue_type = Execution set queue batch acl_users = *** *** set queue batch resources_default.nodes = 1 set queue batch resources_default.walltime = 100:00:00 set queue batch enabled = True set queue batch started = True # # Set server attributes. # set server scheduling = True set server acl_hosts = *** set server default_queue = q1 set server log_events = 511 set server mail_from = adm set server query_other_jobs = True set server resources_default.nodes = 1 set server resources_default.walltime = 100:00:00 set server scheduler_iteration = 600 set server node_check_rate = 150 set server tcp_timeout = 6 set server pbs_version = 2.1.8 -------------------------------------------------------------------------------- MAUI config: # Maui version 3.2.6p13 (PID: 18804) # global policies REJECTNEGPRIOJOBS[0] FALSE ENABLENEGJOBPRIORITY[0] FALSE ENABLEMULTINODEJOBS[0] TRUE ENABLEMULTIREQJOBS[0] FALSE BFPRIORITYPOLICY[0] [NONE] JOBPRIOACCRUALPOLICY QUEUEPOLICY NODELOADPOLICY ADJUSTSTATE USEMACHINESPEED FALSE USESYSTEMQUEUETIME TRUE USELOCALMACHINEPRIORITY FALSE NODEUNTRACKEDLOADFACTOR 1.2 JOBNODEMATCHPOLICY[0] EXACTNODE JOBMAXSTARTTIME[0] INFINITY METAMAXTASKS[0] 0 NODESETPOLICY[0] [NONE] NODESETATTRIBUTE[0] [NONE] NODESETLIST[0] NODESETDELAY[0] 00:00:00 NODESETPRIORITYTYPE[0] MINLOSS NODESETTOLERANCE[0] 0.00 BACKFILLPOLICY[0] FIRSTFIT BACKFILLDEPTH[0] 0 BACKFILLPROCFACTOR[0] 0 BACKFILLMAXSCHEDULES[0] 10000 BACKFILLMETRIC[0] PROCS BFCHUNKDURATION[0] 00:00:00 BFCHUNKSIZE[0] 0 PREEMPTPOLICY[0] REQUEUE MINADMINSTIME[0] 00:00:00 RESOURCELIMITPOLICY[0] NODEAVAILABILITYPOLICY[0] COMBINED:[DEFAULT] NODEALLOCATIONPOLICY[0] CPULOAD TASKDISTRIBUTIONPOLICY[0] DEFAULT RESERVATIONPOLICY[0] CURRENTHIGHEST RESERVATIONRETRYTIME[0] 00:00:00 RESERVATIONTHRESHOLDTYPE[0] NONE RESERVATIONTHRESHOLDVALUE[0] 0 FSPOLICY [NONE] FSINTERVAL 12:00:00 FSDEPTH 8 FSDECAY 1.00 # Priority Weights SERVICEWEIGHT[0] 1 TARGETWEIGHT[0] 1 CREDWEIGHT[0] 1 ATTRWEIGHT[0] 1 FSWEIGHT[0] 1 RESWEIGHT[0] 1 USAGEWEIGHT[0] 1 QUEUETIMEWEIGHT[0] 1 XFACTORWEIGHT[0] 0 SPVIOLATIONWEIGHT[0] 0 BYPASSWEIGHT[0] 0 TARGETQUEUETIMEWEIGHT[0] 0 TARGETXFACTORWEIGHT[0] 0 USERWEIGHT[0] 0 GROUPWEIGHT[0] 0 ACCOUNTWEIGHT[0] 0 QOSWEIGHT[0] 0 CLASSWEIGHT[0] 0 FSUSERWEIGHT[0] 0 FSGROUPWEIGHT[0] 0 FSACCOUNTWEIGHT[0] 0 FSQOSWEIGHT[0] 0 FSCLASSWEIGHT[0] 0 ATTRATTRWEIGHT[0] 0 ATTRSTATEWEIGHT[0] 0 NODEWEIGHT[0] 0 PROCWEIGHT[0] 0 MEMWEIGHT[0] 0 SWAPWEIGHT[0] 0 DISKWEIGHT[0] 0 PSWEIGHT[0] 0 PEWEIGHT[0] 0 WALLTIMEWEIGHT[0] 0 UPROCWEIGHT[0] 0 UJOBWEIGHT[0] 0 CONSUMEDWEIGHT[0] 0 REMAININGWEIGHT[0] 0 PERCENTWEIGHT[0] 0 XFMINWCLIMIT[0] 00:02:00 # partition DEFAULT policies REJECTNEGPRIOJOBS[1] FALSE ENABLENEGJOBPRIORITY[1] FALSE ENABLEMULTINODEJOBS[1] TRUE ENABLEMULTIREQJOBS[1] FALSE BFPRIORITYPOLICY[1] [NONE] JOBPRIOACCRUALPOLICY QUEUEPOLICY NODELOADPOLICY ADJUSTSTATE JOBNODEMATCHPOLICY[1] JOBMAXSTARTTIME[1] INFINITY METAMAXTASKS[1] 0 NODESETPOLICY[1] [NONE] NODESETATTRIBUTE[1] [NONE] NODESETLIST[1] NODESETDELAY[1] 00:00:00 NODESETPRIORITYTYPE[1] MINLOSS NODESETTOLERANCE[1] 0.00 # Priority Weights XFMINWCLIMIT[1] 00:00:00 SRTASKCOUNT[0] 0 SRTPN[0] 0 SRRESOURCES[0] PROCS=-1;MEM=0;DISK=0;SWAP=0 SRDEPTH[0] 2 SRSTARTTIME[0] 00:00:00 SRENDTIME[0] 00:00:00 SRWSTARTTIME[0] 00:00:00 SRWENDTIME[0] 00:00:00 SRDAYS[0] ALL SRHOSTLIST[0] node2 node3 node4 node5 node6 node7 node8 node9 node10 node11 node12 node13 node14 node15 node16 node17 node18 node19 node20 node21 node22 node23 node24 node25 node26 node27 SRCHARGEACCOUNT[0] SRCFG[27] HOSTLIST=node2 node3 node4 node5 node6 node7 node8 node9 node10 node11 node12 node13 node14 node15 node16 node17 node 18 node19 node20 node21 node22 node23 node24 node25 node26 node27 RMAUTHTYPE[0] CHECKSUM CLASSCFG[[NONE]] DEFAULT.FEATURES=[NONE] CLASSCFG[[ALL]] DEFAULT.FEATURES=[NONE] CLASSCFG[q1] DEFAULT.FEATURES=[NONE] CLASSCFG[batch] DEFAULT.FEATURES=[NONE] ****skip node specific info**** # SERVER MODULES: MX SERVERMODE NORMAL SERVERNAME SERVERHOST *** SERVERPORT 42559 LOGFILE maui.log LOGFILEMAXSIZE 10000000 LOGFILEROLLDEPTH 1 LOGLEVEL 4 LOGFACILITY fALL SERVERHOMEDIR /usr/local/maui/ TOOLSDIR /usr/local/maui/tools/ LOGDIR /usr/local/maui/log/ STATDIR /usr/local/maui/stats/ LOCKFILE /usr/local/maui/maui.pid SERVERCONFIGFILE /usr/local/maui/maui.cfg CHECKPOINTFILE /usr/local/maui/maui.ck CHECKPOINTINTERVAL 00:05:00 CHECKPOINTEXPIRATIONTIME 3:11:20:00 TRAPJOB TRAPNODE TRAPFUNCTION RESDEPTH 24 RMPOLLINTERVAL 00:00:30 NODEACCESSPOLICY SHARED ALLOCLOCALITYPOLICY [NONE] SIMTIMEPOLICY [NONE] ADMIN1 admin1 *** ADMINHOSTS ALL NODEPOLLFREQUENCY 0 DISPLAYFLAGS DEFAULTDOMAIN DEFAULTCLASSLIST [DEFAULT:1] FEATURENODETYPEHEADER FEATUREPROCSPEEDHEADER FEATUREPARTITIONHEADER DEFERTIME 1:00:00 DEFERCOUNT 24 DEFERSTARTCOUNT 1 JOBPURGETIME 0 NODEPURGETIME 2140000000 APIFAILURETHRESHHOLD 6 NODESYNCTIME 600 JOBSYNCTIME 600 JOBMAXOVERRUN 00:10:00 NODEMAXLOAD 0.0 PLOTMINTIME 120 PLOTMAXTIME 245760 PLOTTIMESCALE 11 PLOTMINPROC 1 PLOTMAXPROC 512 PLOTPROCSCALE 9 SCHEDCFG[] MODE=NORMAL SERVER=*** # RM MODULES: PBS SSS WIKI NATIVE RMCFG[***] AUTHTYPE=CHECKSUM EPORT=15004 TIMEOUT=00:00:09 TYPE=PBS SIMWORKLOADTRACEFILE workload SIMRESOURCETRACEFILE resource SIMAUTOSHUTDOWN OFF SIMSTARTTIME 0 SIMSCALEJOBRUNTIME FALSE SIMFLAGS SIMJOBSUBMISSIONPOLICY CONSTANTJOBDEPTH SIMINITIALQUEUEDEPTH 16 SIMWCACCURACY 0.00 SIMWCACCURACYCHANGE 0.00 SIMNODECOUNT 0 SIMNODECONFIGURATION NORMAL SIMWCSCALINGPERCENT 100 SIMCOMRATE 0.10 SIMCOMTYPE ROUNDROBIN COMINTRAFRAMECOST 0.30 COMINTERFRAMECOST 0.30 SIMSTOPITERATION -1 SIMEXITITERATION -1 _______________________________________________ mauiusers mailing list mauiusers@supercluster.org http://www.supercluster.org/mailman/listinfo/mauiusers