Yeah I checked the logs. PBS keeps on crashing

.. Connection refused (111) in contact_sched, Could not contact Scheduler ..

and maui keeps on crashing & refusing to restart.
i.e.
  [EMAIL PROTECTED]  bin]# ./showstats
  ERROR:    lost connection to scheduler
  02/17 11:21:53 ERROR:    cannot request service (status)

Even if PBS is working MAUI keeps on messing it up.
/etc/init.d/maui restart wont work either.

Am thinking of upgrading to Centos 3.6 & oscar 4.2 ? would that be a good idea ?
--
regards.


From: "Bernard Li" <[EMAIL PROTECTED]>
To: "X Y" <[EMAIL PROTECTED]>
CC: <[email protected]>
Subject: RE: [Oscar-users] PVM jobs need to be forced with qrun to run !
Date: Thu, 16 Feb 2006 22:14:30 -0800

Have you checked your TORQUE/MAUI logs for more information?

It is difficult to troubleshoot any further without any additional info.

Cheers,

Bernard

________________________________

From: X Y [mailto:[EMAIL PROTECTED]
Sent: Thu 16/02/2006 06:33
To: Bernard Li
Cc: [email protected]
Subject: RE: [Oscar-users] PVM jobs need to be forced with qrun to run !



Hi Bernard,

Now even mpi jobs just sit there, along with pvm jobs. both have to be qrun
thru root.
Whats going on ? I mean can torque/maui change behaviour on their on over
time.
My server is resonably secure. I highly doubt any security breach or
something.

Btw what could be quick short term solutions other than me sitting on the
terminal qrun'ing users jobs. can qmgr thing be useful. can u suggest a
quick fix (syntax etc..)
---
Regards


>From: "Bernard Li" <[EMAIL PROTECTED]>
>To: "X Y" <[EMAIL PROTECTED]>, <[email protected]>
>Subject: RE: [Oscar-users] PVM jobs need to be forced with qrun to run !
>Date: Wed, 15 Feb 2006 22:10:18 -0800
>
>So does the job just sit there in the queue and do not run?  Do the logs
>(TORQUE, MAUI) say anything?
>
>Cheers,
>
>Bernard
>
>________________________________
>
>From: [EMAIL PROTECTED] on behalf of X Y
>Sent: Wed 15/02/2006 02:43
>To: [email protected]
>Subject: [Oscar-users] PVM jobs need to be forced with qrun to run !
>
>
>
>
>   Hi,
>   My cluster specs/config:
>          Oscar version: 4.1
>          OS : Redhat 9 (x86)
>          with Default Oscar installation
>          Compute Nodes: 32 nodes
>
>   Im able to run my mpi jobs fine. a soon as I qsub my mpi-jobs they get
>que-ed
>   up in the default que (workq) & run.
>   but my pvm jobs wont run. unless I su to root & manually (forcefully)
>qrun
>them. So I
> doubt the problem is related to resources_default.nodes being set as mpi
>ones are running fine.
>   (btw its set with the qmgr right?). the pvm pbsjobscript is attached
>below
>just in case.
>   Any suggestions/ideas are welcome.
>   Regards
>   --
>   SD.
>
>
>
>   pvmpbscript:
>   [EMAIL PROTECTED] server_priv]# cat /home/oscartst/pbs_script.pvm
>   ************************************
>   #!/bin/sh
>
>   ### Job name
>   #PBS -N pvmtest
>
>   ### Output files
>   #PBS -o pvmtest.out
>   #PBS -e pvmtest.err
>
>   ### Queue name
>   #PBS -q workq
>
>   ### Script Commands
>   cd $PBS_O_WORKDIR
>
>   # generate pvm nodes file
>   echo "* ep=$PBS_O_WORKDIR wd=$PBS_O_WORKDIR" > pvm_nodes
>   cat $PBS_NODEFILE >> pvm_nodes
>
>   # start pvm daemon & wait for slave daemons to start up
>   pvmd pvm_nodes &
>   #sleep 10
>
>   # run job
>   p=`pwd`
>   cp master1.c slave1.c /tmp
>   cd /tmp
>   gcc -I$PVM_ROOT/include master1.c -L$PVM_ROOT/lib/$PVM_ARCH -lpvm3 -o
>   master1
>   gcc -I$PVM_ROOT/include slave1.c -L$PVM_ROOT/lib/$PVM_ARCH -lpvm3 -o
>slave1
>   cp master1 slave1 $p
>   cd $p
>   ./master1
>
>   # wait again to make sure everyone's finished
>   # then kill master pvm daemon
>   #sleep 5
>   /usr/bin/killall -TERM pvmd3
>
>   # get rid of lock files & nodes file
>   uid=`id -u`
>   tail +2 $PBS_NODEFILE > pvm_nodes
>   /bin/rm -f /tmp/pvm?.$uid
>   crm  pvm_nodes:/tmp/pvmd.$uid > /dev/null 2>&1
>   crm  pvm_nodes:/tmp/pvml.$uid > /dev/null 2>&1
>   /bin/rm -f pvm_nodes
>   exit
>   *************************************
>
>_________________________________________________________________
>Express yourself instantly with MSN Messenger! Download today - it's FREE!
>http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>
>
>
>-------------------------------------------------------
>This SF.net email is sponsored by: Splunk Inc. Do you grep through log
>files
>for problems?  Stop!  Download the new AJAX search engine that makes
>searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
>http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
>_______________________________________________
>Oscar-users mailing list
>[email protected]
>https://lists.sourceforge.net/lists/listinfo/oscar-users
>
>

_________________________________________________________________
Don't just search. Find. Check out the new MSN Search!
http://search.msn.click-url.com/go/onm00200636ave/direct/01/




_________________________________________________________________
Don’t just search. Find. Check out the new MSN Search! http://search.msn.click-url.com/go/onm00200636ave/direct/01/



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to