Re: [Oscar-users] Trouble with LAM, PBS, TM boot : PBS and C3 out

Michael Edwards Wed, 29 Jun 2005 06:16:41 -0700

I do not have my cluster handy to check, but if I am recalling
correctly pbs.conf is in the /etc directory.  I always have to do an
"updatedb" before I can find anything oscar related since it is added
after install, so you might try that and run find again.


As far as this problem, I would suggest rerunning "Complete Cluster
Settup", rebooting the head node, then rebooting the computer nodes. 
That should recreate all your config files and restart all your
services.  It seems to me that the config files were not created
correctly for some reason.  If "Complete Cluster Setup" throws errors
or there are problems after you do this, please post the
oscarinstall.log.

What kind of network cards are you using?

Did you do a standard install (default packages) and did you alter the
configuration files at all after install?  Did you have any issues
getting the nodes to accept images?

On 6/29/05, Lefevre Jerome <[EMAIL PROTECTED]> wrote:
> 
> (You will find laminfos and pbsnodes in attach file)
> 
> Hi,
> 
> More infos about my trouble with PBS, LAM and TM boot.
> Jeff Squyres told me about a PBS problem, not a LAM problem (Please, See
> Mail From Jeff below).
> So, i check some output from PBS and C3, because it seems C3 conf depends
> on the same thing that the PBS conf depends on (
> (see http://sourceforge.net/mailarchive/message.php?msg_id=7646933).
> 
> First : I check if pbs.conf is present and there is not (with : locate
> pbs.conf)!
> If i go inside <PBS_HOME>, i don't find any etc directory, neither pbs.conf
> inside /etc too.
> [EMAIL PROTECTED] etc]# ls /opt/pbs
> bin include lib man sbin
> 
> However, if i check with these bunch of commands :
> qmgr -c "list queue workq" , qstat -f -Q workq , qmgr -c "print server"
> workq seems to be correctly (?) setup (please look below).
> 
> Now if i check C3.conf, here is the output :
> [EMAIL PROTECTED] etc]# cat /etc/c3.conf
> cluster oscar_cluster {
> editr.cluster.ird.nc
> dead remove_for_0-indexing
> node1
> node2
> node3
> node4
> }
> This line "dead remove_for_0-indexing is supiscious" or not ?
> Any idea ?
> Many thanks for your help !
> Best regard
> Jérôme Lefevre
> 
> 
> ************** Output from qmgr -c "list queue workq"
> , qstat -f -Q workq
> , qmgr -c "print server" ****************************
> [EMAIL PROTECTED] etc]# qmgr -c "list queue workq"
> Queue workq
> queue_type = Execution
> total_jobs = 1
> state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:1 Exiting:0
> resources_max.cput = 10000:00:00
> resources_max.ncpus = 10
> resources_max.nodect = 5
> resources_max.walltime = 10000:00:00
> resources_min.cput = 00:00:01
> resources_min.ncpus = 1
> resources_min.nodect = 1
> resources_min.walltime = 00:00:01
> resources_default.cput = 10000:00:00
> resources_default.ncpus = 1
> resources_default.nodect = 1
> resources_default.walltime = 10000:00:00
> resources_available.nodect = 5
> resources_assigned.ncpus = 1
> resources_assigned.nodect = 2
> enabled = True
> started = True
> [EMAIL PROTECTED] etc]# qstat -f -Q workq
> Queue: workq
> queue_type = Execution
> total_jobs = 1
> state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:1 Exiting:0
> resources_max.cput = 10000:00:00
> resources_max.ncpus = 10
> resources_max.nodect = 5
> resources_max.walltime = 10000:00:00
> resources_min.cput = 00:00:01
> resources_min.ncpus = 1
> resources_min.nodect = 1
> resources_min.walltime = 00:00:01
> resources_default.cput = 10000:00:00
> resources_default.ncpus = 1
> resources_default.nodect = 1
> resources_default.walltime = 10000:00:00
> resources_available.nodect = 5
> resources_assigned.ncpus = 1
> resources_assigned.nodect = 2
> enabled = True
> started = True
> Donc, pbs est bien configuré quelque part !!!
> [EMAIL PROTECTED] etc]# qmgr -c "print server"
> #
> # Create queues and set their attributes.
> #
> #
> # Create and define queue workq
> #
> create queue workq
> set queue workq queue_type = Execution
> set queue workq resources_max.cput = 10000:00:00
> set queue workq resources_max.ncpus = 10
> set queue workq resources_max.nodect = 5
> set queue workq resources_max.walltime = 10000:00:00
> set queue workq resources_min.cput = 00:00:01
> set queue workq resources_min.ncpus = 1
> set queue workq resources_min.nodect = 1
> set queue workq resources_min.walltime = 00:00:01
> set queue workq resources_default.cput = 10000:00:00
> set queue workq resources_default.ncpus = 1
> set queue workq resources_default.nodect = 1
> set queue workq resources_default.walltime = 10000:00:00
> set queue workq resources_available.nodect = 5
> set queue workq enabled = True
> set queue workq started = True
> #
> # Set server attributes.
> #
> set server scheduling = True
> set server default_queue = workq
> set server log_events = 64
> set server mail_from = adm
> set server query_other_jobs = True
> set server resources_available.ncpus = 10
> set server resources_available.nodect = 5
> set server resources_available.nodes = 5
> set server resources_max.ncpus = 10
> set server resources_max.nodes = 5
> set server scheduler_iteration = 60
> set server node_ping_rate = 300
> set server node_check_rate = 600
> ***************************************************************
> 
> 
> ************** MAIL FROM JEFF SQUYRES ************************
>  > You will find my configure.log in this post.
> Thanks.
>  > Some precision, first with Scheduler :
>  >
>  > With Maui running, PBS_server complains always with this message :
>  > 06/27/2005 10:14:06;0001;PBS_Server;Svr;PBS_Server;Connection refused
>  > (111)
>  > in contact_sched, Could not contact Scheduler
> You'll need to ask the OSCAR list about this error -- I'm not a PBS
> expert.
>  > If I stop MAUI and start PBS_SCHED, PBS_SERVER tell me :
>  > 06/27/2005 16:04:37;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler
>  > sent
>  > command new
>  > 06/27/2005 16:04:38;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler
>  > sent
>  > command recyc
>  > 06/27/2005 16:04:43;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler
>  > sent
>  > command term
>  > 06/27/2005 16:05:43;0040;PBS_Server;Svr;editr.cluster.ird.nc;Scheduler
>  > sent
>  > command time
>  >
>  > So, in the next test, i stop the scheduler MAUI and keep PBS_SCHED
>  > running.
> LAM will not care whether you are running the PBS scheduler or the Maui
> scheduler; it does not interact with the scheduler. It only interacts
> with the PBS MOM's themselves (i.e., *after* all scheduling decisions
> have been made). So whichever scheduler you get running (from LAM's
> point of view) is fine.
>  > Now, if you look LAMINFO, we see : SSI boot: tm (API v1.1, Module
>  > v1.1)
>  > We can presume PBS and LAM 7.1 interface is correct ?
> Probably, especially since you showed that it worked in your first post
> (i.e., it did a lamboot successfully using TM). However, I would have
> liked to see the full output from laminfo to see the other modules and
> other configuration information.
>  > But, if i ask a PBS job in interactive mode, like this :
>  >
>  > [EMAIL PROTECTED] SCRATCH]$ qsub -lnodes=2 -I
>  > qsub: waiting for job 43.editr.cluster.ird.nc to start
>  >
>  > After a long time, PBS still waiting ... If i check with "qstat -f",
>  > "Resource_List.ncpus "is always equal to 1. However i asked 2 nodes !
>  > What
>  > is wrong ? Do you want other log, output ?
> This also seems to be a PBS problem, not a LAM problem. LAM will *use*
> PBS, but if PBS is configured incorrectly (e.g., PBS is only assigning
> you 1 node when you asked for 2), LAM cannot fix this -- it can only
> use the outputs from PBS. More specifically, PBS does all the
> scheduling and assignments -- once the job starts and all decisions
> have been made, LAM simply uses the results of those decisions to run
> your parallel application.
> The OSCAR list is probably your best bet for answers to these questions
> (because OSCAR will have configured and setup Torque on your cluster).
> They'll probably ask for more configuration information about your
> Torque setup to see if something went wrong during the OSCAR initialization.
> 
> 
> 
> 
> 
>


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Re: [Oscar-users] Trouble with LAM, PBS, TM boot : PBS and C3 out

Reply via email to