Title: Re: [Oscar-users] Torque fails creating work queue at step 7
Hi Umberto:
 
pbsnodes -a should tell you the status of all your nodes.  If they are all "unknown" then you probably just need to restart pbs_mom on all the nodes (because you changed your pbs_server's configuration).
 
BTW, you might want to get the nightly snapshot of OSCAR 4.2.1 from our website as it has a bunch of bugfixes (including the issue you're seeing with Ganglia).  There seems to be one outstanding TORQUE bug which we're pinning down right now though.
 
Cheers,
 
Bernard


From: [EMAIL PROTECTED] on behalf of Umberto Amato
Sent: Tue 03/01/2006 07:54
To: [email protected]
Subject: Re: [Oscar-users] Torque fails creating work queue at step 7

Dear Bernard,
I moved pbs_oscar alias from the private to the public address in /etc/hosts
and now Step 7 is successfull (thanks again).
I'm stuck now in the Testing step

[EMAIL PROTECTED] testing]# ./test_cluster
Performing root tests...
Connection refused
/opt/pbs/bin/pbsnodes: cannot connect to server pbs_oscar, error=111
Torque node check [PASSED]
Starting TORQUE Server: [ OK ]
Torque service check:pbs_server [PASSED]
Maui service check:maui [PASSED]
/home mounts [PASSED]
Preparing user tests...
Performing user tests...
SSH ping test [PASSED]
SSH server->node [PASSED]
SSH node->server [PASSED]
Checking for 18 free nodes: [FAILED]
Not enough free nodes. Tests incomplete.
Checking for 18 free nodes: [FAILED]
Not enough free nodes. Tests incomplete.
Can't find string terminator '"' anywhere before EOF at -e line 1.
Ganglia setup test [FAILED]
Torque default queue definition [PASSED]
Checking for 18 free nodes: [FAILED]
Not enough free nodes. Tests incomplete.
Checking for 18 free nodes: [FAILED]
Not enough free nodes. Tests incomplete.
There were issues running some user test scripts. Please check your logs
located in /home/oscartst.
Run APItests...
Running Installation tests for pvm
[PASS] 2006-01-03T15:46:26Z pvmd-path-ls.apt
[PASS] 2006-01-03T15:46:26Z envvar-pvm_arch.apt
[PASS] 2006-01-03T15:46:26Z envvar-pvm_root.apt
[PASS] 2006-01-03T15:46:26Z pvmd-path-which.apt
[PASS] 2006-01-03T15:46:26Z modulecmd-path-ls.apt
[PASS] 2006-01-03T15:46:26Z pvm-module-list.apt
[PASS] 2006-01-03T15:46:26Z pvm-module-show-pvm_rsh.apt
[PASS] 2006-01-03T15:46:26Z pvm-module-show-pvm_arch.apt
[PASS] 2006-01-03T15:46:26Z pvm-module-show-pvm_root.apt

and more precisely with (the rest is a consequence)

Connection refused
/opt/pbs/bin/pbsnodes: cannot connect to server pbs_oscar, error=111

A look at /var/spool/pbs/server_logs/pbs_server.log shows

01/03/2006 15:30:14;0002;PBS_Server;Svr;PBS_Server;Server Ready, pid = 8680,
loglevel=0

01/03/2006 15:30:17;0001;PBS_Server;Svr;PBS_Server;Connection refused (111)
in contact_sched, Could not contact Scheduler - port 15004

01/03/2006 15:31:14;0040;PBS_Server;Svr;lilligrid.na.iac.cnr.it;Scheduler
sent command scheduler_first

01/03/2006 15:31:30;0002;PBS_Server;Svr;Log;Log opened

that is the problem should arise from contact_sched.

No much hint from the pbs forum. Any further (highly appreciated) hint from
anyone?

Umberto



----- Original Message -----
From: "Bernard Li" <[EMAIL PROTECTED]>
To: "Umberto Amato" <[EMAIL PROTECTED]>;
<[email protected]>
Sent: Tuesday, January 03, 2006 12:52 AM
Subject: RE: [Oscar-users] Torque fails creating work queue at step 7


Hi Umberto:

Try putting "pbs_oscar" in your _external_ interface instead of your
internal interface and see if it works.  Also, there are a bunch of log
files in /var/spool/pbs which you can take a look at also.

Cheers,

Bernard

________________________________

From: [EMAIL PROTECTED] on behalf of Umberto Amato
Sent: Mon 02/01/2006 08:53
To: [email protected]
Subject: [Oscar-users] Torque fails creating work queue at step 7



Dear all,
I-m installing OSCAR 4.1 on a cluster made with dual 64 bit Opteron boards
and Scientific Linux Operating System 4.1. I have a problem already
considered on the list, that is failure of Torque in creating work queue at
Step 7. In http://sourceforge.net/mailarchive/message.php?msg_id=11522706
the issue had been closed for lack of occurrences: here am I.

The relevant part of the oscarinstall.log is:

Updating pbs_server nodes
/opt/pbs/bin/pbsnodes: Server has no node list
qmgr obj=lilligridfast1.na.iac.cnr.it svr=default: Unauthorized Request
create node lilligridfast1.na.iac.cnr.it np = 2 , properties = all
qmgr obj=lilligridfast2.na.iac.cnr.it svr=default: Unauthorized Request
create node lilligridfast2.na.iac.cnr.it np = 2 , properties = all
qmgr obj=lilligridfast3.na.iac.cnr.it svr=default: Unauthorized Request
create node lilligridfast3.na.iac.cnr.it np = 2 , properties = all
qmgr obj=lilligridfast4.na.iac.cnr.it svr=default: Unauthorized Request
create node lilligridfast4.na.iac.cnr.it np = 2 , properties = all
qmgr obj=lilligridfast5.na.iac.cnr.it svr=default: Unauthorized Request
create node lilligridfast5.na.iac.cnr.it np = 2 , properties = all
Shutting down TORQUE Server: [ OK ]
Starting TORQUE Server: [ OK ]
Creating torque workq queue...
Max open servers: 4
qmgr obj=workq svr=default: Unauthorized Request
create queue workq
Configuration of Torque queues failed at
/opt/oscar/packages/torque/scripts/post_install line 315
Script /opt/oscar/packages/torque/scripts/post_install exitted badly with
exit code '2' at ./post_install line 44
Couldn't run 'post_install' script for torque at ./post_install line 45
Some of the post install scripts failed, please check your logs for more
info at ./post_install line 50
--> Step 7: Failed to properly complete the cluster install; please check
the logs

I also attach the /etc/hosts file, because from the mail exchange it turns
out to be the problem:

# Do not remove the following line, or various programs
# that require network functionality will fail.
127.0.0.1 localhost.localdomain localhost
192.168.1.100 lilligridfast100.na.iac.cnr.it lilligridfast100 oscar_server
nfs_oscar pbs_oscar
140.164.12.100 lilligrid.na.iac.cnr.it lilligrid
# These entries are managed by SIS, please don't modify them.
192.168.1.1 lilligridfast1.na.iac.cnr.it lilligridfast1
192.168.1.2 lilligridfast2.na.iac.cnr.it lilligridfast2
192.168.1.3 lilligridfast3.na.iac.cnr.it lilligridfast3
192.168.1.4 lilligridfast4.na.iac.cnr.it lilligridfast4
192.168.1.5 lilligridfast5.na.iac.cnr.it lilligridfast5

Ping to any of the aliases of 192.168.1.100 (including pbs_oscar) is
successfull from the server and from the nodes, while the corresponding host
command fails.

Any help will be greatly appreciated

Umberto Amato
Istituto per le Applicazioni del Calcolo -Mauro Picone- CNR
Via Pietro Castellino111
80131 Napoli

E-mail: [EMAIL PROTECTED]






-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users






-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to