Hi Olivier,
I checked the entries you said and everything looks correct.
I suspect somehow the munge.key file. Let's explain this:
On the master:
Eth0 -> configured with private IP address like onto all the nodes,
corresponding to pbs_oscar, nfs_oscar; connected to the cluster
Eth1 -> configured with a public IP address and FQDN name = hostname of the
master; connected to the outside world
/etc/hosts reflect the above situation and is synchronized on all nodes.
The nodes know and address nfs_oscar, pbs_oscar ..
NOTE: I followed what we had on previous OSCAR cluster which worked. But some
things changed meanwhile, one of them
being the use of munge.
Back to the configuration, the "hostname" cmd returns FQDN onto the master node
(eth1) and the munge.key
was generated with this hostname and copied to the nodes.
When executing munge -n | unmunge onto the master it returns the FQDN name =
hostname (eth1).
I am wondering if this is not an issue.
The error message from the pbs_server shows a conflict between the name
assigned to the private IP address (eth0) and the
hostname which corresponds to the public IP address (eth1).
When looking at the pbs mom config with momctl I see all master IP addresses
were configured to be accepted.
One idea was to change the hostname to the "private" name and re-generate the
munge.key. I did it but I have some troubles now
so I will probably switch back.
Kind Regards,
Costel
From: LAHAYE Olivier [mailto:olivier.lah...@cea.fr]
Sent: Wednesday, March 13, 2013 4:32 PM
To: Costel Seitan; oscar-users@lists.sourceforge.net
Subject: RE : [Oscar-users] RE : RE : OSCAR unstable News: yume finaly WORKS in
all situations:-) and new oscar-utils package.
Hi Costel,
Don't worry about disable service,. If your iptable is disabled, then it is ok.
If I'm correct, your nodes are on a private network connected to eth1 on your
head (and eth0 is on the public network).
If this is the case, and If I remember well my old cluster which had the same
architecture, the /etc/hosts pbs_oscar entry should point to the IP of the eth1.
Check (on the head *and* on the nodes) that /etc/torque/server_name contains a
hostname that can be resolved by all nodes and points to the eth1 IP. Check
that the /etc/hosts in the image, the nodes and the head have the correct entry
for pbs_oscar (ort the host that is in /etc/torque/server_name
The restart all pbs_mom, trqauthd and pbs_server services.
If it doesn't fix the issues, as a last resort, check the return of the
hostname commandon the nodes and try to use that in the
/var/lib/torque/server_priv/nodes. If hostnames are not correct, fix that in
/etc/sysconfig/network
Beyond that I don't have anymore ideas.
Best regards,
PS: Why did you had to manually edit the nodes files, did the step 7 failed to
setup that correctly? I almost copletely rewriten the torque setup post install
and handely many unhandeled errors situation.... Seems that I missed some :(
(If you can send to me the log of the torque post install it may help me).
Olivier.
--
Olivier LAHAYE
CEA DRT/LIST/DCSI/DIR
________________________________
De : Costel Seitan [csei...@slb.com]
Date d'envoi : mercredi 13 mars 2013 16:08
À : oscar-users@lists.sourceforge.net
Cc: LAHAYE Olivier
Objet : RE: [Oscar-users] RE : RE : OSCAR unstable News: yume finaly WORKS in
all situations:-) and new oscar-utils package.
Olivier,
I am not sure I selected disable service opkg .. I do not really remember.
I checked line by line
/var/lib/torque/server_priv/nodes : I created it myself and added the hostnames
of all present and future nodes, one per line.
/etc/torque/server_name: contains "pbs_oscar » on all the nodes and the master
I did cexec iptables -L and seems disabled. I even did telnet masternode 15001
and it looks OK.
I restarted pbs_mom on nodes and pbs_server several times. I also restarted
trqauthd processes.
munge is running fine on all nodes and the server.
I changed the log level and the messages are more complete now. It looks like a
host resolution pb.:
03/13/2013 15:51:28;0004;PBS_Server.4105;Svr;authenticate_user;Hosts do not
match: Requested host <eth0_hostname>: credential host: <eth1_hostname>
Where
eth0_hostname is the first name appearing into the /etc/hosts file for the
master (the same line with pbs_server)
And
eth1_hostname is the FQDN name = DNS hostname for the master as seen from
outside the cluster.
Kind Regards,
Costel
From: LAHAYE Olivier [mailto:olivier.lah...@cea.fr]
Sent: Wednesday, March 13, 2013 2:27 PM
To: Costel Seitan
Cc: oscar-users@lists.sourceforge.net
Subject: [Oscar-users] RE : RE : OSCAR unstable News: yume finaly WORKS in all
situations:-) and new oscar-utils package.
did you select the disable service opkg? I don't remember if I recommended it.
IT'll disable iptables if my memory is correct.
can you check /var/lib/torque/server_priv/nodes
can you check /etc/torque/server_name
anyway, can you check that iptables are disabled on nodes?
can you restart the pbs_mom on nodes and pbs_server on head?
can you check that munge is running on head and nodes
What does /opt/pbs/bin/pbsnodes reports?
Note that it is recommended to avoid running step 7 when all nodes are not up
and running. I've fixed many post install scripts so they can be run multiple
times, but sometimes there are things that can be run once. example: cexec will
automatically disable nodes that are in /etc/c3.conf and that fail to respond.
There is no command to automatically reenable dead nodes (I've asked for the
feature upstream and received positive feedback, but no delays in feature
availability).
Best regards,
Olivier.
PS: I forgot to reply to oscar-user the 1st time, but I think it can be of any
use to other oscar users, so I put my answer again in the list. please accept
my apologies for that.
--
Olivier LAHAYE
CEA DRT/LIST/DCSI/DIR
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users