Re: [Oscar-users] corrupt passwd files on nodes

2012-06-01 Thread drla4
Hello ST,

 thank's for your help. At least that speeds up the recovery process a lot, if 
I can repair it in single user mode. But I am still wondering, if one could go 
to the root of the problem to keep it from reoccuring (it used to be fairly 
infrequent, but now it has happened 3 times in a few weeks!) The funny ting is, 
only one of our clusters seems to suffer from this problems, though the others 
are very similar in hardwar and software setup.

 Does anyone have a tip, what one could try to fix this permanently?

 Regards,
 Lutz Ackermann
  Von: st...@ntu.edu.tw
 Gesendet: 29.05.2012 17:55
 An: dr...@directbox.com
 Betreff: Re: [Oscar-users] corrupt passwd files on nodes

 Hi Lutz,

 To boot with the single user mode, you need to be physically with the node. 
The way to boot with single-user mode seems to depend on the boot loader you 
have. Since you have Redhat, you may refer to the followings for more details.
 
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/3/html/System_Administration_Guide/s1-rescuemode-booting-single.html

 As far as I know, there is no way to do this remotely.

 Regards,
 ST

 On 2012/5/29 ?? 05:38, dr...@directbox.com wrote:   

Hi ST,

 thanks for your response. That might be an interesting rout for us, too, 
because currently all I can so is reboot each node from a liveCD to get in at 
all. You say login the node with single user mode - could you describe how 
exactly this is done, please? Could one do that via ssh or can it only be done 
physically on the node?

 Regards,
 Lutz Ackermann

  Von: st...@ntu.edu.tw
 Gesendet: 29.05.2012 06:57
 An: oscar-users@lists.sourceforge.net
 Kopie: dr...@directbox.com
 Betreff: Re: [Oscar-users] corrupt passwd files on nodes

 Hi Lutz,

 Our cluster had the same problem one or twice in the past.

 The password on nodes can be synchronized with that of the server with the 
following command
 /opt/sync_files/bin/sync_files
 This command is executed every 15 min via cron (cat /etc/crontab)
 */15 * * * * root env USER=root /opt/sync_files/bin/sync_files /dev/null 21

 If something goes wrong during the sync process, the passwd file will be 
corrupted. In this case, we had to login the node with single user mode (so 
that the root password is not needed), and then copy the password files from 
the server.

 Regards,
 ST

 On 2012/5/28 ?? 09:01, dr...@directbox.com wrote:  

Hi all,

 I have now repeatedly encountered a problem and would like to know if it is a 
known / widespread one:

 time and again some (or all) of the nodes of one of our clusters become 
completely inaccessible (i.e. one cannot ssh or console-login to nodes). By 
rebooting a node from a live medium one finds that /etc/passwd has size 0; 
since I also find that /etc/groups and /etc/shadow have the same date, I assume 
that OSCAR has got some mechanism to distribute these files according to some 
schedule and that corruption can occur during the process of pushing those 
files down from the head node - am I right?

 Now my question is, how could one analyze, why the cluster does this and how 
could one fix it?

 Regards
 Dr Lutz Ackermann
 MMC - UL

 PS: It's an OSCAR 5 cluster installed on a RedHat derivative:
 $ cat /proc/version
 Linux version 2.6.9-78.ELsmp (brewbuil...@ls20-bc2-14.build.redhat.com) (gcc 
version 3.4.6 20060404 (Red Hat 3.4.6-10)) #1 SMP Wed Jul 9 15:46:26 EDT 2008
 $ cat /etc/*release
 Red Hat Enterprise Linux AS release 4 (Nahant Update 7)

-- 
Live Security Virtual Conference Exclusive live event will cover all the ways 
today's security and threat landscape has changed and how IT managers can 
respond. Discussions will include endpoint security, mobile security and the 
latest in malware threats. 
http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ 

___ Oscar-users mailing list 
Oscar-users@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/oscar-users  

-- Shiang-Tai Lin, Professor Department of Chemical Engineering National Taiwan 
University TEL: +886-2-33661369 FAX: +886-2-23623040 Email: st...@ntu.edu.tw 
Webpate: http://web.che.ntu.edu.tw/stlin/   

-- Shiang-Tai Lin, Professor Department of Chemical Engineering National Taiwan 
University TEL: +886-2-33661369 FAX: +886-2-23623040 Email: st...@ntu.edu.tw 
Webpate: http://web.che.ntu.edu.tw/stlin/--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo

Re: [Oscar-users] corrupt passwd files on nodes

2012-05-29 Thread Ibad Kureshi U0850037
We saw the same problem occur when we blocked the root a/c and moved to a sudo 
environment. The sync_files would not work. Shiang-Tais solution is what we 
followed as well.

-Ibad Kureshi
HPC Admin, University of Huddersfield

From: Shiang-Tai Lin [st...@ntu.edu.tw]
Sent: Monday, May 28, 2012 11:57 PM
To: oscar-users@lists.sourceforge.net
Subject: Re: [Oscar-users] corrupt passwd files on nodes

Hi Lutz,

Our cluster had the same problem one or twice in the past.

The password on nodes can be synchronized with that of the server with the 
following command
/opt/sync_files/bin/sync_files
This command is executed every 15 min via cron (cat /etc/crontab)
*/15 * * * * root env USER=root /opt/sync_files/bin/sync_files /dev/null 21

If something goes wrong during the sync process, the passwd file will be 
corrupted. In this case, we had to login the node with single user mode (so 
that the root password is not needed), and then copy the password files from 
the server.

Regards,
ST

On 2012/5/28 下午 09:01, dr...@directbox.commailto:dr...@directbox.com wrote:
Hi all,

I have now repeatedly encountered a problem and would like to know if it is a 
known / widespread one:

time and again some (or all) of the nodes of one of our clusters become 
completely inaccessible (i.e. one cannot ssh or console-login to nodes). By 
rebooting a node from a live medium one finds that /etc/passwd has size 0; 
since I also find that /etc/groups and /etc/shadow have the same date, I assume 
that OSCAR has got some mechanism to distribute these files according to some 
schedule and that corruption can occur during the process of pushing those 
files down from the head node - am I right?

Now my question is, how could one analyze, why the cluster does this and how 
could one fix it?

Regards
Dr Lutz Ackermann
MMC - UL


PS: It's an OSCAR 5 cluster installed on a RedHat derivative:
$ cat /proc/version
Linux version 2.6.9-78.ELsmp 
(brewbuil...@ls20-bc2-14.build.redhat.commailto:brewbuil...@ls20-bc2-14.build.redhat.com)
 (gcc version 3.4.6 20060404 (Red Hat 3.4.6-10)) #1 SMP Wed Jul 9 15:46:26 EDT 
2008
$ cat /etc/*release
Red Hat Enterprise Linux AS release 4 (Nahant Update 7)




--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/



___
Oscar-users mailing list
Oscar-users@lists.sourceforge.netmailto:Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users



--
Shiang-Tai Lin, Professor
Department of Chemical Engineering
National Taiwan University
TEL: +886-2-33661369
FAX: +886-2-23623040
Email: st...@ntu.edu.twmailto:st...@ntu.edu.tw
Webpate: http://web.che.ntu.edu.tw/stlin/



---
This transmission is confidential and may be legally privileged. If you receive 
it in error, please notify us immediately by e-mail and remove it from your 
system. If the content of this e-mail does not relate to the business of the 
University of Huddersfield, then we do not endorse it and will accept no 
liability.

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users


Re: [Oscar-users] corrupt passwd files on nodes

2012-05-28 Thread Shiang-Tai Lin

Hi Lutz,

Our cluster had the same problem one or twice in the past.

The password on nodes can be synchronized with that of the server with 
the following command

/opt/sync_files/bin/sync_files
This command is executed every 15 min via cron (cat /etc/crontab)
*/15 * * * * root env USER=root /opt/sync_files/bin/sync_files 
/dev/null 21


If something goes wrong during the sync process, the passwd file will be 
corrupted. In this case, we had to login the node with single user mode 
(so that the root password is not needed), and then copy the password 
files from the server.


Regards,
ST

On 2012/5/28 ?? 09:01, dr...@directbox.com wrote:

Hi all,

I have now repeatedly encountered a problem and would like to know if 
it is a known / widespread one:


time and again some (or all) of the nodes of one of our clusters 
become completely inaccessible (i.e. one cannot ssh or console-login 
to nodes). By rebooting a node from a live medium one finds that 
/etc/passwd has size 0; since I also find that /etc/groups and 
/etc/shadow have the same date, I assume that OSCAR has got some 
mechanism to distribute these files according to some schedule and 
that corruption can occur during the process of pushing those files 
down from the head node - am I right?


Now my question is, how could one analyze, why the cluster does this 
and how could one fix it?


Regards
Dr Lutz Ackermann
MMC - UL


PS: It's an OSCAR 5 cluster installed on a RedHat derivative:
$ cat /proc/version
Linux version 2.6.9-78.ELsmp 
(brewbuil...@ls20-bc2-14.build.redhat.com) (gcc version 3.4.6 20060404 
(Red Hat 3.4.6-10)) #1 SMP Wed Jul 9 15:46:26 EDT 2008

$ cat /etc/*release
Red Hat Enterprise Linux AS release 4 (Nahant Update 7)



--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/


___
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users


--
Shiang-Tai Lin, Professor
Department of Chemical Engineering
National Taiwan University
TEL: +886-2-33661369
FAX: +886-2-23623040
Email: st...@ntu.edu.tw
Webpate: http://web.che.ntu.edu.tw/stlin/

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users