Re: [Oscar-users] RE: adding nodes to the cluster

Richard C Ferri Fri, 19 Oct 2001 13:44:51 -0500 (CDT)

Gilbert,
     It seems like  you're getting to the very end of the clone script (the
one that copies stuff from the server to client, and installs all the RPMs)
and then rdev is failing.  rdev is doing something really simple -- it's
setting the root device for the kernel that's permanently installed on your
local harddrive. The command that is failing is line 469:


rdev /mnt/boot/vmlinuz $rootpart

where $rootpart is the root partition name (e.g. /dev/sda6).

I am having some trouble understanding how all those nice RPMs got
installed and lilo ran, but then the rdev command failed.  It is definitely
not normal to see all thoese errors on reboot. What is happening is that
when the kernel is loaded it doesn't know where to find its root file
system, and as we know, life is meaningless without root.

I'd like to a) see what your disk partition file looks like and b) like you
to run the rdev command on the node while it's still in network boot mode
(before you boot it from harddrive). If you can debug a little perl, put a
breakpoint on the rdev command in clone, and display what $rootpart is.
My guess is that somehow clone is confused about what the root partition is
named, and as a result rdev is failing causing root not to get mounted on
reboot (thus all those nasty error messages).

Rich

Richard Ferri
IBM Linux Technology Center
[EMAIL PROTECTED]
845.433.7920

"Chavez, Gilbert R SITI-IT-DSAS" <[EMAIL PROTECTED]> (by way of Jeremy
Enos <[EMAIL PROTECTED]>)@lists.sourceforge.net on 10/19/2001 12:59:05 PM

Sent by:  [EMAIL PROTECTED]


To:   [EMAIL PROTECTED]
cc:
Subject:  [Oscar-users] RE: adding nodes to the cluster



Well, I'm getting a little closer to succeeding on this one node. Its
giving
me FITS! I'm so close I can smell it! Here's what its doing now:

- I used the exact numbers for cylinders on the 18gig disk table (old
table)
for the 40gig disk table file and succeeded, or at least I got passed this
stage. I will tweet the numbers to correctness later.

- The boot process got farther and started loading the RPMs. However, I
received an error (listed below) regarding rdev. Do you have any clues to
what is causing this error?

There are a lot of FAILED messages on the screen during the initial bootup
but the screen scrolls too fast to where I can't see what the failures are.
Is this normal to have these failures during the first go-around?

Here's some messages I did see and were able to write down:
/etc/rc.d/rc5.d/S99local: /proc/sys/net/ipv4/ip_forward - no such file (BUT
ITS THERE ON THE SERVER)
/var/lib/nfs/etab - couldn't stat
nfssvc not supported
unable to open nfs
could open /mnt/etc/group
  "" ""  ""  "   "  /passwd
  "  "   "    "   " etc....

Also, there is a message about the disk having 4870 cylinders and it being
larger than 1024 may cause a problem.


Excerpt from the node09.log file

: about to read the client resource allocation table
: about to partition the harddrive
: about to execute part2 to partition the harddrive using /tar/40gig.disk
as
the file allocation table
: about to install rpms for an RPM type installation
: about to copy /tar/group.source to /mnt/etc/group
: about to copy /tar/myshadow.source to /mnt/etc/shadow
: about to copy /tar/rhosts.source to /mnt/root/.rhosts
: about to copy /tar/passwd.source to /mnt/etc/passwd
: about to copy /tar/gshadow.source to /mnt/etc/gshadow
: about to create the /etc/fstab in the permanent root file system
: about to copy any user exit scripts to /tmp/exit
: about to create the kernel system map
: about to create /etc/lilo.conf and run lilo
: the rdev command failed, exiting with error

Any help would be appreciated.....

regards,

Gilbert



-----Original Message-----
From: Jeremy Enos
Sent: Thursday, October 11, 2001 4:27 PM
To: Chavez, Gilbert R SITI-IT-DSAS; Chavez, Gilbert R SITI-IT-DSAS
Subject: RE: adding nodes to the cluster


Yep... looks like you're getting booted ok, but the "clone" script is
having trouble parsing the disktable file.  You may want to just use one of
the samples in OSCAR-1.0/oscarResources/.
You can edit an already created resource directly and probably save
yourself some overhead.  The resource files are generated in /tftpboo/tar/.
Do you know if you have the same ethernet adapter in your new systems as
your old systems?  If you're getting booted with a floppy disk, then I
suspect the NIC is the same or it wouldn't have worked.  While network
booted, the universal, support everything kernel that is used will spew
many error messages that don't mean anything.... I'm not sure about the
network card errors you're seeing though.  I'd continue trying the way
you're going though, because the problem you're running into right now
seems to be with parsing that disktable file.

          Jeremy


At 04:09 PM 10/11/2001 -0500, Chavez, Gilbert R SITI-IT-DSAS wrote:
 >Thanks for responding. How do you build a new ethernetboot diskette and
 >ramdisk? I'm able to boot the new node but I received an error (look at
the
 >following) that the disk table is bad. Is this bad because of the
etherboot
 >disk you suggested? Also, every once in a while I received "eth0:card not
 >receiving RX buffer" and "eth0:card no receiving resources". Maybe this
is
 >due to the etherboot floppy we are using.
 >
 >: about to execute part2 to partition the harddrive using /tar/40gig.disk
as
 >the file allocation table
 >: an error occurred during disk partitioning, exiting with error
 >
 >regards,
 >
 >Gilbert
 >
 >-----Original Message-----
 >From: Jeremy Enos
 >Sent: Thursday, October 11, 2001 3:34 PM
 >To: Chavez, Gilbert R SITI-IT-DSAS
 >Subject: RE: adding nodes to the cluster
 >
 >
 >All PXE capable ethernet cards should work just fine with pxelinux.bin
 >(unless that card's PXE support is bad).  You should only be using the
 >tagged image if you don't have working PXE support, and you boot from a
 >floppy.  (I think this is what we did on the original nodes)  Now, that
 >floppy that we generated is specific to the ethernet card in the original
 >nodes.  If the new nodes have a different card, then you will have to
 >generate a new etherboot floppy.
 >
 >
 >          Jeremy
 >
 >At 06:59 PM 10/10/2001 -0500, you wrote:
 > >Well, I tried to boot from pxelinux.bin and with tagged, but to luck.
With
 > >pxelinux.bin the system tries to boot but tells me that the
"pxelinux.bin"
 > >is a wrong image tag, was this feature tested when you were here? The
 > >pxelinux.bin file is a data file and the tagged file is a "x86 boot
 >sector".
 > >I think the problem is with the data file pxelinux.bin, shouldn't it be
a
 > >boot sector like the tagged file? How do I get the correct pxelinux.bin
 > >file? Can I download it? Trying to boot with tagged I receive messages
that
 > >"eth0: found no sources on card", and "no RX buffer" error messages.
Have
 > >you seen these errors before?
 > >
 > >regards,
 > >
 > >Gilbert
 > >
 > >-----Original Message-----
 > >From: Jeremy Enos
 > >Sent: Wednesday, October 10, 2001 5:43 PM
 > >To: Chavez, Gilbert R SITI-IT-DSAS; Chavez, Gilbert R SITI-IT-DSAS
 > >Subject: RE: adding nodes to the cluster
 > >
 > >
 > >Hi Gilbert-
 > >Sorry it's taking me so long... I'm at a conference in LA all week.
 > >Anyway...
 > >Sounds like you're doing pretty well... you basically just need to make
new
 > >resources and groups for the new machines, and go from there with
building
 > >them.
 > >In the dhcpd.conf file... pxelinux.bin is used if you're network
booting
 > >with the PXE boot rom on the card.  tagged is used if you're using an
 > >etherboot floppy.
 > >The error you see about gdm is normal while you're booted on an NFS
mounted
 > >filesystem (network booted).  I'm not sure what effect changing those
 > >permissions might have.
 > >Let me know how things progress...
 > >
 > >          Jeremy
 > >
 > >
 > >At 05:30 PM 10/10/2001 -0500, Chavez, Gilbert R SITI-IT-DSAS wrote:
 > > >Jeremy,
 > > >  Per my voice mail to you, I tried to add the new PC to the cluster.
I
 > > >figured out some things, like defining a machine, allocating
resources,
 > > >deallocating resources, etc. I also created a disk table for a 40
disk
 >and
 > > >allocated it to the new PC. I tried so many things to get the new PC
 > >working
 > > >but to no avail. I once got it working to where I could at least log
into
 > > >the node as root but it complained about ownership on /var/gdm. After
 > > >correcting the ownership on /var/gdm the screen on the new node went
 >blank
 > > >and the system was in a hung state. I looked at the log file under
 > > >/tftpboot/lim/log/node09.log and noticed that it complained about the
 >disk
 > > >partitioning was not correct. I corrected the disktable file and
tried
to
 > > >reboot, where the system did not boot up properly at this point.
 > > >
 > > >I tried to use PXE and a floppy diskette install but had no luck. One
 >thing
 > > >to mention which may have messed things up was within the oscar
wizard
 >and
 > > >clicked on step6. I looked at the scripts that start with step 6 and
it
 >is
 > > >pointing to the pre_install.part2 script. I noticed this script made
 > >changes
 > > >to the dhpcd.conf file under /etc and placed all nodes as file
 > > >/tftpboot/pxelinux.bin under filename instead of /tftpboot/tagged.
So,
I
 > > >updated the file manually and placed everything back the way it was
in
 >the
 > > >dhcpd.conf file. I noticed that the dhcpd daemon was not running so I
 > > >restarted it again. Is the pxelinux.bin for PCs with PXE ready on
them?
 >Can
 > > >I used this for the new PC? Is there anything critical that step 6
 > > >(pre_install.part2) changes that I need to be concerned about? I have
 > >listed
 > > >the pre_install.part2 script below.
 > > >
 > > >I'm going to remove the new node completely and start over. I know
I'm
 >real
 > > >close to getting this system installed.
 > > >
 > > >Any help would be appreciated.......Please get back with me as soon
as
 >you
 > > >can.
 > > >
 > > >thanks,
 > > >
 > > >Gilbert
 > > >
 > > >[root@pleiades scripts]# more pre_install.part2
 > > >#!/bin/sh
 > > >
 > > ># pre_install.part2 - script to do part2 of the pre-client-install
server
 > > >setup
 > > ># Last Updated 11/16/00 by Michael Brim ([EMAIL PROTECTED])
 > > >
 > > ># Install C3 Tools & Supporting Programs/Files
 > > >
 > > >   echo "Installing C3 Tools"
 > > >   cd ../c3
 > > >   ./lui_to_ORNL -l /tftpboot/lim -ORNL /etc/ORNLcluster.def
 > > >   ./c3_install /etc/ORNLcluster.def /tftpboot/pxelinux.bin
 > > >
 > > ># Install PBS Server
 > > >
 > > >   echo "Installing PBS Server RPM"
 > > >   cd ../pbs
 > > >   ./pbs_server_install
 > > >
 > > ># Done
 > > >
 > > >   echo
 > > >   echo "Server Pre-Client-Install Complete - Begin booting client
nodes"
 > > >
 > > >
 > > >
 > > >
 > > >-----Original Message-----
 > > >From: Jeremy Enos
 > > >Sent: Tuesday, October 09, 2001 8:32 PM
 > > >To: Chavez, Gilbert R SITI-IT-DSAS
 > > >Subject: Re: adding nodes to the cluster
 > > >
 > > >
 > > >I can give you some help with this...  I've got to run right now
 > > >though.  I'll send you something later.
 > > >
 > > >          Jeremy
 > > >
 > > >
 > > >At 06:42 PM 10/9/2001 -0500, you wrote:
 > > >
 > > > >Jeremy,
 > > > >  We want to add more systems to the cluster we have here at Shell.
For
 > > >now,
 > > > >we have one system that we want to add to the cluster for testing
 > >purposes,
 > > > >but the architecture is different. This machine is a clone
(Systemax)
 > >which
 > > > >has a 1.8 MHZ processor with a 40 gig disk. Since this PC is
different
 > >from
 > > > >our other cluster systems (Dell) what all do we need to do to get
this
 > > > >machine added to the cluster properly (such as disktables, etc)?
 > > > >
 > > > >Our install procedures also need to be updated. I have listed them
 >below,
 > > > >please let me know what step(s) are missing for a new install.
 > > > >
 > > > >- Run glui and define a machine
 > > > >
 > > > >- Define and group then allocate resources for the node, then boot
the
 > >node
 > > > >(I don't remember how to do this step, please advise)
 > > > >
 > > > >- Once the system is at the login prompt remove the floppy and
reset
 > >node.
 > > > >
 > > > >- Run oscar_wizard step 7 only
 > > > >
 > > > >- Run node_setup NODENAME
 > > > >
 > > > >Aren't we supposed to add the new host to the /etc/host file and
update
 > > > >other files like the dchpd.conf file before starting up glui? Do we
 >still
 > > > >need to do a "upresources" and/or "upnodes", or maybe a
upresourcesfast
 > >for
 > > > >a faster machine as the one we want to install.
 > > > >
 > > > >Any help would be appreciated...
 > > > >
 > > > >thanks,
 > > > >
 > > > >Gilbert Chavez


_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users




_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Re: [Oscar-users] RE: adding nodes to the cluster

Reply via email to