RE: [Oscar-users] RE: adding nodes to the cluster

Jeremy Enos Tue, 30 Oct 2001 12:49:41 -0600 (CST)

<x-flowed>
Is this on the network boot or the HD boot?

        Jeremy


At 12:46 PM 10/30/2001 -0600, Chavez, Gilbert R SITI-ITDSAS wrote:

Jeremy,
 I noticed that during the boot up session of the additional node I'm adding
to the cluster, I received a messages stating "fsck.nfs" not found. Where is
it trying to find this fsck.nfs?

regards,

Gilbert

-----Original Message-----
From: Jeremy Enos [mailto:[EMAIL PROTECTED]]
Sent: Monday, October 22, 2001 3:55 PM
To: Chavez, Gilbert R SITI-ITDSAS; 'Richard C Ferri'; Chavez, Gilbert R
SITI-ITDSAS
Cc: [EMAIL PROTECTED]
Subject: RE: [Oscar-users] RE: adding nodes to the cluster


Hi Gilbert-
You shouldn't need to build a custom kernel in all likelihood.  I think you
have rpm installed version 2.2.16-3 on your older nodes.  You also
shouldn't need to replace any of the kernels in /tftpboot/.  Those are the
universal "support everything" kernels that are only used for booting over
the network for node builds.   It sounds like you're getting booted
successfully, and the problem is down the line from there.
I'll leave the rest to Rich-

         Jeremy

At 01:06 PM 10/22/2001 -0500, Chavez, Gilbert R SITI-ITDSAS wrote:
>Rich,
>  Thanks for responding. When the cluster was built for us it was built
using
>8 nodes and 1 head node. These 8 node systems had the same architecture
>using Dells (1gig memory, 866MHZ CPU, with 18gig disk SCSI disk drives).
>This additional system I'm trying to install to the cluster is a SysteMax
>with 1gig memory, 1.8MHZ CPU, and a 40gig IDE drive. As you can see this
new
>system has different architecture. I added this system using the glui and
>adding
>it to the existing group but created a new resouce for the disk table
(40gig
>instead of 18gig). I made sure that this new node was listed under the
>proper files (/etc/MAC.info, dhcpd.conf, c3.conf, etc..) However, I have
not
>rebuilt a kernel for this new machine. Do I need to rebuild a kernel? If
so,
>how do I rebuild a kernel for a cluster system and where do I put it? The
>cluster is running Linux6.2 and I know how to build a kernel on a 7.1 (I
>recently attended a Linux class), I'm assuming the kernel rebuilding is the
>same between versions? Below is the disktable I'm using. This is the same
>disktable I'm using for the 18gig drives, but for this disktable I modified
>the disk device from sda (from the original disk table for the 18gig
drives)
>to hda for IDE, but the system keeps crashing while performing an rdev.
What
>kernel am I using for the cluster systems? Is it the kernel in
>/tftpboot/bzImage? If I have to rebuild the kernel do I rebuild it from the
>head node and copy it to /tftpboot/bzImage? There is also a file called
>upramdisk163.ramdisk in /tftpboot/tar, do I need to rebuild this file too?
>The head node is also a Dell, but the difference is this head node has two
>866 CPUs.
>
>I will try to boot the system in network mode and enter the rdev from the
>command line as you suggested. Do I hit the "tab" key to get to where I can
>enter the rdev command before it tries to boot from harddrive?
>
>Any help would be appreciated.
>
>Gilbert
>
>Disk table under /usr/local/oscar/lui_sources
>/dev/hda1       ext2            3       c       y       /boot
>/dev/hda2       extended        2210    c       n
>/dev/hda5       ext2            2190    c       n       /
>/dev/hda6       swap            20      c
>nfs             nfs             /home   rw      10.0.0.50
>
>
>
>
>-----Original Message-----
>From: Richard C Ferri
>Sent: Friday, October 19, 2001 1:21 PM
>To: Chavez, Gilbert R SITI-IT-DSAS
>Cc: [EMAIL PROTECTED]
>Subject: Re: [Oscar-users] RE: adding nodes to the cluster
>
>
>
>Gilbert,
>      It seems like  you're getting to the very end of the clone script
(the
>one that copies stuff from the server to client, and installs all the RPMs)
>and then rdev is failing.  rdev is doing something really simple -- it's
>setting the root device for the kernel that's permanently installed on your
>local harddrive. The command that is failing is line 469:
>
>rdev /mnt/boot/vmlinuz $rootpart
>
>where $rootpart is the root partition name (e.g. /dev/sda6).
>
>I am having some trouble understanding how all those nice RPMs got
>installed and lilo ran, but then the rdev command failed.  It is definitely
>not normal to see all thoese errors on reboot. What is happening is that
>when the kernel is loaded it doesn't know where to find its root file
>system, and as we know, life is meaningless without root.
>
>I'd like to a) see what your disk partition file looks like and b) like you
>to run the rdev command on the node while it's still in network boot mode
>(before you boot it from harddrive). If you can debug a little perl, put a
>breakpoint on the rdev command in clone, and display what $rootpart is.
>My guess is that somehow clone is confused about what the root partition is
>named, and as a result rdev is failing causing root not to get mounted on
>reboot (thus all those nasty error messages).
>
>Rich
>
>Richard Ferri
>IBM Linux Technology Center
>[EMAIL PROTECTED]
>845.433.7920
>
>"Chavez, Gilbert R SITI-IT-DSAS" <[EMAIL PROTECTED]> (by way of Jeremy
>Enos <[EMAIL PROTECTED]>)@lists.sourceforge.net on 10/19/2001 12:59:05 PM
>
>Sent by:  [EMAIL PROTECTED]
>
>
>To:   [EMAIL PROTECTED]
>cc:
>Subject:  [Oscar-users] RE: adding nodes to the cluster
>
>
>
>Well, I'm getting a little closer to succeeding on this one node. Its
>giving
>me FITS! I'm so close I can smell it! Here's what its doing now:
>
>- I used the exact numbers for cylinders on the 18gig disk table (old
>table)
>for the 40gig disk table file and succeeded, or at least I got passed this
>stage. I will tweet the numbers to correctness later.
>
>- The boot process got farther and started loading the RPMs. However, I
>received an error (listed below) regarding rdev. Do you have any clues to
>what is causing this error?
>
>There are a lot of FAILED messages on the screen during the initial bootup
>but the screen scrolls too fast to where I can't see what the failures are.
>Is this normal to have these failures during the first go-around?
>
>Here's some messages I did see and were able to write down:
>/etc/rc.d/rc5.d/S99local: /proc/sys/net/ipv4/ip_forward - no such file (BUT
>ITS THERE ON THE SERVER)
>/var/lib/nfs/etab - couldn't stat
>nfssvc not supported
>unable to open nfs
>could open /mnt/etc/group
>   "" ""  ""  "   "  /passwd
>   "  "   "    "   " etc....
>
>Also, there is a message about the disk having 4870 cylinders and it being
>larger than 1024 may cause a problem.
>
>
>Excerpt from the node09.log file
>
>: about to read the client resource allocation table
>: about to partition the harddrive
>: about to execute part2 to partition the harddrive using /tar/40gig.disk
>as
>the file allocation table
>: about to install rpms for an RPM type installation
>: about to copy /tar/group.source to /mnt/etc/group
>: about to copy /tar/myshadow.source to /mnt/etc/shadow
>: about to copy /tar/rhosts.source to /mnt/root/.rhosts
>: about to copy /tar/passwd.source to /mnt/etc/passwd
>: about to copy /tar/gshadow.source to /mnt/etc/gshadow
>: about to create the /etc/fstab in the permanent root file system
>: about to copy any user exit scripts to /tmp/exit
>: about to create the kernel system map
>: about to create /etc/lilo.conf and run lilo
>: the rdev command failed, exiting with error
>
>Any help would be appreciated.....
>
>regards,
>
>Gilbert
>
>
>
>-----Original Message-----
>From: Jeremy Enos
>Sent: Thursday, October 11, 2001 4:27 PM
>To: Chavez, Gilbert R SITI-IT-DSAS; Chavez, Gilbert R SITI-IT-DSAS
>Subject: RE: adding nodes to the cluster
>
>
>Yep... looks like you're getting booted ok, but the "clone" script is
>having trouble parsing the disktable file.  You may want to just use one of
>the samples in OSCAR-1.0/oscarResources/.
>You can edit an already created resource directly and probably save
>yourself some overhead.  The resource files are generated in /tftpboo/tar/.
>Do you know if you have the same ethernet adapter in your new systems as
>your old systems?  If you're getting booted with a floppy disk, then I
>suspect the NIC is the same or it wouldn't have worked.  While network
>booted, the universal, support everything kernel that is used will spew
>many error messages that don't mean anything.... I'm not sure about the
>network card errors you're seeing though.  I'd continue trying the way
>you're going though, because the problem you're running into right now
>seems to be with parsing that disktable file.
>
>           Jeremy
>
>
>At 04:09 PM 10/11/2001 -0500, Chavez, Gilbert R SITI-IT-DSAS wrote:
>  >Thanks for responding. How do you build a new ethernetboot diskette and
>  >ramdisk? I'm able to boot the new node but I received an error (look at
>the
>  >following) that the disk table is bad. Is this bad because of the
>etherboot
>  >disk you suggested? Also, every once in a while I received "eth0:card
not
>  >receiving RX buffer" and "eth0:card no receiving resources". Maybe this
>is
>  >due to the etherboot floppy we are using.
>  >
>  >: about to execute part2 to partition the harddrive using
/tar/40gig.disk
>as
>  >the file allocation table
>  >: an error occurred during disk partitioning, exiting with error
>  >
>  >regards,
>  >
>  >Gilbert
>  >
>  >-----Original Message-----
>  >From: Jeremy Enos
>  >Sent: Thursday, October 11, 2001 3:34 PM
>  >To: Chavez, Gilbert R SITI-IT-DSAS
>  >Subject: RE: adding nodes to the cluster
>  >
>  >
>  >All PXE capable ethernet cards should work just fine with pxelinux.bin
>  >(unless that card's PXE support is bad).  You should only be using the
>  >tagged image if you don't have working PXE support, and you boot from a
>  >floppy.  (I think this is what we did on the original nodes)  Now, that
>  >floppy that we generated is specific to the ethernet card in the
original
>  >nodes.  If the new nodes have a different card, then you will have to
>  >generate a new etherboot floppy.
>  >
>  >
>  >          Jeremy
>  >
>  >At 06:59 PM 10/10/2001 -0500, you wrote:
>  > >Well, I tried to boot from pxelinux.bin and with tagged, but to luck.
>With
>  > >pxelinux.bin the system tries to boot but tells me that the
>"pxelinux.bin"
>  > >is a wrong image tag, was this feature tested when you were here? The
>  > >pxelinux.bin file is a data file and the tagged file is a "x86 boot
>  >sector".
>  > >I think the problem is with the data file pxelinux.bin, shouldn't it
be
>a
>  > >boot sector like the tagged file? How do I get the correct
pxelinux.bin
>  > >file? Can I download it? Trying to boot with tagged I receive messages
>that
>  > >"eth0: found no sources on card", and "no RX buffer" error messages.
>Have
>  > >you seen these errors before?
>  > >
>  > >regards,
>  > >
>  > >Gilbert
>  > >
>  > >-----Original Message-----
>  > >From: Jeremy Enos
>  > >Sent: Wednesday, October 10, 2001 5:43 PM
>  > >To: Chavez, Gilbert R SITI-IT-DSAS; Chavez, Gilbert R SITI-IT-DSAS
>  > >Subject: RE: adding nodes to the cluster
>  > >
>  > >
>  > >Hi Gilbert-
>  > >Sorry it's taking me so long... I'm at a conference in LA all week.
>  > >Anyway...
>  > >Sounds like you're doing pretty well... you basically just need to
make
>new
>  > >resources and groups for the new machines, and go from there with
>building
>  > >them.
>  > >In the dhcpd.conf file... pxelinux.bin is used if you're network
>booting
>  > >with the PXE boot rom on the card.  tagged is used if you're using an
>  > >etherboot floppy.
>  > >The error you see about gdm is normal while you're booted on an NFS
>mounted
>  > >filesystem (network booted).  I'm not sure what effect changing those
>  > >permissions might have.
>  > >Let me know how things progress...
>  > >
>  > >          Jeremy
>  > >
>  > >
>  > >At 05:30 PM 10/10/2001 -0500, Chavez, Gilbert R SITI-IT-DSAS wrote:
>  > > >Jeremy,
>  > > >  Per my voice mail to you, I tried to add the new PC to the
cluster.
>I
>  > > >figured out some things, like defining a machine, allocating
>resources,
>  > > >deallocating resources, etc. I also created a disk table for a 40
>disk
>  >and
>  > > >allocated it to the new PC. I tried so many things to get the new PC
>  > >working
>  > > >but to no avail. I once got it working to where I could at least log
>into
>  > > >the node as root but it complained about ownership on /var/gdm.
After
>  > > >correcting the ownership on /var/gdm the screen on the new node went
>  >blank
>  > > >and the system was in a hung state. I looked at the log file under
>  > > >/tftpboot/lim/log/node09.log and noticed that it complained about
the
>  >disk
>  > > >partitioning was not correct. I corrected the disktable file and
>tried
>to
>  > > >reboot, where the system did not boot up properly at this point.
>  > > >
>  > > >I tried to use PXE and a floppy diskette install but had no luck.
One
>  >thing
>  > > >to mention which may have messed things up was within the oscar
>wizard
>  >and
>  > > >clicked on step6. I looked at the scripts that start with step 6 and
>it
>  >is
>  > > >pointing to the pre_install.part2 script. I noticed this script made
>  > >changes
>  > > >to the dhpcd.conf file under /etc and placed all nodes as file
>  > > >/tftpboot/pxelinux.bin under filename instead of /tftpboot/tagged.
>So,
>I
>  > > >updated the file manually and placed everything back the way it was
>in
>  >the
>  > > >dhcpd.conf file. I noticed that the dhcpd daemon was not running so
I
>  > > >restarted it again. Is the pxelinux.bin for PCs with PXE ready on
>them?
>  >Can
>  > > >I used this for the new PC? Is there anything critical that step 6
>  > > >(pre_install.part2) changes that I need to be concerned about? I
have
>  > >listed
>  > > >the pre_install.part2 script below.
>  > > >
>  > > >I'm going to remove the new node completely and start over. I know
>I'm
>  >real
>  > > >close to getting this system installed.
>  > > >
>  > > >Any help would be appreciated.......Please get back with me as soon
>as
>  >you
>  > > >can.
>  > > >
>  > > >thanks,
>  > > >
>  > > >Gilbert
>  > > >
>  > > >[root@pleiades scripts]# more pre_install.part2
>  > > >#!/bin/sh
>  > > >
>  > > ># pre_install.part2 - script to do part2 of the pre-client-install
>server
>  > > >setup
>  > > ># Last Updated 11/16/00 by Michael Brim ([EMAIL PROTECTED])
>  > > >
>  > > ># Install C3 Tools & Supporting Programs/Files
>  > > >
>  > > >   echo "Installing C3 Tools"
>  > > >   cd ../c3
>  > > >   ./lui_to_ORNL -l /tftpboot/lim -ORNL /etc/ORNLcluster.def
>  > > >   ./c3_install /etc/ORNLcluster.def /tftpboot/pxelinux.bin
>  > > >
>  > > ># Install PBS Server
>  > > >
>  > > >   echo "Installing PBS Server RPM"
>  > > >   cd ../pbs
>  > > >   ./pbs_server_install
>  > > >
>  > > ># Done
>  > > >
>  > > >   echo
>  > > >   echo "Server Pre-Client-Install Complete - Begin booting client
>nodes"
>  > > >
>  > > >
>  > > >
>  > > >
>  > > >-----Original Message-----
>  > > >From: Jeremy Enos
>  > > >Sent: Tuesday, October 09, 2001 8:32 PM
>  > > >To: Chavez, Gilbert R SITI-IT-DSAS
>  > > >Subject: Re: adding nodes to the cluster
>  > > >
>  > > >
>  > > >I can give you some help with this...  I've got to run right now
>  > > >though.  I'll send you something later.
>  > > >
>  > > >          Jeremy
>  > > >
>  > > >
>  > > >At 06:42 PM 10/9/2001 -0500, you wrote:
>  > > >
>  > > > >Jeremy,
>  > > > >  We want to add more systems to the cluster we have here at
Shell.
>For
>  > > >now,
>  > > > >we have one system that we want to add to the cluster for testing
>  > >purposes,
>  > > > >but the architecture is different. This machine is a clone
>(Systemax)
>  > >which
>  > > > >has a 1.8 MHZ processor with a 40 gig disk. Since this PC is
>different
>  > >from
>  > > > >our other cluster systems (Dell) what all do we need to do to get
>this
>  > > > >machine added to the cluster properly (such as disktables, etc)?
>  > > > >
>  > > > >Our install procedures also need to be updated. I have listed them
>  >below,
>  > > > >please let me know what step(s) are missing for a new install.
>  > > > >
>  > > > >- Run glui and define a machine
>  > > > >
>  > > > >- Define and group then allocate resources for the node, then boot
>the
>  > >node
>  > > > >(I don't remember how to do this step, please advise)
>  > > > >
>  > > > >- Once the system is at the login prompt remove the floppy and
>reset
>  > >node.
>  > > > >
>  > > > >- Run oscar_wizard step 7 only
>  > > > >
>  > > > >- Run node_setup NODENAME
>  > > > >
>  > > > >Aren't we supposed to add the new host to the /etc/host file and
>update
>  > > > >other files like the dchpd.conf file before starting up glui? Do
we
>  >still
>  > > > >need to do a "upresources" and/or "upnodes", or maybe a
>upresourcesfast
>  > >for
>  > > > >a faster machine as the one we want to install.
>  > > > >
>  > > > >Any help would be appreciated...
>  > > > >
>  > > > >thanks,
>  > > > >
>  > > > >Gilbert Chavez
>
>
>_______________________________________________
>Oscar-users mailing list
>[EMAIL PROTECTED]
>https://lists.sourceforge.net/lists/listinfo/oscar-users
>
>
>
>_______________________________________________
>Oscar-users mailing list
>[EMAIL PROTECTED]
>https://lists.sourceforge.net/lists/listinfo/oscar-users


_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

</x-flowed>

RE: [Oscar-users] RE: adding nodes to the cluster

Reply via email to