RE: [Oscar-users] RE: adding nodes to the cluster

Chavez, Gilbert R SITI-ITDSAS Tue, 30 Oct 2001 12:56:54 -0600 (CST)

On the network boot...I also receive mountd errors thereafter. Copy of dirs
to the local /etc dir, for example, can not be done either. No mounting is
done at this point.


regards,

Gilbert

-----Original Message-----
From: Jeremy Enos [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, October 30, 2001 12:50 PM
To: Chavez, Gilbert R SITI-ITDSAS
Cc: [EMAIL PROTECTED]
Subject: RE: [Oscar-users] RE: adding nodes to the cluster


Is this on the network boot or the HD boot?

         Jeremy

At 12:46 PM 10/30/2001 -0600, Chavez, Gilbert R SITI-ITDSAS wrote:
>Jeremy,
>  I noticed that during the boot up session of the additional node I'm
adding
>to the cluster, I received a messages stating "fsck.nfs" not found. Where
is
>it trying to find this fsck.nfs?
>
>regards,
>
>Gilbert
>
>-----Original Message-----
>From: Jeremy Enos [mailto:[EMAIL PROTECTED]]
>Sent: Monday, October 22, 2001 3:55 PM
>To: Chavez, Gilbert R SITI-ITDSAS; 'Richard C Ferri'; Chavez, Gilbert R
>SITI-ITDSAS
>Cc: [EMAIL PROTECTED]
>Subject: RE: [Oscar-users] RE: adding nodes to the cluster
>
>
>Hi Gilbert-
>You shouldn't need to build a custom kernel in all likelihood.  I think you
>have rpm installed version 2.2.16-3 on your older nodes.  You also
>shouldn't need to replace any of the kernels in /tftpboot/.  Those are the
>universal "support everything" kernels that are only used for booting over
>the network for node builds.   It sounds like you're getting booted
>successfully, and the problem is down the line from there.
>I'll leave the rest to Rich-
>
>          Jeremy
>
>At 01:06 PM 10/22/2001 -0500, Chavez, Gilbert R SITI-ITDSAS wrote:
> >Rich,
> >  Thanks for responding. When the cluster was built for us it was built
>using
> >8 nodes and 1 head node. These 8 node systems had the same architecture
> >using Dells (1gig memory, 866MHZ CPU, with 18gig disk SCSI disk drives).
> >This additional system I'm trying to install to the cluster is a SysteMax
> >with 1gig memory, 1.8MHZ CPU, and a 40gig IDE drive. As you can see this
>new
> >system has different architecture. I added this system using the glui and
> >adding
> >it to the existing group but created a new resouce for the disk table
>(40gig
> >instead of 18gig). I made sure that this new node was listed under the
> >proper files (/etc/MAC.info, dhcpd.conf, c3.conf, etc..) However, I have
>not
> >rebuilt a kernel for this new machine. Do I need to rebuild a kernel? If
>so,
> >how do I rebuild a kernel for a cluster system and where do I put it? The
> >cluster is running Linux6.2 and I know how to build a kernel on a 7.1 (I
> >recently attended a Linux class), I'm assuming the kernel rebuilding is
the
> >same between versions? Below is the disktable I'm using. This is the same
> >disktable I'm using for the 18gig drives, but for this disktable I
modified
> >the disk device from sda (from the original disk table for the 18gig
>drives)
> >to hda for IDE, but the system keeps crashing while performing an rdev.
>What
> >kernel am I using for the cluster systems? Is it the kernel in
> >/tftpboot/bzImage? If I have to rebuild the kernel do I rebuild it from
the
> >head node and copy it to /tftpboot/bzImage? There is also a file called
> >upramdisk163.ramdisk in /tftpboot/tar, do I need to rebuild this file
too?
> >The head node is also a Dell, but the difference is this head node has
two
> >866 CPUs.
> >
> >I will try to boot the system in network mode and enter the rdev from the
> >command line as you suggested. Do I hit the "tab" key to get to where I
can
> >enter the rdev command before it tries to boot from harddrive?
> >
> >Any help would be appreciated.
> >
> >Gilbert
> >
> >Disk table under /usr/local/oscar/lui_sources
> >/dev/hda1       ext2            3       c       y       /boot
> >/dev/hda2       extended        2210    c       n
> >/dev/hda5       ext2            2190    c       n       /
> >/dev/hda6       swap            20      c
> >nfs             nfs             /home   rw      10.0.0.50
> >
> >
> >
> >
> >-----Original Message-----
> >From: Richard C Ferri
> >Sent: Friday, October 19, 2001 1:21 PM
> >To: Chavez, Gilbert R SITI-IT-DSAS
> >Cc: [EMAIL PROTECTED]
> >Subject: Re: [Oscar-users] RE: adding nodes to the cluster
> >
> >
> >
> >Gilbert,
> >      It seems like  you're getting to the very end of the clone script
>(the
> >one that copies stuff from the server to client, and installs all the
RPMs)
> >and then rdev is failing.  rdev is doing something really simple -- it's
> >setting the root device for the kernel that's permanently installed on
your
> >local harddrive. The command that is failing is line 469:
> >
> >rdev /mnt/boot/vmlinuz $rootpart
> >
> >where $rootpart is the root partition name (e.g. /dev/sda6).
> >
> >I am having some trouble understanding how all those nice RPMs got
> >installed and lilo ran, but then the rdev command failed.  It is
definitely
> >not normal to see all thoese errors on reboot. What is happening is that
> >when the kernel is loaded it doesn't know where to find its root file
> >system, and as we know, life is meaningless without root.
> >
> >I'd like to a) see what your disk partition file looks like and b) like
you
> >to run the rdev command on the node while it's still in network boot mode
> >(before you boot it from harddrive). If you can debug a little perl, put
a
> >breakpoint on the rdev command in clone, and display what $rootpart is.
> >My guess is that somehow clone is confused about what the root partition
is
> >named, and as a result rdev is failing causing root not to get mounted on
> >reboot (thus all those nasty error messages).
> >
> >Rich
> >
> >Richard Ferri
> >IBM Linux Technology Center
> >[EMAIL PROTECTED]
> >845.433.7920
> >
> >"Chavez, Gilbert R SITI-IT-DSAS" <[EMAIL PROTECTED]> (by way of Jeremy
> >Enos <[EMAIL PROTECTED]>)@lists.sourceforge.net on 10/19/2001 12:59:05
PM
> >
> >Sent by:  [EMAIL PROTECTED]
> >
> >
> >To:   [EMAIL PROTECTED]
> >cc:
> >Subject:  [Oscar-users] RE: adding nodes to the cluster
> >
> >
> >
> >Well, I'm getting a little closer to succeeding on this one node. Its
> >giving
> >me FITS! I'm so close I can smell it! Here's what its doing now:
> >
> >- I used the exact numbers for cylinders on the 18gig disk table (old
> >table)
> >for the 40gig disk table file and succeeded, or at least I got passed
this
> >stage. I will tweet the numbers to correctness later.
> >
> >- The boot process got farther and started loading the RPMs. However, I
> >received an error (listed below) regarding rdev. Do you have any clues to
> >what is causing this error?
> >
> >There are a lot of FAILED messages on the screen during the initial
bootup
> >but the screen scrolls too fast to where I can't see what the failures
are.
> >Is this normal to have these failures during the first go-around?
> >
> >Here's some messages I did see and were able to write down:
> >/etc/rc.d/rc5.d/S99local: /proc/sys/net/ipv4/ip_forward - no such file
(BUT
> >ITS THERE ON THE SERVER)
> >/var/lib/nfs/etab - couldn't stat
> >nfssvc not supported
> >unable to open nfs
> >could open /mnt/etc/group
> >   "" ""  ""  "   "  /passwd
> >   "  "   "    "   " etc....
> >
> >Also, there is a message about the disk having 4870 cylinders and it
being
> >larger than 1024 may cause a problem.
> >
> >
> >Excerpt from the node09.log file
> >
> >: about to read the client resource allocation table
> >: about to partition the harddrive
> >: about to execute part2 to partition the harddrive using /tar/40gig.disk
> >as
> >the file allocation table
> >: about to install rpms for an RPM type installation
> >: about to copy /tar/group.source to /mnt/etc/group
> >: about to copy /tar/myshadow.source to /mnt/etc/shadow
> >: about to copy /tar/rhosts.source to /mnt/root/.rhosts
> >: about to copy /tar/passwd.source to /mnt/etc/passwd
> >: about to copy /tar/gshadow.source to /mnt/etc/gshadow
> >: about to create the /etc/fstab in the permanent root file system
> >: about to copy any user exit scripts to /tmp/exit
> >: about to create the kernel system map
> >: about to create /etc/lilo.conf and run lilo
> >: the rdev command failed, exiting with error
> >
> >Any help would be appreciated.....
> >
> >regards,
> >
> >Gilbert
> >
> >
> >
> >-----Original Message-----
> >From: Jeremy Enos
> >Sent: Thursday, October 11, 2001 4:27 PM
> >To: Chavez, Gilbert R SITI-IT-DSAS; Chavez, Gilbert R SITI-IT-DSAS
> >Subject: RE: adding nodes to the cluster
> >
> >
> >Yep... looks like you're getting booted ok, but the "clone" script is
> >having trouble parsing the disktable file.  You may want to just use one
of
> >the samples in OSCAR-1.0/oscarResources/.
> >You can edit an already created resource directly and probably save
> >yourself some overhead.  The resource files are generated in
/tftpboo/tar/.
> >Do you know if you have the same ethernet adapter in your new systems as
> >your old systems?  If you're getting booted with a floppy disk, then I
> >suspect the NIC is the same or it wouldn't have worked.  While network
> >booted, the universal, support everything kernel that is used will spew
> >many error messages that don't mean anything.... I'm not sure about the
> >network card errors you're seeing though.  I'd continue trying the way
> >you're going though, because the problem you're running into right now
> >seems to be with parsing that disktable file.
> >
> >           Jeremy
> >
> >
> >At 04:09 PM 10/11/2001 -0500, Chavez, Gilbert R SITI-IT-DSAS wrote:
> >  >Thanks for responding. How do you build a new ethernetboot diskette
and
> >  >ramdisk? I'm able to boot the new node but I received an error (look
at
> >the
> >  >following) that the disk table is bad. Is this bad because of the
> >etherboot
> >  >disk you suggested? Also, every once in a while I received "eth0:card
>not
> >  >receiving RX buffer" and "eth0:card no receiving resources". Maybe
this
> >is
> >  >due to the etherboot floppy we are using.
> >  >
> >  >: about to execute part2 to partition the harddrive using
>/tar/40gig.disk
> >as
> >  >the file allocation table
> >  >: an error occurred during disk partitioning, exiting with error
> >  >
> >  >regards,
> >  >
> >  >Gilbert
> >  >
> >  >-----Original Message-----
> >  >From: Jeremy Enos
> >  >Sent: Thursday, October 11, 2001 3:34 PM
> >  >To: Chavez, Gilbert R SITI-IT-DSAS
> >  >Subject: RE: adding nodes to the cluster
> >  >
> >  >
> >  >All PXE capable ethernet cards should work just fine with pxelinux.bin
> >  >(unless that card's PXE support is bad).  You should only be using the
> >  >tagged image if you don't have working PXE support, and you boot from
a
> >  >floppy.  (I think this is what we did on the original nodes)  Now,
that
> >  >floppy that we generated is specific to the ethernet card in the
>original
> >  >nodes.  If the new nodes have a different card, then you will have to
> >  >generate a new etherboot floppy.
> >  >
> >  >
> >  >          Jeremy
> >  >
> >  >At 06:59 PM 10/10/2001 -0500, you wrote:
> >  > >Well, I tried to boot from pxelinux.bin and with tagged, but to
luck.
> >With
> >  > >pxelinux.bin the system tries to boot but tells me that the
> >"pxelinux.bin"
> >  > >is a wrong image tag, was this feature tested when you were here?
The
> >  > >pxelinux.bin file is a data file and the tagged file is a "x86 boot
> >  >sector".
> >  > >I think the problem is with the data file pxelinux.bin, shouldn't it
>be
> >a
> >  > >boot sector like the tagged file? How do I get the correct
>pxelinux.bin
> >  > >file? Can I download it? Trying to boot with tagged I receive
messages
> >that
> >  > >"eth0: found no sources on card", and "no RX buffer" error messages.
> >Have
> >  > >you seen these errors before?
> >  > >
> >  > >regards,
> >  > >
> >  > >Gilbert
> >  > >
> >  > >-----Original Message-----
> >  > >From: Jeremy Enos
> >  > >Sent: Wednesday, October 10, 2001 5:43 PM
> >  > >To: Chavez, Gilbert R SITI-IT-DSAS; Chavez, Gilbert R SITI-IT-DSAS
> >  > >Subject: RE: adding nodes to the cluster
> >  > >
> >  > >
> >  > >Hi Gilbert-
> >  > >Sorry it's taking me so long... I'm at a conference in LA all week.
> >  > >Anyway...
> >  > >Sounds like you're doing pretty well... you basically just need to
>make
> >new
> >  > >resources and groups for the new machines, and go from there with
> >building
> >  > >them.
> >  > >In the dhcpd.conf file... pxelinux.bin is used if you're network
> >booting
> >  > >with the PXE boot rom on the card.  tagged is used if you're using
an
> >  > >etherboot floppy.
> >  > >The error you see about gdm is normal while you're booted on an NFS
> >mounted
> >  > >filesystem (network booted).  I'm not sure what effect changing
those
> >  > >permissions might have.
> >  > >Let me know how things progress...
> >  > >
> >  > >          Jeremy
> >  > >
> >  > >
> >  > >At 05:30 PM 10/10/2001 -0500, Chavez, Gilbert R SITI-IT-DSAS wrote:
> >  > > >Jeremy,
> >  > > >  Per my voice mail to you, I tried to add the new PC to the
>cluster.
> >I
> >  > > >figured out some things, like defining a machine, allocating
> >resources,
> >  > > >deallocating resources, etc. I also created a disk table for a 40
> >disk
> >  >and
> >  > > >allocated it to the new PC. I tried so many things to get the new
PC
> >  > >working
> >  > > >but to no avail. I once got it working to where I could at least
log
> >into
> >  > > >the node as root but it complained about ownership on /var/gdm.
>After
> >  > > >correcting the ownership on /var/gdm the screen on the new node
went
> >  >blank
> >  > > >and the system was in a hung state. I looked at the log file under
> >  > > >/tftpboot/lim/log/node09.log and noticed that it complained about
>the
> >  >disk
> >  > > >partitioning was not correct. I corrected the disktable file and
> >tried
> >to
> >  > > >reboot, where the system did not boot up properly at this point.
> >  > > >
> >  > > >I tried to use PXE and a floppy diskette install but had no luck.
>One
> >  >thing
> >  > > >to mention which may have messed things up was within the oscar
> >wizard
> >  >and
> >  > > >clicked on step6. I looked at the scripts that start with step 6
and
> >it
> >  >is
> >  > > >pointing to the pre_install.part2 script. I noticed this script
made
> >  > >changes
> >  > > >to the dhpcd.conf file under /etc and placed all nodes as file
> >  > > >/tftpboot/pxelinux.bin under filename instead of /tftpboot/tagged.
> >So,
> >I
> >  > > >updated the file manually and placed everything back the way it
was
> >in
> >  >the
> >  > > >dhcpd.conf file. I noticed that the dhcpd daemon was not running
so
>I
> >  > > >restarted it again. Is the pxelinux.bin for PCs with PXE ready on
> >them?
> >  >Can
> >  > > >I used this for the new PC? Is there anything critical that step 6
> >  > > >(pre_install.part2) changes that I need to be concerned about? I
>have
> >  > >listed
> >  > > >the pre_install.part2 script below.
> >  > > >
> >  > > >I'm going to remove the new node completely and start over. I know
> >I'm
> >  >real
> >  > > >close to getting this system installed.
> >  > > >
> >  > > >Any help would be appreciated.......Please get back with me as
soon
> >as
> >  >you
> >  > > >can.
> >  > > >
> >  > > >thanks,
> >  > > >
> >  > > >Gilbert
> >  > > >
> >  > > >[root@pleiades scripts]# more pre_install.part2
> >  > > >#!/bin/sh
> >  > > >
> >  > > ># pre_install.part2 - script to do part2 of the pre-client-install
> >server
> >  > > >setup
> >  > > ># Last Updated 11/16/00 by Michael Brim ([EMAIL PROTECTED])
> >  > > >
> >  > > ># Install C3 Tools & Supporting Programs/Files
> >  > > >
> >  > > >   echo "Installing C3 Tools"
> >  > > >   cd ../c3
> >  > > >   ./lui_to_ORNL -l /tftpboot/lim -ORNL /etc/ORNLcluster.def
> >  > > >   ./c3_install /etc/ORNLcluster.def /tftpboot/pxelinux.bin
> >  > > >
> >  > > ># Install PBS Server
> >  > > >
> >  > > >   echo "Installing PBS Server RPM"
> >  > > >   cd ../pbs
> >  > > >   ./pbs_server_install
> >  > > >
> >  > > ># Done
> >  > > >
> >  > > >   echo
> >  > > >   echo "Server Pre-Client-Install Complete - Begin booting client
> >nodes"
> >  > > >
> >  > > >
> >  > > >
> >  > > >
> >  > > >-----Original Message-----
> >  > > >From: Jeremy Enos
> >  > > >Sent: Tuesday, October 09, 2001 8:32 PM
> >  > > >To: Chavez, Gilbert R SITI-IT-DSAS
> >  > > >Subject: Re: adding nodes to the cluster
> >  > > >
> >  > > >
> >  > > >I can give you some help with this...  I've got to run right now
> >  > > >though.  I'll send you something later.
> >  > > >
> >  > > >          Jeremy
> >  > > >
> >  > > >
> >  > > >At 06:42 PM 10/9/2001 -0500, you wrote:
> >  > > >
> >  > > > >Jeremy,
> >  > > > >  We want to add more systems to the cluster we have here at
>Shell.
> >For
> >  > > >now,
> >  > > > >we have one system that we want to add to the cluster for
testing
> >  > >purposes,
> >  > > > >but the architecture is different. This machine is a clone
> >(Systemax)
> >  > >which
> >  > > > >has a 1.8 MHZ processor with a 40 gig disk. Since this PC is
> >different
> >  > >from
> >  > > > >our other cluster systems (Dell) what all do we need to do to
get
> >this
> >  > > > >machine added to the cluster properly (such as disktables, etc)?
> >  > > > >
> >  > > > >Our install procedures also need to be updated. I have listed
them
> >  >below,
> >  > > > >please let me know what step(s) are missing for a new install.
> >  > > > >
> >  > > > >- Run glui and define a machine
> >  > > > >
> >  > > > >- Define and group then allocate resources for the node, then
boot
> >the
> >  > >node
> >  > > > >(I don't remember how to do this step, please advise)
> >  > > > >
> >  > > > >- Once the system is at the login prompt remove the floppy and
> >reset
> >  > >node.
> >  > > > >
> >  > > > >- Run oscar_wizard step 7 only
> >  > > > >
> >  > > > >- Run node_setup NODENAME
> >  > > > >
> >  > > > >Aren't we supposed to add the new host to the /etc/host file and
> >update
> >  > > > >other files like the dchpd.conf file before starting up glui? Do
>we
> >  >still
> >  > > > >need to do a "upresources" and/or "upnodes", or maybe a
> >upresourcesfast
> >  > >for
> >  > > > >a faster machine as the one we want to install.
> >  > > > >
> >  > > > >Any help would be appreciated...
> >  > > > >
> >  > > > >thanks,
> >  > > > >
> >  > > > >Gilbert Chavez
> >
> >
> >_______________________________________________
> >Oscar-users mailing list
> >[EMAIL PROTECTED]
> >https://lists.sourceforge.net/lists/listinfo/oscar-users
> >
> >
> >
> >_______________________________________________
> >Oscar-users mailing list
> >[EMAIL PROTECTED]
> >https://lists.sourceforge.net/lists/listinfo/oscar-users

_______________________________________________
Oscar-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/oscar-users

RE: [Oscar-users] RE: adding nodes to the cluster

Reply via email to