On the network boot...I also receive mountd errors thereafter. Copy of dirs to the local /etc dir, for example, can not be done either. No mounting is done at this point.
regards, Gilbert -----Original Message----- From: Jeremy Enos [mailto:[EMAIL PROTECTED]] Sent: Tuesday, October 30, 2001 12:50 PM To: Chavez, Gilbert R SITI-ITDSAS Cc: [EMAIL PROTECTED] Subject: RE: [Oscar-users] RE: adding nodes to the cluster Is this on the network boot or the HD boot? Jeremy At 12:46 PM 10/30/2001 -0600, Chavez, Gilbert R SITI-ITDSAS wrote: >Jeremy, > I noticed that during the boot up session of the additional node I'm adding >to the cluster, I received a messages stating "fsck.nfs" not found. Where is >it trying to find this fsck.nfs? > >regards, > >Gilbert > >-----Original Message----- >From: Jeremy Enos [mailto:[EMAIL PROTECTED]] >Sent: Monday, October 22, 2001 3:55 PM >To: Chavez, Gilbert R SITI-ITDSAS; 'Richard C Ferri'; Chavez, Gilbert R >SITI-ITDSAS >Cc: [EMAIL PROTECTED] >Subject: RE: [Oscar-users] RE: adding nodes to the cluster > > >Hi Gilbert- >You shouldn't need to build a custom kernel in all likelihood. I think you >have rpm installed version 2.2.16-3 on your older nodes. You also >shouldn't need to replace any of the kernels in /tftpboot/. Those are the >universal "support everything" kernels that are only used for booting over >the network for node builds. It sounds like you're getting booted >successfully, and the problem is down the line from there. >I'll leave the rest to Rich- > > Jeremy > >At 01:06 PM 10/22/2001 -0500, Chavez, Gilbert R SITI-ITDSAS wrote: > >Rich, > > Thanks for responding. When the cluster was built for us it was built >using > >8 nodes and 1 head node. These 8 node systems had the same architecture > >using Dells (1gig memory, 866MHZ CPU, with 18gig disk SCSI disk drives). > >This additional system I'm trying to install to the cluster is a SysteMax > >with 1gig memory, 1.8MHZ CPU, and a 40gig IDE drive. As you can see this >new > >system has different architecture. I added this system using the glui and > >adding > >it to the existing group but created a new resouce for the disk table >(40gig > >instead of 18gig). I made sure that this new node was listed under the > >proper files (/etc/MAC.info, dhcpd.conf, c3.conf, etc..) However, I have >not > >rebuilt a kernel for this new machine. Do I need to rebuild a kernel? If >so, > >how do I rebuild a kernel for a cluster system and where do I put it? The > >cluster is running Linux6.2 and I know how to build a kernel on a 7.1 (I > >recently attended a Linux class), I'm assuming the kernel rebuilding is the > >same between versions? Below is the disktable I'm using. This is the same > >disktable I'm using for the 18gig drives, but for this disktable I modified > >the disk device from sda (from the original disk table for the 18gig >drives) > >to hda for IDE, but the system keeps crashing while performing an rdev. >What > >kernel am I using for the cluster systems? Is it the kernel in > >/tftpboot/bzImage? If I have to rebuild the kernel do I rebuild it from the > >head node and copy it to /tftpboot/bzImage? There is also a file called > >upramdisk163.ramdisk in /tftpboot/tar, do I need to rebuild this file too? > >The head node is also a Dell, but the difference is this head node has two > >866 CPUs. > > > >I will try to boot the system in network mode and enter the rdev from the > >command line as you suggested. Do I hit the "tab" key to get to where I can > >enter the rdev command before it tries to boot from harddrive? > > > >Any help would be appreciated. > > > >Gilbert > > > >Disk table under /usr/local/oscar/lui_sources > >/dev/hda1 ext2 3 c y /boot > >/dev/hda2 extended 2210 c n > >/dev/hda5 ext2 2190 c n / > >/dev/hda6 swap 20 c > >nfs nfs /home rw 10.0.0.50 > > > > > > > > > >-----Original Message----- > >From: Richard C Ferri > >Sent: Friday, October 19, 2001 1:21 PM > >To: Chavez, Gilbert R SITI-IT-DSAS > >Cc: [EMAIL PROTECTED] > >Subject: Re: [Oscar-users] RE: adding nodes to the cluster > > > > > > > >Gilbert, > > It seems like you're getting to the very end of the clone script >(the > >one that copies stuff from the server to client, and installs all the RPMs) > >and then rdev is failing. rdev is doing something really simple -- it's > >setting the root device for the kernel that's permanently installed on your > >local harddrive. The command that is failing is line 469: > > > >rdev /mnt/boot/vmlinuz $rootpart > > > >where $rootpart is the root partition name (e.g. /dev/sda6). > > > >I am having some trouble understanding how all those nice RPMs got > >installed and lilo ran, but then the rdev command failed. It is definitely > >not normal to see all thoese errors on reboot. What is happening is that > >when the kernel is loaded it doesn't know where to find its root file > >system, and as we know, life is meaningless without root. > > > >I'd like to a) see what your disk partition file looks like and b) like you > >to run the rdev command on the node while it's still in network boot mode > >(before you boot it from harddrive). If you can debug a little perl, put a > >breakpoint on the rdev command in clone, and display what $rootpart is. > >My guess is that somehow clone is confused about what the root partition is > >named, and as a result rdev is failing causing root not to get mounted on > >reboot (thus all those nasty error messages). > > > >Rich > > > >Richard Ferri > >IBM Linux Technology Center > >[EMAIL PROTECTED] > >845.433.7920 > > > >"Chavez, Gilbert R SITI-IT-DSAS" <[EMAIL PROTECTED]> (by way of Jeremy > >Enos <[EMAIL PROTECTED]>)@lists.sourceforge.net on 10/19/2001 12:59:05 PM > > > >Sent by: [EMAIL PROTECTED] > > > > > >To: [EMAIL PROTECTED] > >cc: > >Subject: [Oscar-users] RE: adding nodes to the cluster > > > > > > > >Well, I'm getting a little closer to succeeding on this one node. Its > >giving > >me FITS! I'm so close I can smell it! Here's what its doing now: > > > >- I used the exact numbers for cylinders on the 18gig disk table (old > >table) > >for the 40gig disk table file and succeeded, or at least I got passed this > >stage. I will tweet the numbers to correctness later. > > > >- The boot process got farther and started loading the RPMs. However, I > >received an error (listed below) regarding rdev. Do you have any clues to > >what is causing this error? > > > >There are a lot of FAILED messages on the screen during the initial bootup > >but the screen scrolls too fast to where I can't see what the failures are. > >Is this normal to have these failures during the first go-around? > > > >Here's some messages I did see and were able to write down: > >/etc/rc.d/rc5.d/S99local: /proc/sys/net/ipv4/ip_forward - no such file (BUT > >ITS THERE ON THE SERVER) > >/var/lib/nfs/etab - couldn't stat > >nfssvc not supported > >unable to open nfs > >could open /mnt/etc/group > > "" "" "" " " /passwd > > " " " " " etc.... > > > >Also, there is a message about the disk having 4870 cylinders and it being > >larger than 1024 may cause a problem. > > > > > >Excerpt from the node09.log file > > > >: about to read the client resource allocation table > >: about to partition the harddrive > >: about to execute part2 to partition the harddrive using /tar/40gig.disk > >as > >the file allocation table > >: about to install rpms for an RPM type installation > >: about to copy /tar/group.source to /mnt/etc/group > >: about to copy /tar/myshadow.source to /mnt/etc/shadow > >: about to copy /tar/rhosts.source to /mnt/root/.rhosts > >: about to copy /tar/passwd.source to /mnt/etc/passwd > >: about to copy /tar/gshadow.source to /mnt/etc/gshadow > >: about to create the /etc/fstab in the permanent root file system > >: about to copy any user exit scripts to /tmp/exit > >: about to create the kernel system map > >: about to create /etc/lilo.conf and run lilo > >: the rdev command failed, exiting with error > > > >Any help would be appreciated..... > > > >regards, > > > >Gilbert > > > > > > > >-----Original Message----- > >From: Jeremy Enos > >Sent: Thursday, October 11, 2001 4:27 PM > >To: Chavez, Gilbert R SITI-IT-DSAS; Chavez, Gilbert R SITI-IT-DSAS > >Subject: RE: adding nodes to the cluster > > > > > >Yep... looks like you're getting booted ok, but the "clone" script is > >having trouble parsing the disktable file. You may want to just use one of > >the samples in OSCAR-1.0/oscarResources/. > >You can edit an already created resource directly and probably save > >yourself some overhead. The resource files are generated in /tftpboo/tar/. > >Do you know if you have the same ethernet adapter in your new systems as > >your old systems? If you're getting booted with a floppy disk, then I > >suspect the NIC is the same or it wouldn't have worked. While network > >booted, the universal, support everything kernel that is used will spew > >many error messages that don't mean anything.... I'm not sure about the > >network card errors you're seeing though. I'd continue trying the way > >you're going though, because the problem you're running into right now > >seems to be with parsing that disktable file. > > > > Jeremy > > > > > >At 04:09 PM 10/11/2001 -0500, Chavez, Gilbert R SITI-IT-DSAS wrote: > > >Thanks for responding. How do you build a new ethernetboot diskette and > > >ramdisk? I'm able to boot the new node but I received an error (look at > >the > > >following) that the disk table is bad. Is this bad because of the > >etherboot > > >disk you suggested? Also, every once in a while I received "eth0:card >not > > >receiving RX buffer" and "eth0:card no receiving resources". Maybe this > >is > > >due to the etherboot floppy we are using. > > > > > >: about to execute part2 to partition the harddrive using >/tar/40gig.disk > >as > > >the file allocation table > > >: an error occurred during disk partitioning, exiting with error > > > > > >regards, > > > > > >Gilbert > > > > > >-----Original Message----- > > >From: Jeremy Enos > > >Sent: Thursday, October 11, 2001 3:34 PM > > >To: Chavez, Gilbert R SITI-IT-DSAS > > >Subject: RE: adding nodes to the cluster > > > > > > > > >All PXE capable ethernet cards should work just fine with pxelinux.bin > > >(unless that card's PXE support is bad). You should only be using the > > >tagged image if you don't have working PXE support, and you boot from a > > >floppy. (I think this is what we did on the original nodes) Now, that > > >floppy that we generated is specific to the ethernet card in the >original > > >nodes. If the new nodes have a different card, then you will have to > > >generate a new etherboot floppy. > > > > > > > > > Jeremy > > > > > >At 06:59 PM 10/10/2001 -0500, you wrote: > > > >Well, I tried to boot from pxelinux.bin and with tagged, but to luck. > >With > > > >pxelinux.bin the system tries to boot but tells me that the > >"pxelinux.bin" > > > >is a wrong image tag, was this feature tested when you were here? The > > > >pxelinux.bin file is a data file and the tagged file is a "x86 boot > > >sector". > > > >I think the problem is with the data file pxelinux.bin, shouldn't it >be > >a > > > >boot sector like the tagged file? How do I get the correct >pxelinux.bin > > > >file? Can I download it? Trying to boot with tagged I receive messages > >that > > > >"eth0: found no sources on card", and "no RX buffer" error messages. > >Have > > > >you seen these errors before? > > > > > > > >regards, > > > > > > > >Gilbert > > > > > > > >-----Original Message----- > > > >From: Jeremy Enos > > > >Sent: Wednesday, October 10, 2001 5:43 PM > > > >To: Chavez, Gilbert R SITI-IT-DSAS; Chavez, Gilbert R SITI-IT-DSAS > > > >Subject: RE: adding nodes to the cluster > > > > > > > > > > > >Hi Gilbert- > > > >Sorry it's taking me so long... I'm at a conference in LA all week. > > > >Anyway... > > > >Sounds like you're doing pretty well... you basically just need to >make > >new > > > >resources and groups for the new machines, and go from there with > >building > > > >them. > > > >In the dhcpd.conf file... pxelinux.bin is used if you're network > >booting > > > >with the PXE boot rom on the card. tagged is used if you're using an > > > >etherboot floppy. > > > >The error you see about gdm is normal while you're booted on an NFS > >mounted > > > >filesystem (network booted). I'm not sure what effect changing those > > > >permissions might have. > > > >Let me know how things progress... > > > > > > > > Jeremy > > > > > > > > > > > >At 05:30 PM 10/10/2001 -0500, Chavez, Gilbert R SITI-IT-DSAS wrote: > > > > >Jeremy, > > > > > Per my voice mail to you, I tried to add the new PC to the >cluster. > >I > > > > >figured out some things, like defining a machine, allocating > >resources, > > > > >deallocating resources, etc. I also created a disk table for a 40 > >disk > > >and > > > > >allocated it to the new PC. I tried so many things to get the new PC > > > >working > > > > >but to no avail. I once got it working to where I could at least log > >into > > > > >the node as root but it complained about ownership on /var/gdm. >After > > > > >correcting the ownership on /var/gdm the screen on the new node went > > >blank > > > > >and the system was in a hung state. I looked at the log file under > > > > >/tftpboot/lim/log/node09.log and noticed that it complained about >the > > >disk > > > > >partitioning was not correct. I corrected the disktable file and > >tried > >to > > > > >reboot, where the system did not boot up properly at this point. > > > > > > > > > >I tried to use PXE and a floppy diskette install but had no luck. >One > > >thing > > > > >to mention which may have messed things up was within the oscar > >wizard > > >and > > > > >clicked on step6. I looked at the scripts that start with step 6 and > >it > > >is > > > > >pointing to the pre_install.part2 script. I noticed this script made > > > >changes > > > > >to the dhpcd.conf file under /etc and placed all nodes as file > > > > >/tftpboot/pxelinux.bin under filename instead of /tftpboot/tagged. > >So, > >I > > > > >updated the file manually and placed everything back the way it was > >in > > >the > > > > >dhcpd.conf file. I noticed that the dhcpd daemon was not running so >I > > > > >restarted it again. Is the pxelinux.bin for PCs with PXE ready on > >them? > > >Can > > > > >I used this for the new PC? Is there anything critical that step 6 > > > > >(pre_install.part2) changes that I need to be concerned about? I >have > > > >listed > > > > >the pre_install.part2 script below. > > > > > > > > > >I'm going to remove the new node completely and start over. I know > >I'm > > >real > > > > >close to getting this system installed. > > > > > > > > > >Any help would be appreciated.......Please get back with me as soon > >as > > >you > > > > >can. > > > > > > > > > >thanks, > > > > > > > > > >Gilbert > > > > > > > > > >[root@pleiades scripts]# more pre_install.part2 > > > > >#!/bin/sh > > > > > > > > > ># pre_install.part2 - script to do part2 of the pre-client-install > >server > > > > >setup > > > > ># Last Updated 11/16/00 by Michael Brim ([EMAIL PROTECTED]) > > > > > > > > > ># Install C3 Tools & Supporting Programs/Files > > > > > > > > > > echo "Installing C3 Tools" > > > > > cd ../c3 > > > > > ./lui_to_ORNL -l /tftpboot/lim -ORNL /etc/ORNLcluster.def > > > > > ./c3_install /etc/ORNLcluster.def /tftpboot/pxelinux.bin > > > > > > > > > ># Install PBS Server > > > > > > > > > > echo "Installing PBS Server RPM" > > > > > cd ../pbs > > > > > ./pbs_server_install > > > > > > > > > ># Done > > > > > > > > > > echo > > > > > echo "Server Pre-Client-Install Complete - Begin booting client > >nodes" > > > > > > > > > > > > > > > > > > > > > > > > >-----Original Message----- > > > > >From: Jeremy Enos > > > > >Sent: Tuesday, October 09, 2001 8:32 PM > > > > >To: Chavez, Gilbert R SITI-IT-DSAS > > > > >Subject: Re: adding nodes to the cluster > > > > > > > > > > > > > > >I can give you some help with this... I've got to run right now > > > > >though. I'll send you something later. > > > > > > > > > > Jeremy > > > > > > > > > > > > > > >At 06:42 PM 10/9/2001 -0500, you wrote: > > > > > > > > > > >Jeremy, > > > > > > We want to add more systems to the cluster we have here at >Shell. > >For > > > > >now, > > > > > >we have one system that we want to add to the cluster for testing > > > >purposes, > > > > > >but the architecture is different. This machine is a clone > >(Systemax) > > > >which > > > > > >has a 1.8 MHZ processor with a 40 gig disk. Since this PC is > >different > > > >from > > > > > >our other cluster systems (Dell) what all do we need to do to get > >this > > > > > >machine added to the cluster properly (such as disktables, etc)? > > > > > > > > > > > >Our install procedures also need to be updated. I have listed them > > >below, > > > > > >please let me know what step(s) are missing for a new install. > > > > > > > > > > > >- Run glui and define a machine > > > > > > > > > > > >- Define and group then allocate resources for the node, then boot > >the > > > >node > > > > > >(I don't remember how to do this step, please advise) > > > > > > > > > > > >- Once the system is at the login prompt remove the floppy and > >reset > > > >node. > > > > > > > > > > > >- Run oscar_wizard step 7 only > > > > > > > > > > > >- Run node_setup NODENAME > > > > > > > > > > > >Aren't we supposed to add the new host to the /etc/host file and > >update > > > > > >other files like the dchpd.conf file before starting up glui? Do >we > > >still > > > > > >need to do a "upresources" and/or "upnodes", or maybe a > >upresourcesfast > > > >for > > > > > >a faster machine as the one we want to install. > > > > > > > > > > > >Any help would be appreciated... > > > > > > > > > > > >thanks, > > > > > > > > > > > >Gilbert Chavez > > > > > >_______________________________________________ > >Oscar-users mailing list > >[EMAIL PROTECTED] > >https://lists.sourceforge.net/lists/listinfo/oscar-users > > > > > > > >_______________________________________________ > >Oscar-users mailing list > >[EMAIL PROTECTED] > >https://lists.sourceforge.net/lists/listinfo/oscar-users _______________________________________________ Oscar-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/oscar-users
