Sorry, here's the issue:

Almost two years ago I loaded a cluster of 33 Dell PowerEdge 1750s with
RedHat 9 and OSCAR 3.  Initially, I couldn't get the nodes to network
boot.  Searching through this list, I found Jason Hlady's post
concerning issues he was having loading a batch of 1750s and the fact
that he'd gotten an updated kernel from Frank Crawford which had worked
on some of his 1750s.  Jason was kind enough to share this tarball with
me and it worked great.  

Now, what we actually purchased at that time was 38 Dell PowerEdge
1750s.  Five of them were kept seperate as a small test cluster to test
changes, etc.  I've loaded this cluster multiple times with the same
Redhat 9 and OSCAR 3 using Frank's updated kernel.  It has always
worked.  Well, in troubleshooting a kernel issue with some of our
software, we decided to take the four nodes from my test cluster and add
them to the original 32 node cluster (33 actually, but one is the
headnode).  This is where the problem begins.

I ran install_cluster eth0 (eth0 is connected to the cluster gb ethernet
back end) and added four nodes.  When I went to network boot these
nodes, it did not work.  Here are some of the errors...

(stuff omitted)
tg3: (02:00.0) phy probe failed, err -16
tg3: problem fetching invariants of chip, aborting
tg3: (02:00.1) phy probe failed, err -16
tg3: problem fetching invariants of chip, aborting
(stuff omitted)
FusionMPT base driver 2.03.00
mptbase: Initiating ioc0 bringup
mptbase: ioc0: WARNING: unexpected doorbell active
mptbase: ioc0: ERROR: doorbell ACK timeout (2)
(mptbase stuff repeats a couple times)

Kernel panic....

Some searching on the users list immediately brought me back to Jason
Hlady's problem.   I read through the string of messages, but didn't see
any solution other than to reload the cluster.  This isn't an option for
me.  

I removed the kernel and initrd.img from /tftpboot and verified that
without them the nodes don't get a kernel...so I know they are being
used.  It's just really strange that these have worked over and over for
me but suddenly on this cluster they've quit working.  I don't get it.

Anyway...if there's anything you suggest checking please let me know.  I
have Frank's tarball in /usr/share/systemimager/boot/i386/standard/ and
of course the kernel and initrd.img are in /tftpboot.  It just doesn't
work now for some reason.

John
-----Original Message-----
From: Bernard Li [mailto:[EMAIL PROTECTED] 
Posted At: Thursday, February 09, 2006 10:49 PM
Posted To: OSCAR
Conversation: [Oscar-users] Upgrading OSCAR Cluster
Subject: RE: [Oscar-users] Upgrading OSCAR Cluster


Hi John:

boel_binaries.tar.gz is located in
/usr/share/systemimager/boot/<arch>/standard and this gets pulled down
via rsync by SystemImager during node imaging.

Can you post your specific problem?  I get lost in the thread.

Cheers,

Bernard



From: OSCAR [mailto:[EMAIL PROTECTED]
Sent: Thu 09/02/2006 07:46
To: Bernard Li; [email protected]
Subject: RE: [Oscar-users] Upgrading OSCAR Cluster


Thanks Bernard.  I think this is what I'll end up doing long term.  We
have another cluster running OSCAR 4 and RHEL 3, with the support for
CentOS in later OSCAR versions we'll definitely be switching to that.

But, in the mean time I need to limp this cluster through the end of the
project it's currently for.  Do you have any thoughts for the other
thread I posted?  I'm running RH9 with OSCAR 3 and I am getting the
exact issue discussed in that thread.  Unfortunately, starting over
isn't an option.  I need to get these nodes loaded to test the new
kernel before I push it out to the rest of the cluster....but for some
reason the nodes don't get the right kernel when they pxe boot.  I can't
find where they are getting it from...it doesn't appear to be the one in
/tftpboot because that's the one I used to successfully the load the
cluster initially.  (it's the boel_binaries one discussed in the
thread).  

Any and all thoughts are appreciated.  Thanks in advance for your
support.

John
-----Original Message-----
From: Bernard Li [mailto:[EMAIL PROTECTED] 
Posted At: Thursday, February 09, 2006 1:42 AM
Posted To: OSCAR
Conversation: Upgrading OSCAR Cluster
Subject: RE: [Oscar-users] Upgrading OSCAR Cluster


Hi John:

There is currently no upgrade path for OSCAR - i.e. if you want to
upgrade, you'll have to re-install the OS (on the headnode), re-install
OSCAR, re-create the images and re-deploy your compute nodes.

You don't have to do this on the production cluster, if you have 2 spare
computers (hopefully with similar hardware as your cluster nodes), you
can build a test cluster and create/tweak your images before you perform
this on your production cluster.  You probably also want to backup your
user files, fstab as well as other configuration settings from Ganglia
and/or TORQUE, etc.

CentOS 3 is based on RHEL3 and should be quite similar to Red Hat Linux
9 - if you want a "newer" distribution I would recommend at least CentOS
4 (that is provided that your other software work under this OS).  Do
note that CentOS 4 runs 2.6 kernel, whereas both CentOS 3 and RHL9 runs
2.4 kernel.

Cheers,

Bernard



From: [EMAIL PROTECTED] on behalf of OSCAR
Sent: Wed 08/02/2006 12:04
To: [email protected]
Subject: [Oscar-users] Upgrading OSCAR Cluster


We're running OSCAR 3.0 on RedHat9.  We're running into several issues
with some of our software and the solution seems to be upgrading to a
newer linux distribution.  This cluster is shared by multiple projects
so I want to minimize the impact as much as possible.  What I'd like to
do is upgrade OSCAR, use it to build an image based on a newer
distribution, tweak the image with all the changes we've made for our
software, and then deploy it.  
I'm just starting to look at this, so if anyone has suggestions, please
let me know.  Can anybody point me to upgrade instructions?  Is it as
simple as just installing the latest oscar version and proceeding like a
new install?  Will the installation notice that I already have a cluster
deployed and pick up those settings?
Any advice on whether to upgrade the headnode OS?  Should I do this
first? 
I plan to use CentOS 3, primarily for compatibility with other systems
we use. 
All advice is greatly appreciated...I'll start digging around for this
information myself now too.  :)  
Thanks, 
John Artman 
CCNA, MCP, RHCE/CT 
Senior Systems Engineer 
ENSCO Inc. 


The information contained in this email message is intended only for the
use of the individuals to whom it is addressed and may contain
information that is privileged and sensitive. If the reader of this
message is not the intended recipient, you are hereby notified that any
dissemination, distribution or copying of this communication is strictly
prohibited. If you have received this communication in error, please
notify the sender immediately by email at the above referenced address.
Thank you.

The information contained in this email message is intended only for the
use of the individuals to whom it is addressed and may contain
information that is privileged and sensitive. If the reader of this
message is not the intended recipient, you are hereby notified that any
dissemination, distribution or copying of this communication is strictly
prohibited. If you have received this communication in error, please
notify the sender immediately by email at the above referenced address.
Thank you.

The information contained in this email message is intended only for the use of 
the individuals to whom it is addressed and may contain information that is 
privileged and sensitive. If the reader of this message is not the intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of this communication is strictly prohibited. If you have received this 
communication in error, please notify the sender immediately by email at the 
above referenced address. Thank you.


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642
_______________________________________________
Oscar-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to