eek, a nasty state of affairs... here's hoping it'll stay up.
------- Forwarded Message
Date: Thu, 01 May 2008 03:54:02 -0700
From: Joe Schaefer <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: SVN commit restored
The host formerly known as eris served the ASF's
subversion repository for a bit over 2 years. The
machine was an one of 4 IBM x345's that Sam Ruby was
able to convince his employer to loan to us, outfitted
with a ServerRaid 6i card purchased by the ASF. If
you can recall what svn service was like just prior
to bringing eris online, this week's difficulties
should ring a bell.
Eris served the foundation exceptionally well, with
only minor service outages throught its lifespan. The
infra team was aware that eris had been struggling
lately, and part of our purchasing plan included
replacing all the x345's on loan. The final
replacement machines, 3 Dell 2950s, arrived late
Friday. As luck would have it, eris's first
major failure happened Saturday morning- the
RAID array had become non-responsive. We
were able to bring eris back online for the
weekend, but a repeat failure of the RAID
array happened late Sunday, at which point
we had little choice other than to disable
commits and resync the backups - at least
until we could rack the new Dells and
bring one of them online ASAP.
Bringing a new machine online to replace
eris has been wrought with setbacks. Initially
we made several botched bone-standard
installations, which we surmised was caused
by a bad disk. After replacing the disk, we
were able to bring up a fully functional
replacement for eris, only to have
it suffer an unrecoverable crash under load
testing. At this point, we were no longer certain
that the cause of our initial failures were limited
to a bad disk, but started suspecting a potential
flaw in the PERC6i controller we were using.
Nevertheless we dutifully reinstalled everything
on the same host, working in tandem and around the
clock. We were able to bring the machine back online,
load test it with rsync, and announce to committers@
on Tuesday that commits had been reactivated.
Unfortunately the machine crashed after less than 24
hours of uptime, leaving our zfs array in a degraded
state. We then started a vicious cycle of reboots,
retuning sysctls, and recompiling kernels, until we
had reached the point where the machine would no
longer
boot on its own - deep, dark kernel loader voodoo
was necessary to get the machine back up. That's
when my note went out to committers@ stating that
commits were disabled again while the infra team
considered our options.
The first thing we pursued at that point was to
resync our backups again, followed by a plan to
swap out the working disks into another Dell chassis.
We did that, but were still stuck with a machine that
would not boot on its own. Recompiling a debugging
kernel at that point was fruitful- we had overshot
on one of our sysctls and that was causing the panics.
After backing it off to what we had in the original
config, the host was able to survive a reboot
unattended and has been sufficiently load tested now
that I'm comfortable making the following statement:
Subversion commit access has been (tentatively!)
restored.
I say tentatively because this is the same config we
were using when this installation first crashed, so
there may be brief periods of downtime while we
recover. But I don't expect any unrecoverable
errors at this point.
Kudos to the many, many hours of volunteer service
put forth by the infrastructure team and OSUOSL
in this effort. We stuck with original host
name eris on the new Dell, so suffice it to say
eris is dead! long live the new eris!
_________________________________________________________________________
___________
Be a better friend, newshound, and
know-it-all with Yahoo! Mobile. Try it now. http://mobile.yahoo.com/;_ylt=Ahu
06i62sR8HDtDypao8Wcj9tAcJ
------- End of Forwarded Message