eek, a nasty state of affairs... here's hoping it'll stay up.

------- Forwarded Message

Date:    Thu, 01 May 2008 03:54:02 -0700
From:    Joe Schaefer <[EMAIL PROTECTED]>
To:      [EMAIL PROTECTED]
Subject: SVN commit restored

The host formerly known as eris served the ASF's
subversion repository for a bit over 2 years.  The
machine was an one of 4 IBM x345's that Sam Ruby was 
able to convince his employer to loan to us, outfitted
with a ServerRaid 6i card purchased by the ASF.  If 
you can recall what svn service was like just prior 
to bringing eris online,  this week's difficulties 
should ring a bell.

Eris served the foundation exceptionally well, with 
only minor service outages throught its lifespan.  The

infra team was aware that eris had been struggling 
lately, and part of our purchasing plan included 
replacing all the x345's on loan.  The final 
replacement machines, 3 Dell 2950s, arrived late
Friday.  As luck would have it, eris's first 
major failure happened Saturday morning- the 
RAID array had become non-responsive.  We 
were able to bring eris back online for the 
weekend, but a repeat failure of the RAID 
array happened late Sunday, at which point
we had little choice other than to disable 
commits and resync the backups - at least 
until we could rack the new Dells and 
bring one of them online ASAP.

Bringing a new machine online to replace 
eris has been wrought with setbacks.  Initially
we made several botched bone-standard 
installations, which we surmised was caused 
by a bad disk.  After replacing the disk, we 
were able to bring up a fully functional 
replacement for eris, only to have
it suffer an unrecoverable crash under load
testing.  At this point, we were no longer certain
that the cause of our initial failures were limited
to a bad disk, but started suspecting a potential
flaw in the PERC6i controller we were using.

Nevertheless we dutifully reinstalled everything 
on the same host, working in tandem and around the
clock.  We were able to bring the machine back online,
load test it with rsync, and announce to committers@
on Tuesday that commits had been reactivated. 
Unfortunately the machine crashed after less than 24
hours of uptime, leaving our zfs array in a degraded
state.  We then started a vicious cycle of reboots,
retuning sysctls, and recompiling kernels, until we
had reached the point where the machine would no
longer
boot on its own - deep, dark kernel loader voodoo 
was necessary to get the machine back up.  That's
when my note went out to committers@ stating that
commits were disabled again while the infra team
considered our options.

The first thing we pursued at that point was to
resync our backups again, followed by a plan to
swap out the working disks into another Dell chassis.
We did that, but were still stuck with a machine that
would not boot on its own.  Recompiling a debugging
kernel at that point was fruitful- we had overshot
on one of our sysctls and that was causing the panics.
After backing it off to what we had in the original
config, the host was able to survive a reboot
unattended and has been sufficiently load tested now
that I'm comfortable making the following statement:

   Subversion commit access has been (tentatively!) 
   restored.

I say tentatively because this is the same config we
were using when this installation first crashed, so
there may be brief periods of downtime while we
recover.  But I don't expect any unrecoverable 
errors at this point.

Kudos to the many, many hours of volunteer service 
put forth by the infrastructure team and OSUOSL 
in this effort.  We stuck with original host 
name eris on the new Dell, so suffice it to say

   eris is dead! long live the new eris!



      _________________________________________________________________________
___________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu
06i62sR8HDtDypao8Wcj9tAcJ


------- End of Forwarded Message

Reply via email to