Ok, I fucked up...

Pier Fumagalli Tue, 22 Oct 2002 17:12:35 -0700

Whilst trying to help out to solve the situation of Daedalus being
overloaded (Greg Ames and I were trying to split the load from Daedalus onto
Nagoya), I _seriously_ screwed up Nagoya... I'm sorry...


I started off noticing that Nagoya was collapsing (or better, the network
was collapsing) when the traffic was roughly at 5 mbps, although the system
was doing just fine (load average 1.4 with GUMP running).

I figured out it was something to do with the interfaces, and therefore had
Justyna change some cables over in the lab, and after seeing that the
interfaces were still running at 10 mbps half-duplex (for some odd weirdness
that I still didn't understand), I forced them to be a 100 mbps full-duplex
(the switch supports it apparently)...

Well, that was the root of all problems... Forcing the interface at 100 mbps
made the whole network around Nagoya collapse (I suspect, then that the
switch is broken), including Nagoya's console...

Without access to the console, and with the network interfaces sending
random packets on the ethernet segment, the only possible solution was to
physically power-off the system and hope for a better chance of interfaces
auto-configuration at boot-up...

Once I had that done (thank Justy again), I had access to the console
(serial), but at the same time Nagoya didn't want to boot properly, it
wasn't seeing the SCSI/fiber-optics disk array...

That's where I noticed that I fucked up big time... Some times in the past
(like 6 months ago), I removed a bunch of Solaris packages as documented by
Sun Blueprints ("Security through System Minimization"), and since
everything was working, I never actually thought that something bad could
have happened...

Well, what happened was that although the Solaris kernel had still the
modules for the disk array in memory, well, those were not available anymore
on the disk, and therefore, major pain at the next reboot...

Now, I managed to restore the modules, reconfigure the system and have it
up-and-running once again, but at the same time I didn't fix the problem
afflicting the network...

Therefore, Nagoya is up and running as it was before, but it can't really
hold more than 10 mbps half-duplex traffic (therefore 5 mbps of real
bandwidth), because of some random stuff happening on the ethernet
segment...

I'm sorry if everything got really fucked up this afternoon, hopefully the
situation should get back to normal once the various queues on the different
mail systems get flushed...

Justy is going to file a request with Sun's hardware support to try and
figure out why out of the 155 mbps network we have available for Nagoya we
can use only 5 mbps (and why the switch and/or Nagoya's interfaces are
behaving so strangely), but in the meantime, if anyone has experience with a
"3Com SuperStack II 1000 Switch" and Sun HME interfaces please let me
know...

Really sorry about what happened, but I'm a moron and that's more than what
I need to say...

    Pier


--
To unsubscribe, e-mail:   <mailto:general-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:general-help@;jakarta.apache.org>

Ok, I fucked up...

Reply via email to