What did wreck the system? (Was "Re: Help: Login Failure"")

Ralf Sun, 06 Jun 2004 05:00:32 -0500

Dear List of Experts :)

I do confess having entered 
(*) e2fsadm -L +10M /dev/vg_system/lv_var


without unmounting - and without getting those XXXXXXXX bar that usually 
indicates progress / success. At that state, 98% of /var were full (used).

When I tried to login from a Thin Client two hours later, the above described 
effect occured (as if the password was mistyped. Now, I remember that a 
collegue described the same problem having occured before!).

Checking the disk usage with df, on /var allegedly 101% were used, the 
absolute amount of bytes being used was a large negative number! However, we 
still could browse /var at this state, while paging some log files lasted 
strikingly long, so there was already a feeling of corruption.

Could it be, that at this very state - even without the stupidity of (*) - the 
overcharged /var drive had lead to corrupt ldap data as there was no further 
way to write to /var/ldap (or what the exact location is)? 

Instead of backing up as much from /var as possible, we then unmounted /var 
and gave it a "fsck -fy /dev/vg_system/lv_var" (I regret the 'y'). After 
pages of fixing messages, we mounted /var again - and found only a lost+found 
directory there. We managed to restore most of the data - but didn't get the 
ldap to running.

What do you think now:
[ ] The system got wrecked when /var run out of memory.
[ ] The system got wrecked when (*) was done.
[ ] fsck couldn't cope with the situation as there was no free space on the 
drive, which wrecked the system?

Now, we have RC-3 running - and this is not too bad - but for further 
situations, one should learn some lessons. Please comment on those:

(1) As a matter of fact, /var run out of memory. This was due to two facts:
 (i) Taking in consideration that squid takes 100 MB out of 150 MB partitioned 
for /var, there is only 50 MB designed for logs AND ldap.
 (ii) All logs go to tjener's /var - even logs from attached workstations 
(this is what we believe, at least). Admitedly, our teacher's work station is 
quite old and once per second says "kernel: i8253 count to high! resetting"! 
You can imagine that this message filled up /var/log/messages!
=> LESSON: Make /var larger, filter the above mentioned message, trigger 
logrotate on size rather than on time.

(2) A full /var/log corrupts ldap!
=> LESSON: Put those on different partitions, add /var/ldap (or what path it 
is) by default to the list of backup directories! (This was not the case with 
our system, was it, Klaus?)

(3) LESSON: Never try to enlarge mounted partitions!

(4) LESSON: Never do fsck with -y option set on a full partition (rather 
(re)move some files first and omit -y switch)!

(5) LESSON: Always backup your system.

(6) LESSON: Don't use tight time slices for administration!

Please, feedback all your opinions: Where do you see aspects that should be 
taken to bugzilla?

Regards
Ralf

Am Freitag 04 Juni 2004 12:57 schrieb Frank Wei�er:
> The worst thing seems to be putting /usr on a too small LV, because to
> resize ext3, you need to umount it, but then you haven't got access to
> resize2fs any more :-(

As long as you stand back from installing X/KDE on your tjener, the given 
800MB should suffice :)

What did wreck the system? (Was "Re: Help: Login Failure"")

Reply via email to