Hello,
I had a RH 6.0 server running qmail with rather quite success for a few
months, until I decided to install daemontools and the
supervise/multilog/svscan/... tools.
I mainly followed the LWQ docs to get the daemontools installed and
configured, and things seemed ok. Mail was going in and out, no problem.
Until once I had to manually stop the service (qmail stop) and when I
tried restarting it (qmail start) multilog started to continuosly echo
to the console that it didn't have write permissions (sorry don't have
the exact error log right now) about every second. When rebooting the
system, qmail would however run 'supervised' without a problem. Only a
manual 'qmail start' would cause it.
So I was in this situation for about three days, when this morning I
found an error log being echoed to the console non-stop, every second.
The error was:
Feb 17 21:03:45 www kernel: scsi0: MEDIUM ERROR on channel 0, id 0, lun
0, CDB: Read (10) 00 00 26 6b 38 00 00 02 00
Feb 17 21:03:45 www kernel: Current error sd08:05: sense key Medium
Error
Feb 17 21:03:45 www kernel: Additional sense indicates Unrecovered read
error
Feb 17 21:03:45 www kernel: scsidisk I/O error: dev 08:05, sector
2244648
Feb 17 21:03:45 www kernel: EXT2-fs error (device sd(8,5)):
ext2_write_inode: unable to read inode block - inode=280580, block=11223
24
I tried to telnet to the machine and take a look at it (impossible to
work on the console with all those messages being spit every second) but
it didn't work, so I managed to do a soft reboot, after which, the
/dev/sda5 filesystem (where /var is) was being reported as a bad
filesystem: "Attempting to read block from filesystem resulted in short
read while trying ot open /dev/sda5. Could this be a zero-lenght
partition?"
I managed to recover the filesystem by defining a different superblock
with e2fsck. Reboot, and the SCSI IO ERROR comes up again. /var ends up
screwed up again. Another fix with e2fsck. Fixed. Reboot. Same IO error.
Now, going back to the subject of the e-mail, I found out how to stop
those IO errors. All I need to do is 'qmail stop' and the error would go
away, and the rest of the system would continue working as usual -
except no mail being delivered or accepted of course.
Now I wouldn't say "daemontools did this" but rather, what the heck have
I done wrong to get this odd error? I'd suspect that the first problem I
experienced with multilog might be related but dont really know. My next
step would be to remove the daemontools from managing qmail, and leave
qmail on its own, but I deeply don't think this problem is impossible to
fix, and don't want to give up on these tools just because I can't get
it right the first time. Though it is kind of scary when you start to
see these errors on a machine that otherwise has run almost flawlessly
up til now (fairly new system as well).
Any hint would be greatly appreciated!!
PS: I've run all sorts of checks on the hard drive for bad sectors, the
filesystem tables always show the right info, etc. I don't think it's a
hardware failure.