I have a server vanilla 2.4.26 SMP kernel, Adaptec SCSI RAID5 w/ reiserfs, been
running fine for about 9 months.
I recently (last 3 days) started doing some back ups with most current rsync
(2.6.3), just reading from the disk in this machine writing to a remote
machine. That's the only thing I can think of that's changed, besides added an
ntp server but I am assuming that's not it. At the same time this rsync job was
running there is a samba share I am reading from and translating data from, this
job has been running since the machine was installed though never caused a
problem.
this morning the machine locked up, still responded to pings but everything else
seemed completely dead. Checking the error logs after reboot I see this:
Oct 21 04:15:04 hsa10 kernel: sd(8,3):vs-13060: reiserfs_update_sd: stat data of
object [25333038 23882091 0x0 SD] (nlink == 1) not found (pos 33)
Oct 21 04:15:04 hsa10 last message repeated 3 times
Oct 21 04:15:04 hsa10 kernel: sd(8,3):PAP-12350: do_balance: insert_size == 0, m
ode == p
Oct 21 04:15:04 hsa10 kernel: 4sd(8,3):vs-13060: reiserfs_update_sd: stat dat
a of object [25333038 23882091 0x0 SD] (nlink == 1) not found (pos 33)
Oct 21 04:15:04 hsa10 kernel: sd(8,3):vs-13060: reiserfs_update_sd: stat data of
object [25333038 23882091 0x0 SD] (nlink == 1) not found (pos 33)
and on and on until an hour and 40 minutes later when the logs end.
these numbers: [25333038 23882091 0x0 SD] and vs-13060 do not change in the
rest of the log.
there are other messages that aren't exactly like the above mixed in like:
Oct 21 04:18:00 hsa10 kernel: sd(8,3):PAP-5660: reiserfs_do_truncate: wrong
result -1 of search for [25333038 23882091 0xfff DIRECT]
also I noticed all these retries happen at around 00 seconds after the minute
(top of the minute)
So what should I do? what could cause this as well? We have a write cache
(another situation where there can be this kind of error apparently dring power
loss) on this server but it's also battery backed up so that shouldn't be an
issue (according to what I read).
on reboot the dmesg spews:
reiserfs: found format 3.6 with standard journal
reiserfs: checking transaction log (device sd(8,3)) ...
for (sd(8,3))
reiserfs: replayed 63 transactions in 0 seconds
sd(8,3):Using r5 hash to sort names
I see some people who have gotten this type of error told
to rebuild the filesystem, is that suggested in all cases? I don't want to
have to do that if it's not needed. We do have back ups but even so ...
the server is rebooted and seems to be running fine now.
Any help is appreciated.
brian