Rustam wrote:
Today my production server crashed 4 times. THIS IS NIGHTMARE!
Self-healing file system?! For me ZFS is SELF-KILLING filesystem.
I cannot fsck it, there's no such tool. I cannot scrub it, it crashes
30-40 minutes after scrub starts. I cannot use it, it crashes a
number of times every day! And with every crash number of checksum
failures is growing:
NAMESTATE READ WRITE CKSUM box5ONLINE 0
0 0 ...after a few hours... box5ONLINE 0 0
4 ...after a few hours... box5ONLINE 0 0 62
...after another few hours... box5ONLINE 0 0
120 ...crash! and we start again... box5ONLINE 0 0
0 ...etc...
actually 120 is record, sometimes it crashed as soon as it boots.
and always there's a permanent error: errors: Permanent errors have
been detected in the following files: box5:0x0
and very wise self-healing advice: http://www.sun.com/msg/ZFS-8000-8A
Restore the file in question if possible. Otherwise restore the
entire pool from backup.
Thanks, but if I restore it from backup it won't be ZFS anymore,
that's for sure.
That's a bit harsh. ZFS is telling you that you have corrupted data
based on the checksums. Other types of filesystems would likely simply
pass the corrupted data on silently.
It's not I/O problem. AFAIK, default ZFS I/O error behavior is wait
to repair (i've 10U4, non-configurable). Then why it panics?
Do you have the panic messages? ZFS won't cause panics based on bad
checksums. It will by default cause panic if it can't write data out to
any device or if it completely loses access to non-redundant devices or
loses both redundant devices at the same time.
Recently there were discussions on failure of OpenSolaris community.
Now it's been more than half a month since I reported such an error.
Nobody even posted something like RTFM. Come on guys, I know you
are there and busy with enterprise customers... but at least give me
some troubleshooting ideas. i'm totally lost.
just to remind, it's heavily loaded fs with 3-4 million files and
folders.
Link to original post:
http://www.opensolaris.org/jive/thread.jspa?threadID=57425
Since this seems to show the same number of checksum errors across 2
different channels and 4 different drives. Given that, I'd assume that
this is likely a dual-channel HBA of some sort. It would appear that
you either have bad hardware or some sort of driver issue.
Regards,
Phil
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss