[zfs-discuss] checksum errors after online'ing device

2008-08-02 Thread Thomas Nau
Dear all

As we wanted to patch one of our iSCSI Solaris servers we had to offline 
the ZFS submirrors on the clients connected to that server. The devices 
connected to the second server stayed online so the pools on the clients 
were still available but in degraded mode. When the server came back 
up we onlined the devices on the clients an the resilver completed pretty 
quickly as the filesystem was read-mostly (ftp, http server)

Nevertheless during the first hour of operation after onlining we 
recognized numerous checksum errors on the formerly offlined device. We 
decided to scrub the pool and after several hours we got about 3500 error 
in 600GB of data.

I always thought that ZFS would sync the mirror immediately after bringing 
the device online not requiring a scrub. Am I wrong?

Both, servers and clients run s10u5 with the latest patches but we 
saw the same behaviour with OpenSolaris clients

Any hints?
Thomas

-
GPG fingerprint: B1 EE D2 39 2C 82 26 DA  A5 4D E0 50 35 75 9E ED
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors after online'ing device

2008-08-02 Thread Miles Nordin
 tn == Thomas Nau [EMAIL PROTECTED] writes:

tn Nevertheless during the first hour of operation after onlining
tn we recognized numerous checksum errors on the formerly
tn offlined device. We decided to scrub the pool and after
tn several hours we got about 3500 error in 600GB of data.

Did you use 'zpool offline' when you took them down, or did you
offline them some other way, like by breaking the network connection,
stopping the iSCSI target daemon, or 'iscsiadm remove
discovery-address ..' on the initiator?

This is my experience, too (but with old b71).  I'm also using iSCSI.
It might be a variant of this:

 http://bugs.opensolaris.org/view_bug.do?bug_id=6675685
 checksum errors after 'zfs offline ; reboot'

Aside from the fact the checksum-errored blocks are silently not
redundant, it's also interesting because I think, in general, there
are a variety of things which can cause checksum errors besides
disk/cable/controller problems.  I wonder if they're useful for
diagnosing disk problems only in very gently-used setups, or not at
all?

Another iSCSI problem: for me, the targets I've 'zpool offline'd will
automatically ONLINE themselves when iSCSI rediscovers them.  but only
sometimes.  I haven't figured out how to predict when they will and
when they won't.


pgpo9BOlPemM3.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors after online'ing device

2008-08-02 Thread Thomas Nau
Miles

On Sat, 2 Aug 2008, Miles Nordin wrote:
 tn == Thomas Nau [EMAIL PROTECTED] writes:

tn Nevertheless during the first hour of operation after onlining
tn we recognized numerous checksum errors on the formerly
tn offlined device. We decided to scrub the pool and after
tn several hours we got about 3500 error in 600GB of data.

 Did you use 'zpool offline' when you took them down, or did you
 offline them some other way, like by breaking the network connection,
 stopping the iSCSI target daemon, or 'iscsiadm remove
 discovery-address ..' on the initiator?

We did a zpool offline, nothing else, before we took the iSCSI server 
down


 Another iSCSI problem: for me, the targets I've 'zpool offline'd will
 automatically ONLINE themselves when iSCSI rediscovers them.  but only
 sometimes.  I haven't figured out how to predict when they will and
 when they won't.

I never experienced that one but we usually don't touch any of the iSCSI 
settings as long as a devices is offline. At least as long as we don't 
have to for any reason

Thomas

-
GPG fingerprint: B1 EE D2 39 2C 82 26 DA  A5 4D E0 50 35 75 9E ED
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] checksum errors after online'ing device

2008-08-02 Thread Miles Nordin
 tn == Thomas Nau [EMAIL PROTECTED] writes:

tn I never experienced that one but we usually don't touch any of
tn the iSCSI settings as long as a devices is offline. At least
tn as long as we don't have to for any reason

Usually I do 'zpool offline' followed by 'iscsiadm remove
discovery-address ...'

This is for two reasons:

 1. At least with my old crappy Linux IET, it doesn't restore the
sessions unless I remove and add the discovery-address

 2. the auto-ONLINEing-on-discovery problem.  Removing the discovery
address makes absolutely sure ZFS doesn't ONLINE something before
I want it to.

If you have to do this maintenance again, you might want to try
removing the discovery address for reason #2.  Maybe when your iSCSI
target was coming back up, it bounced a bit.  so, when the target was
coming back up, you might have done the equivalent of removing the
target without 'zpool offline'ing first (and then immediately plugging
it back in).

That's the ritual I've been using anyway.  If anything unexpected
happens, I still have to manually scrub the whole pool to seek out all
these hidden ``checksum'' errors.

Hopefully some day you will be able to just look in fmdump and see
``yup, the target bounced once as it was coming back up.''  and
targets will be able to bounce as much as they like with
failmode=wait, or for short reasonable timeouts with other failmodes,
and automatically do fully-adequate but efficient resilvers with
proper dirty-region-logging without causing any latent checksum
errors.  and zpool offline'd devices will stay offline until reboot as
promised, and will never online themselves.  and iSCSI sessions will
always come up on their own without having to kick the initiator.


pgpPajiw7r2cN.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss