Re: [zfs-discuss] Inconcistancies with scrub and zdb

2008-05-05 Thread Jonathan Loran


Jonathan Loran wrote:
> Since no one has responded to my thread, I have a question:  Is zdb 
> suitable to run on a live pool?  Or should it only be run on an exported 
> or destroyed pool?  In fact, I see that it has been asked before on this 
> forum, but is there a users guide to zdb? 
>
>   
Answering myself, finally looked at the zdb source code, and I see the results 
running on a live pool are not consistent, hence the -L option.  OK, so I'm 
going to trust the scrub to tell me if there are errors, and as far as I can 
tell, my pools are clean now.  But is was scary creating the mirror from a pool 
with checksum errors.  I think there could be some more verbosity about what is 
going on, or to give the user some options when checksum errors are found in 
the process of silvering up a mirror for the first time.  Just a comment.

Thanks,

Jon

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Inconcistancies with scrub and zdb

2008-05-05 Thread Jonathan Loran

Since no one has responded to my thread, I have a question:  Is zdb 
suitable to run on a live pool?  Or should it only be run on an exported 
or destroyed pool?  In fact, I see that it has been asked before on this 
forum, but is there a users guide to zdb? 

Thanks,

Jon

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Inconcistancies with scrub and zdb

2008-05-04 Thread Jonathan Loran

Hi List,

First of all:  S10u4 120011-14

So I have the weird situation.  Earlier this week, I finally mirrored up 
two iSCSI based pools.  I had been wanting to do this for some time, 
because the availability of the data in these pools is important. One 
pool mirrored just fine, but the other pool is another story.

First lesson (I think) is you should scrub your pools, at least those 
backed by a SAN, before mirroring them.  The problem pool was scrubbed 
about two weeks before I mirrored it, and it was clean. I assumed, 
wrongly that there were no checksum errors in the time that elapsed.  
Well guess again.  When I mirrored this guy, the source mirror had two 
checksum errors.  Interestingly, the target inherited these errors, and 
so now both sides of the mirror showed two checksums in the counters.  I 
don't know if this was real, or if the zpool attach operation just 
incremented the counters on the second half of the mirror.

My next mistake was to assume the counters were in error on the second 
mirror, and so I zeroed out the counters with zpool clear.  OK, so now I 
scrub the pool, and no checksum errors were found on either side of the 
mirror.  Huh?!?  What about those two checksum errors on the first 
mirror.  OK, so I run zdb on the pool, and if finds scads of errors:

Traversing all blocks to verify checksums and verify nothing leaked ...

zdb_blkptr_cb: Got error 50 reading <33, 727252, 0, 4a> -- skipping--
...

and then tons of:

Error counts:
errno count
50 123
leaked space: vdev 0, offset 0x4deaed800, size 2048
...


OK, this is odd, so I scrub the pool again, and this time it found 4 
checksum errors, on the initial mirror, but none on the other mirror. 
That makes some sense, (though I don't know what changed) so I break the 
mirror, taking off the original side that has the checksum errs. I then 
scrub the pool, no errors found. That's good, but just to be sure, I run 
zdb on it, and it finds tons of the same errors as if found on the 
original side of the mirror. Argh!

In the mean time, I ran 4 passes of format-> analyze -> compare on the 
initial half of the mirror that had the checksums and it's totally clean 
hardware wise.

So my questions are these:

1) Does zdb leaked space mean trouble with the pool?
2) Is it possible that the errors got injected to the new half of the 
mirror when I attached it? For now, I'm going to assume that the new 
half of the mirror is OK, hardware wise. 
3) I'm running a scrub and zdb on the other pool that lives on these SAN 
boxes, cause I want to see if they come up with the same problems. If 
not, what would be going on with this crazy pool.
4) Can I recover from this without copying the whole pool to new 
storage? If not, it will be painful for us. We will have to reboot 350 
servers and workstations on stale file handles, interrupting 100's of 
production processes. My user base is loosing faith in my team.

Oh sage ones, please advise. Thanks in advance.

Jon


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss