[zfs-discuss] Quickest way to find files with cksum errors without doing scrub

2009-09-28 Thread Albert Chin
Without doing a zpool scrub, what's the quickest way to find files in a
filesystem with cksum errors? Iterating over all files with find takes
quite a bit of time. Maybe there's some zdb fu that will perform the
check for me?

-- 
albert chin (ch...@thewrittenword.com)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quickest way to find files with cksum errors without doing scrub

2009-09-28 Thread Richard Elling

On Sep 28, 2009, at 2:41 PM, Albert Chin wrote:

Without doing a zpool scrub, what's the quickest way to find files  
in a
filesystem with cksum errors? Iterating over all files with find  
takes

quite a bit of time. Maybe there's some zdb fu that will perform the
check for me?


Scrub could be faster, but you can try
tar cf - .  /dev/null

If you think about it, validating checksums requires reading the data.
So you simply need to read the data.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quickest way to find files with cksum errors without doing scrub

2009-09-28 Thread Bob Friesenhahn

On Mon, 28 Sep 2009, Richard Elling wrote:


Scrub could be faster, but you can try
tar cf - .  /dev/null

If you think about it, validating checksums requires reading the data.
So you simply need to read the data.


This should work but it does not verify the redundant metadata.  For 
example, the duplicate metadata copy might be corrupt but the problem 
is not detected since it did not happen to be used.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quickest way to find files with cksum errors without doing scrub

2009-09-28 Thread Albert Chin
On Mon, Sep 28, 2009 at 12:09:03PM -0500, Bob Friesenhahn wrote:
 On Mon, 28 Sep 2009, Richard Elling wrote:

 Scrub could be faster, but you can try
  tar cf - .  /dev/null

 If you think about it, validating checksums requires reading the data.
 So you simply need to read the data.

 This should work but it does not verify the redundant metadata.  For
 example, the duplicate metadata copy might be corrupt but the problem
 is not detected since it did not happen to be used.

Too bad we cannot scrub a dataset/object.

-- 
albert chin (ch...@thewrittenword.com)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quickest way to find files with cksum errors without doing scrub

2009-09-28 Thread Richard Elling

On Sep 28, 2009, at 3:42 PM, Albert Chin wrote:


On Mon, Sep 28, 2009 at 12:09:03PM -0500, Bob Friesenhahn wrote:

On Mon, 28 Sep 2009, Richard Elling wrote:


Scrub could be faster, but you can try
tar cf - .  /dev/null

If you think about it, validating checksums requires reading the  
data.

So you simply need to read the data.


This should work but it does not verify the redundant metadata.  For
example, the duplicate metadata copy might be corrupt but the problem
is not detected since it did not happen to be used.


Too bad we cannot scrub a dataset/object.


Can you provide a use case? I don't see why scrub couldn't start and
stop at specific txgs for instance. That won't necessarily get you to a
specific file, though.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quickest way to find files with cksum errors without doing scrub

2009-09-28 Thread Tim Cook
On Mon, Sep 28, 2009 at 12:16 PM, Richard Elling
richard.ell...@gmail.comwrote:

 On Sep 28, 2009, at 3:42 PM, Albert Chin wrote:

  On Mon, Sep 28, 2009 at 12:09:03PM -0500, Bob Friesenhahn wrote:

 On Mon, 28 Sep 2009, Richard Elling wrote:


 Scrub could be faster, but you can try
tar cf - .  /dev/null

 If you think about it, validating checksums requires reading the data.
 So you simply need to read the data.


 This should work but it does not verify the redundant metadata.  For
 example, the duplicate metadata copy might be corrupt but the problem
 is not detected since it did not happen to be used.


 Too bad we cannot scrub a dataset/object.


 Can you provide a use case? I don't see why scrub couldn't start and
 stop at specific txgs for instance. That won't necessarily get you to a
 specific file, though.
  -- richard


 On Mon, Sep 28, 2009 at 12:16 PM, Richard Elling richard.ell...@gmail.com
 wrote:

 On Sep 28, 2009, at 3:42 PM, Albert Chin wrote:

  On Mon, Sep 28, 2009 at 12:09:03PM -0500, Bob Friesenhahn wrote:

 On Mon, 28 Sep 2009, Richard Elling wrote:


 Scrub could be faster, but you can try
tar cf - .  /dev/null

 If you think about it, validating checksums requires reading the data.
 So you simply need to read the data.


 This should work but it does not verify the redundant metadata.  For
 example, the duplicate metadata copy might be corrupt but the problem
 is not detected since it did not happen to be used.


 Too bad we cannot scrub a dataset/object.


 Can you provide a use case? I don't see why scrub couldn't start and
 stop at specific txgs for instance. That won't necessarily get you to a
 specific file, though.
  -- richard


I get the impression he just wants to check a single file in a pool without
waiting for it to check the entire pool.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quickest way to find files with cksum errors without doing scrub

2009-09-28 Thread Bob Friesenhahn

On Mon, 28 Sep 2009, Bob Friesenhahn wrote:


This should work but it does not verify the redundant metadata.  For example, 
the duplicate metadata copy might be corrupt but the problem is not detected 
since it did not happen to be used.


I am finding that your tar incantation is reading hardly any data from 
disk when testing my home directory and the 'tar' happens to be GNU 
tar:


# time tar cf - .  /dev/null
tar cf - .  /dev/null  2.72s user 12.43s system 96% cpu 15.721 total
# du -sh .
82G

Looks like the GNU folks slipped in a small performance enhancement 
if the output is to /dev/null.


Make sure to use /bin/tar, which seems to actually read the data.

When actually reading the data via tar, read performance is very poor. 
Hopefully I will have a ZFS IDR to test with in the next few days 
which fixes the prefetch bug.


Zpool scrub reads the data at 360MB/second but this tar method is only 
reading at an average of 6MB/second to 42MB/second (according to zpool 
iostat).  Wups, I just saw a one-minute average of 105MB and then 
131MB.  Quite variable.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quickest way to find files with cksum errors without doing scrub

2009-09-28 Thread Albert Chin
On Mon, Sep 28, 2009 at 10:16:20AM -0700, Richard Elling wrote:
 On Sep 28, 2009, at 3:42 PM, Albert Chin wrote:

 On Mon, Sep 28, 2009 at 12:09:03PM -0500, Bob Friesenhahn wrote:
 On Mon, 28 Sep 2009, Richard Elling wrote:

 Scrub could be faster, but you can try
tar cf - .  /dev/null

 If you think about it, validating checksums requires reading the  
 data.
 So you simply need to read the data.

 This should work but it does not verify the redundant metadata.  For
 example, the duplicate metadata copy might be corrupt but the problem
 is not detected since it did not happen to be used.

 Too bad we cannot scrub a dataset/object.

 Can you provide a use case? I don't see why scrub couldn't start and
 stop at specific txgs for instance. That won't necessarily get you to a
 specific file, though.

If your pool is borked but mostly readable, yet some file systems have
cksum errors, you cannot zfs send that file system (err, snapshot of
filesystem). So, you need to manually fix the file system by traversing
it to read all files to determine which must be fixed. Once this is
done, you can snapshot and zfs send. If you have many file systems,
this is time consuming.

Of course, you could just rsync and be happy with what you were able to
recover, but if you have clones branched from the same parent, which a
few differences inbetween shapshots, having to rsync *everything* rather
than just the differences is painful. Hence the reason to try to get
zfs send to work.

But, this is an extreme example and I doubt pools are often in this
state so the engineering time isn't worth it. In such cases though, a
zfs scrub would be useful.

-- 
albert chin (ch...@thewrittenword.com)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quickest way to find files with cksum errors without doing scrub

2009-09-28 Thread Victor Latushkin

Richard Elling wrote:

On Sep 28, 2009, at 3:42 PM, Albert Chin wrote:


On Mon, Sep 28, 2009 at 12:09:03PM -0500, Bob Friesenhahn wrote:

On Mon, 28 Sep 2009, Richard Elling wrote:


Scrub could be faster, but you can try
tar cf - .  /dev/null

If you think about it, validating checksums requires reading the data.
So you simply need to read the data.


This should work but it does not verify the redundant metadata.  For
example, the duplicate metadata copy might be corrupt but the problem
is not detected since it did not happen to be used.


Too bad we cannot scrub a dataset/object.


Can you provide a use case? I don't see why scrub couldn't start and
stop at specific txgs for instance. That won't necessarily get you to a
specific file, though.


With ever increasing disk and pool sizes it takes more and more time for 
scrub to complete its job. Let's imagine that you have 100TB pool with 
90TB of data in it, and there's dataset with 10TB that is critical and 
another dataset with 80TB that is not that critical and you can afford 
loosing some blocks/files there.


So being able to scrub individual dataset would help to run scrubs of 
critical data more frequently and faster and schedule scrubs for less 
frequently used and/or less important data to happen much less frequently.


It may be useful to have a way to tell ZFS to scrub pool-wide metadata 
only (space maps etc), so that you can build your own schedule of scrubs.


Another interesting idea is to be able to scrub only blocks modified 
since last snapshot.


victor
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quickest way to find files with cksum errors without doing scrub

2009-09-28 Thread Richard Elling

On Sep 28, 2009, at 10:31 AM, Victor Latushkin wrote:


Richard Elling wrote:

On Sep 28, 2009, at 3:42 PM, Albert Chin wrote:

On Mon, Sep 28, 2009 at 12:09:03PM -0500, Bob Friesenhahn wrote:

On Mon, 28 Sep 2009, Richard Elling wrote:


Scrub could be faster, but you can try
   tar cf - .  /dev/null

If you think about it, validating checksums requires reading the  
data.

So you simply need to read the data.


This should work but it does not verify the redundant metadata.   
For
example, the duplicate metadata copy might be corrupt but the  
problem

is not detected since it did not happen to be used.


Too bad we cannot scrub a dataset/object.

Can you provide a use case? I don't see why scrub couldn't start and
stop at specific txgs for instance. That won't necessarily get you  
to a

specific file, though.


With ever increasing disk and pool sizes it takes more and more time  
for scrub to complete its job. Let's imagine that you have 100TB  
pool with 90TB of data in it, and there's dataset with 10TB that is  
critical and another dataset with 80TB that is not that critical and  
you can afford loosing some blocks/files there.


Personally, I have three concerns here.
	1. Gratuitous complexity, especially inside a pool -- aka creeping  
featurism
	2. Wouldn't a better practice be to use two pools with different  
protection
	   policies? The only protection policy differences inside a pool are  
copies.
	   In other words, I am concerned that people replace good data  
protection
	   practices with scrubs and expecting scrub to deliver better data  
protection

   (it won't).
	3. Since the pool contains the set of blocks, shared by datasets, it  
is not clear
	   to me that scrubbing a dataset will detect all of the data  
corruption failures
	   which can affect the dataset.  I'm thinking along the lines of  
phantom writes,

   for example.
4. the time it takes to scrub lots of stuff
...there are four concerns... :-)

For magnetic media, a yearly scrub interval should suffice for most  
folks.  I know

some folks who scrub monthly. More frequent scrubs won't buy much.

Scrubs are also useful for detecting broken hardware. However, normal
activity will also detect broken hardware, so it is better to think of  
scrubs as
finding degradation of old data rather than being a hardware checking  
service.



So being able to scrub individual dataset would help to run scrubs  
of critical data more frequently and faster and schedule scrubs for  
less frequently used and/or less important data to happen much less  
frequently.


It may be useful to have a way to tell ZFS to scrub pool-wide  
metadata only (space maps etc), so that you can build your own  
schedule of scrubs.


Another interesting idea is to be able to scrub only blocks modified  
since last snapshot.


This can be relatively easy to implement. But remember that scrubs are  
most
useful for finding data which has degraded from the media. In other  
words, old
data. New data is not likely to have degraded yet, and since ZFS is  
COW, all of
the new data is, well, new.  This is why having the ability to bound  
the start and

end of a scrub by txg can be easy and perhaps useful.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quickest way to find files with cksum errors without doing scrub

2009-09-28 Thread Victor Latushkin

On 28.09.09 22:01, Richard Elling wrote:

On Sep 28, 2009, at 10:31 AM, Victor Latushkin wrote:


Richard Elling wrote:

On Sep 28, 2009, at 3:42 PM, Albert Chin wrote:

On Mon, Sep 28, 2009 at 12:09:03PM -0500, Bob Friesenhahn wrote:

On Mon, 28 Sep 2009, Richard Elling wrote:


Scrub could be faster, but you can try
   tar cf - .  /dev/null

If you think about it, validating checksums requires reading the 
data.

So you simply need to read the data.


This should work but it does not verify the redundant metadata.  For
example, the duplicate metadata copy might be corrupt but the problem
is not detected since it did not happen to be used.


Too bad we cannot scrub a dataset/object.

Can you provide a use case? I don't see why scrub couldn't start and
stop at specific txgs for instance. That won't necessarily get you to a
specific file, though.


With ever increasing disk and pool sizes it takes more and more time 
for scrub to complete its job. Let's imagine that you have 100TB pool 
with 90TB of data in it, and there's dataset with 10TB that is 
critical and another dataset with 80TB that is not that critical and 
you can afford loosing some blocks/files there.


Personally, I have three concerns here.
1. Gratuitous complexity, especially inside a pool -- aka creeping 
featurism


There's the idea of priority-based resilvering (though not implemented yet, see 
http://blogs.sun.com/bonwick/en_US/entry/smokin_mirrors) that can be simply 
extended to scrubs as well.


2. Wouldn't a better practice be to use two pools with different 
protection
   policies? The only protection policy differences inside a pool 
are copies.
   In other words, I am concerned that people replace good data 
protection
   practices with scrubs and expecting scrub to deliver better data 
protection

   (it won't).


It may be better, it may be not... With two pools you split you bandwidth and 
IOPS and space and have more entities to care about...


3. Since the pool contains the set of blocks, shared by datasets, it 
is not clear
   to me that scrubbing a dataset will detect all of the data 
corruption failures
   which can affect the dataset.  I'm thinking along the lines of 
phantom writes,

   for example.


That is why it may be useful to always scrub pool-wide metadata or have a way to 
specifically request it.



4. the time it takes to scrub lots of stuff
...there are four concerns... :-)

For magnetic media, a yearly scrub interval should suffice for most 
folks.  I know

some folks who scrub monthly. More frequent scrubs won't buy much.


It won't buy you much in term of magnetic media decay discovery. Unfortunately, 
there other sources of corruption as well (including phantom writes you are 
thinking about), and being able to discover corruption and recover it as quickly 
as possible from the backup it a good thing.



Scrubs are also useful for detecting broken hardware. However, normal
activity will also detect broken hardware, so it is better to think of 
scrubs as
finding degradation of old data rather than being a hardware checking 
service.



So being able to scrub individual dataset would help to run scrubs of 
critical data more frequently and faster and schedule scrubs for less 
frequently used and/or less important data to happen much less 
frequently.


It may be useful to have a way to tell ZFS to scrub pool-wide metadata 
only (space maps etc), so that you can build your own schedule of scrubs.


Another interesting idea is to be able to scrub only blocks modified 
since last snapshot.


This can be relatively easy to implement. But remember that scrubs are most
useful for finding data which has degraded from the media. In other 
words, old
data. New data is not likely to have degraded yet, and since ZFS is COW, 
all of

the new data is, well, new.




This is why having the ability to bound the start and end of a scrub by txg
can be easy and perhaps useful.


This requires exporting concept of the transaction group numbers to the user and 
i do not see how it is less complex from the user interface perspective than 
being able to request scrub of individual dataset, pool-wide metadata or 
newly-written data.


regards,
victor
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quickest way to find files with cksum errors without doing scrub

2009-09-28 Thread Bob Friesenhahn

On Mon, 28 Sep 2009, Richard Elling wrote:

	   In other words, I am concerned that people replace good 
data protection
	   practices with scrubs and expecting scrub to deliver better data 
protection

   (it won't).


Many people here would profoundly disagree with the above.  There is 
no substitute for good backups, but a periodic scrub helps validate 
that a later resilver would succeed.  A perioic scrub also helps find 
system problems early when they are less likely to crater your 
business.  It is much better to find an issue during a scrub rather 
than during resilver of a mirror or raidz.


Scrubs are also useful for detecting broken hardware. However, 
normal activity will also detect broken hardware, so it is better to 
think of scrubs as finding degradation of old data rather than being 
a hardware checking service.


Do you have a scientific reference for this notion that old data is 
more likely to be corrupt than new data or is it just a gut-feeling? 
This hypothesis does not sound very supportable to me.  Magnetic 
hysteresis lasts quite a lot longer than the recommended service life 
for a hard drive.  Studio audio tapes from the '60s are still being 
used to produce modern remasters of old audio recordings which sound 
better than they ever did before (other than the master tape).  Some 
forms of magnetic hysteresis are known to last millions of years. 
Media failure is more often than not mechanical or chemical and not 
related to loss of magnetic hysteresis.  Head failures may be 
construed to be media failures.


See http://en.wikipedia.org/wiki/Ferromagnetic for information on 
ferromagnetic materials.


It would be most useful if zfs incorporated a slow-scan scrub which 
validates data at a low rate of speed which does not hinder active 
I/O.  Of course this is not a green energy efficient solution.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quickest way to find files with cksum errors without doing scrub

2009-09-28 Thread Richard Elling

On Sep 28, 2009, at 11:41 AM, Bob Friesenhahn wrote:


On Mon, 28 Sep 2009, Richard Elling wrote:

	   In other words, I am concerned that people replace good data  
protection
	   practices with scrubs and expecting scrub to deliver better  
data protection

   (it won't).


Many people here would profoundly disagree with the above.  There is  
no substitute for good backups, but a periodic scrub helps validate  
that a later resilver would succeed.  A perioic scrub also helps  
find system problems early when they are less likely to crater your  
business.  It is much better to find an issue during a scrub rather  
than during resilver of a mirror or raidz.


As I said, I am concerned that people would mistakenly expect that  
scrubbing

offers data protection. It doesn't.  I think you proved my point? ;-)

Scrubs are also useful for detecting broken hardware. However,  
normal activity will also detect broken hardware, so it is better  
to think of scrubs as finding degradation of old data rather than  
being a hardware checking service.


Do you have a scientific reference for this notion that old data  
is more likely to be corrupt than new data or is it just a gut- 
feeling? This hypothesis does not sound very supportable to me.   
Magnetic hysteresis lasts quite a lot longer than the recommended  
service life for a hard drive.  Studio audio tapes from the '60s are  
still being used to produce modern remasters of old audio  
recordings which sound better than they ever did before (other than  
the master tape).


Those are analog tapes... they just fade away...
For data, it depends on the ECC methods, quality of the media,  
environment, etc.
You will find considerable attention spent on verification of data on  
tapes in
archiving products. In the tape world, there are slightly different  
conditions than
the magnetic disk world, but I can't think of a single study which  
shows that
magnetic disks get more reliable over time, while there are dozens  
which show
that they get less reliable and that latent sector errors dominate, as  
much as 5x,
over full disk failures.  My studies of Sun disk failure rates have  
shown similar

results.

 Some forms of magnetic hysteresis are known to last millions of  
years. Media failure is more often than not mechanical or chemical  
and not related to loss of magnetic hysteresis.  Head failures may  
be construed to be media failures.


Here is a good study from the University of Wisconsin-Madison which  
clearly
shows the relationship between disk age and latent sector errors. It  
also shows
how the increase in aerial density also increases the latent sector  
error (LSE) rate.
Additionally, this gets back to the ECC method, which we observe to be  
different
on consumer-grade and enterprise-class disks. The study shows a clear  
win
for enterprise-class drives wrt latent errors.  The paper suggests a 2- 
week
scrub cycle and recognizes that many RAID arrays have such policies.  
There
are indeed many studies which show latent sector errors are a bigger  
problem

as the disk ages.
An Analysis of Latent Sector Errors in Disk Drives
www.cs.wisc.edu/adsl/Publications/latent-sigmetrics07.ps



See http://en.wikipedia.org/wiki/Ferromagnetic for information on  
ferromagnetic materials.


For disks we worry about the superparamagnetic effect.
http://en.wikipedia.org/wiki/Superparamagnetism

Quoting US Patent 6987630,
... the superparamagnetic effect is a thermal relaxation of information
stored on the disk surface. Because the superparamagnetic effect may
occur at room temperature, over time, information stored on the disk
surface will begin to decay. Once the stored information decays beyond
	a threshold level, it will be unable to be properly read by the read  
head

and the information will be lost.

The superparamagnetic effect manifests itself by a loss in amplitude in
the readback signal over time or an increase in the mean square error
(MSE) of the read back signal over time. In other words, the readback
signal quality metrics are means square error and amplitude as measured
by the read channel integrated circuit. Decreases in the quality of the
readback signal cause bit error rate (BER) increases. As is well known,
the BER is the ultimate measure of drive performance in a disk drive.

This effect is based on the time since written. Hence, older data can  
have

higher MSE and subsequent BER leading to a UER.

To be fair, newer disk technology is constantly improving. But what is
consistent with the physics is that increase in bit densities leads to
more space and rebalancing the BER. IMHO, this is why we see densities
increase, but UER does not increase (hint: marketing always wins these
sorts of battles).

FWIW, flash memories are not affected by superparamagnetic decay.

It would be most useful if zfs incorporated a 

Re: [zfs-discuss] Quickest way to find files with cksum errors without doing scrub

2009-09-28 Thread David Magda

On Sep 28, 2009, at 19:39, Richard Elling wrote:

Finally, there are two basic types of scrubs: read-only and  
rewrite.  ZFS does
read-only. Other scrubbers can do rewrite. There is evidence that  
rewrites

are better for attacking superparamagnetic decay issues.


Something that may be possible when *bp rewrite is eventually committed.

Educating post. Thanks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quickest way to find files with cksum errors without doing scrub

2009-09-28 Thread Robert Milkowski

Bob Friesenhahn wrote:

On Mon, 28 Sep 2009, Richard Elling wrote:


Scrub could be faster, but you can try
tar cf - .  /dev/null

If you think about it, validating checksums requires reading the data.
So you simply need to read the data.


This should work but it does not verify the redundant metadata.  For 
example, the duplicate metadata copy might be corrupt but the problem 
is not detected since it did not happen to be used.




Not only that - it won't also read all the copies of data if zfs has 
redundancy configured at a pool level. Scrubbing the pool will. And 
that's the main reason behind the scrub - to be able to detect and 
repair checksum errors (if any) while a redundant copy is still fine.


--
Robert Milkowski
http://milek.blogspot.com


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quickest way to find files with cksum errors without doing scrub

2009-09-28 Thread Robert Milkowski

Robert Milkowski wrote:

Bob Friesenhahn wrote:

On Mon, 28 Sep 2009, Richard Elling wrote:


Scrub could be faster, but you can try
tar cf - .  /dev/null

If you think about it, validating checksums requires reading the data.
So you simply need to read the data.


This should work but it does not verify the redundant metadata.  For 
example, the duplicate metadata copy might be corrupt but the problem 
is not detected since it did not happen to be used.




Not only that - it won't also read all the copies of data if zfs has 
redundancy configured at a pool level. Scrubbing the pool will. And 
that's the main reason behind the scrub - to be able to detect and 
repair checksum errors (if any) while a redundant copy is still fine.




Also doing tar means reading from ARC and/or L2ARC if data is cached 
which won't verify if data is actually fine on a disk. Scrub won't use a 
cache and will always go to physical disks.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quickest way to find files with cksum errors without doing scrub

2009-09-28 Thread Bob Friesenhahn

On Mon, 28 Sep 2009, Richard Elling wrote:


Many people here would profoundly disagree with the above.  There is no 
substitute for good backups, but a periodic scrub helps validate that a 
later resilver would succeed.  A perioic scrub also helps find system 
problems early when they are less likely to crater your business.  It is 
much better to find an issue during a scrub rather than during resilver of 
a mirror or raidz.


As I said, I am concerned that people would mistakenly expect that scrubbing
offers data protection. It doesn't.  I think you proved my point? ;-)


It does not specifically offer data protection but if you have only 
duplex redundancy, it substantially helps find and correct a failure 
which would have caused data loss during a resilver.  The value 
substantially diminishes if you have triple redundancy.


I hope it does not offend that I scrub my mirrored pools once a week.

Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss