Re: [zfs-discuss] Finding corrupted files

2010-10-20 Thread Stephan Budach
Am 19.10.2010 um 22:36 schrieb Tuomas Leikola tuomas.leik...@gmail.com:

 On Mon, Oct 18, 2010 at 4:55 PM, Edward Ned Harvey sh...@nedharvey.com 
 wrote:
 Thank you, but, the original question was whether a scrub would identify
 just corrupt blocks, or if it would be able to map corrupt blocks to a list
 of corrupt files.
 
 
 Just in case this wasn't already clear.
 
 After scrub sees read or checksum errors, zpool status -v will list
 filenames that are affected. At least in my experience.
 -- 
 - Tuomas

That didn't do it for me. I used scrub and afterwards zpool staus -v didn't 
show any additional corrupted files, although there were the same three files 
corrupted in a number of snapshots, which of course zfs send detected when 
trying to actually send them.

budy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-20 Thread Edward Ned Harvey
 From: Stephan Budach [mailto:stephan.bud...@jvm.de]
 
  Just in case this wasn't already clear.
 
  After scrub sees read or checksum errors, zpool status -v will list
  filenames that are affected. At least in my experience.
  --
  - Tuomas
 
 That didn't do it for me. I used scrub and afterwards zpool staus -v
 didn't show any additional corrupted files, although there were the
 same three files corrupted in a number of snapshots, which of course
 zfs send detected when trying to actually send them.

Budy, we've been over this.

The behavior you experienced is explained by having corrupt data inside a
hardware raid, and during the scrub you luckily read the good copy of
redundant data.  During zfs send, you unluckily read the bad copy of
redundant data.  This is a known problem as long as you use hardware raid.
It's one of the big selling points, reasons for ZFS to exist.  You should
always give ZFS JBOD devices to work on, so ZFS is able to scrub both of the
redundant sides of the data, and when a checksum error occurs, ZFS is able
to detect *and* correct it.  Don't use hardware raid.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-20 Thread Darren J Moffat

On 20/10/2010 12:20, Edward Ned Harvey wrote:

It's one of the big selling points, reasons for ZFS to exist.  You should
always give ZFS JBOD devices to work on, so ZFS is able to scrub both of the
redundant sides of the data, and when a checksum error occurs, ZFS is able
to detect *and* correct it.  Don't use hardware raid.


That isn't the recommended best practice, you are stating it far too 
strongly.


The recommended best practice is to always create ZFS pools with 
redundancy in the control of ZFS.  That doesn't require that the back 
end storage be JBOD or full disks nor does it require you not to use 
hardware raid. Some of all of which are impossible if you are using SAN 
or other remote block storage devices in many cases - and certainly the 
case if the SAN is provided by a Sun ZFS Storage appliance.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-20 Thread Stephan Budach


 From: Stephan Budach [mailto:stephan.bud...@jvm.de]
 
 Just in case this wasn't already clear.
 
 After scrub sees read or checksum errors, zpool status -v will list
 filenames that are affected. At least in my experience.
 --
 - Tuomas
 
 That didn't do it for me. I used scrub and afterwards zpool staus -v
 didn't show any additional corrupted files, although there were the
 same three files corrupted in a number of snapshots, which of course
 zfs send detected when trying to actually send them.
 
 Budy, we've been over this.
 
 The behavior you experienced is explained by having corrupt data inside a
 hardware raid, and during the scrub you luckily read the good copy of
 redundant data.  During zfs send, you unluckily read the bad copy of
 redundant data.  This is a known problem as long as you use hardware raid.
 It's one of the big selling points, reasons for ZFS to exist.  You should
 always give ZFS JBOD devices to work on, so ZFS is able to scrub both of the
 redundant sides of the data, and when a checksum error occurs, ZFS is able
 to detect *and* correct it.  Don't use hardware raid.
 

Edward - I am working on that! 

Although, I have to say that I do have exactly 3 files that are corrupt in each 
snapshot until I finally deleted them and restored them from their original 
source.

zfs send will abort when trying to send them, while scrub doesn't notice this.
If zfs send would have sent any of these snapshots successfully, or if any of 
my read attempts for these files would work one time and fail the other time, 
I'd agree.
I can't see how this behaviour could be explained, or better: what are the 
chances that only scrub gets the clean blocks from the h/w raids, while zfs 
send or cp always get the corrupted blocks!

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-20 Thread Edward Ned Harvey
 From: Stephan Budach [mailto:stephan.bud...@jvm.de]
 
 Although, I have to say that I do have exactly 3 files that are corrupt
 in each snapshot until I finally deleted them and restored them from
 their original source.
 
 zfs send will abort when trying to send them, while scrub doesn't
 notice this.

That cannot be consistently repeatable.  If anything will notice corrupt
data, scrub will too.  The only way you will find corrupt data with
something else and not with scrub is ... If the corrupt data didn't exist
during the scrub.

I'm glad you're working to change the raid setup to jbod, because, although
that's not the only possible explanation, it is the most obvious
explanation.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-20 Thread Edward Ned Harvey
 -Original Message-
 From: Darren J Moffat [mailto:darr...@opensolaris.org]
  It's one of the big selling points, reasons for ZFS to exist.  You
 should
  always give ZFS JBOD devices to work on, so ZFS is able to scrub both
 of the
  redundant sides of the data, and when a checksum error occurs, ZFS is
 able
  to detect *and* correct it.  Don't use hardware raid.
 
 That isn't the recommended best practice, you are stating it far too
 strongly.

 The recommended best practice is to always create ZFS pools with
 redundancy in the control of ZFS.  That doesn't require that the back
 end storage be JBOD or full disks nor does it require you not to use
 hardware raid. Some of all of which are impossible if you are using SAN
 or other remote block storage devices in many cases - and certainly the
 case if the SAN is provided by a Sun ZFS Storage appliance.

You're right though, I'm stating that too strongly.  Never say never.  And
never say always.  The truth is exactly as you said.  Even if you have
redundancy in hardware, make sure you also have redundancy in ZFS.

If you allow hardware to manage redundancy ... Then just as Budy has
experienced, when corruption is found, it's not consistently repeatable, and
it could appear anywhere in the storage unit randomly.  ZFS is unable to
isolate the individual failing disk.  After enough checksum failures, the
whole storage unit will be marked failed and taken offline.  So much for
your redundancy.  

It is a problem if your only redundancy is hardware.  It is not a problem if
you also have redundancy managed by ZFS.  So a more correct conclusion would
be Whenever possible don't use hardware raid, and whenever possible use
JBOD managed by ZFS.  But whatever you do, make sure ZFS has some redundancy
it can manage.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-20 Thread Stephan Budach

Am 20.10.10 15:11, schrieb Edward Ned Harvey:

From: Stephan Budach [mailto:stephan.bud...@jvm.de]

Although, I have to say that I do have exactly 3 files that are corrupt
in each snapshot until I finally deleted them and restored them from
their original source.

zfs send will abort when trying to send them, while scrub doesn't
notice this.

That cannot be consistently repeatable.  If anything will notice corrupt
data, scrub will too.  The only way you will find corrupt data with
something else and not with scrub is ... If the corrupt data didn't exist
during the scrub.


I will do some more scrubbing - it only takes a couple of hours and then 
scrub should at least show some of the errors.
When I use zpool clear on that pool, why does zpool status still shows 
the errors that have been encountered? I'd figure that it would be a lot 
easier to track, if scrub finds new erorrs, if zpool status -v 
wouldn't show the old ones.


--
Stephan Budach
Jung von Matt/it-services GmbH
Glashüttenstraße 79
20357 Hamburg

Tel: +49 40-4321-1353
Fax: +49 40-4321-1114
E-Mail: stephan.bud...@jvm.de
Internet: http://www.jvm.com

Geschäftsführer: Ulrich Pallas, Frank Wilhelm
AG HH HRB 98380

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-19 Thread Tuomas Leikola
On Mon, Oct 18, 2010 at 4:55 PM, Edward Ned Harvey sh...@nedharvey.com wrote:
 Thank you, but, the original question was whether a scrub would identify
 just corrupt blocks, or if it would be able to map corrupt blocks to a list
 of corrupt files.


Just in case this wasn't already clear.

After scrub sees read or checksum errors, zpool status -v will list
filenames that are affected. At least in my experience.
-- 
- Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-18 Thread Edward Ned Harvey
 From: Richard Elling [mailto:richard.ell...@gmail.com]
 
 On Oct 17, 2010, at 6:17 AM, Edward Ned Harvey wrote:
 
  From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
  boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 
  If scrub is operating at a block-level (and I think it is), then how
  can
  checksum failures be mapped to file names?  For example, this is a
  long-requested feature of zfs send which is fundamentally
 difficult
  or
  impossible to implement.
 
  How about that.  I recently learned that zfs diff does exist
 already, in
  b147 of openindiana.  That means it's already in the oracle opened-
 source
  zfs code, but apparently too new to be included in any of the present
  releases.
 
  So it seems, zfs does have some ability to figure out which file owns
 a
  particular block on disk.
 
 uhm... of course this exists.  The problem is that the efficient
 mapping
 goes the other way: files to blocks.  Snapshots further complicate this
 because a block may belong to a filename in one snapshot but the file
 got renamed in another snapshot.  Deduplication also complicates this
 because a block may be referenced in multiple files. Maintaining this
 mapping live is probably not worth the effort.

Thank you, but, the original question was whether a scrub would identify
just corrupt blocks, or if it would be able to map corrupt blocks to a list
of corrupt files.  

Until I wrote this comment about zfs diff no answer existed in this
thread.  (Unless I overlooked it somehow.)

So thank you for the information about dedup and difficulty maintaining live
information.  Although it was irrelevant to the discussion at hand.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-17 Thread Orvar Korvar
budy,
here are some links. Remember, the reason you get corrupted files, is because 
ZFS detects it. Probably, you got corruption earlier as well, but your hardware 
did not notice it. This is called Silent Corruption. But ZFS is designed to 
detect and correct Silent Corruption. Which no normal hardware is designed for.

The thing is, ZFS does end-to-end checksum. The data in RAM, is it identlcal on 
disc? From RAM down to controller to disk. There can be errors in the passing 
between the realms. Normally, there are checksums within each realm (checksums 
on the disc), but no checksums from the beginning of the chaing, to the end: 
end to the end checksums:
http://jforonda.blogspot.com/2007/01/faulty-fc-port-meets-zfs.html

Here are some links. CERN did a data integrity survey on 3000 hw raid and saw 
silent corruptions.
http://storagemojo.com/2007/09/19/cerns-data-corruption-research/


 In another CERN paper, they say such data corruption is found in all 
solutions, no matter price (even very expensive Enterprise solutions)!!! From 
that paper (can not find the link now)
Conclusions
-silent corruptions are a fact of life
-first step towards a solution is detection
-elimination seems impossible
-existing datasets are at the mercy of Murphy
-correction will cost time AND money
-effort has to start now (if not started already)
-multiple cost-schemes exist
--trade time and storage space (à la Google)
--trade time and CPU power (correction codes

CERN writes: checksumming - not necessarily enough you need to use 
end-to-end checksumming (ZFS has a point)


See the specifications on a new SAS Enterprise disk, typically it says:
one irrecoverable error in 10^15 bits. With todays large and fast raids, you 
quickly reach 10^ 15 bits in a short time.


Greenplums database solution faces one such bit every 15 min:
http://queue.acm.org/detail.cfm?id=1317400


Ordinary filesystems such as XFS, ReiserFS, JFS, etc does not protect your 
data, nor detect all errors (here is a PhD thesis link)
http://www.zdnet.com/blog/storage/how-microsoft-puts-your-data-at-risk/169


ZFS data integrity tested by researchers:
http://www.zdnet.com/blog/storage/zfs-data-integrity-tested/811?tag=rbxccnbzd1
(if they had ran zfs raid, ZFS would have corrected all artificially injected 
errors. Now, ZFS only detected all errors - which is very difficult to do. 
First step is detection, then repair the errors)


Companies tries to hide silent corruption:
http://www.enterprisestorageforum.com/sans/features/article.php/3704666


http://www.miracleas.com/BAARF/RAID5_versus_RAID10.txt
When a drive returns garbage, since RAID5 does not EVER check parity on read 
(RAID3  RAID4 do BTW and both perform better for databases than RAID5 to boot) 
if you write a garbage sector back garbage parity will be calculated and your 
RAID5 integrity is lost! Similarly if a drive fails and one of the remaining 
drives is flaky the replacement will be rebuilt with garbage also propagating 
the problem to two blocks instead of just one.


http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
The paper explains that the best RAID-6 can do is use probabilistic methods to 
distinguish between single and dual-disk corruption, eg. there are 95% chances 
it is single-disk corruption so I am going to fix it assuming that, but there 
are 5% chances I am going to actually corrupt more data, I just can't tell. I 
wouldn't want to rely on a RAID controller that takes gambles :-)


Researchers write regarding hw-raid:
http://www.cs.wisc.edu/adsl/Publications/parity-fast08.html
We use the model checker to evaluate a number of different approaches found in 
real RAID systems, focusing on parity-based protection and single errors. We 
find holes in all of the schemes examined, where systems potentially exposes 
data to loss or returns corrupt data to the user. In data loss scenarios, the 
error is detected, but the data cannot be recovered, while in the rest, the 
error is not detected and therefore corrupt data is returned to the user. For 
example, we examine a combination of two techniques – block-level checksums 
(where checksums of the data block are stored within the same disk block as 
data and verified on every read) and write-verify (where data is read back 
immediately after it is written to disk and verified for correctness), and show 
that the scheme could still fail to detect certain error conditions, thus 
returning corrupt data to the user.

We discover one particularly interesting and general problem that we call 
parity pollution. In this situation, corrupt data in one block of a stripe 
spreads to other blocks through various parity calculations. We find a number 
of cases where parity pollution occurs, and show how pollution can lead to data 
loss. Specifically, we find that data scrubbing (which is used to reduce the 
chances of double disk failures) tends to be one of themain causes of parity 
pollution.



Re: [zfs-discuss] Finding corrupted files

2010-10-17 Thread Kees Nuyt
On Sun, 17 Oct 2010 03:05:34 PDT, Orvar Korvar
knatte_fnatte_tja...@yahoo.com wrote:

 here are some links. 

Wow, that's a great overview, thanks!
-- 
  (  Kees Nuyt
  )
c[_]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-17 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 
 If scrub is operating at a block-level (and I think it is), then how
 can
 checksum failures be mapped to file names?  For example, this is a
 long-requested feature of zfs send which is fundamentally difficult
 or
 impossible to implement.

How about that.  I recently learned that zfs diff does exist already, in
b147 of openindiana.  That means it's already in the oracle opened-source
zfs code, but apparently too new to be included in any of the present
releases.

So it seems, zfs does have some ability to figure out which file owns a
particular block on disk.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-17 Thread Richard Elling
On Oct 17, 2010, at 6:17 AM, Edward Ned Harvey wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 
 If scrub is operating at a block-level (and I think it is), then how
 can
 checksum failures be mapped to file names?  For example, this is a
 long-requested feature of zfs send which is fundamentally difficult
 or
 impossible to implement.
 
 How about that.  I recently learned that zfs diff does exist already, in
 b147 of openindiana.  That means it's already in the oracle opened-source
 zfs code, but apparently too new to be included in any of the present
 releases.
 
 So it seems, zfs does have some ability to figure out which file owns a
 particular block on disk.

uhm... of course this exists.  The problem is that the efficient mapping
goes the other way: files to blocks.  Snapshots further complicate this 
because a block may belong to a filename in one snapshot but the file 
got renamed in another snapshot.  Deduplication also complicates this
because a block may be referenced in multiple files. Maintaining this
mapping live is probably not worth the effort.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-16 Thread Richard Elling
On Oct 15, 2010, at 6:18 AM, Stephan Budach wrote:

 So, what would you suggest, if I wanted to create really big pools? Say in 
 the 100 TB range? That would be quite a number of single drives then, 
 especially when you want to go with zpool raid-1.

For 100 TB, the methods change dramatically.  You can't just reload 100 TB from 
CD
or tape. When you get to this scale you need to be thinking about raidz2+ *and*
mirroring.

I will be exploring these issues of scale at the Techniques for Managing Huge
Amounts of Data tutorial at the USENIX LISA '10 Conference.
http://www.usenix.org/events/lisa10/training/
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
USENIX LISA '10 Conference November 8-16
ZFS and performance consulting
http://www.RichardElling.com












___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-16 Thread Pasi Kärkkäinen
On Sat, Oct 16, 2010 at 08:38:28AM -0700, Richard Elling wrote:
On Oct 15, 2010, at 6:18 AM, Stephan Budach wrote:
 
  So, what would you suggest, if I wanted to create really big pools? Say
  in the 100 TB range? That would be quite a number of single drives then,
  especially when you want to go with zpool raid-1.
 
For 100 TB, the methods change dramatically.  You can't just reload 100 TB
from CD
or tape. When you get to this scale you need to be thinking about raidz2+
*and*
mirroring.
I will be exploring these issues of scale at the Techniques for Managing
Huge
Amounts of Data tutorial at the USENIX LISA '10 Conference.
[1]http://www.usenix.org/events/lisa10/training/

Hopefully your presentation will be available online after the event!

-- Pasi

 -- richard
 
--
OpenStorage Summit, October 25-27, Palo Alto, CA
[2]http://nexenta-summit2010.eventbrite.com
USENIX LISA '10 Conference November 8-16
ZFS and performance consulting
[3]http://www.RichardElling.com
 
 References
 
Visible links
1. http://www.usenix.org/events/lisa10/training/
2. http://nexenta-summit2010.eventbrite.com/
3. http://www.richardelling.com/

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-16 Thread Richard Elling
On Oct 16, 2010, at 4:13 PM, Pasi Kärkkäinen wrote:
 On Sat, Oct 16, 2010 at 08:38:28AM -0700, Richard Elling wrote:
   On Oct 15, 2010, at 6:18 AM, Stephan Budach wrote:
 
 So, what would you suggest, if I wanted to create really big pools? Say
 in the 100 TB range? That would be quite a number of single drives then,
 especially when you want to go with zpool raid-1.
 
   For 100 TB, the methods change dramatically.  You can't just reload 100 TB
   from CD
   or tape. When you get to this scale you need to be thinking about raidz2+
   *and*
   mirroring.
   I will be exploring these issues of scale at the Techniques for Managing
   Huge
   Amounts of Data tutorial at the USENIX LISA '10 Conference.
   [1]http://www.usenix.org/events/lisa10/training/
 
 Hopefully your presentation will be available online after the event!

Sure, though I would encourage everyone to attend :-)
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
USENIX LISA '10 Conference November 8-16, 2010
ZFS and performance consulting
http://www.RichardElling.com












___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-15 Thread Stephan Budach

Am 14.10.10 17:48, schrieb Edward Ned Harvey:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Toby Thain


I don't want to heat up the discussion about ZFS managed discs vs.
HW raids, but if RAID5/6 would be that bad, no one would use it
anymore.

It is. And there's no reason not to point it out. The world has

Well, neither one of the above statements is really fair.

The truth is:  radi5/6 are generally not that bad.  Data integrity failures
are not terribly common (maybe one bit per year out of 20 large disks or
something like that.)

And in order to reach the conclusion nobody would use it, the people using
it would have to first *notice* the failure.  Which they don't.  That's kind
of the point.

Since I started using ZFS in production, about a year ago, on three servers
totaling approx 1.5TB used, I have had precisely one checksum error, which
ZFS corrected.  I have every reason to believe, if that were on a raid5/6,
the error would have gone undetected and nobody would have noticed.


Point taken!

So, what would you suggest, if I wanted to create really big pools? Say 
in the 100 TB range? That would be quite a number of single drives then, 
especially when you want to go with zpool raid-1.


Cheers,
budy

--
Stephan Budach
Jung von Matt/it-services GmbH
Glashüttenstraße 79
20357 Hamburg

Tel: +49 40-4321-1353
Fax: +49 40-4321-1114
E-Mail: stephan.bud...@jvm.de
Internet: http://www.jvm.com

Geschäftsführer: Ulrich Pallas, Frank Wilhelm
AG HH HRB 98380

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-15 Thread Ross Walker
On Oct 15, 2010, at 9:18 AM, Stephan Budach stephan.bud...@jvm.de wrote:

 Am 14.10.10 17:48, schrieb Edward Ned Harvey:
 
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Toby Thain
 
 I don't want to heat up the discussion about ZFS managed discs vs.
 HW raids, but if RAID5/6 would be that bad, no one would use it
 anymore.
 It is. And there's no reason not to point it out. The world has
 Well, neither one of the above statements is really fair.
 
 The truth is:  radi5/6 are generally not that bad.  Data integrity failures
 are not terribly common (maybe one bit per year out of 20 large disks or
 something like that.)
 
 And in order to reach the conclusion nobody would use it, the people using
 it would have to first *notice* the failure.  Which they don't.  That's kind
 of the point.
 
 Since I started using ZFS in production, about a year ago, on three servers
 totaling approx 1.5TB used, I have had precisely one checksum error, which
 ZFS corrected.  I have every reason to believe, if that were on a raid5/6,
 the error would have gone undetected and nobody would have noticed.
 
 Point taken!
 
 So, what would you suggest, if I wanted to create really big pools? Say in 
 the 100 TB range? That would be quite a number of single drives then, 
 especially when you want to go with zpool raid-1.

A pool consisting of 4 disk raidz vdevs (25% overhead) or 6 disk raidz2 vdevs 
(33% overhead) should deliver the storage and performance for a pool that size, 
versus a pool of mirrors (50% overhead).

You need a lot if spindles to reach 100TB.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-15 Thread Stephan Budach

Am 12.10.10 14:21, schrieb Edward Ned Harvey:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Stephan Budach

   c3t211378AC0253d0  ONLINE   0 0 0

How many disks are there inside of c3t211378AC0253d0?

How are they configured?  Hardware raid 5?  A mirror of two hardware raid
5's?  The point is:  This device, as seen by ZFS, is not a pure storage
device.  It is a high level device representing some LUN or something, which
is configured  controlled by hardware raid.

If there's zero redundancy in that device, then scrub would probably find
the checksum errors consistently and repeatably.

If there's some redundancy in that device, then all bets are off.  Sometimes
scrub might read the good half of the data, and other times, the bad half.


But then again, the error might not be in the physical disks themselves.
The error might be somewhere in the raid controller(s) or the interconnect.
Or even some weird unsupported driver or something.

Both raid boxes run raid6 with 16 drives each. This is the reason I was 
running a non-mirrored pool in the first place.
I fully understand that zfs' power comes to play, when you're running 
with multiple independent drives, but that was what I got at hand.


I now also got what you meant by good half but I don't dare to say 
whether or not this is also the case in a raid6 setup.


Regards

--
Stephan Budach
Jung von Matt/it-services GmbH
Glashüttenstraße 79
20357 Hamburg

Tel: +49 40-4321-1353
Fax: +49 40-4321-1114
E-Mail: stephan.bud...@jvm.de
Internet: http://www.jvm.com

Geschäftsführer: Ulrich Pallas, Frank Wilhelm
AG HH HRB 98380

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-15 Thread Edward Ned Harvey
 From: Stephan Budach [mailto:stephan.bud...@jvm.de]
 
 Point taken!
 
 So, what would you suggest, if I wanted to create really big pools? Say
 in the 100 TB range? That would be quite a number of single drives
 then, especially when you want to go with zpool raid-1.

You have a lot of disks.  You either tell the hardware to manage a lot of
disks, and then tell ZFS to manage a single device, and you take unnecessary
risk and performance degradation for no apparent reason ...

Or you tell ZFS to manage a lot of disks.  Either way, you have a lot of
disks that need to be managed by something.  Why would you want to make that
hardware instead of ZFS?

For 100TB ... I suppose you have 2TB disks.  I suppose you have 12 buses.  I
would make a raidz1 using 1 disk from bus0, bus1, ... bus5.  I would make
another raidz1 vdev using a disk from bus6, bus7, ... bus11.  And so forth.
Then, even if you lose a whole bus, you still haven't lost your pool.  Each
raidz1 vdev would be 6 disks with a capacity of 5, so you would have a total
of 10 vdev's and that means 5 disks on each bus.

Or do whatever you want.  The point is yes, give all the individual disks to
ZFS.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-14 Thread Stephan Budach
I'd like to see those docs as well.
As all HW raids are driven by software, of course - and software can be buggy.

I don't want to heat up the discussion about ZFS managed discs vs. HW raids, 
but if RAID5/6 would be that bad, no one would use it anymore.

So… just post the link and I will take a close look at the docs.

Thanks,
budy
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-14 Thread Toby Thain


On 14-Oct-10, at 3:27 AM, Stephan Budach wrote:


I'd like to see those docs as well.
As all HW raids are driven by software, of course - and software can  
be buggy.




It's not that the software 'can be buggy' - that's not the point here.  
The point being made is that conventional RAID just doesn't offer data  
*integrity* - it's not a design factor. The necessary mechanisms  
simply aren't there.


Contrariwise, with ZFS, end to end integrity is *designed in*. The  
'papers' which demonstrate this difference are the design documents;  
anyone could start with Mr Bonwick's blog - with which I am sure most  
list readers are already familiar.


http://blogs.sun.com/bonwick/en_US/category/ZFS
e.g. http://blogs.sun.com/bonwick/en_US/entry/zfs_end_to_end_data

I don't want to heat up the discussion about ZFS managed discs vs.  
HW raids, but if RAID5/6 would be that bad, no one would use it  
anymore.


It is. And there's no reason not to point it out. The world has  
changed a lot since RAID was 'state of the art'. It is important to  
understand its limitations (most RAID users apparently don't).


The saddest part is that your experience clearly shows these  
limitations. As expected, the hardware RAID didn't protect your data,  
since it's designed neither to detect nor repair such errors.


If you had been running any other filesystem on your RAID you would  
never even have found out about it until you accessed a damaged part  
of it. Furthermore, backups would probably have been silently corrupt,  
too.


As many other replies have said: The correct solution is to let ZFS,  
and not conventional RAID, manage your redundancy. That's the bottom  
line of any discussion of ZFS managed discs vs. HW raids. If still  
unclear, read Bonwick's blog posts, or the detailed reply to you from  
Edward Harvey (10/6).


--Toby



So… just post the link and I will take a close look at the docs.

Thanks,
budy
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-14 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Toby Thain
 
  I don't want to heat up the discussion about ZFS managed discs vs.
  HW raids, but if RAID5/6 would be that bad, no one would use it
  anymore.
 
 It is. And there's no reason not to point it out. The world has

Well, neither one of the above statements is really fair.

The truth is:  radi5/6 are generally not that bad.  Data integrity failures
are not terribly common (maybe one bit per year out of 20 large disks or
something like that.)

And in order to reach the conclusion nobody would use it, the people using
it would have to first *notice* the failure.  Which they don't.  That's kind
of the point.

Since I started using ZFS in production, about a year ago, on three servers
totaling approx 1.5TB used, I have had precisely one checksum error, which
ZFS corrected.  I have every reason to believe, if that were on a raid5/6,
the error would have gone undetected and nobody would have noticed.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-13 Thread Orvar Korvar
Budy, if you are using raid-5 or raid-6 underneath ZFS, then you should know 
that raid-5/6 might corrupt data. See here for lots of technical articles why 
raid-5 is bad:
http://www.baarf.com/
raid-6 is not better. I can show you links about raid-6 being not safe.

I is a good thing you run ZFS, because ZFS can detect those errors, whereas 
raid-5/6 can not. There are lots of research from computer scientists that show 
this. You want to see some research papers on data corruption and hardware raid?

On the other hand, ZFS is safe. There are research papers showing that ZFS 
detects and corrects all errors. You want to see them?

The bottom line is: ZFS should manage the discs directly. Do not let hardware 
raid (which can not detect all errors) run the discs. ZFS can detect and repair 
those errors. That is the reason to use ZFS, for data safety. Not for 
performance (that is secondary). 

You do have problems with your discs, only ZFS detects those errors. Your 
hardware raid did not detect those errors. ZFS can not repair the errors, 
unless ZFS runs the discs.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-13 Thread Richard Elling
On Oct 13, 2010, at 12:59 PM, Orvar Korvar wrote:
 On the other hand, ZFS is safe. There are research papers showing that ZFS 
 detects and corrects all errors. You want to see them?

I would.  URLs please?
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
ZFS Tutorial at USENIX LISA'10 Conference
November 8, 2010  San Jose, CA
ZFS and performance consulting
http://www.RichardElling.com












___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-12 Thread Stephan Budach
You are implying that the issues resulted from the H/W raid(s) and I don't 
think that this is appropriate.

I configured a striped pool using two raids - this is exactly the same as using 
two single hard drives without mirroring them. I simply cannot see what zfs 
would be able to do in case of a block corruption in that matter.

You are not stating that a single hard drive is more reliable than a HW raid 
box, are you? Actually my pool has no mirror capabilities at all, unless I am 
seriously mistaken.

What scrub has found out is that none of the blocks had any issue, but the 
filesystem was not clean either, so if scrub does it's job right and doesn't 
report any errors, the error must have occurred somewhere else up the stack, 
way before the checksum had been calculated.

No?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-12 Thread Tuomas Leikola
On Tue, Oct 12, 2010 at 9:39 AM, Stephan Budach stephan.bud...@jvm.de wrote:
 You are implying that the issues resulted from the H/W raid(s) and I don't 
 think that this is appropriate.


Not exactly. Because the raid is managed in hardware, and not by zfs,
is the reason why zfs cannot fix these errors when it encounters them.

 I configured a striped pool using two raids - this is exactly the same as 
 using two single hard drives without mirroring them. I simply cannot see what 
 zfs would be able to do in case of a block corruption in that matter.

It cannot, exactly.

 You are not stating that a single hard drive is more reliable than a HW raid 
 box, are you? Actually my pool has no mirror capabilities at all, unless I am 
 seriously mistaken.

no, but zfs-managed raid is more reliable than hardware raid.

 What scrub has found out is that none of the blocks had any issue, but the 
 filesystem was not clean either, so if scrub does it's job right and 
 doesn't report any errors, the error must have occurred somewhere else up the 
 stack, way before the checksum had been calculated.

If the case is, as speculated, that one mirror has bad data and one
has good, scrub or any IO has 50% chances of seeing the corruption.
scrub does verify checksums.

Tuomas
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-12 Thread Stephan Budach
 If the case is, as speculated, that one mirror has bad data and one
has good, scrub or any IO has 50% chances of seeing the corruption.
scrub does verify checksums.

Yes, if the vdev would be a mirrored one, which it wasn't. There weren't any 
mirrors setup. Plus, if the checksums would have been bad, scrub would have to 
deteteced that. It would not have been to resolve it, but that wasn't the case.

zpool status backupPool_01
  pool: backupPool_01
 state: ONLINE
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
backupPool_01ONLINE   0 0 0
  c3t211378AC0253d0  ONLINE   0 0 0
  c3t211378AC026Ed0  ONLINE   0 0 0

errors: No known data errors

If one of the two devices would go bad, boom - that it'd be for the entire 
pool, but as long as the two devices work, it's okay.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-12 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Stephan Budach
 
 You are implying that the issues resulted from the H/W raid(s) and I
 don't think that this is appropriate.

Please quote originals when you reply.  If you don't - then it's easy to
follow the thread on the web forum, but not in email.  So if you don't
quote, you'll be losing a lot of the people following the thread.  

I think it's entirely appropriate to imply that your problem this time stems
from hardware.  I'll say it outright.  You have a hardware problem.  Because
if there is a repeatable checksum failure (bad disk) then if anything can
find it, scrub can.  And scrub is the best way to find it.

If you have a nonrepeatable checksum failure (such as you have) then there
is only one possibility.  You are experiencing a hardware problem.

One possibility is that there's a failing disk in your hardware raid set,
and your hardware raid controller is unable to detect it, because hardware
raid doesn't do checksumming.  Sometimes ZFS reads the device, and gets an
error.  Sometimes the hardware raid controller reads the other side of the
mirror, and there is no error.

This is not the only possibility.  There could be some other piece of
hardware yielding your intermittent checksum errors.  But there's one
absolute conclusion:  Your intermittent checksum errors are caused by
hardware.

If scrub didn't find an error, then there was no error at the time of scrub.

If scrub didn't find an error, and then something else *did* find an error,
it means one of two things.  (a) Maybe the error only occurred after the
scrub.  or (b) the hardware raid controller or some other piece of hardware
didn't produce corrupted data during the scrub, but will produce corrupted
data at some other time.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-12 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Stephan Budach

   c3t211378AC0253d0  ONLINE   0 0 0

How many disks are there inside of c3t211378AC0253d0?

How are they configured?  Hardware raid 5?  A mirror of two hardware raid
5's?  The point is:  This device, as seen by ZFS, is not a pure storage
device.  It is a high level device representing some LUN or something, which
is configured  controlled by hardware raid.

If there's zero redundancy in that device, then scrub would probably find
the checksum errors consistently and repeatably.

If there's some redundancy in that device, then all bets are off.  Sometimes
scrub might read the good half of the data, and other times, the bad half.


But then again, the error might not be in the physical disks themselves.
The error might be somewhere in the raid controller(s) or the interconnect.
Or even some weird unsupported driver or something.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-12 Thread Ross Walker
On Oct 12, 2010, at 8:21 AM, Edward Ned Harvey sh...@nedharvey.com wrote:

 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Stephan Budach
 
  c3t211378AC0253d0  ONLINE   0 0 0
 
 How many disks are there inside of c3t211378AC0253d0?
 
 How are they configured?  Hardware raid 5?  A mirror of two hardware raid
 5's?  The point is:  This device, as seen by ZFS, is not a pure storage
 device.  It is a high level device representing some LUN or something, which
 is configured  controlled by hardware raid.
 
 If there's zero redundancy in that device, then scrub would probably find
 the checksum errors consistently and repeatably.
 
 If there's some redundancy in that device, then all bets are off.  Sometimes
 scrub might read the good half of the data, and other times, the bad half.
 
 
 But then again, the error might not be in the physical disks themselves.
 The error might be somewhere in the raid controller(s) or the interconnect.
 Or even some weird unsupported driver or something.

If it were a parity based raid set then the error would most likely be 
reproducible, if not detected by the raid controller.

The biggest problem is from hardware mirrors where the hardware can't detect an 
error on one side vs the other.

For mirrors it's always best to use ZFS' built-in mirrors, otherwise if I were 
to use HW RAID I would use RAID5/6/50/60 since errors encountered can be 
reproduced, two parity raids mirrored in ZFS would probably provide the best of 
both worlds, for a steep cost though.

-Ross

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-12 Thread Edward Ned Harvey
 From: Stephan Budach [mailto:stephan.bud...@jvm.de]
 
 I now also got what you meant by good half but I don't dare to say
 whether or not this is also the case in a raid6 setup.

The same concept applies to raid5 or raid6.  When you read the device, you
never know if you're actually reading the data or the parity and in
fact, they're mixed together in order to fully utilize all the hardware
available.  (Assuming you have some decently smart hardware.)

But all of that is mostly irrelevant.  One fact remains:

You have checksum errors.  There is only one cause for checksum errors:
Hardware failure.

It may be the physical disks themselves, or the raid card, or ram, or cpu,
or any of the interconnect in between.  I suppose it could be a driver
problem, but that's less likely.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-11 Thread David Dyer-Bennet

On Fri, October 8, 2010 04:47, Stephan Budach wrote:
 So, I decided to give tar a whirl, after zfs send encountered the next
 corrupted file, resulting in an I/O error, even though scrub ran
 successfully w/o any erors.

I must say that this concept of scrub running w/o error when corrupted
files, detectable to zfs send, apparently exist, is very disturbing. 
Background scrubbing, and the block checksums to make it more meaningful
than just reading the disk blocks, was the key thing that drew me into
ZFS, and this seems to suggest that it doesn't work.

Does your sequence of tests happen to provide evidence that the problem
isn't new errors appearing, sometimes after a scrub and before the send? 
For example, have you done 1) scrub finds no error, 2) send finds error,
3) scrub finds no error?  (with nothing in between that could have cleared
or fixed the error).

-- 
David Dyer-Bennet, d...@dd-b.net; http://dd-b.net/
Snapshots: http://dd-b.net/dd-b/SnapshotAlbum/data/
Photos: http://dd-b.net/photography/gallery/
Dragaera: http://dragaera.info

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-11 Thread Richard Elling

On Oct 6, 2010, at 1:26 PM, Stephan Budach wrote:

 Hi Cindy,
 
 thanks for bringing that to my attention. I checked fmdump and found a lot of 
 these entries
 
 Okt 06 2010 17:52:12.862812483 ereport.io.scsi.cmd.disk.tran

...

 Okt 06 2010 17:52:12.862813713 ereport.io.scsi.cmd.disk.recovered

...

 Googling about these errors brought me directly to this document:
 
 http://dsc.sun.com/solaris/articles/scsi_disk_fma2.html
 
 which talks about these scsi errors. Since we're talking FC here, it seems to 
 point to some FC issue I have not been aware of. Furthermore, it's always the 
 same FC device that show these errors, so I will try to check the device and 
 it's connections to the fabric first.

SCSI transport errors occur between the HBA and the target.  These are
reported up the stack to Solaris.  As you can see, a retry was successful.
However, these will have negative impacts on performance, so it is best
to solve the problem.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-11 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of David Dyer-Bennet
 
 I must say that this concept of scrub running w/o error when corrupted
 files, detectable to zfs send, apparently exist, is very disturbing.

As previously mentioned, the OP is using a hardware raid system.  It is
impossible for ZFS to read both sides of the mirror, which means it's pure
chance.  The hardware raid may fetch data from a bad disk one time, and
fetch good data from another disk the next time.  Or vice-versa.

You should always configure JBOD and allow ZFS to manage the raid.  Don't do
it in hardware, as the OP of this thread is soundly demonstrating the
reasons why.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-11 Thread Stephan Budach
I think one has to accept that zfs send appearently is able to detect such 
errors while scrub is not. scrub is operates only on the block level and makes 
sure that each block can be read and is in line with its's checksum.

However, zfs send seems to have detected some errors in the file system 
structure itself, resulting in a couple of files being unable to read.
What had caused these errors, I have no idea, but deleting the affected files 
and replacing them did the job.

I think that my understanding of zfs send/recv only operating on the block 
level, bypassing the higher level fs stuff, has been too simple.

Now to answer your question: I did 1), 2) and 3), but between 2) and 3) I 
verified using tar that all files were accessible.
Also, I didn't had any problem since.

Cheers,
budy
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-08 Thread Stephan Budach
So, I decided to give tar a whirl, after zfs send encountered the next 
corrupted file, resulting in an I/O error, even though scrub ran successfully 
w/o any erors.

I then issued a 
/usr/gnu/bin/tar -cf /dev/null /obelixData/…/.zfs/snapshot/actual snapshot/DTP

which finished without any issue and I have now issued a zfs send of this 
snapshot to my remote host.

Let's see, what happens in approx. 9 hrs.

budy
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-08 Thread Stephan Budach
So - after 10 hrs and 21 mins. the incremental zfs send/recv finished without a 
problem. ;)

Seems that using tar for checking all files is an appropriate action.

Cheers,
budy
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-07 Thread Stephan Budach
Hi Edward,

well that was exactly my point, when I raised this question. If zfs send is 
able to identify corrupted files while it transfers a snapshot, why shouldn't 
scrub be able to do the same?

ZFS send quit with an I/O error and zpool status -v showed my the file that 
indeed had problems. Since I thought that zfs send also operates on the block 
level, I thought whether or not scrub would basically do the same thing.

On the other hand scrub really doesn't care about what to read from the device 
- it simply reads all blocks, which is not the case when running zfs send.

Maybe, if zfs send could just go on and not halt on an I/O error and instead 
just print out the errors…

Cheers,
budy
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-07 Thread Ian Collins

On 10/ 7/10 06:22 PM, Stephan Budach wrote:

Hi Edward,

these are interesting points. I have considered a couple of them, when I 
started playing around with ZFS.

I am not sure whether I disagree with all of your points, but I conducted a 
couple of tests, where I configured my raids as jbods and mapped each drive out 
as a seperate LUN and I couldn't notice a difference in performance in any way.

   
The time you will notice is when a cable falls out or becomes loose and 
you get corrupted data and loose the pool due to lack of redundancy.  
Even though your LUNs are RAID, there are still numerous single points 
of failure between them and the target system.


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-07 Thread Stephan Budach
Ian,

I know - and I will address this, by upgrading the vdevs to mirrors, but 
there're a lot of other SPOFs around. So I started out by reducing the most 
common failures and I have found that to be the disc drives, not the chassis.

The beauty is: one can work their way up until the point of securuty is reached 
or until there is no more money to spend.

Cheers,
budy
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-07 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Stephan Budach
 
 I
 conducted a couple of tests, where I configured my raids as jbods and
 mapped each drive out as a seperate LUN and I couldn't notice a
 difference in performance in any way.

Not sure if my original points were communicated clearly.  Giving JBOD's to
ZFS is not for the sake of performance.  The reason for JBOD is reliability.
Because hardware raid cannot detect or correct checksum errors.  ZFS can.
So it's better to skip the hardware raid and use JBOD, to enable ZFS access
to each separate side of the redundant data.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-07 Thread Edward Ned Harvey
 From: edmud...@mail.bounceswoosh.org
 [mailto:edmud...@mail.bounceswoosh.org] On Behalf Of Eric D. Mudama
 
 On Wed, Oct  6 at 22:04, Edward Ned Harvey wrote:
  * Because ZFS automatically buffers writes in ram in order to
  aggregate as previously mentioned, the hardware WB cache is not
  beneficial.  There is one exception.  If you are doing sync writes
  to spindle disks, and you don't have a dedicated log device, then
  the WB cache will benefit you, approx half as much as you would
  benefit by adding dedicated log device.  The sync write sort-of
  by-passes the ram buffer, and that's the reason why the WB is able
  to do some good in the case of sync writes.
 
 All of your comments made sense except for this one.
 
 (etc)

Your point about long-term fragmentation and significant drive emptiness are
well received.  I never let a pool get over 90% full, for several reasons
including this one.  My target is 70%, which seems to be sufficiently empty.

Also, as you indicated, blocks of 128K are not sufficiently large for
reordering to benefit.  There's another thread here, where I calculated, you
need blocks approx 40MB in size, in order to reduce random seek time below
1% of total operation time.  So all that I said will only be relevant or
accurate if within 30sec (or 5 sec in the future) there exists at least 40M
of aggregatable sequential writes.

It's really easy to measure and quantify what I was saying.  Just create a
pool, and benchmark it in each configuration.  Results that I measured were:

(stripe of 2 mirrors) 
721  IOPS without WB or slog.  
2114 IOPS with WB
2722 IOPS with WB and slog
2927 IOPS with slog, and no WB

There's a whole spreadsheet full of results that I can't publish, but the
trend of WB versus slog was clear and consistent.

I will admit the above were performed on relatively new, relatively empty
pools.  It would be interesting to see if any of that changes, if the test
is run on a system that has been in production for a long time, with real
user data in it.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-07 Thread Cindy Swearingen

I would not discount the performance issue...

Depending on your workload, you might find that performance increases
with ZFS on your hardware RAID in JBOD mode.

Cindy

On 10/07/10 06:26, Edward Ned Harvey wrote:

From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Stephan Budach

I
conducted a couple of tests, where I configured my raids as jbods and
mapped each drive out as a seperate LUN and I couldn't notice a
difference in performance in any way.


Not sure if my original points were communicated clearly.  Giving JBOD's to
ZFS is not for the sake of performance.  The reason for JBOD is reliability.
Because hardware raid cannot detect or correct checksum errors.  ZFS can.
So it's better to skip the hardware raid and use JBOD, to enable ZFS access
to each separate side of the redundant data.


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-07 Thread Toby Thain


On 7-Oct-10, at 1:22 AM, Stephan Budach wrote:


Hi Edward,

these are interesting points. I have considered a couple of them,  
when I started playing around with ZFS.


I am not sure whether I disagree with all of your points, but I  
conducted a couple of tests, where I configured my raids as jbods  
and mapped each drive out as a seperate LUN and I couldn't notice a  
difference in performance in any way.





The integrity issue is, however, clear cut. ZFS must manage the  
redundancy.


ZFS just alerted you that your 'FC RAID' doesn't actually provide data  
integrity,  you just lost the 'calculated' bet. :)


--Toby


I'd love to discuss this in a seperate thread, but first I will have  
to check the archives an Google. ;)


Thanks,
budy
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-07 Thread Edward Ned Harvey
 From: Cindy Swearingen [mailto:cindy.swearin...@oracle.com]
 
 I would not discount the performance issue...
 
 Depending on your workload, you might find that performance increases
 with ZFS on your hardware RAID in JBOD mode.

Depends on the raid card you're comparing to.  I've certainly seen some raid
cards that were too dumb to read from 2 disks in a mirror simultaneously for
the sake of read performance enhancement.  And many other similar
situations.

But I would not say that's generally true anymore.  In the last several
years, all the hardware raid cards that I've bothered to test were able to
utilize all the hardware available.  Just like ZFS.

There are performance differences...  like ... the hardware raid might be
able to read 15% faster in raid5, while ZFS is able to write 15% faster in
raidz, and so forth.  Differences that roughly balance each other out.

For example, here's one data point I can share (2 mirrors striped, results
normalized):
8 initial writers, 8 rewriters, 8 readers   
ZFS 1.432.995.05
HW  2.002.542.96

8 re-readers,   8 reverse readers,  8 stride readers
ZFS 4.193.593.93
HW  3.022.802.90

8 random readers,   8 random mix,   8 random writers
ZFS 2.572.401.69
HW  1.991.701.73

average
ZFS 3.09
HW  2.40

There were some categories where ZFS was faster.  Some where HW was faster.
On average, ZFS was faster, but they were all in the same ballpark, and the
results were highly dependent on specific details and tunables.  AKA, not a
place you should explore, unless you have a highly specialized use case that
you wish to optimize.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Finding corrupted files

2010-10-06 Thread Stephan Budach
Hi,

I recently discovered some - or at least one corrupted file on one ofmy ZFS 
datasets, which caused an I/O error when trying to send a ZFDS snapshot to 
another host:


zpool status -v obelixData
  pool: obelixData
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
obelixData   ONLINE   4 0 0
  c4t21D023038FA8d0  ONLINE   0 0 0
  c4t21D02305FF42d0  ONLINE   4 0 0

errors: Permanent errors have been detected in the following files:

0x949:0x12b9b9

obelixData/jvmprepr...@2010-10-02_2359:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in 
CI vor ET 10.6.2010/13404_41_07008 Estate 
HandelsMarketing/Dealer_Launch_Invitations 
Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps

obelixData/jvmprepr...@backupsnapshot_2010-10-05-08:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ
 in CI vor ET 10.6.2010/13404_41_07008 Estate 
HandelsMarketing/Dealer_Launch_Invitations 
Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps

obelixData/jvmprepr...@2010-09-24_2359:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in 
CI vor 6_210/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations 
Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps
/obelixData/JvMpreprint/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor ET 
10.6.2010/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations 
Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps

Now, scrub would reveal corrupted blocks on the devices, but is there a way to 
identify damaged files as well?

Thanks,
budy
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-06 Thread Tomas Ögren
On 06 October, 2010 - Stephan Budach sent me these 2,1K bytes:

 Hi,
 
 I recently discovered some - or at least one corrupted file on one ofmy ZFS 
 datasets, which caused an I/O error when trying to send a ZFDS snapshot to 
 another host:
 
 
 zpool status -v obelixData
   pool: obelixData
  state: ONLINE
 status: One or more devices has experienced an error resulting in data
 corruption.  Applications may be affected.
 action: Restore the file in question if possible.  Otherwise restore the
 entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
  scrub: none requested
 config:
 
 NAME STATE READ WRITE CKSUM
 obelixData   ONLINE   4 0 0
   c4t21D023038FA8d0  ONLINE   0 0 0
   c4t21D02305FF42d0  ONLINE   4 0 0
 
 errors: Permanent errors have been detected in the following files:
 
 0x949:0x12b9b9
 
 obelixData/jvmprepr...@2010-10-02_2359:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in 
 CI vor ET 10.6.2010/13404_41_07008 Estate 
 HandelsMarketing/Dealer_Launch_Invitations 
 Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps
 
 obelixData/jvmprepr...@backupsnapshot_2010-10-05-08:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ
  in CI vor ET 10.6.2010/13404_41_07008 Estate 
 HandelsMarketing/Dealer_Launch_Invitations 
 Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps
 
 obelixData/jvmprepr...@2010-09-24_2359:/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in 
 CI vor 6_210/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations 
 Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps
 /obelixData/JvMpreprint/DTP/Jobs/Mercedes-Benz/C_Klasse/RZ in CI vor 
 ET 10.6.2010/13404_41_07008 Estate HandelsMarketing/Dealer_Launch_Invitations 
 Fremddokumente/Dealer_Launch_S204/Images/Vorhang_Innen.eps
 
 Now, scrub would reveal corrupted blocks on the devices, but is there a way 
 to identify damaged files as well?

Is this a trick question or something? The filenames are right over
your question..?

/Tomas
-- 
Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-06 Thread Stephan Budach
No - not a trick question., but maybe I didn't make myself clear.
Is there a way to discover such bad files other than trying to actually read 
from them one by one, say using cp or by sending a snapshot elsewhere?

I am well aware that the file shown in  zpool status -v is damaged and I have 
already restored it, but I wanted to know, if there're more of them.

Regards,
budy
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-06 Thread Scott Meilicke
Scrub?

On Oct 6, 2010, at 6:48 AM, Stephan Budach wrote:

 No - not a trick question., but maybe I didn't make myself clear.
 Is there a way to discover such bad files other than trying to actually read 
 from them one by one, say using cp or by sending a snapshot elsewhere?
 
 I am well aware that the file shown in  zpool status -v is damaged and I have 
 already restored it, but I wanted to know, if there're more of them.
 
 Regards,
 budy
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Scott Meilicke



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-06 Thread Jim Dunham
Budy,

 No - not a trick question., but maybe I didn't make myself clear.
 Is there a way to discover such bad files other than trying to actually read 
 from them one by one, say using cp or by sending a snapshot elsewhere?

As noted by your original email, ZFS reports on any corruption using the zpool 
status command.

ZFS detects corruption as part of its normal filesystem operations, which maybe 
triggered by: cp, send-recv, etc., or by a forced reading of the entire 
filesystem by scrub.

 I am well aware that the file shown in  zpool status -v is damaged and I have 
 already restored it, but I wanted to know, if there're more of them.

Assuming that the ZFS filesystem in question is not degrading further (as in a 
disk going bad), upon completion of a successful scrub, zpool reports the 
complete status of the filesystem being reported on. 

- Jim

 
 Regards,
 budy
 -- 
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-06 Thread Stephan Budach
Well I think, that answers my question then: after a successful scrub, zpool 
status -v should then list all damaged files on an entire zpool.

I only asked, because I read a thread in this forum that one guy had a problem 
with different files, aven after a successful scrub.

Thanks,
budy
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-06 Thread Cindy Swearingen

Budy,

Your previous zpool status output shows a non-redundant pool with data 
corruption.


You should use the fmdump -eV command to find out the underlying cause
of this corruption.

You can review the hardware-level monitoring tools, here:

http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide

Thanks,

Cindy

On 10/06/10 13:09, Stephan Budach wrote:

Well I think, that answers my question then: after a successful scrub, zpool 
status -v should then list all damaged files on an entire zpool.

I only asked, because I read a thread in this forum that one guy had a problem 
with different files, aven after a successful scrub.

Thanks,
budy

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-06 Thread Ian Collins

On 10/ 6/10 09:52 PM, Stephan Budach wrote:

Hi,

I recently discovered some - or at least one corrupted file on one ofmy ZFS 
datasets, which caused an I/O error when trying to send a ZFDS snapshot to 
another host:


zpool status -v obelixData
   pool: obelixData
  state: ONLINE
status: One or more devices has experienced an error resulting in data
 corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
 entire pool from backup.
see: http://www.sun.com/msg/ZFS-8000-8A
  scrub: none requested
config:

 NAME STATE READ WRITE CKSUM
 obelixData   ONLINE   4 0 0
   c4t21D023038FA8d0  ONLINE   0 0 0
   c4t21D02305FF42d0  ONLINE   4 0 0

   

Are you aware that this is a very dangerous configuration?

Your pool lacks redundancy and you will loose it if one of the devices 
fails.


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-06 Thread Stephan Budach
Hi Cindy,

thanks for bringing that to my attention. I checked fmdump and found a lot of 
these entries:


Okt 06 2010 17:52:12.862812483 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
class = ereport.io.scsi.cmd.disk.tran
ena = 0x514dc67d57e1
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = 
/p...@0,0/pci8086,3...@7/pci1077,1...@0,1/f...@0,0/d...@w21d02305ff42,0
(end detector)

driver-assessment = retry
op-code = 0x88
cdb = 0x88 0x0 0x0 0x0 0x0 0x2 0xac 0xd4 0x3d 0x80 0x0 0x0 0x0 0x80 0x0 
0x0
pkt-reason = 0x3
pkt-state = 0x0
pkt-stats = 0x20
__ttl = 0x1
__tod = 0x4cac9b2c 0x336d7943

Okt 06 2010 17:52:12.862813713 ereport.io.scsi.cmd.disk.recovered
nvlist version: 0
class = ereport.io.scsi.cmd.disk.recovered
ena = 0x514dc67d57e1
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = 
/p...@0,0/pci8086,3...@7/pci1077,1...@0,1/f...@0,0/d...@w21d02305ff42,0
devid = id1,s...@n600d02310005ff42712ab96c
(end detector)

driver-assessment = recovered
op-code = 0x88
cdb = 0x88 0x0 0x0 0x0 0x0 0x2 0xac 0xd4 0x3d 0x80 0x0 0x0 0x0 0x80 0x0 
0x0
pkt-reason = 0x0
pkt-state = 0x1f
pkt-stats = 0x0
__ttl = 0x1
__tod = 0x4cac9b2c 0x336d7e11

Googling about these errors brought me directly to this document:

http://dsc.sun.com/solaris/articles/scsi_disk_fma2.html

which talks about these scsi errors. Since we're talking FC here, it seems to 
point to some FC issue I have not been aware of. Furthermore, it's always the 
same FC device that show these errors, so I will try to check the device and 
it's connections to the fabric first.

Thanks,
budy
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-06 Thread Stephan Budach
Ian,

yes, although these vdevs are FC raids themselves, so the risk is… uhm… 
calculated.

Unfortuanetly, one of the devices seems to have some issues, as stated im my 
previous post.
I will, nevertheless, add redundancy to my pool asap.

Thanks,
budy
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-06 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Stephan Budach
 
 Ian,
 
 yes, although these vdevs are FC raids themselves, so the risk is… uhm…
 calculated.

Whenever possible, you should always JBOD the storage and let ZFS manage the 
raid, for several reasons.  (See below).  Also, as counter-intuitive as this 
sounds (see below) you should disable hardware write-back cache (even with BBU) 
because it hurts performance in any of these situations:  (a) Disable WB if you 
have access to SSD or other nonvolatile dedicated log device.  (b) Disable WB 
if you know all of your writes to be async mode and not sync mode.  (c) Disable 
WB if you've opted to disable ZIL.

* Hardware raid blindly assumes the redundant data written to disk is written 
correctly.  So later, if you experience a checksum error (such as you have) 
then it's impossible for ZFS to correct it.  The hardware raid doesn't know a 
checksum error has occurred, and there is no way for the OS to read the other 
side of the mirror to attempt correcting the checksum via redundant data.

* ZFS has knowledge of both the filesystem, and the block level devices, while 
hardware raid has only knowledge of block level devices.  Which means ZFS is 
able to optimize performance in ways that hardware cannot possibly do.  For 
example, whenever there are many small writes taking place concurrently, ZFS is 
able to remap the physical disk blocks of those writes, to aggregate them into 
a single sequential write.  Depending on your metric, this yields 1-2 orders of 
magnitude higher IOPS.

* Because ZFS automatically buffers writes in ram in order to aggregate as 
previously mentioned, the hardware WB cache is not beneficial.  There is one 
exception.  If you are doing sync writes to spindle disks, and you don't have a 
dedicated log device, then the WB cache will benefit you, approx half as much 
as you would benefit by adding dedicated log device.  The sync write sort-of 
by-passes the ram buffer, and that's the reason why the WB is able to do some 
good in the case of sync writes.  

Ironically, if you have WB enabled, and you have a SSD log device, then the WB 
hurts you.  You get the best performance with SSD log, and no WB.  Because the 
WB lies to the OS, saying some tiny chunk of data has been written... then 
the OS will happily write another tiny chunk, and another, and another.  The WB 
is only buffering a lot of tiny random writes, and in aggregate, it will only 
go as fast as the random writes.  It undermines ZFS's ability to aggregate 
small writes into sequential writes.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-06 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Stephan Budach
 
 Now, scrub would reveal corrupted blocks on the devices, but is there a
 way to identify damaged files as well?

I saw a lot of people offering the same knee-jerk reaction that I had:
Scrub.  And that is the only correct answer, to make a best effort at
salvaging data.  But I think there is a valid question here which was
neglected.

*Does* scrub produce a list of all the names of all the corrupted files?
And if so, how does it do that?

If scrub is operating at a block-level (and I think it is), then how can
checksum failures be mapped to file names?  For example, this is a
long-requested feature of zfs send which is fundamentally difficult or
impossible to implement.

Zfs send operates at a block level.  And there is a desire to produce a list
of all the incrementally changed files in a zfs incremental send.  But no
capability of doing that.

It seems, if scrub is able to list the names of files that correspond to
corrupted blocks, then zfs send should be able to list the names of files
that correspond to changed blocks, right?

I am reaching the opposite conclusion of what's already been said.  I think
you should scrub, but don't expect file names as a result.  I think if you
want file names, then tar  /dev/null will be your best friend.

I didn't answer anything at first, cuz I was hoping somebody would have that
answer.  I only know that I don't know, and the above is my best guess.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-06 Thread Stephan Budach
Hi Edward,

these are interesting points. I have considered a couple of them, when I 
started playing around with ZFS.

I am not sure whether I disagree with all of your points, but I conducted a 
couple of tests, where I configured my raids as jbods and mapped each drive out 
as a seperate LUN and I couldn't notice a difference in performance in any way.

I'd love to discuss this in a seperate thread, but first I will have to check 
the archives an Google. ;)

Thanks,
budy
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Finding corrupted files

2010-10-06 Thread Eric D. Mudama

On Wed, Oct  6 at 22:04, Edward Ned Harvey wrote:

* Because ZFS automatically buffers writes in ram in order to
aggregate as previously mentioned, the hardware WB cache is not
beneficial.  There is one exception.  If you are doing sync writes
to spindle disks, and you don't have a dedicated log device, then
the WB cache will benefit you, approx half as much as you would
benefit by adding dedicated log device.  The sync write sort-of
by-passes the ram buffer, and that's the reason why the WB is able
to do some good in the case of sync writes.


All of your comments made sense except for this one.

Every N seconds when the system decides to burst writes to media from
RAM, those writes are only sequential in the case where the underlying
storage devices are significantly empty.

Once you're in a situation where your allocations are scattered across
the disk due to longer-term fragmentation, I don't see any way that a
write cache would hurt performances on the devices, since it'd allow
the drive to reorder writes to the media within that burst of data.

Even though ZFS is issuing writes of ~256 sectors if it can, that is
only a fraction of a revolution on a modern drive, so random writes of
128KB still have significant opportunity for reordering optimization.

Granted, with NCQ or TCQ you can get back much of the cache-disabled
performance loss, however, in any system that implements an internal
queue depth greater than the protocol-allowed queue depth, there is
opportunity for improvement, to an asymptotic limit driven by servo
settle speed.

Obviously this performance improvement comes with the standard WB
risks, and YMMV, IANAL, etc.

--eric

--
Eric D. Mudama
edmud...@mail.bounceswoosh.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss