date:20091002

Hi,

I have a pc with a MARVELL AOC-SAT2-MV8 controller and a pool made up of a six 
disks in a raid-z pool with a hot spare.

pre
-bash-3.2$ /sbin/zpool status
  pool: nas
 stato: ONLINE
 scrub: scrub in progress for 9h4m, 81,59% done, 2h2m to go
config:

NAMESTATE READ WRITE CKSUM
nas ONLINE   0 0 0
  raidz1ONLINE   0 0 0
c2t1d0  ONLINE   0 0 0
c2t4d0  ONLINE   0 0 0
c2t5d0  ONLINE   0 0 0
c2t3d0  ONLINE   0 0 0
c2t2d0  ONLINE   0 0 0
c2t0d0  ONLINE   0 0 0
dischi di riserva
  c2t7d0AVAIL

errori: nessun errore di dati rilevato
/pre

Now, the problem is that issuing an

iostat -Cmnx 10 

or any other time intervall, I've seen, sometimes, a complete stall of disk I/O 
due to a disk in the pool (not always the same) being 100% busy.

pre

$ iostat -Cmnx 10 

   r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0,00,30,02,0  0,0  0,00,00,1   0   0 c1
0,00,30,02,0  0,0  0,00,00,1   0   0 c1t0d0
 1852,1  297,0 13014,9 4558,4  9,2  1,64,30,7   2 158 c2
  311,8   61,3 2185,3  750,7  2,0  0,35,50,7  17  25 c2t0d0
  309,5   34,7 2207,2  769,5  1,6  0,54,71,4  41  47 c2t1d0
  309,3   36,3 2173,0  770,0  1,0  0,32,90,7  18  26 c2t2d0
  296,0   65,5 2057,3  749,2  2,1  0,25,90,6  16  23 c2t3d0
  313,3   64,1 2187,3  748,8  1,7  0,24,60,5  15  21 c2t4d0
  311,9   35,1 2204,8  770,1  0,7  0,22,10,5  11  17 c2t5d0
0,00,00,00,0  0,0  0,00,00,0   0   0 c2t7d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0,4   14,73,2   30,4  0,0  0,20,0   13,2   0   2 c1
0,4   14,73,2   30,4  0,0  0,20,0   13,2   0   2 c1t0d0
1,70,0   58,90,0  3,0  1,0 1766,4  593,1   2 101 c2
0,30,07,70,0  0,0  0,00,30,4   0   0 c2t0d0
0,30,0   11,50,0  0,0  0,04,48,4   0   0 c2t1d0
0,00,00,00,0  3,0  1,00,00,0 100 100 c2t2d0
0,40,0   14,10,0  0,0  0,00,46,6   0   0 c2t3d0
0,40,0   14,10,0  0,0  0,00,32,5   0   0 c2t4d0
0,30,0   11,50,0  0,0  0,03,66,9   0   0 c2t5d0
0,00,00,00,0  0,0  0,00,00,0   0   0 c2t7d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0,03,10,03,1  0,0  0,00,00,7   0   0 c1
0,03,10,03,1  0,0  0,00,00,7   0   0 c1t0d0
0,00,00,00,0  3,0  1,00,00,0   2 100 c2
0,00,00,00,0  0,0  0,00,00,0   0   0 c2t0d0
0,00,00,00,0  0,0  0,00,00,0   0   0 c2t1d0
0,00,00,00,0  3,0  1,00,00,0 100 100 c2t2d0
0,00,00,00,0  0,0  0,00,00,0   0   0 c2t3d0
0,00,00,00,0  0,0  0,00,00,0   0   0 c2t4d0
0,00,00,00,0  0,0  0,00,00,0   0   0 c2t5d0
0,00,00,00,0  0,0  0,00,00,0   0   0 c2t7d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0,00,10,00,4  0,0  0,00,01,2   0   0 c1
0,00,10,00,4  0,0  0,00,01,2   0   0 c1t0d0
0,0   29,50,0  320,2  3,4  1,0  113,9   34,6   2 102 c2
0,06,90,0   63,3  0,1  0,0   12,60,7   0   0 c2t0d0
0,04,40,0   65,5  0,0  0,08,70,8   0   0 c2t1d0
0,00,00,00,0  3,0  1,00,00,0 100 100 c2t2d0
0,07,40,0   62,7  0,1  0,0   15,40,8   1   1 c2t3d0
0,06,80,0   63,6  0,1  0,0   13,20,7   0   0 c2t4d0
0,04,00,0   65,1  0,0  0,07,90,7   0   0 c2t5d0
0,00,00,00,0  0,0  0,00,00,0   0   0 c2t7d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
0,00,30,02,4  0,0  0,00,00,1   0   0 c1
0,00,30,02,4  0,0  0,00,00,1   0   0 c1t0d0
0,00,00,00,0  3,0  1,00,00,0   2 100 c2
0,00,00,00,0  0,0  0,00,00,0   0   0 c2t0d0
0,00,00,00,0  0,0  0,00,00,0   0   0 c2t1d0
0,00,00,00,0  3,0  1,00,00,0 100 100 c2t2d0
0,00,00,00,0  0,0  0,00,00,0   0   0 c2t3d0
0,00,00,00,0  0,0  0,00,00,0   0   0 c2t4d0
0,00,00,00,0  0,0  0,00,00,0   0   0 c2t5d0
0,00,00,00,0  0,0  0,00,00,0   0   0 c2t7d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w

Re: [zfs-discuss] RAIDZ v. RAIDZ1

2009-10-02 Thread David Stewart

Cindy:

I believe I may have been mistaken.  When I recreated the zpools, you are 
correct you receive different numbers for zpool list and zfs list for the 
sizes.  I must have typed one command and then the other when creating the 
different pools.

Thanks for the assist.  Sheepish grin.

David
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] zpool is very slow

2009-10-02 Thread David Stewart

I created a raidz zpool and shares and now the OS is very slow.  I timed it and 
I can get about eight seconds of use before I get ten seconds of a frozen 
screen.  I can be doing anything or barely anything (moving the mouse an inch 
side to side repeatedly.)  This makes the machine unusable.  If I detach the 
SATA card that the raidz zpool is attached to everything is fine.  The slowdown 
occurs regardless of the user that I login as (admin, reg user), and the speed 
up occurs only when the SATA card is removed.  This leads me to believe that 
something is going on with the zpool.  There are no files on the zpool (I don't 
have the patience for the constant freezing to copy files over to the zpool.)  

The zpool is 4TB in size.  I previously had the system up and running for a 
week before I did something stupid and decided to start from scratch and 
reinstall and recreate the zpool.

The zpool status command shows no errors with the zpool and iostat:
mediaz used 470k avail 5.44T read 0 write 0 bandwidth read 37 bandwidth write 44

How do I find what is accessing the zpool and stop it?

David
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool is very slow

2009-10-02 Thread Remco Lengers


David,

May be you can you the iosnoop of the dtrace toolkit:

http://www.solarisinternals.com/wiki/index.php/DTraceToolkit#Scripts

..Remco

David Stewart wrote:
I created a raidz zpool and shares and now the OS is very slow.  I timed it and I can get about eight seconds of use before I get ten seconds of a frozen screen.  I can be doing anything or barely anything (moving the mouse an inch side to side repeatedly.)  This makes the machine unusable.  If I detach the SATA card that the raidz zpool is attached to everything is fine.  The slowdown occurs regardless of the user that I login as (admin, reg user), and the speed up occurs only when the SATA card is removed.  This leads me to believe that something is going on with the zpool.  There are no files on the zpool (I don't have the patience for the constant freezing to copy files over to the zpool.)  


The zpool is 4TB in size.  I previously had the system up and running for a 
week before I did something stupid and decided to start from scratch and 
reinstall and recreate the zpool.

The zpool status command shows no errors with the zpool and iostat:
mediaz used 470k avail 5.44T read 0 write 0 bandwidth read 37 bandwidth write 44

How do I find what is accessing the zpool and stop it?

David

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] strange pool disks usage pattern

2009-10-02 Thread Carson Gaspar


Maurilio Longo wrote:

Hi,

I have a pc with a MARVELL AOC-SAT2-MV8 controller and a pool made up of a
six disks in a raid-z pool with a hot spare.

...

Now, the problem is that issuing an

iostat -Cmnx 10

or any other time intervall, I've seen, sometimes, a complete stall of disk
I/O due to a disk in the pool (not always the same) being 100% busy.

...

In this case it was c2t2d0 and it blocked the pool for 30 or 40 seconds.

/var/adm/messages does not contain anything related to the pool.

What can it be?


This usually means you have either a driver bug, a bad controller, or a bad disk

The marvell driver bug sometimes manifested in this way, but you would have seen 
bus resets in your error logs.


Given you have exactly one outstanding transaction on the stuck disk, I 
suspect the disk is busy doing error recovery.


Speaking from my recent extremely painful experience, replace that disk ASAP.

--
Carson

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] strange pool disks usage pattern

Carson,

the strange thing is that this is happening on several disks (can it be that 
are all failing?)

What is the controller bug you're talking about? I'm running snv_114 on this 
pc, so it is fairly recent.

Best regards.

Maurilio.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] strange pool disks usage pattern

2009-10-02 Thread Carson Gaspar


Maurilio Longo wrote:

the strange thing is that this is happening on several disks (can it be that
are all failing?)


Possible, but less likely. I'd suggest running some disk I/O tests, looking at 
the drive error counters before/after.



What is the controller bug you're talking about? I'm running snv_114 on this
pc, so it is fairly recent.


There was a bug in the marvell driver for the controller used on the X4500 that 
caused bus hangs / resets. It was fixed around U6, so it should be long gone 
from OpenSolaris. But perhaps there's a different bug?


You could also have a firmware bug on your disks. You might try lowering the 
number of tagged commands per disk and see if that helps at all.


--
Carson
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] strange pool disks usage pattern

 Possible, but less likely. I'd suggest running some
 disk I/O tests, looking at 
 the drive error counters before/after.
 

These disks have a few months of life and are scrubbed weekly, no errors so far.

I did try to use smartmontools, but it cannot report SMART logs nor start SMART 
tests, so I don't know how to look at their internal state.

 You could also have a firmware bug on your disks. You
 might try lowering the 
 number of tagged commands per disk and see if that
 helps at all.

from man marvell88sx I read that this driver has no tunable parameters, so I 
don't know how I could change NCQ depth.

Best regards.

Maurilio.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] strange pool disks usage pattern

2009-10-02 Thread Carson Gaspar


Maurilio Longo wrote:


I did try to use smartmontools, but it cannot report SMART logs nor start
SMART tests, so I don't know how to look at their internal state.


Really? That's odd...


You could also have a firmware bug on your disks. You might try lowering
the number of tagged commands per disk and see if that helps at all.


from man marvell88sx I read that this driver has no tunable parameters, so I
don't know how I could change NCQ depth.


ZFS has a per block device outstanding IO tunable - I think it's in the evil 
tuning guide.


--
Carson

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] strange pool disks usage pattern


Maurilio Longo wrote:

Carson,

the strange thing is that this is happening on several disks (can it be that 
are all failing?)

What is the controller bug you're talking about? I'm running snv_114 on this 
pc, so it is fairly recent.

Best regards.

Maurilio.
  


See 'iostat -En' output.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS caching of compressed data


Stuart Anderson wrote:
I am wondering if the following idea makes any sense as a way to get 
ZFS to cache compressed data in DRAM?


In particular, given a 2-way zvol mirror of highly compressible data 
on persistent storage devices, what would go wrong if I dynamically 
added a ramdisk as a 3rd mirror device at boot time?


Would ZFS route most (or all) of the reads to the lower latency DRAM 
device?


In the case of an un-clean shutdown where there was no opportunity to 
actively remove the ramdisk from the pool before shutdown would there 
be any problem at boot time when the ramdisk is still registered but 
unavailable?


Note, this Gedanken experiment is for highly compressible (~9x) 
metadata for a non-ZFS filesystem.
 

You would only get about 33% of IO's served from ram-disk.
However at the KCA conference Bill and Jeff mentioned Just-in-time 
decompression/decryption planned for ZFS. If I understand it correctly 
some % of pages in ARC will be kept compressed/encrypted and will be 
decompressed/decrypted only if accessed. This could be especially useful 
to do so with prefetch.


Now I would imaging that one will be able to tune what's percentage of 
ARC should keep compressed pages.


Now I don't remember if they mentioned L2ARC here but it would probably 
be useful to have a tunable which would put compressed or uncompressed 
data onto L2ARC depending on it's value. Which approach is better would 
always depends on a given environment and on where an actual bottleneck is.




--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Unable to import pool: invalid vdev configuration

2009-10-02 Thread Victor Latushkin


Osvald Ivarsson wrote:

On Thu, Oct 1, 2009 at 7:40 PM, Victor Latushkin
victor.latush...@sun.com wrote:

On 01.10.09 17:54, Osvald Ivarsson wrote:

I'm running OpenSolaris build svn_101b. I have 3 SATA disks connected to
my motherboard. The raid, a raidz, which is called rescamp, has worked
good before until a power failure yesterday. I'm now unable to import the
pool. I can't export the raid, since it isn't imported.

# zpool import rescamp
cannot import 'rescamp': invalid vdev configuration

# zpool import
 pool: rescamp
   id: 12297694211509104163
 state: UNAVAIL
action: The pool cannot be imported due to damaged devices or data.
config:

   rescamp UNAVAIL  insufficient replicas
 raidz1UNAVAIL  corrupted data
   c15d0   ONLINE
   c14d0   ONLINE
   c14d1   ONLINE

I've tried using zdb -l on all three disks, but in all cases it failes to
unpack the labels.

# zdb -l /dev/dsk/c14d0

LABEL 0

failed to unpack label 0

LABEL 1

failed to unpack label 1

LABEL 2

failed to unpack label 2

LABEL 3

failed to unpack label 3

If I run # zdb -l /dev/dsk/c14d0s0 I do find 4 labels, but c14d0, c14d1
and c15d0 is what I created the raid with. I do find labels this way for all
three disks. Is this to any help?

# zdb -l /dev/dsk/c14d1s0

LABEL 0

   version=13
   name='rescamp'
   state=0
   txg=218097573
   pool_guid=12297694211509104163
   hostid=4925114
   hostname='slaskvald'
   top_guid=9479723326726871122
   guid=17774184411399278071
   vdev_tree
   type='raidz'
   id=0
   guid=9479723326726871122
   nparity=1
   metaslab_array=23
   metaslab_shift=34
   ashift=9
   asize=3000574672896
   is_log=0
   children[0]
   type='disk'
   id=0
   guid=9020535344824299914
   path='/dev/dsk/c15d0s0'
   devid='id1,c...@ast31000333as=9te0dglf/a'
   phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a'
   whole_disk=1
   DTL=102
   children[1]
   type='disk'
   id=1
   guid=14384361563876398475
   path='/dev/dsk/c14d0s0'
   devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a'
   phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a'
   whole_disk=1
   DTL=216
   children[2]
   type='disk'
   id=2
   guid=17774184411399278071
   path='/dev/dsk/c14d1s0'
   devid='id1,c...@ast31000333as=9te0de8w/a'
   phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a'
   whole_disk=1
   DTL=100

LABEL 1

   version=13
   name='rescamp'
   state=0
   txg=218097573
   pool_guid=12297694211509104163
   hostid=4925114
   hostname='slaskvald'
   top_guid=9479723326726871122
   guid=17774184411399278071
   vdev_tree
   type='raidz'
   id=0
   guid=9479723326726871122
   nparity=1
   metaslab_array=23
   metaslab_shift=34
   ashift=9
   asize=3000574672896
   is_log=0
   children[0]
   type='disk'
   id=0
   guid=9020535344824299914
   path='/dev/dsk/c15d0s0'
   devid='id1,c...@ast31000333as=9te0dglf/a'
   phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a'
   whole_disk=1
   DTL=102
   children[1]
   type='disk'
   id=1
   guid=14384361563876398475
   path='/dev/dsk/c14d0s0'
   devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a'
   phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a'
   whole_disk=1
   DTL=216
   children[2]
   type='disk'
   id=2
   guid=17774184411399278071
   path='/dev/dsk/c14d1s0'
   devid='id1,c...@ast31000333as=9te0de8w/a'
   phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a'
   whole_disk=1
   DTL=100

LABEL 2

   version=13
   name='rescamp'
   state=0
   txg=218097573
   pool_guid=12297694211509104163
   hostid=4925114
   hostname='slaskvald'
   top_guid=9479723326726871122
   guid=17774184411399278071
   vdev_tree
   type='raidz'
   id=0
   guid=9479723326726871122

Re: [zfs-discuss] cachefile for snail zpool import mystery?


Max Holm wrote:

Hi,

We are seeing more long delays in zpool import, say, 4~5 or even
25~30 minutes, especially when backup jobs are going on in the FC SAN 
the LUNs resides (no iSCSI LUNs yet). On the same node for the LUNs of the same array, 
some pools takes a few seconds, but minutes for some. the pattern
seems random to me so far.  It's first noticed soon after being upgraded to 
Solaris 10 U6 (10/08, on sparc, M4000,Vx90 using some IBM and Sun arrays.) 
Appreciated, if someone can comment on this. Thanks.


We have a few VCS clusters, each has a set of service groups that 
import/export some zpools at proper events on a proper node 
(with '-R /' option). To fix the long delays, it seems I can use

the 'zpool set cachefile=/x/... ...' for each pool, deploy
all cachefiles to every cluster node of a cluster on a persisent 
location,/y/, then have the agent online script do 
'zpool import -c /y/...', if /y/... exists. Any better fix?


1. Why would it ever take so long (20-30 minutes!) to import a pool?
It seems I/O on the FC SAN were just fine, no error messages either. 
Is it problems of other stacks or because I deleted some LUNs on the array
without taking it off device trees? 
  

This is probably your problem. Try to do devfsadm -vC



2.  we now have the burden of maintaining these cachefiles when
we change the zpool, say add/drop a lun. any advice?
It'd be nice if zfs keeps a cache file (other than /etc/zfs/zpool.cache)
for those ones imported under an altroot, and make it persistent,
verify/update entries at proper events.


IIRC when you change a pool config its cache file will be automatically 
updated on the same node.



 At least, I wish zfs allow
us to create the cachefiles while they are not currently imported.
so that I can just have a simple daily job to maintain the cache files 
on every node of a cluster automatically. 
 
  
What you can do is to put a script in a crontab which checks if a pool 
is currently imported on this node and if it is then copy the pools 
cache file to over nodes.


btw: IIRC Sun Cluster HAS+ agane will automatically make use of cache files

--
Robert Milkowski 
http://milek.blogspot.com



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can't rm file when No space left on device...


Chris Ridd wrote:


On 1 Oct 2009, at 19:34, Andrew Gabriel wrote:

Pick a file which isn't in a snapshot (either because it's been 
created since the most recent snapshot, or because it's been 
rewritten since the most recent snapshot so it's no longer sharing 
blocks with the snapshot version).


Out of curiosity, is there an easy way to find such a file?



Find files with modification or creation time after last snapshot was 
created.
Files which were modified after may still have most of their blocks 
refered by a snapshot though.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Unable to import pool: invalid vdev configuration

2009-10-02 Thread Victor Latushkin


Osvald Ivarsson wrote:

On Fri, Oct 2, 2009 at 2:36 PM, Victor Latushkin
victor.latush...@sun.com wrote:

Osvald Ivarsson wrote:

On Thu, Oct 1, 2009 at 7:40 PM, Victor Latushkin
victor.latush...@sun.com wrote:

On 01.10.09 17:54, Osvald Ivarsson wrote:

I'm running OpenSolaris build svn_101b. I have 3 SATA disks connected to
my motherboard. The raid, a raidz, which is called rescamp, has worked
good before until a power failure yesterday. I'm now unable to import
the
pool. I can't export the raid, since it isn't imported.

# zpool import rescamp
cannot import 'rescamp': invalid vdev configuration

# zpool import
 pool: rescamp
  id: 12297694211509104163
 state: UNAVAIL
action: The pool cannot be imported due to damaged devices or data.
config:

  rescamp UNAVAIL  insufficient replicas
raidz1UNAVAIL  corrupted data
  c15d0   ONLINE
  c14d0   ONLINE
  c14d1   ONLINE

I've tried using zdb -l on all three disks, but in all cases it failes
to
unpack the labels.

# zdb -l /dev/dsk/c14d0

LABEL 0

failed to unpack label 0

LABEL 1

failed to unpack label 1

LABEL 2

failed to unpack label 2

LABEL 3

failed to unpack label 3

If I run # zdb -l /dev/dsk/c14d0s0 I do find 4 labels, but c14d0, c14d1
and c15d0 is what I created the raid with. I do find labels this way for
all
three disks. Is this to any help?

# zdb -l /dev/dsk/c14d1s0

LABEL 0

  version=13
  name='rescamp'
  state=0
  txg=218097573
  pool_guid=12297694211509104163
  hostid=4925114
  hostname='slaskvald'
  top_guid=9479723326726871122
  guid=17774184411399278071
  vdev_tree
  type='raidz'
  id=0
  guid=9479723326726871122
  nparity=1
  metaslab_array=23
  metaslab_shift=34
  ashift=9
  asize=3000574672896
  is_log=0
  children[0]
  type='disk'
  id=0
  guid=9020535344824299914
  path='/dev/dsk/c15d0s0'
  devid='id1,c...@ast31000333as=9te0dglf/a'
  phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a'
  whole_disk=1
  DTL=102
  children[1]
  type='disk'
  id=1
  guid=14384361563876398475
  path='/dev/dsk/c14d0s0'
  devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a'
  phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a'
  whole_disk=1
  DTL=216
  children[2]
  type='disk'
  id=2
  guid=17774184411399278071
  path='/dev/dsk/c14d1s0'
  devid='id1,c...@ast31000333as=9te0de8w/a'
  phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a'
  whole_disk=1
  DTL=100

LABEL 1

  version=13
  name='rescamp'
  state=0
  txg=218097573
  pool_guid=12297694211509104163
  hostid=4925114
  hostname='slaskvald'
  top_guid=9479723326726871122
  guid=17774184411399278071
  vdev_tree
  type='raidz'
  id=0
  guid=9479723326726871122
  nparity=1
  metaslab_array=23
  metaslab_shift=34
  ashift=9
  asize=3000574672896
  is_log=0
  children[0]
  type='disk'
  id=0
  guid=9020535344824299914
  path='/dev/dsk/c15d0s0'
  devid='id1,c...@ast31000333as=9te0dglf/a'
  phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a'
  whole_disk=1
  DTL=102
  children[1]
  type='disk'
  id=1
  guid=14384361563876398475
  path='/dev/dsk/c14d0s0'
  devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a'
  phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a'
  whole_disk=1
  DTL=216
  children[2]
  type='disk'
  id=2
  guid=17774184411399278071
  path='/dev/dsk/c14d1s0'
  devid='id1,c...@ast31000333as=9te0de8w/a'
  phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a'
  whole_disk=1
  DTL=100

LABEL 2

  version=13
  name='rescamp'
  state=0
  txg=218097573
  pool_guid=12297694211509104163
  hostid=4925114
  hostname='slaskvald'
  top_guid=9479723326726871122
  guid=17774184411399278071
  vdev_tree
  type='raidz'
  id=0
  guid=9479723326726871122

Re: [zfs-discuss] strange pool disks usage pattern

Milek,

this is it

pre
# iostat -En
c1t0d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST3808110AS  Revision: DSerial No:
Size: 80,03GB 80026361856 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 91 Predictive Failure Analysis: 0
c2t0d0   Soft Errors: 0 Hard Errors: 11 Transport Errors: 0
Vendor: ATA  Product: ST31000333AS Revision: CC1H Serial No:
Size: 1000,20GB 1000204886016 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 101 Predictive Failure Analysis: 0
c2t1d0   Soft Errors: 0 Hard Errors: 4 Transport Errors: 0
Vendor: ATA  Product: ST31000333AS Revision: CC1H Serial No:
Size: 1000,20GB 1000204886016 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 96 Predictive Failure Analysis: 0
c2t2d0   Soft Errors: 0 Hard Errors: 69 Transport Errors: 0
Vendor: ATA  Product: ST31000333AS Revision: CC1H Serial No:
Size: 1000,20GB 1000204886016 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 105 Predictive Failure Analysis: 0
c2t3d0   Soft Errors: 0 Hard Errors: 5 Transport Errors: 0
Vendor: ATA  Product: ST31000333AS Revision: CC1H Serial No:
Size: 1000,20GB 1000204886016 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 96 Predictive Failure Analysis: 0
c2t4d0   Soft Errors: 0 Hard Errors: 90 Transport Errors: 0
Vendor: ATA  Product: ST31000333AS Revision: CC1H Serial No:
Size: 1000,20GB 1000204886016 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 96 Predictive Failure Analysis: 0
c2t5d0   Soft Errors: 0 Hard Errors: 30 Transport Errors: 0
Vendor: ATA  Product: ST31000333AS Revision: CC1H Serial No:
Size: 1000,20GB 1000204886016 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 96 Predictive Failure Analysis: 0
c2t7d0   Soft Errors: 0 Hard Errors: 0 Transport Errors: 0
Vendor: ATA  Product: ST31000333AS Revision: CC1H Serial No:
Size: 1000,20GB 1000204886016 bytes
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 94 Predictive Failure Analysis: 0
#
/pre

What are hard errors?

Maurilio.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] strange pool disks usage pattern

Errata,

they're ST31000333AS and not 340AS

Maurilio.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

Data security.  I migrated my organization from Linux to Solaris driven away 
from Linux by the the shortfalls of fsck on TB size file systems, and towards 
Solaris by the features of ZFS.

At the time I tried to dig up information concerning tradeoffs associated with 
Fletcher2 vs. 4 vs. SHA256 and found nothing.  Studying the algorithms, I 
decided that fletcher2 would tend to be weak for periodic data, which 
characterizes my data.  I ran throughput tests and got 67MB/Sec for Fletcher2 
and 4 and 48MB/Sec for SHA256.  I projected (perhaps without basis) SHA256's 
cryptographic strength to also mean strength as a hash, and chose it since 
48MB/Sec is more than I need.

21 months later (9/15/09) I lost everything to a corrupt metadata (Not sure 
where this was printed) ZFS-8000-CS.  No clue why to date, I will never know.  
The person who restored from tape was not informed to set checksum=sha256, so 
it all went in with the default, Fletcher2.

Before taking rather disruptive actions to correct this, I decided to question 
my original decision and found schlie's post stating that a bug in fletcher2 
makes it essentially a one bit parity on the entire block:
http://opensolaris.org/jive/thread.jspa?threadID=69655tstart=30  While 
this is twice as good as any other file system in the world that has NO such 
checksum, this does not provide the security I migrated for.  Especially given 
that I did not know what caused the original data loss, it is all I have to 
lean on.

Convinced that I need to convert all of the checksums to sha256 to have the 
data security ZFS purports to deliver and in the absence of a checksum 
conversion capability, I need to copy the data.  It appears that all of the 
implementations of the various means of copying data, from tar and cpio to cp 
to rsync to pax have ghosts in their closets, each living in glass houses, and 
each throwing stones at the other with respect to various issues with file 
size, filename lengths, pathname lengths, ACLs, extended attributes, sparse 
files, etc. etc. etc.  

It seems like zfs send/receive *should* be safe from all such issues as part of 
the zfs family, but the questions raised here are ambiguous once one starts to 
think about it.  If the file system is faithfully duplicated, it should also 
duplicate all properties, including the checksum used on each block.  It 
appears (to my advantage) that this is not what is done.  This enables the 
filesystem spontaneously created by zfs receive to inherit from the pool, which 
evidently can be set to sha256 though it is a pool not a file system in the 
pool.  The present question is protection on the base pool.  This can be set 
when the pool is created, though not with U4 which I am running.  It is not 
clear (yet) if this is simply not documented with the current release or if the 
version that supports this has not been released yet.  If I were to upgrade 
(Which I cannot do in a timely fashion), it would only be to U7.  I cannot run 
a weekly build type of OS on my production server.  Any way 
 it goes I am hosed.  In short there is surely some structure, some blocks with 
stuff written in them when a pool is created but before anything else is done, 
else it would be a blank disk, not a zfs pool.  Are these protected by 
Fletcher2 as the default?  I have learned that the Ubberblock is protected by 
SHA256, other parts by Fletcher4.  Is this everything?  In U4 was it fletcher4, 
or was this a recent change steming from Schlie's report?

In short, what is the situation with regard to the data security I switched to 
Solaris/ZFS for, and what can I do to achieve it?  What *do* the tools do?  Are 
there tools for what needs to be done to convert things, to copy things, to 
verify things, and to do so completely and correctly?  

So here is where I am:  I should zfs send/receive, but I cannot have confidence 
that there are not fletcher2 protected blocks (1 bit parity) at the most 
fundamental levels of the zpool.  To verify data, I cannot depend on existing 
tools since diff is not large file aware.  My best idea at this point is to 
calculate and compare MD5 sums of every file and spot check other properties as 
best I can.

Given this rather full perspective, help or comments very appreciated.  I still 
think zfs is the way to go, but the road is a little bumpy at the moment.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

Appologize that the preceeding post appears out of context.  I expected it to 
indent as I pushed the reply button on myxiplx' Oct 1, 2009 1:47 post.  It 
was in response to his question.  I will try to remember to provide links 
internal to my messages.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-02 Thread Tomas Ögren

On 02 October, 2009 - Ray Clark sent me these 4,4K bytes:

 Data security.  I migrated my organization from Linux to Solaris
 driven away from Linux by the the shortfalls of fsck on TB size file
 systems, and towards Solaris by the features of ZFS.
[...]
 Before taking rather disruptive actions to correct this, I decided to
 question my original decision and found schlie's post stating that a
 bug in fletcher2 makes it essentially a one bit parity on the entire
 block:
 http://opensolaris.org/jive/thread.jspa?threadID=69655tstart=30
 While this is twice as good as any other file system in the world that
 has NO such checksum, this does not provide the security I migrated
 for.  Especially given that I did not know what caused the original
 data loss, it is all I have to lean on.
...

That post refers to bug 6740597
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6740597
which also refers to
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=2178540

So it seems like it's fixed in snv114 and s10u8, which won't help your
s10u4 unless you update..

/Tomas
-- 
Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

Replying to Cindys Oct 1, 2009 3:34 PM post:

Thank you.   The second part was my attempt to guess my way out of this.  If 
the fundamental structure of the pool (That which was created before I set the 
checksum=sha256 property) is using fletcher2, perhaps as I use the pool all of 
this structure will be updated, and therefore automatically migrate to the new 
checksum.  It would be very difficult for me to recreate the pool, but I have 
space to duplicate the user files (and so get the new checksum).  Perhaps 
this will also result in the underlying structure of the pool being converted 
in the course of normal use.  

Comments for or against?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

Replying to relling's October 1, 2009 3:34 post:

Richard, regarding when a pool is created, there is only metadata which uses 
fletcher4.  Was this true in U4, or is this a new change of default with U4 
using fletcher2?  Similarly, did the Ubberblock use sha256 in U4?  I am running 
U4.

--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Unable to import pool: invalid vdev configuration

2009-10-02 Thread Osvald Ivarsson

On Thu, Oct 1, 2009 at 7:40 PM, Victor Latushkin
victor.latush...@sun.com wrote:
 On 01.10.09 17:54, Osvald Ivarsson wrote:

 I'm running OpenSolaris build svn_101b. I have 3 SATA disks connected to
 my motherboard. The raid, a raidz, which is called rescamp, has worked
 good before until a power failure yesterday. I'm now unable to import the
 pool. I can't export the raid, since it isn't imported.

 # zpool import rescamp
 cannot import 'rescamp': invalid vdev configuration

 # zpool import
  pool: rescamp
    id: 12297694211509104163
  state: UNAVAIL
 action: The pool cannot be imported due to damaged devices or data.
 config:

        rescamp     UNAVAIL  insufficient replicas
          raidz1    UNAVAIL  corrupted data
            c15d0   ONLINE
            c14d0   ONLINE
            c14d1   ONLINE

 I've tried using zdb -l on all three disks, but in all cases it failes to
 unpack the labels.

 # zdb -l /dev/dsk/c14d0
 
 LABEL 0
 
 failed to unpack label 0
 
 LABEL 1
 
 failed to unpack label 1
 
 LABEL 2
 
 failed to unpack label 2
 
 LABEL 3
 
 failed to unpack label 3

 If I run # zdb -l /dev/dsk/c14d0s0 I do find 4 labels, but c14d0, c14d1
 and c15d0 is what I created the raid with. I do find labels this way for all
 three disks. Is this to any help?

 # zdb -l /dev/dsk/c14d1s0
 
 LABEL 0
 
    version=13
    name='rescamp'
    state=0
    txg=218097573
    pool_guid=12297694211509104163
    hostid=4925114
    hostname='slaskvald'
    top_guid=9479723326726871122
    guid=17774184411399278071
    vdev_tree
        type='raidz'
        id=0
        guid=9479723326726871122
        nparity=1
        metaslab_array=23
        metaslab_shift=34
        ashift=9
        asize=3000574672896
        is_log=0
        children[0]
                type='disk'
                id=0
                guid=9020535344824299914
                path='/dev/dsk/c15d0s0'
                devid='id1,c...@ast31000333as=9te0dglf/a'
                phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a'
                whole_disk=1
                DTL=102
        children[1]
                type='disk'
                id=1
                guid=14384361563876398475
                path='/dev/dsk/c14d0s0'
                devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a'
                phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a'
                whole_disk=1
                DTL=216
        children[2]
                type='disk'
                id=2
                guid=17774184411399278071
                path='/dev/dsk/c14d1s0'
                devid='id1,c...@ast31000333as=9te0de8w/a'
                phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a'
                whole_disk=1
                DTL=100
 
 LABEL 1
 
    version=13
    name='rescamp'
    state=0
    txg=218097573
    pool_guid=12297694211509104163
    hostid=4925114
    hostname='slaskvald'
    top_guid=9479723326726871122
    guid=17774184411399278071
    vdev_tree
        type='raidz'
        id=0
        guid=9479723326726871122
        nparity=1
        metaslab_array=23
        metaslab_shift=34
        ashift=9
        asize=3000574672896
        is_log=0
        children[0]
                type='disk'
                id=0
                guid=9020535344824299914
                path='/dev/dsk/c15d0s0'
                devid='id1,c...@ast31000333as=9te0dglf/a'
                phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a'
                whole_disk=1
                DTL=102
        children[1]
                type='disk'
                id=1
                guid=14384361563876398475
                path='/dev/dsk/c14d0s0'
                devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a'
                phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a'
                whole_disk=1
                DTL=216
        children[2]
                type='disk'
                id=2
                guid=17774184411399278071
                path='/dev/dsk/c14d1s0'
                devid='id1,c...@ast31000333as=9te0de8w/a'
                phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a'
                whole_disk=1
                DTL=100
 
 LABEL 2
 
    version=13
    name='rescamp'
    state=0
    txg=218097573
    pool_guid=12297694211509104163
    hostid=4925114
    hostname='slaskvald'

Re: [zfs-discuss] Unable to import pool: invalid vdev configuration

2009-10-02 Thread Osvald Ivarsson

On Fri, Oct 2, 2009 at 2:36 PM, Victor Latushkin
victor.latush...@sun.com wrote:
 Osvald Ivarsson wrote:

 On Thu, Oct 1, 2009 at 7:40 PM, Victor Latushkin
 victor.latush...@sun.com wrote:

 On 01.10.09 17:54, Osvald Ivarsson wrote:

 I'm running OpenSolaris build svn_101b. I have 3 SATA disks connected to
 my motherboard. The raid, a raidz, which is called rescamp, has worked
 good before until a power failure yesterday. I'm now unable to import
 the
 pool. I can't export the raid, since it isn't imported.

 # zpool import rescamp
 cannot import 'rescamp': invalid vdev configuration

 # zpool import
  pool: rescamp
   id: 12297694211509104163
  state: UNAVAIL
 action: The pool cannot be imported due to damaged devices or data.
 config:

       rescamp     UNAVAIL  insufficient replicas
         raidz1    UNAVAIL  corrupted data
           c15d0   ONLINE
           c14d0   ONLINE
           c14d1   ONLINE

 I've tried using zdb -l on all three disks, but in all cases it failes
 to
 unpack the labels.

 # zdb -l /dev/dsk/c14d0
 
 LABEL 0
 
 failed to unpack label 0
 
 LABEL 1
 
 failed to unpack label 1
 
 LABEL 2
 
 failed to unpack label 2
 
 LABEL 3
 
 failed to unpack label 3

 If I run # zdb -l /dev/dsk/c14d0s0 I do find 4 labels, but c14d0, c14d1
 and c15d0 is what I created the raid with. I do find labels this way for
 all
 three disks. Is this to any help?

 # zdb -l /dev/dsk/c14d1s0
 
 LABEL 0
 
   version=13
   name='rescamp'
   state=0
   txg=218097573
   pool_guid=12297694211509104163
   hostid=4925114
   hostname='slaskvald'
   top_guid=9479723326726871122
   guid=17774184411399278071
   vdev_tree
       type='raidz'
       id=0
       guid=9479723326726871122
       nparity=1
       metaslab_array=23
       metaslab_shift=34
       ashift=9
       asize=3000574672896
       is_log=0
       children[0]
               type='disk'
               id=0
               guid=9020535344824299914
               path='/dev/dsk/c15d0s0'
               devid='id1,c...@ast31000333as=9te0dglf/a'
               phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a'
               whole_disk=1
               DTL=102
       children[1]
               type='disk'
               id=1
               guid=14384361563876398475
               path='/dev/dsk/c14d0s0'
               devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a'
               phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a'
               whole_disk=1
               DTL=216
       children[2]
               type='disk'
               id=2
               guid=17774184411399278071
               path='/dev/dsk/c14d1s0'
               devid='id1,c...@ast31000333as=9te0de8w/a'
               phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a'
               whole_disk=1
               DTL=100
 
 LABEL 1
 
   version=13
   name='rescamp'
   state=0
   txg=218097573
   pool_guid=12297694211509104163
   hostid=4925114
   hostname='slaskvald'
   top_guid=9479723326726871122
   guid=17774184411399278071
   vdev_tree
       type='raidz'
       id=0
       guid=9479723326726871122
       nparity=1
       metaslab_array=23
       metaslab_shift=34
       ashift=9
       asize=3000574672896
       is_log=0
       children[0]
               type='disk'
               id=0
               guid=9020535344824299914
               path='/dev/dsk/c15d0s0'
               devid='id1,c...@ast31000333as=9te0dglf/a'
               phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a'
               whole_disk=1
               DTL=102
       children[1]
               type='disk'
               id=1
               guid=14384361563876398475
               path='/dev/dsk/c14d0s0'
               devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a'
               phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a'
               whole_disk=1
               DTL=216
       children[2]
               type='disk'
               id=2
               guid=17774184411399278071
               path='/dev/dsk/c14d1s0'
               devid='id1,c...@ast31000333as=9te0de8w/a'
               phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a'
               whole_disk=1
               DTL=100
 
 LABEL 2
 
   version=13
   name='rescamp'
   state=0
   txg=218097573
   pool_guid=12297694211509104163
   hostid=4925114
   hostname='slaskvald'

Re: [zfs-discuss] Unable to import pool: invalid vdev configuration

2009-10-02 Thread Osvald Ivarsson

On Fri, Oct 2, 2009 at 2:51 PM, Victor Latushkin
victor.latush...@sun.com wrote:
 Osvald Ivarsson wrote:

 On Fri, Oct 2, 2009 at 2:36 PM, Victor Latushkin
 victor.latush...@sun.com wrote:

 Osvald Ivarsson wrote:

 On Thu, Oct 1, 2009 at 7:40 PM, Victor Latushkin
 victor.latush...@sun.com wrote:

 On 01.10.09 17:54, Osvald Ivarsson wrote:

 I'm running OpenSolaris build svn_101b. I have 3 SATA disks connected
 to
 my motherboard. The raid, a raidz, which is called rescamp, has
 worked
 good before until a power failure yesterday. I'm now unable to import
 the
 pool. I can't export the raid, since it isn't imported.

 # zpool import rescamp
 cannot import 'rescamp': invalid vdev configuration

 # zpool import
  pool: rescamp
  id: 12297694211509104163
  state: UNAVAIL
 action: The pool cannot be imported due to damaged devices or data.
 config:

      rescamp     UNAVAIL  insufficient replicas
        raidz1    UNAVAIL  corrupted data
          c15d0   ONLINE
          c14d0   ONLINE
          c14d1   ONLINE

 I've tried using zdb -l on all three disks, but in all cases it failes
 to
 unpack the labels.

 # zdb -l /dev/dsk/c14d0
 
 LABEL 0
 
 failed to unpack label 0
 
 LABEL 1
 
 failed to unpack label 1
 
 LABEL 2
 
 failed to unpack label 2
 
 LABEL 3
 
 failed to unpack label 3

 If I run # zdb -l /dev/dsk/c14d0s0 I do find 4 labels, but c14d0,
 c14d1
 and c15d0 is what I created the raid with. I do find labels this way
 for
 all
 three disks. Is this to any help?

 # zdb -l /dev/dsk/c14d1s0
 
 LABEL 0
 
  version=13
  name='rescamp'
  state=0
  txg=218097573
  pool_guid=12297694211509104163
  hostid=4925114
  hostname='slaskvald'
  top_guid=9479723326726871122
  guid=17774184411399278071
  vdev_tree
      type='raidz'
      id=0
      guid=9479723326726871122
      nparity=1
      metaslab_array=23
      metaslab_shift=34
      ashift=9
      asize=3000574672896
      is_log=0
      children[0]
              type='disk'
              id=0
              guid=9020535344824299914
              path='/dev/dsk/c15d0s0'
              devid='id1,c...@ast31000333as=9te0dglf/a'
              phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a'
              whole_disk=1
              DTL=102
      children[1]
              type='disk'
              id=1
              guid=14384361563876398475
              path='/dev/dsk/c14d0s0'
              devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a'
              phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a'
              whole_disk=1
              DTL=216
      children[2]
              type='disk'
              id=2
              guid=17774184411399278071
              path='/dev/dsk/c14d1s0'
              devid='id1,c...@ast31000333as=9te0de8w/a'
              phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a'
              whole_disk=1
              DTL=100
 
 LABEL 1
 
  version=13
  name='rescamp'
  state=0
  txg=218097573
  pool_guid=12297694211509104163
  hostid=4925114
  hostname='slaskvald'
  top_guid=9479723326726871122
  guid=17774184411399278071
  vdev_tree
      type='raidz'
      id=0
      guid=9479723326726871122
      nparity=1
      metaslab_array=23
      metaslab_shift=34
      ashift=9
      asize=3000574672896
      is_log=0
      children[0]
              type='disk'
              id=0
              guid=9020535344824299914
              path='/dev/dsk/c15d0s0'
              devid='id1,c...@ast31000333as=9te0dglf/a'
              phys_path='/p...@0,0/pci-...@11/i...@1/c...@0,0:a'
              whole_disk=1
              DTL=102
      children[1]
              type='disk'
              id=1
              guid=14384361563876398475
              path='/dev/dsk/c14d0s0'
              devid='id1,c...@asamsung_hd103uj=s13pjdws690618/a'
              phys_path='/p...@0,0/pci-...@11/i...@0/c...@0,0:a'
              whole_disk=1
              DTL=216
      children[2]
              type='disk'
              id=2
              guid=17774184411399278071
              path='/dev/dsk/c14d1s0'
              devid='id1,c...@ast31000333as=9te0de8w/a'
              phys_path='/p...@0,0/pci-...@11/i...@0/c...@1,0:a'
              whole_disk=1
              DTL=100
 
 LABEL 2
 
  version=13
  name='rescamp'
  state=0
  txg=218097573
  pool_guid=12297694211509104163
  hostid=4925114
  hostname='slaskvald'

Re: [zfs-discuss] Can't rm file when No space left on device...

2009-10-02 Thread Rudolf Potucek

  It seems like the appropriate solution would be to
 have a tool that
  allows removing a file from one or more snapshots
 at the same time as
  removing the source ... 
 
 That would make them not really snapshots.  And such
 a tool would have
 to fix clones too.

While I concur that being able to remove files from snapshots is somewhat 
against the concept behind snapshots, I feel that there is a tradeoff here for 
the administrator:

Let's say we accidentally snapshotted a very large temporary file. We don't 
need the file and we don't need its snapshot. Yet the only way to free the 
space taken up by this accidentally snapshotted file is to delete the WHOLE 
snapshot, including all the files of which snapshots may be required. To 
paraphrase: that would make this snapshot not really a snapshot ANYMORE.

At this point having a separate tool that allows you to do spring cleaning 
and deleting files from snapshots would quite possibly be more in the spirit of 
snapshotting than having to delete snapshots.

Just my $.02,

  Rudolf
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Find out file changes by comparing snapshots?

2009-10-02 Thread Simon Gao

Hi,

Is there a way or script that helps to find out what files have changed by 
comparing two snapshots?

Thanks,

Simon
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-02 Thread Ross

Interesting answer, thanks :)

I'd like to dig a little deeper if you don't mind, just to further my own 
understanding (which is usually rudimentary compared to a lot of the guys on 
here).  My belief is that ZFS stores two copies of the metadata for any block, 
so corrupt metadata really shouldn't happen often.

Could I ask what the structure of your pool is, what level of redundancy do you 
have there.  The very fact that you had a 'corrupt metadata' error implies to 
me that the checksums have done their job in finding an error, and I'm 
wondering if the true cause could be further down the line.

I'm still taking all this in though - we'll be using sha256 on our secondary 
system, just in case :)
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS caching of compressed data

2009-10-02 Thread Stuart Anderson

On Oct 2, 2009, at 5:05 AM, Robert Milkowski wrote:

Stuart Anderson wrote:
I am wondering if the following idea makes any sense as a way to
get ZFS to cache compressed data in DRAM?

In particular, given a 2-way zvol mirror of highly compressible
data on persistent storage devices, what would go wrong if I
dynamically added a ramdisk as a 3rd mirror device at boot time?

Would ZFS route most (or all) of the reads to the lower latency
DRAM device?

In the case of an un-clean shutdown where there was no opportunity
to actively remove the ramdisk from the pool before shutdown would
there be any problem at boot time when the ramdisk is still
registered but unavailable?

Note, this Gedanken experiment is for highly compressible (~9x)
metadata for a non-ZFS filesystem.

You would only get about 33% of IO's served from ram-disk.

With SVM you are allowed to specify a read policy on sub-mirrors for
just this reason, e.g.,

http://wikis.sun.com/display/BigAdmin/Using+a+SVM+submirror+on+a+ramdisk+to+increase+read+performance

Is there no equivalent in ZFS?

However at the KCA conference Bill and Jeff mentioned Just-in-time
decompression/decryption planned for ZFS. If I understand it
correctly some % of pages in ARC will be kept compressed/encrypted
and will be decompressed/decrypted only if accessed. This could be
especially useful to do so with prefetch.

I thought the optimization being discussed there was simply to avoid
decompressing/decrypting unused data. I missed the part about keeping
compressed data around in the ARC .

Now I would imaging that one will be able to tune what's percentage
of ARC should keep compressed pages.

That would be nice.

Now I don't remember if they mentioned L2ARC here but it would
probably be useful to have a tunable which would put compressed or
uncompressed data onto L2ARC depending on it's value. Which approach
is better would always depends on a given environment and on where
an actual bottleneck is.

I agree something like this would be preferable to the SVM ramdisk
solution.

Thanks.

--
Stuart Anderson ander...@ligo.caltech.edu
http://www.ligo.caltech.edu/~anderson

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Find out file changes by comparing snapshots?

2009-10-02 Thread Darren J Moffat


Simon Gao wrote:

Hi,

Is there a way or script that helps to find out what files have changed by 
comparing two snapshots?


http://blogs.sun.com/chrisg/entry/zfs_versions_of_a_file

Is something along those lines, but since the snapshots are visible 
under .zfs/snapshot/snapshot_name/ as filesystems you could just use 
basic UNIX tools like find/diff etc.


--
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Performance issue on a zpool

2009-10-02 Thread Joseph L. Casale

I have an HP DL380G4 w/ 3Gb of ram and a slow MSA15 (SATA discs to a
single u320 interface). I was using this with 10u7 for a smb over zfs
file server for a few clients with mild needs. I never benched it as
these unattended wkst's just wrote a slow steady of data and had no
issues.

I now need to use it for backup storage and noticed how absolutely bad
the performance is. Just as a non technical way to grasp how bad, a dd
of 4g from /dev/zero on the root rpool in a mirrored u320 pair of 72g discs
on the Smart Array 6i takes ~1.5 minutes. The same dd on a RaidZ2 with 6
discs all exported as simple volumes on a Smart Array 6400 controller takes
almost 20 minutes?

With windows installed on this aging server, the times are nearly identical.

The data I intend to write out to this machine will be bacula disk volumes
and hence it needs to sustain large amounts of streamed data so even a zil
I don't think will help as the file's will be 300-400G :)

What's my next best option?

Thanks!
jlc
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

My pool was the default, with checksum=256.  The default has two copies of all 
metadata (as I understand it), and one copy of user data.  It was a raidz2 with 
eight 750GB drives, yielding just over 4TB of usable space.  

I am not happy with the situation, but I recognize that I am 2x better off (1 
bit parity) than I would be with any other file system.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] strange pool disks usage pattern


For the archives...

On Oct 2, 2009, at 12:41 AM, Maurilio Longo wrote:


Hi,

I have a pc with a MARVELL AOC-SAT2-MV8 controller and a pool made  
up of a six disks in a raid-z pool with a hot spare.


pre
-bash-3.2$ /sbin/zpool status
 pool: nas
stato: ONLINE
scrub: scrub in progress for 9h4m, 81,59% done, 2h2m to go
config:

   NAMESTATE READ WRITE CKSUM
   nas ONLINE   0 0 0
 raidz1ONLINE   0 0 0
   c2t1d0  ONLINE   0 0 0
   c2t4d0  ONLINE   0 0 0
   c2t5d0  ONLINE   0 0 0
   c2t3d0  ONLINE   0 0 0
   c2t2d0  ONLINE   0 0 0
   c2t0d0  ONLINE   0 0 0
   dischi di riserva
 c2t7d0AVAIL

errori: nessun errore di dati rilevato
/pre

Now, the problem is that issuing an

iostat -Cmnx 10

or any other time intervall, I've seen, sometimes, a complete stall  
of disk I/O due to a disk in the pool (not always the same) being  
100% busy.


pre

$ iostat -Cmnx 10

  r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   0,00,30,02,0  0,0  0,00,00,1   0   0 c1
   0,00,30,02,0  0,0  0,00,00,1   0   0 c1t0d0
1852,1  297,0 13014,9 4558,4  9,2  1,64,30,7   2 158 c2
 311,8   61,3 2185,3  750,7  2,0  0,35,50,7  17  25 c2t0d0
 309,5   34,7 2207,2  769,5  1,6  0,54,71,4  41  47 c2t1d0
 309,3   36,3 2173,0  770,0  1,0  0,32,90,7  18  26 c2t2d0
 296,0   65,5 2057,3  749,2  2,1  0,25,90,6  16  23 c2t3d0
 313,3   64,1 2187,3  748,8  1,7  0,24,60,5  15  21 c2t4d0
 311,9   35,1 2204,8  770,1  0,7  0,22,10,5  11  17 c2t5d0
   0,00,00,00,0  0,0  0,00,00,0   0   0 c2t7d0
   extended device statistics
   r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   0,4   14,73,2   30,4  0,0  0,20,0   13,2   0   2 c1
   0,4   14,73,2   30,4  0,0  0,20,0   13,2   0   2 c1t0d0
   1,70,0   58,90,0  3,0  1,0 1766,4  593,1   2 101 c2
   0,30,07,70,0  0,0  0,00,30,4   0   0 c2t0d0
   0,30,0   11,50,0  0,0  0,04,48,4   0   0 c2t1d0
   0,00,00,00,0  3,0  1,00,00,0 100 100 c2t2d0


This is a symptom of an I/O getting dropped in the data path.
You can clearly see 1 IOP in actv queue (which is the queue
between the interface card and target).  The %busy is calculated
by counting the percentage of time that at least one IOP is in
the actv queue.  The higher level device drivers have timeouts
and will try to reset and re-issue IOPs as needed.
 -- richard



   0,40,0   14,10,0  0,0  0,00,46,6   0   0 c2t3d0
   0,40,0   14,10,0  0,0  0,00,32,5   0   0 c2t4d0
   0,30,0   11,50,0  0,0  0,03,66,9   0   0 c2t5d0
   0,00,00,00,0  0,0  0,00,00,0   0   0 c2t7d0
   extended device statistics
   r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   0,03,10,03,1  0,0  0,00,00,7   0   0 c1
   0,03,10,03,1  0,0  0,00,00,7   0   0 c1t0d0
   0,00,00,00,0  3,0  1,00,00,0   2 100 c2
   0,00,00,00,0  0,0  0,00,00,0   0   0 c2t0d0
   0,00,00,00,0  0,0  0,00,00,0   0   0 c2t1d0
   0,00,00,00,0  3,0  1,00,00,0 100 100 c2t2d0
   0,00,00,00,0  0,0  0,00,00,0   0   0 c2t3d0
   0,00,00,00,0  0,0  0,00,00,0   0   0 c2t4d0
   0,00,00,00,0  0,0  0,00,00,0   0   0 c2t5d0
   0,00,00,00,0  0,0  0,00,00,0   0   0 c2t7d0
   extended device statistics
   r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   0,00,10,00,4  0,0  0,00,01,2   0   0 c1
   0,00,10,00,4  0,0  0,00,01,2   0   0 c1t0d0
   0,0   29,50,0  320,2  3,4  1,0  113,9   34,6   2 102 c2
   0,06,90,0   63,3  0,1  0,0   12,60,7   0   0 c2t0d0
   0,04,40,0   65,5  0,0  0,08,70,8   0   0 c2t1d0
   0,00,00,00,0  3,0  1,00,00,0 100 100 c2t2d0
   0,07,40,0   62,7  0,1  0,0   15,40,8   1   1 c2t3d0
   0,06,80,0   63,6  0,1  0,0   13,20,7   0   0 c2t4d0
   0,04,00,0   65,1  0,0  0,07,90,7   0   0 c2t5d0
   0,00,00,00,0  0,0  0,00,00,0   0   0 c2t7d0
   extended device statistics
   r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   0,00,30,02,4  0,0  0,00,00,1   0   0 c1
   0,00,30,02,4  0,0  0,00,00,1   0   0 c1t0d0
   0,00,00,00,0  3,0  1,00,00,0   2 100 c2
   0,00,00,00,0  0,0  0,00,00,0   0   0 c2t0d0
   0,00,00,00,0  0,0  0,00,00,0   0   0 c2t1d0
   0,00,00,00,0  3,0  1,00,00,0

Re: [zfs-discuss] Best way to convert checksums

2009-10-02 Thread Marion Hakanson

webcl...@rochester.rr.com said:
  To verify data, I cannot depend on existing tools since diff is not large
 file aware.  My best idea at this point is to calculate and compare MD5 sums
 of every file and spot check other properties as best I can. 

Ray,

I recommend that you use rsync's -c to compare copies.  It reads all the
source files, computes a checksum for them, then does the same for the
destination and compares checksums.  As far as I know, the only thing
that rsync can't do in your situation is the ZFS/NFSv4 ACL's.  I've used
it to migrate many TB's of data.

Regards,

Marion




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Replacing a failed drive

2009-10-02 Thread Dan Transue

Does the same thing apply for a failing drive?  I have a drive that 
has not failed but by all indications, it's about to  Can I do the 
same thing here?


-dan

Jeff Bonwick wrote:

Yep, you got it.

Jeff

On Fri, Jun 19, 2009 at 04:15:41PM -0700, Simon Breden wrote:
  

Hi,

I have a ZFS storage pool consisting of a single RAIDZ2 vdev of 6 drives, and I 
have a question about replacing a failed drive, should it occur in future.

If a drive fails in this double-parity vdev, then am I correct in saying that I 
would need to (1) unplug the old drive once I've identified the drive id 
(c1t0d0 etc), (2) plug in the new drive on the same SATA cable, and (3) issue a 
'zpool replace pool_name drive_id' command etc, at which point ZFS will 
resilver the new drive from the parity data ?

Thanks,
Simon
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


--
http://www.java.com * Dan Transue *
*Sun Microsystems, Inc.*
495 S. High Street, #200
Columbus, OH 43215 US
Phone x30944 / 877-932-9964
Mobile 484-554-6951
Fax 877-932-9964
Email dan.tran...@sun.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] .zfs snapshots on subdirectories?

2009-10-02 Thread Edward Ned Harvey

Suppose I have a storagepool:   /storagepool

And I have snapshots on it.  Then I can access the snaps under
/storagepool/.zfs/snapshots

 

But is there any way to enable this within all the subdirs?  For example, 

cd /storagepool/users/eharvey/some/foo/dir

cd .zfs

 

I don't want to create a new filesystem for every subdir.  I just want to
automatically have the .zfs hidden directory available within all the
existing subdirs, if that's possible.

 

Thanks..

 

 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Replacing a failed drive

2009-10-02 Thread Cindy Swearingen


Yes, you can use the zpool replace process with any kind of drive:
failed, failing, or even healthy.

cs

On 10/02/09 12:15, Dan Transue wrote:
Does the same thing apply for a failing drive?  I have a drive that 
has not failed but by all indications, it's about to  Can I do the 
same thing here?


-dan

Jeff Bonwick wrote:

Yep, you got it.

Jeff

On Fri, Jun 19, 2009 at 04:15:41PM -0700, Simon Breden wrote:
  

Hi,

I have a ZFS storage pool consisting of a single RAIDZ2 vdev of 6 drives, and I 
have a question about replacing a failed drive, should it occur in future.

If a drive fails in this double-parity vdev, then am I correct in saying that I 
would need to (1) unplug the old drive once I've identified the drive id 
(c1t0d0 etc), (2) plug in the new drive on the same SATA cable, and (3) issue a 
'zpool replace pool_name drive_id' command etc, at which point ZFS will 
resilver the new drive from the parity data ?

Thanks,
Simon
--
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


--
http://www.java.com * Dan Transue *
*Sun Microsystems, Inc.*
495 S. High Street, #200
Columbus, OH 43215 US
Phone x30944 / 877-932-9964
Mobile 484-554-6951
Fax 877-932-9964
Email dan.tran...@sun.com




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] bigger zfs arc

2009-10-02 Thread Rob Logan


 zfs will use as much memory as is necessary but how is necessary 
calculated?

using arc_summary.pl from http://www.cuddletech.com/blog/pivot/entry.php?id=979
my tiny system shows:
 Current Size: 4206 MB (arcsize)
 Target Size (Adaptive):   4207 MB (c)
 Min Size (Hard Limit):894 MB (zfs_arc_min)
 Max Size (Hard Limit):7158 MB (zfs_arc_max)

so arcsize is close to the desired c, no pressure here but it would be nice to 
know
how c is calculated as its much smaller than zfs_arc_max on a system
like yours with nothing else on it.

 When an L2ARC is attached does it get used if there is no memory pressure?

My guess is no. for the same reason an L2ARC takes so long to fill.
arc_summary.pl from the same system is

  Most Recently Used Ghost:0%  9367837 (mru_ghost)  [ Return Customer 
Evicted, Now Back ]
  Most Frequently Used Ghost:  0% 11138758 (mfu_ghost)  [ Frequent Customer 
Evicted, Now Back ]

so with no ghosts, this system wouldn't benefit from an L2ARC even if added

In review:  (audit welcome)

if arcsize = c and is much less than zfs_arc_max,
  there is no point in adding system ram in hopes of increase arc.

if m?u_ghost is a small %, there is no point in adding an L2ARC.

if you do add a L2ARC, one must have ram between c and zfs_arc_max for its 
pointers.

Rob
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS caching of compressed data


Stuart Anderson wrote:


On Oct 2, 2009, at 5:05 AM, Robert Milkowski wrote:


Stuart Anderson wrote:
I am wondering if the following idea makes any sense as a way to get 
ZFS to cache compressed data in DRAM?


In particular, given a 2-way zvol mirror of highly compressible data 
on persistent storage devices, what would go wrong if I dynamically 
added a ramdisk as a 3rd mirror device at boot time?


Would ZFS route most (or all) of the reads to the lower latency DRAM 
device?


In the case of an un-clean shutdown where there was no opportunity 
to actively remove the ramdisk from the pool before shutdown would 
there be any problem at boot time when the ramdisk is still 
registered but unavailable?


Note, this Gedanken experiment is for highly compressible (~9x) 
metadata for a non-ZFS filesystem.



You would only get about 33% of IO's served from ram-disk.


With SVM you are allowed to specify a read policy on sub-mirrors for 
just this reason, e.g.,
http://wikis.sun.com/display/BigAdmin/Using+a+SVM+submirror+on+a+ramdisk+to+increase+read+performance 



Is there no equivalent in ZFS?



Nope, at least not right now.


--
Robert Milkowski
http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-02 Thread Cindy Swearingen


Ray,

The checksums are set on the file systems not the pool.

If a new checksum is set and *you* rewrite the data, then the rewritten
data will contain the new checksum. If your pool has the space for you 
to duplicate the user data and new checksum is set, then the duplicated 
data will have the new checksum.


ZFS doesn't rewrite data as part of normal operations. I confirmed with
a simple test (like Darren's) that even if you have a single-disk pool 
and the disk is replaced and all the data is resilvered and a new 
checksum is set, you'll see data with the previous checksum and the new

checksum.

Cindy

On 10/02/09 08:44, Ray Clark wrote:

Replying to Cindys Oct 1, 2009 3:34 PM post:

Thank you.   The second part was my attempt to guess my way out of this.  If the fundamental structure of the pool (That which was created before I set the checksum=sha256 property) is using fletcher2, perhaps as I use the pool all of this structure will be updated, and therefore automatically migrate to the new checksum.  It would be very difficult for me to recreate the pool, but I have space to duplicate the user files (and so get the new checksum).  Perhaps this will also result in the underlying structure of the pool being converted in the course of normal use.  


Comments for or against?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums


On Oct 2, 2009, at 7:46 AM, Ray Clark wrote:


Replying to relling's October 1, 2009 3:34 post:

Richard, regarding when a pool is created, there is only metadata  
which uses fletcher4.  Was this true in U4, or is this a new change  
of default with U4 using fletcher2?  Similarly, did the Ubberblock  
use sha256 in U4?  I am running U4.


ZFS uses different checksums for different things. Briefly,

use checksum
-
uberblock   SHA-256, self-checksummed
labels  SHA-256
metadatafletcher4
datafletcher2 (default), set with checksum parameter
ZIL log fletcher2, self-checksummed
gang block  SHA-256, self-checksummed

The parent holds the checksum for an entity is not self-checksummed.

The big question, that is currently unanswered, is do we see single
bit faults in disk-based storage systems? The answer to this question
must be known before the effectiveness of a checksum can be evaluated.
The overwhelming empirical evidence suggests that fletcher2 catches
many storage system corruptions.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-02 Thread Miles Nordin

 re == Richard Elling richard.ell...@gmail.com writes:
 r == Ross  myxi...@googlemail.com writes:

re The answer to this question must be known before the
re effectiveness of a checksum can be evaluated.

...well...we can use math to know that a checksum is effective.  What
you are really suggesting we evaluate ``empirically'' is the degree of
INeffectiveness of the broken checksum.

 r ZFS stores two copies of the metadata for any block, so
 r corrupt metadata really shouldn't happen often.

the other copy probably won't be read if the first copy read has a
valid checksum.  I think it'll more likely just lazy-panic instead.
If that's the case, the two copies won't help cover up the broken
checksum bug.  but Richard's table says metadata has fletcher4 which
the OP said is as good as the correct algorithm would have been, even
in its broken implementation, so long as it's only used up to
128kByte.  It's only data and ZIL that has the relevantly-broken
checksum, according to his math.

re The overwhelming empirical evidence suggests that fletcher2
re catches many storage system corruptions.

What do you mean by the word ``many''?  It's a weasel-word.  It
basically means, AFAICT, ``the broken checksum still trips
sometimes.''  But have you any empirical evidence about the fraction
of real world errors which are still caught by the broken checksum
vs. those that are not?  I don't see how you could.

How about cases where checksums are not used to correct bit-flip
gremlins but relied upon to determine whether a data structure is
fully present (committed) yet, like in the ZIL, or to determine which
half of a mirror is stale---these are cases where checksums could be
wrong even if the storage subsystem is functioning in an ideal way.

Checksum weakness on ZFS where checksums are presumed good by other
parts of the design could potentially be worse overall than a
checksumless design.  That's not my impression, but it's the right
place to put the bar.  Ray's ``well at least it's better than no
checksums'' is wrong because it presumes ZFS could function as well as
another filesystem if ZFS were using a hypothetical null checksum.  It
couldn't.

Anyway I'm glad the problem is both fixed and also avoidable on the
broken systems.  I just think the doublespeak after the fact is, once
again, not helping anyone.


pgpSoPvsby5bY.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums


Hi Miles, good to hear from you again.

On Oct 2, 2009, at 1:20 PM, Miles Nordin wrote:


re == Richard Elling richard.ell...@gmail.com writes:
r == Ross  myxi...@googlemail.com writes:


   re The answer to this question must be known before the
   re effectiveness of a checksum can be evaluated.

...well...we can use math to know that a checksum is effective.  What
you are really suggesting we evaluate ``empirically'' is the degree of
INeffectiveness of the broken checksum.


By your logic, SECDED ECC for memory is broken because it only
corrects 1 bit per symbol and only detects brokeness of 2 bits per
symbol. However, the empirical evidence suggests that ECC provides
a useful function for many people. Do we know how many triple bit
errors occur in memories? I can compute the probability, but have
never seen a field failure analysis. So, if ECC is good enough for
DRAM, is fletcher2 good enough for storage?

NB, for DRAM the symbol size is usually 64 bits. For the ZFS case, the
symbol size is 4,096 to 1,048,576 bits. AFAIK, no collisions have been
found in SHA-256 digests for symbols of size 1,048,576, but it has not
been proven that that they do not exist.


r ZFS stores two copies of the metadata for any block, so
r corrupt metadata really shouldn't happen often.

the other copy probably won't be read if the first copy read has a
valid checksum.  I think it'll more likely just lazy-panic instead.
If that's the case, the two copies won't help cover up the broken
checksum bug.  but Richard's table says metadata has fletcher4 which
the OP said is as good as the correct algorithm would have been, even
in its broken implementation, so long as it's only used up to
128kByte.  It's only data and ZIL that has the relevantly-broken
checksum, according to his math.

   re The overwhelming empirical evidence suggests that fletcher2
   re catches many storage system corruptions.

What do you mean by the word ``many''?  It's a weasel-word.


I'll blame the lawyers. They are causing me to remove certain words
from my vocabulary :-(


 It
basically means, AFAICT, ``the broken checksum still trips
sometimes.''  But have you any empirical evidence about the fraction
of real world errors which are still caught by the broken checksum
vs. those that are not?  I don't see how you could.


Question for the zfs-discuss participants, have you seen a data  
corruption

that was not detected when using fletcher2?

Personally, I've seen many corruptions of data stored on file systems
lacking checksums.


How about cases where checksums are not used to correct bit-flip
gremlins but relied upon to determine whether a data structure is
fully present (committed) yet, like in the ZIL, or to determine which
half of a mirror is stale---these are cases where checksums could be
wrong even if the storage subsystem is functioning in an ideal way.

Checksum weakness on ZFS where checksums are presumed good by other
parts of the design could potentially be worse overall than a
checksumless design.  That's not my impression, but it's the right
place to put the bar.  Ray's ``well at least it's better than no
checksums'' is wrong because it presumes ZFS could function as well as
another filesystem if ZFS were using a hypothetical null checksum.  It
couldn't.


I'm in Ray's camp. I've got far to many scars from data corruption and  
I'd

rather not add more.
 -- richard



Anyway I'm glad the problem is both fixed and also avoidable on the
broken systems.  I just think the doublespeak after the fact is, once
again, not helping anyone.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

Replying to hakanson's Oct 2, 2009 2:01 post:

Thanks.  I suppose it is true that I am not even trying to compare the 
peripheral stuff, and simple presence of a file and the data matching covers 
some of them.  

Using it for moving data, one encounters a longer list:  Sparse files, ACL 
handling, extended atributes, length of filenames, length of pathnames, large 
files.  And probably other interesting things that can be not handled 
correctly. 

Most information for misbehavior of the various archive / backup / data 
movement utilities is very old.  One wonders how they behave today.  This would 
be a useful compilation, but I can't do it.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can't rm file when No space left on device...

2009-10-02 Thread Erik Trimble


Rudolf Potucek wrote:

It seems like the appropriate solution would be to
  

have a tool that


allows removing a file from one or more snapshots
  

at the same time as

removing the source ... 
  

That would make them not really snapshots.  And such
a tool would have
to fix clones too.



While I concur that being able to remove files from snapshots is somewhat 
against the concept behind snapshots, I feel that there is a tradeoff here for 
the administrator:

Let's say we accidentally snapshotted a very large temporary file. We don't 
need the file and we don't need its snapshot. Yet the only way to free the 
space taken up by this accidentally snapshotted file is to delete the WHOLE 
snapshot, including all the files of which snapshots may be required. To 
paraphrase: that would make this snapshot not really a snapshot ANYMORE.

At this point having a separate tool that allows you to do spring cleaning 
and deleting files from snapshots would quite possibly be more in the spirit of 
snapshotting than having to delete snapshots.

Just my $.02,

  Rudolf
  


NO.  Snapshotting is sacred - once you break the model where a snapshot 
is a point-in-time picture, all sorts of bad things can happen.   You've 
changed a fundamental assumption of snapshots, and this then impacts how 
we view them from all sorts of angles; it's a huge loss to trade away 
for a very small gain.


Should you want to modify a snapshot for some reason, that's what the 
'zfs clone' function is for.  clone your snapshot, promote it, and make 
your modifications.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

Cindys Oct 2, 2009 2:59,  Thanks for staying with me.

Re: The checksums are aset on the file systems not the pool.:

But previous responses seem to indicate that I can set them for file stored in 
the filesystem that appears to be the pool, at the pool level, before I create 
any new ones.  One post seems to indicate that there is a checksum property for 
this file system, and independently for the pool.  (This topic needs a 
picture).  

Re: If a new checksum is set and *you* rewrite the data ... then the 
duplciated data will have the new checksum.

Understand.  Now I am on to being concerned for the blocks that comprise the 
zpool that *contain* the file system.

Re: ZFS doesn't rewrite data as part of normal operations.  I confirmed with a 
simple test (like Darren's) that even if you have a single-disk pool and the 
disk is replaced and all the data is resilvered and a new checksum is set, 
you'll see data with the previous checksum and the new checksum.

Yes, ... a resilver duplicates exactly.  Darren's example showed that without 
the -R, no properties were sent and the zfs receive had no choice but to use 
the pool default for the zfs filesystem that it created.  This also implies 
that there was a property associated with the pool.  So my previous comment 
about zfs send/receive not duplicating exactly was not fair.  The man page / 
admin guide should be clear as to what is sent without -R.  I would have 
guessed everything, just not descendent file systems.  It is a shame that zdb 
is totally undocumented.  I thought I had discovered a gold mine when I first 
read Darren's note!

--Ray
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

Re: relling's Oct 2, 2009 3:26 Post:

(1) Is this list everything?
(2) Is this the same for U4?
(3) If I change the zpool checksum property on creation as you indicated in 
your Oct 1, 12:51 post (evidently very recent versions only), does this change 
the checksums used for this list?  Why would not the strongest checksum be used 
for the most fundamental data rather than fool around, allowing the user to 
compromise only when the tradeoff pays back on the 99% bulk of the data?

Re: The big question, that is currently unanswered, is do we see single bit 
faults in disk-based storage systems?

I don't think this is the question.  I believe the implication of schlie's post 
is not that single bit faults will get through, but that the current fletcher2 
is equivalent to a single bit checksum.  You could have 1,000 bits in error, or 
4095, and still have a 50-50 chance of detecting it.  A single bit error would 
be certain to be detected (I think) even with the current code.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

Re: Miles Nordin Oct 2, 2009 4:20:

Re: Anyway, I'm glad the problem is both fixed...

I want to know HOW it can be fixed?  If they fixed it, this will invalidate 
every pool that has not been changed from the default (Probably almost all of 
them!).  This can't be!  So what WAS done?  In the interest of honesty in 
advertising and enabling people to evaluate their own risks, I think we should 
know how it was fixed.  Something either ingenious or potentially misleading 
must have been done.  I am not suggesting that it was not the best way to 
handle a difficult situation, but I don't see how it can be transparent.  If 
the string fletcher2 does the same thing, it is not fixed.  If it does 
something different, it is misleading.  

... and avoidable on the broken systems.

Please tell me how!  Without destroying and recreating my zpool, I can only fix 
the zfs file system blocks, not the underlying zpool blocks.  WITH destroying 
and recreating my zpool, I can only control the checksum on the underlying 
zpool using a version of Solaris that is not yet available.  And then (Pending 
relling's response) may or may not *still* effect the blocks I am concerned 
about.  So how is this avoidable?  It is partially avoidable (so far) IF I have 
the luxury of doing significant rebuilding..  No?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums


On Oct 2, 2009, at 3:05 PM, Ray Clark wrote:


Re: relling's Oct 2, 2009 3:26 Post:

(1) Is this list everything?


AFAIK


(2) Is this the same for U4?


Yes.  This hasn't changed in a very long time.

(3) If I change the zpool checksum property on creation as you  
indicated in your Oct 1, 12:51 post (evidently very recent versions  
only), does this change the checksums used for this list?  Why would  
not the strongest checksum be used for the most fundamental data  
rather than fool around, allowing the user to compromise only when  
the tradeoff pays back on the 99% bulk of the data?


Performance.  Many people value performance over dependability.

Re: The big question, that is currently unanswered, is do we see  
single bit faults in disk-based storage systems?


I don't think this is the question.  I believe the implication of  
schlie's post is not that single bit faults will get through, but  
that the current fletcher2 is equivalent to a single bit checksum.   
You could have 1,000 bits in error, or 4095, and still have a 50-50  
chance of detecting it.  A single bit error would be certain to be  
detected (I think) even with the current code.


I don't believe schlie posted the number of fletcher2 collisions for the
symbol size used by ZFS. I do not believe it will be anywhere near
50% collisions.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

Re: relling's Oct 2 5:06 Post:

Re: analogy to ECC memory... 

I appreciate the support, but the ECC memory analogy does not hold water.  ECC 
memory is designed to correct for multiple independent events, such as 
electrical noise, bits flipped due to alpha particles from the DRAM package, or 
cosmic rays.  The probability of these independent events coinciding in time 
and space is very small indeed.  It works well.  

ZFS does purport to cover errors such as these in the crummy double layer 
boards wtihout sufficient decoupling, microcontrollers and memories without 
parity or ECC, etc. found in the cost-reduced to the razor's edge hardware most 
of us run on, but it also covers system level errors such as entire blocks 
being replaced, or large fractions of them being corrupted by high level bugs.  
With the current fletcher2 we have only a 50-50 chance of catching these 
multi-bit errors.  Probability of multiple bits being changed is not small, 
because the probabilities of the error mechanism effecting the 4096~1048576 
bits in the block are not independent.  Indeed, in many of the show-cased 
mechanisms, it is a sure bet - the entire disk sector is written with the wrong 
data, for sure!  Although there is a good chance that many of the bits in the 
sector happen to match, there is an excellent chance that many are different.  
And the mechanisms that caused these differences were not independent
 .  

Re: AFAIK, no collisions have been found in SHA-256 digests for symbols of 
size 1,048,576, but it has not been proven that they do not exist

For sure they exist.  I think 4096 of them, for every SHA256 digest, there are 
(I think) 4096 1,048,576 bit long blocks that will create it.  One hopes that 
the same properties that make SHA256 a good cryptographic hash also make it a 
good hash period.  This, I admit, is a leap of ignorance (At least I know what 
cliff I am leaping off of).

Regarding the question of what people have seen, I have seen lots of 
unexplained things happen, and by definition one never knows why.  I am not 
interested in seeing any more.  I see the potential for disaster, and my time, 
and the time of my group, is better spent doing other things.  That is why I 
moved to ZFS.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

2009-10-02 Thread Miles Nordin

 re == Richard Elling richard.ell...@gmail.com writes:

re By your logic, SECDED ECC for memory is broken because it only
re corrects

ECC is not a checksum.

Go ahead, get out your dictionary, enter severe-pedantry-mode.  but it
is relevantly different.  In for example data transmission scenarios,
FEC's like ECC are often used along with a strong noncorrecting
checksum over a larger block.

The OP further described scenarios plausible for storage, like ``long
string of zeroes with 1 bit flipped'', that produce collisions with
the misimplemented fletcher2 (but, obviously, not with any strong
checksum like correct-fletcher2).

re is fletcher2 good enough for storage?

yes, it probably is good enough, but ZFS implements some other broken
algorithm and calls it fletcher2.  so, please stop saying fletcher2.

re I'll blame the lawyers. They are causing me to remove certain
re words from my vocabulary :-(

yeah, well, allow me to add a word back to the vocabulary: BROKEN.

If you are not legally allowed to use words like broken and working,
then find another identity from which to talk, please.

re Question for the zfs-discuss participants, have you seen a
re data corruption that was not detected when using fletcher2?

This is ridiculous.  It's not fletcher2, it's brokenfletcher2.  It's
avoidably extremely weak.  It's reasonable to want to use a real
checksum, and this PR game you are playing is frustrating and
confidence-harming for people who want that.  

This does not have to become a big deal, unless you try to spin it
with a 7200rpm PR machine like IBM did with their broken Deathstar
drives before they became HGST.

Please, what we need to do is admit that the checksum is relevantly
broken in a way that compromises the integrity guarantees with which
ZFS was sold to many customers, fix the checksum, and learn how to
conveniently migrate our data.

Based on the table you posted, I guess file data can be set to
fletcher4 or sha256 using filesystem properties to work around the
bug on Solaris versions with the broken implementation.

 1. What's needed to avoid fletcher2 on the ZIL on broken Solaris
versions?

 2. I understand the workaround, but not the fix.  

How does the fix included S10u8 and snv_114 work?  Is there a ZFS
version bump?  Does the fix work by implementing fletcher2
correctly?  or does it just disable fletcher2 and force everything
to use brokenfletcher4 which is good enough?  If the former, how
are the broken and correct versions of fletcher2
distinguished---do they show up with different names in the pool
properties?

Once you have the fixed software, how do you make sure fixed
checksums are actually covering data blocks originally written by
old broken software?  I assume you have to use rsync or zfs
send/recv to rewrite all the data with the new checksum?  If yes,
what do you have to do before rewriting---upgrade solaris and then
'zfs upgrade' each filesystem one by one?  Will zfs send/recv work
across the filesystem versions, or does the copying have to be
done with rsync?

 3. speaking of which, what about the checksum in zfs send streams?
is it also fletcher2, and if so was it also fixed in
s10u8/snv_114, and how does this affect compatibility for people
who have ignored my advice and stored streams instead of zpools?
Will a newer 'zfs recv' always work with an older 'zfs send' but
not the other way around?

there is basically no informaiton about implementing the fix in the
bug, and we can't write to the bug from outside Sun.  Whatever
sysadmins need to do to get their data under the strength of checksum
they thought it was under, it might be nice to describe it in the bug
for whoever gets referred to the bug and has an affected version.


pgp4LNb1yFFMv.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums

Let me try to refocus:

Given that I have a U4 system with a zpool created with Fletcher2:

What blocks in the system are protected by Fletcher2, or even Fletcher4 
although that does not worry me so much.

Given that I only have 1.6TB of data in a 4TB pool, what can I do to change 
those blocks to sha256 or Fletcher4:

(1) Without destroying and recreating the zpool under U4

(2) With destroying and recreating the zpool under U4 (Which I don't really 
have the resources to pull off)

(3) With upgrading to U7 (Perhaps in a few months)

(4) With upgrading to U8

Thanks.
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Can't rm file when No space left on device...

2009-10-02 Thread Rudolf Potucek

 NO.  Snapshotting is sacred

LOL!

Ok, ok, I admit that snapshotting the whole ZFS root filesystem (yes, we have 
ZFS root in production, oops) instead of creating individual snapshots for 
*each* individual ZFS is against the code of good sysadmin-ing. I bow to the 
developer gods and will only follow the approved gospel in the future ;)

 once you break the model where a snapshot is a point-in-time picture, all 
 sorts of bad things can happen.   You've changed a fundamental assumption of 
 snapshots, and this then impacts how we view them from all sorts of angles; 
 it's a huge loss to trade away for a very small gain.

Hmm ... I can see how the assumption of a snapshot being unalterable could 
provide some programming shortcuts and opportunities for optimization of ZFS 
code. Not sure that I understand the huge loss perspective though. I think at 
the point where I am desperately scrabbling to free 30% of my root FS held 
hostage by an accidental snapshot while keeping on-line backup strategy in 
tact, I won't be too worried about performance ;)

 Should you want to modify a snapshot for some reason, that's what the 'zfs 
 clone' function is for.  clone your snapshot,  promote it, and make your 
 modifications.

Err ... hello ... filesystem already full ... hello?
-- 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] bigger zfs arc

2009-10-02 Thread Mike Gerdts

On Fri, Oct 2, 2009 at 1:45 PM, Rob Logan r...@logan.com wrote:
 zfs will use as much memory as is necessary but how is necessary
 calculated?

 using arc_summary.pl from
 http://www.cuddletech.com/blog/pivot/entry.php?id=979
 my tiny system shows:
         Current Size:             4206 MB (arcsize)
         Target Size (Adaptive):   4207 MB (c)

That looks a lot like ~ 4 * 1024 MB.  Is this a 64-bit capable system
that you have booted from a 32-bit kernel?

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums


On Oct 2, 2009, at 3:44 PM, Ray Clark wrote:


Let me try to refocus:

Given that I have a U4 system with a zpool created with Fletcher2:

What blocks in the system are protected by Fletcher2, or even  
Fletcher4 although that does not worry me so much.


Given that I only have 1.6TB of data in a 4TB pool, what can I do to  
change those blocks to sha256 or Fletcher4:


(1) Without destroying and recreating the zpool under U4

(2) With destroying and recreating the zpool under U4 (Which I don't  
really have the resources to pull off)


(3) With upgrading to U7 (Perhaps in a few months)

(4) With upgrading to U8


This has been answered several times in this thread already.
set checksum=sha256 filesystem
copy your files -- all newly written data will have the sha256  
checksums.


 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best way to convert checksums


On Oct 2, 2009, at 3:36 PM, Miles Nordin wrote:


re == Richard Elling richard.ell...@gmail.com writes:


   re By your logic, SECDED ECC for memory is broken because it only
   re corrects

ECC is not a checksum.


SHA-256 is not a checksum, either, but that isn't the point. The  
concern is

that corruption can be detected.  ECC has very, very limited detection
capabilities, yet it is good enough for many people. We know that
MOS memories have certain failure modes that cause bit flips and by
using ECC and interleaving, the dependability is improved. The big
question is, what does the corrupted data look like in storage? Random
bit flips? Big chunks of zeros? 55aa patterns? Since the concern with
the broken fletcher2 is restricted to the most significant bits, we are
most concerned with failures where the most significants are set to
ones. But as I said, we have no real idea what the corrupted data
should look like, and if it is zero-filled, then fletcher2 will catch  
it.



Go ahead, get out your dictionary, enter severe-pedantry-mode.  but it
is relevantly different.  In for example data transmission scenarios,
FEC's like ECC are often used along with a strong noncorrecting
checksum over a larger block.

The OP further described scenarios plausible for storage, like ``long
string of zeroes with 1 bit flipped'', that produce collisions with
the misimplemented fletcher2 (but, obviously, not with any strong
checksum like correct-fletcher2).

   re is fletcher2 good enough for storage?

yes, it probably is good enough, but ZFS implements some other broken
algorithm and calls it fletcher2.  so, please stop saying fletcher2.


If I was to refer to Fletcher's algorithm, I would use Fletcher.  When I
am referring to the ZFS checksum setting of fletcher2 I will continue
to use fletcher2


   re I'll blame the lawyers. They are causing me to remove certain
   re words from my vocabulary :-(

yeah, well, allow me to add a word back to the vocabulary: BROKEN.

If you are not legally allowed to use words like broken and working,
then find another identity from which to talk, please.

   re Question for the zfs-discuss participants, have you seen a
   re data corruption that was not detected when using fletcher2?

This is ridiculous.  It's not fletcher2, it's brokenfletcher2.  It's
avoidably extremely weak.  It's reasonable to want to use a real
checksum, and this PR game you are playing is frustrating and
confidence-harming for people who want that.


There is no PR campaign. It is what it is. What is done is done.


This does not have to become a big deal, unless you try to spin it
with a 7200rpm PR machine like IBM did with their broken Deathstar
drives before they became HGST.

Please, what we need to do is admit that the checksum is relevantly
broken in a way that compromises the integrity guarantees with which
ZFS was sold to many customers, fix the checksum, and learn how to
conveniently migrate our data.


Unfortunately, there is a backwards compatibility issue that
requires the current fletcher2 to live for a very long time. The
only question for debate is whether it should be the default.
To date, I see no field data that suggests it is not detecting
corruption.


Based on the table you posted, I guess file data can be set to
fletcher4 or sha256 using filesystem properties to work around the
bug on Solaris versions with the broken implementation.

1. What's needed to avoid fletcher2 on the ZIL on broken Solaris
   versions?


Please file RFEs at bugs.opensolaris.org


2. I understand the workaround, but not the fix.

   How does the fix included S10u8 and snv_114 work?  Is there a ZFS
   version bump?  Does the fix work by implementing fletcher2
   correctly?  or does it just disable fletcher2 and force everything
   to use brokenfletcher4 which is good enough?  If the former, how
   are the broken and correct versions of fletcher2
   distinguished---do they show up with different names in the pool
   properties?


The best I can tell, the comments are changed to indicate fletcher2 is
deprecated. However, it must live on (forever) because of backwards
compatibility. I presume one day the default will change to fletcher4
or something else. This is implied by zfs(1m):

 checksum=on | off | fletcher2,| fletcher4 | sha256

 Controls the checksum used to verify data integrity. The
 default  value  is  on,  which  automatically selects an
 appropriate algorithm (currently,  fletcher2,  but  this
 may  change  in future releases). The value off disables
 integrity checking on user data. Disabling checksums  is
 NOT a recommended practice.


   Once you have the fixed software, how do you make sure fixed
   checksums are actually covering data blocks originally written by
   old broken software?  I assume you have to use rsync or zfs
   send/recv to rewrite all the data with the new checksum?  If yes,
   what do you have to do before rewriting---upgrade

Re: [zfs-discuss] bigger zfs arc