Re: [zfs-discuss] How does ZFS dedup space accounting work with quota?

2011-04-25 Thread Erik Trimble

On 4/25/2011 6:23 PM, Ian Collins wrote:

  On 04/26/11 01:13 PM, Fred Liu wrote:

H, it seems dedup is pool-based not filesystem-based.

That's correct. Although it can be turned off and on at the filesystem
level (assuming it is enabled for the pool).
Which is effectively the same as choosing per-filesystem dedup.  Just 
the inverse. You turn it on at the pool level, and off at the filesystem 
level, which is identical to "off at the pool level, on at the 
filesystem level" that NetApp does.



If it can have fine-grained granularity(like based on fs), that will be great!
It is pity! NetApp is sweet in this aspect.


So what happens to user B's quota if user B stores a ton of data that is
a duplicate of user A's data and then user A deletes the original?
Actually, right now, nothing happens to B's quota. He's always charged 
the un-deduped amount for his quota usage, whether or not dedup is 
enabled, and regardless of how much of his data is actually deduped. 
Which is as it should be, as quotas are about limiting how much a user 
is consuming, not how much the backend needs to store that data consumption.


e.g.

A, B, C, & D all have 100Mb of data in the pool, with dedup on.

20MB of storage has a dedup-factor of 3:1 (common to A, B, & C)
50MB of storage has a dedup factor of 2:1 (common to A & B )

Thus, the amount of unique data would be:

A: 100 - 20 - 50 = 30MB
B: 100 - 20 - 50 = 30MB
C: 100 - 20 = 80MB
D: 100MB

Summing it all up, you would have an actual storage consumption of  70 
(50+20 deduped) + 30+30+80+100 (unique data) = 310MB to actual storage, 
for 400MB of apparent storage (i.e. dedup ratio of 1.29:1 )


A, B, C, & D would each still have a quota usage of 100MB.


--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Spare drives sitting idle in raidz2 with failed drive

2011-04-25 Thread Lamp Zy

Thanks Brandon,

On 04/25/2011 05:47 PM, Brandon High wrote:

On Mon, Apr 25, 2011 at 4:56 PM, Lamp Zy  wrote:

I'd expect the spare drives to auto-replace the failed one but this is not
happening.

What am I missing?


Is the autoreplace property set to 'on'?
# zpool get autoreplace fwgpool0
# zpool set autoreplace=on fwgpool0


Yes, autoreplace is on. I should have mentioned it in my original post:

# zpool get autoreplace fwgpool0
NAME  PROPERTY VALUE SOURCE
fwgpool0  autoreplace  onlocal



I really would like to get the pool back in a healthy state using the spare
drives before trying to identify which one is the failed drive in the
storage array and trying to replace it. How do I do this?


Turning on autoreplace might start the replace. If not, the following
will replace the failed drive with the first spare. (I'd suggest
verifying the device names before running it.)
# zpool replace fwgpool0 c4t5000C5001128FE4Dd0 c4t5000C50014D70072d0


I thought about doing that. My understanding is that this command should 
be used to replace a drive with a brand new one i.e. a drive that is not 
known to the raidz configuration.


Should I somehow unconfigure one of the spare drives to be just a loose 
drive and not a raidz spare before running the command (and how do I do 
it)? Or, is it save to just run the replace command and let zfs take 
care of the details like noticing that one of the spares has been 
manually re-purposed to replace a failed drive?


Thank you
Peter
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Drive replacement speed

2011-04-25 Thread Brandon High
On Mon, Apr 25, 2011 at 5:26 PM, Brandon High  wrote:
> Setting zfs_resilver_delay seems to have helped some, based on the
> iostat output. Are there other tunables?

I found zfs_resilver_min_time_ms while looking. I've tried bumping it
up considerably, without much change.

'zpool status' is still showing:
 scan: resilver in progress since Sat Apr 23 17:03:13 2011
6.06T scanned out of 6.40T at 36.0M/s, 2h46m to go
769G resilvered, 94.64% done

'iostat -xn' shows asvc_t under 10ms still.

Increasing the per-device queue depth has increased the ascv_t but
hasn't done much to effect the throughput. I'm using:
echo zfs_vdev_max_pending/W0t35 | pfexec mdb -kw

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does ZFS dedup space accounting work with quota?

2011-04-25 Thread Ian Collins
 On 04/26/11 01:13 PM, Fred Liu wrote:
> H, it seems dedup is pool-based not filesystem-based.

That's correct. Although it can be turned off and on at the filesystem
level (assuming it is enabled for the pool).

> If it can have fine-grained granularity(like based on fs), that will be great!
> It is pity! NetApp is sweet in this aspect.
>
So what happens to user B's quota if user B stores a ton of data that is
a duplicate of user A's data and then user A deletes the original?

-- 
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does ZFS dedup space accounting work with quota?

2011-04-25 Thread Fred Liu
H, it seems dedup is pool-based not filesystem-based.
If it can have fine-grained granularity(like based on fs), that will be great!
It is pity! NetApp is sweet in this aspect.

Thanks.

Fred 

> -Original Message-
> From: Brandon High [mailto:bh...@freaks.com]
> Sent: 星期二, 四月 26, 2011 8:50
> To: Fred Liu
> Cc: cindy.swearin...@oracle.com; ZFS discuss
> Subject: Re: [zfs-discuss] How does ZFS dedup space accounting work
> with quota?
> 
> On Mon, Apr 25, 2011 at 4:53 PM, Fred Liu  wrote:
> > So how can I set the quota size on a file system with dedup enabled?
> 
> I believe the quota applies to the non-dedup'd data size. If a user
> stores 10G of data, it will use 10G of quota, regardless of whether it
> dedups at 100:1 or 1:1.
> 
> -B
> 
> --
> Brandon High : bh...@freaks.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] How does ZFS dedup space accounting work with quota?

2011-04-25 Thread Brandon High
On Mon, Apr 25, 2011 at 4:53 PM, Fred Liu  wrote:
> So how can I set the quota size on a file system with dedup enabled?

I believe the quota applies to the non-dedup'd data size. If a user
stores 10G of data, it will use 10G of quota, regardless of whether it
dedups at 100:1 or 1:1.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Spare drives sitting idle in raidz2 with failed drive

2011-04-25 Thread Brandon High
On Mon, Apr 25, 2011 at 4:56 PM, Lamp Zy  wrote:
> I'd expect the spare drives to auto-replace the failed one but this is not
> happening.
>
> What am I missing?

Is the autoreplace property set to 'on'?
# zpool get autoreplace fwgpool0
# zpool set autoreplace=on fwgpool0

> I really would like to get the pool back in a healthy state using the spare
> drives before trying to identify which one is the failed drive in the
> storage array and trying to replace it. How do I do this?

Turning on autoreplace might start the replace. If not, the following
will replace the failed drive with the first spare. (I'd suggest
verifying the device names before running it.)
# zpool replace fwgpool0 c4t5000C5001128FE4Dd0 c4t5000C50014D70072d0

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Drive replacement speed

2011-04-25 Thread Brandon High
On Mon, Apr 25, 2011 at 4:45 PM, Richard Elling
 wrote:
> If there is other work going on, then you might be hitting the resilver
> throttle. By default, it will delay 2 clock ticks, if needed. It can be turned

There is some other access to the pool from nfs and cifs clients, but
not much, and mostly reads.

Setting zfs_resilver_delay seems to have helped some, based on the
iostat output. Are there other tunables?

> Probably won't work because it does not make the resilvering drive
> any faster.

It doesn't seem like the devices are the bottleneck, even with the
delay turned off.

$ iostat -xn 60 3
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  369.2   11.5 5577.0   71.3  0.7  0.71.91.9  14  29 c2t0d0
  371.9   11.5 5570.3   71.3  0.7  0.71.71.8  13  29 c2t1d0
  369.9   11.5 5574.4   71.3  0.7  0.71.81.9  14  29 c2t2d0
  370.7   11.5 5573.9   71.3  0.7  0.71.81.9  14  29 c2t3d0
  368.0   11.5 5553.1   71.3  0.7  0.71.81.9  14  29 c2t4d0
  196.1  172.8 2825.5 2436.6  0.3  1.10.83.0   6  26 c2t5d0
  183.6  184.9 2717.6 2674.7  0.5  1.31.43.5  11  31 c2t6d0
  393.0   11.2 5540.7   71.3  0.5  0.61.31.5  12  26 c2t7d0
   95.81.2   95.6   16.2  0.0  0.00.20.2   0   1 c0t0d0
0.91.23.6   16.2  0.0  0.07.51.9   0   0 c0t1d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  891.2   11.8 2386.9   64.4  0.0  1.20.01.3   1  36 c2t0d0
  919.9   12.1 2351.8   64.6  0.0  1.10.01.2   0  35 c2t1d0
  906.9   12.1 2346.1   64.6  0.0  1.20.01.3   0  36 c2t2d0
  877.9   11.6 2351.0   64.5  0.7  0.50.80.6  23  35 c2t3d0
  883.4   12.0 2322.0   64.4  0.2  1.00.21.1   7  35 c2t4d0
0.8  758.00.8 1910.4  0.2  5.00.26.6   3  72 c2t5d0
  882.7   11.4 2355.1   64.4  0.8  0.40.90.4  27  34 c2t6d0
  907.8   11.4 2373.1   64.5  0.7  0.30.80.4  23  30 c2t7d0
 1607.89.4 1568.2   83.0  0.1  0.20.10.1   3  18 c0t0d0
7.39.1   23.5   83.0  0.1  0.06.01.4   2   2 c0t1d0
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  960.3   12.7 2868.0   59.0  1.1  0.71.20.8  37  52 c2t0d0
  963.2   12.7 2877.5   59.1  1.1  0.81.10.8  36  51 c2t1d0
  960.3   12.6 2844.7   59.1  1.1  0.71.10.8  37  52 c2t2d0
 1000.1   12.8 2827.1   59.0  0.6  1.20.61.2  21  52 c2t3d0
  960.9   12.3 2811.1   59.0  1.3  0.61.30.6  42  51 c2t4d0
0.5  962.20.4 2418.3  0.0  4.10.04.3   0  59 c2t5d0
 1014.2   12.3 2820.6   59.1  0.8  0.80.80.8  28  48 c2t6d0
 1031.2   12.5 2822.0   59.1  0.8  0.80.70.8  26  45 c2t7d0
 1836.40.0 1783.40.0  0.0  0.20.00.1   1  19 c0t0d0
5.30.05.30.0  0.0  0.01.11.5   1   1 c0t1d0


-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Spare drives sitting idle in raidz2 with failed drive

2011-04-25 Thread Lamp Zy

Hi,

One of my drives failed in Raidz2 with two hot spares:

# zpool status
  pool: fwgpool0
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas 
exist for

the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: resilver completed after 0h0m with 0 errors on Mon Apr 25 
14:45:44 2011

config:

NAME   STATE READ WRITE CKSUM
fwgpool0   DEGRADED 0 0 0
  raidz2   DEGRADED 0 0 0
c4t5000C500108B406Ad0  ONLINE   0 0 0
c4t5000C50010F436E2d0  ONLINE   0 0 0
c4t5000C50011215B6Ed0  ONLINE   0 0 0
c4t5000C50011234715d0  ONLINE   0 0 0
c4t5000C50011252B4Ad0  ONLINE   0 0 0
c4t5000C500112749EDd0  ONLINE   0 0 0
c4t5000C5001128FE4Dd0  UNAVAIL  0 0 0  cannot open
c4t5000C500112C4959d0  ONLINE   0 0 0
c4t5000C50011318199d0  ONLINE   0 0 0
c4t5000C500113C0E9Dd0  ONLINE   0 0 0
c4t5000C500113D0229d0  ONLINE   0 0 0
c4t5000C500113E97B8d0  ONLINE   0 0 0
c4t5000C50014D065A9d0  ONLINE   0 0 0
c4t5000C50014D0B3B9d0  ONLINE   0 0 0
c4t5000C50014D55DEFd0  ONLINE   0 0 0
c4t5000C50014D642B7d0  ONLINE   0 0 0
c4t5000C50014D64521d0  ONLINE   0 0 0
c4t5000C50014D69C14d0  ONLINE   0 0 0
c4t5000C50014D6B2CFd0  ONLINE   0 0 0
c4t5000C50014D6C6D7d0  ONLINE   0 0 0
c4t5000C50014D6D486d0  ONLINE   0 0 0
c4t5000C50014D6D77Fd0  ONLINE   0 0 0
spares
  c4t5000C50014D70072d0AVAIL
  c4t5000C50014D7058Dd0AVAIL

errors: No known data errors


I'd expect the spare drives to auto-replace the failed one but this is 
not happening.


What am I missing?

I really would like to get the pool back in a healthy state using the 
spare drives before trying to identify which one is the failed drive in 
the storage array and trying to replace it. How do I do this?


Thanks for any hints.

--
Peter
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] How does ZFS dedup space accounting work with quota?

2011-04-25 Thread Fred Liu
Cindy,

Following is quoted from ZFS Dedup FAQ:

"Deduplicated space accounting is reported at the pool level. You must use the 
zpool list command rather than the zfs list command to identify disk space 
consumption when dedup is enabled. If you use the zfs list command to review 
deduplicated space, you might see that the file system appears to be increasing 
because we're able to store more data on the same physical device. Using the 
zpool list will show you how much physical space is being consumed and it will 
also show you the dedup ratio.The df command is not dedup-aware and will not 
provide accurate space accounting."

So how can I set the quota size on a file system with dedup enabled?

Thanks.

Fred

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Drive replacement speed

2011-04-25 Thread Richard Elling
On Apr 25, 2011, at 2:52 PM, Brandon High wrote:

> I'm in the process of replacing drive in a pool, and the resilver
> times seem to have increased with each device. The way that I'm doing
> this is by pulling a drive, physically replacing it, then doing
> 'cfgadm -c configure  ; zpool replace tank '. I don't have any
> hot-swap bays available, so I'm physically replacing the device before
> doing a 'zpool replace'.
> 
> I'm replacing Western Digital WD10EADS 1TB drives with Hitachi 5K3000
> 3TB drives. Neither device is fast, but they aren't THAT slow. wsvc_t
> and asvc_t both look fairly healthy giving the device types.

Look for 10-12 ms for asvc_t.  In my experience, SATA disks tend to not 
handle NCQ as well as SCSI disks handle TCQ -- go figure. In your iostats
below, you are obviously not bottlenecking on the disks.

> 
> Replacing the first device (took about 20 hours) went about as
> expected. The second took about 44 hours. The third is still running
> and should finish in slightly over 48 hours.

If there is other work going on, then you might be hitting the resilver
throttle. By default, it will delay 2 clock ticks, if needed. It can be turned 
off temporarily using:
echo zfs_resilver_delay/W0t0 | mdb -kw

to return to normal:
echo zfs_resilver_delay/W0t2 | mdb -kw

> I'm wondering if the following would help for the next drive:
> # zpool offline tank c2t4d0
> # cfgadm -c unconfigure sata3/4::dsk/c2t4d0
> 
> At this point pull the drive and put it into an external USB adapter.
> Put the new drive in the hot-swap bay. The USB adapter shows up as
> c4t0d0.
> 
> # zpool online tank c4t0d0
> 
> This should re-add it to the pool and resilver the last few
> transactions that may have been missed, right?
> 
> Then I want to actually replace the drive in the zpool:
> # cfgadm -c configure sata3/4
> # zpool replace tank c4t0d0 c2t4d0
> 
> Will this work? Will the replace go faster, since it won't need to
> resilver from the parity data?

Probably won't work because it does not make the resilvering drive
any faster.
 -- richard

> 
> 
> $ zpool list tank
> NAME   SIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
> tank  7.25T  6.40T   867G88%  1.11x  DEGRADED  -
> $ zpool status -x
>  pool: tank
> state: DEGRADED
> status: One or more devices is currently being resilvered.  The pool will
>continue to function, possibly in a degraded state.
> action: Wait for the resilver to complete.
> scan: resilver in progress since Sat Apr 23 17:03:13 2011
>5.91T scanned out of 6.40T at 38.0M/s, 3h42m to go
>752G resilvered, 92.43% done
> config:
> 
>NAME  STATE READ WRITE CKSUM
>tank  DEGRADED 0 0 0
>  raidz2-0DEGRADED 0 0 0
>c2t0d0ONLINE   0 0 0
>c2t1d0ONLINE   0 0 0
>c2t2d0ONLINE   0 0 0
>c2t3d0ONLINE   0 0 0
>c2t4d0ONLINE   0 0 0
>replacing-5   DEGRADED 0 0 0
>  c2t5d0/old  FAULTED  0 0 0  corrupted data
>  c2t5d0  ONLINE   0 0 0  (resilvering)
>c2t6d0ONLINE   0 0 0
>c2t7d0ONLINE   0 0 0
> 
> errors: No known data errors
> $ zpool iostat -v tank 60 3
> capacity operationsbandwidth
> pool  alloc   free   read  write   read  write
>   -  -  -  -  -  -
> tank  6.40T   867G566 25  32.2M   156K
>  raidz2  6.40T   867G566 25  32.2M   156K
>c2t0d0-  -362 11  5.56M  71.6K
>c2t1d0-  -365 11  5.56M  71.6K
>c2t2d0-  -363 11  5.56M  71.6K
>c2t3d0-  -363 11  5.56M  71.6K
>c2t4d0-  -361 11  5.54M  71.6K
>replacing -  -  0492  8.28K  4.79M
>  c2t5d0/old  -  -202  5  2.84M  36.7K
>  c2t5d0  -  -  0315  8.66K  4.78M
>c2t6d0-  -170190  2.68M  2.69M
>c2t7d0-  -386 10  5.53M  71.6K
>   -  -  -  -  -  -
> 
> capacity operationsbandwidth
> pool  alloc   free   read  write   read  write
>   -  -  -  -  -  -
> tank  6.40T   867G612 14  8.43M  70.7K
>  raidz2  6.40T   867G612 14  8.43M  70.7K
>c2t0d0-  -411 11  1.51M  57.9K
>c2t1d0-  -414 11  1.50M  58.0K
>c2t2d0-  -385 11  1.51M  57.9K
>c2t3d0-  -412 11  1.50M  58.0K
>c2t4d0-  -412 11  1.45M  57.8K
>

[zfs-discuss] arcstat updates

2011-04-25 Thread Richard Elling
Hi ZFSers,
I've been working on merging the Joyent arcstat enhancements with some of my own
and am now to the point where it is time to broaden the requirements gathering. 
The result
is to be merged into the illumos tree.

arcstat is a perl script to show the value of ARC kstats as they change over 
time. This is
similar to the ideas behind mpstat, iostat, vmstat, and friends.

The current usage is:

Usage: arcstat [-hvx] [-f fields] [-o file] [interval [count]]

Field definitions are as follows:
 mtxmis : mutex_miss per second
  arcsz : ARC size
   mrug : MRU ghost list hits per second
 l2hit% : L2ARC access hit percentage
mh% : Metadata hit percentage
l2miss% : L2ARC access miss percentage
   read : Total ARC accesses per second
  l2hsz : L2ARC header size
  c : ARC target size
   mfug : MFU ghost list hits per second
   miss : ARC misses per second
dm% : Demand data miss percentage
hsz : ARC header size
   dhit : Demand data hits per second
  pread : Prefetch accesses per second
  dread : Demand data accesses per second
 l2miss : L2ARC misses per second
   pmis : Prefetch misses per second
   time : Time
l2bytes : Bytes read per second from the L2ARC
pm% : Prefetch miss percentage
mm% : Metadata miss percentage
   hits : ARC reads per second
  throt : Memory throttles per second
mfu : MFU list hits per second
 l2read : Total L2ARC accesses per second
   mmis : Metadata misses per second
   rmis : recycle_miss per second
   mhit : Metadata hits per second
   dmis : Demand data misses per second
mru : MRU list hits per second
ph% : Prefetch hits percentage
  eskip : evict_skip per second
 l2size : L2ARC size
 l2hits : L2ARC hits per second
   hit% : ARC hit percentage
  miss% : ARC miss percentage
dh% : Demand data hit percentage
  mread : Metadata accesses per second
   phit : Prefetch hits per second

Some questions for the community:
1. Should there be flag compatibility with vmstat, iostat, mpstat, and friends?

2. What is missing?

3. Is it ok if the man page explains the meanings of each field, even though it
might be many pages long?

4. Is there a common subset of columns that are regularly used that would 
justify
a shortcut option? Or do we even need shortcuts? (eg -x)

5. Who wants to help with this little project?


-- 

Richard Elling
rich...@nexenta.com   +1-760-896-4422
Nexenta European User Conference, Amsterdam, May 20
www.nexenta.com/corp/european-user-conference-2011







___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-25 Thread Brandon High
On Mon, Apr 25, 2011 at 8:20 AM, Edward Ned Harvey
 wrote:
> and 128k assuming default recordsize.  (BTW, recordsize seems to be a zfs
> property, not a zpool property.  So how can you know or configure the
> blocksize for something like a zvol iscsi target?)

zvols use the 'volblocksize' property, which defaults to 8k. A 1TB
zvol is therefore 2^27 blocks and would require ~ 34 GB for the ddt
(assuming that a ddt entry is 270 bytes).

The zfs man page for the property reads:

volblocksize=blocksize

 For volumes, specifies the block size of the volume. The
 blocksize  cannot  be  changed  once the volume has been
 written, so it should be set at  volume  creation  time.
 The default blocksize for volumes is 8 Kbytes. Any power
 of 2 from 512 bytes to 128 Kbytes is valid.

 This property can also be referred to by  its  shortened
 column name, volblock.

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Drive replacement speed

2011-04-25 Thread Brandon High
I'm in the process of replacing drive in a pool, and the resilver
times seem to have increased with each device. The way that I'm doing
this is by pulling a drive, physically replacing it, then doing
'cfgadm -c configure  ; zpool replace tank '. I don't have any
hot-swap bays available, so I'm physically replacing the device before
doing a 'zpool replace'.

I'm replacing Western Digital WD10EADS 1TB drives with Hitachi 5K3000
3TB drives. Neither device is fast, but they aren't THAT slow. wsvc_t
and asvc_t both look fairly healthy giving the device types.

Replacing the first device (took about 20 hours) went about as
expected. The second took about 44 hours. The third is still running
and should finish in slightly over 48 hours.

I'm wondering if the following would help for the next drive:
# zpool offline tank c2t4d0
# cfgadm -c unconfigure sata3/4::dsk/c2t4d0

At this point pull the drive and put it into an external USB adapter.
Put the new drive in the hot-swap bay. The USB adapter shows up as
c4t0d0.

# zpool online tank c4t0d0

This should re-add it to the pool and resilver the last few
transactions that may have been missed, right?

Then I want to actually replace the drive in the zpool:
# cfgadm -c configure sata3/4
# zpool replace tank c4t0d0 c2t4d0

Will this work? Will the replace go faster, since it won't need to
resilver from the parity data?


$ zpool list tank
NAME   SIZE  ALLOC   FREECAP  DEDUP  HEALTH  ALTROOT
tank  7.25T  6.40T   867G88%  1.11x  DEGRADED  -
$ zpool status -x
  pool: tank
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Sat Apr 23 17:03:13 2011
5.91T scanned out of 6.40T at 38.0M/s, 3h42m to go
752G resilvered, 92.43% done
config:

NAME  STATE READ WRITE CKSUM
tank  DEGRADED 0 0 0
  raidz2-0DEGRADED 0 0 0
c2t0d0ONLINE   0 0 0
c2t1d0ONLINE   0 0 0
c2t2d0ONLINE   0 0 0
c2t3d0ONLINE   0 0 0
c2t4d0ONLINE   0 0 0
replacing-5   DEGRADED 0 0 0
  c2t5d0/old  FAULTED  0 0 0  corrupted data
  c2t5d0  ONLINE   0 0 0  (resilvering)
c2t6d0ONLINE   0 0 0
c2t7d0ONLINE   0 0 0

errors: No known data errors
$ zpool iostat -v tank 60 3
 capacity operationsbandwidth
pool  alloc   free   read  write   read  write
  -  -  -  -  -  -
tank  6.40T   867G566 25  32.2M   156K
  raidz2  6.40T   867G566 25  32.2M   156K
c2t0d0-  -362 11  5.56M  71.6K
c2t1d0-  -365 11  5.56M  71.6K
c2t2d0-  -363 11  5.56M  71.6K
c2t3d0-  -363 11  5.56M  71.6K
c2t4d0-  -361 11  5.54M  71.6K
replacing -  -  0492  8.28K  4.79M
  c2t5d0/old  -  -202  5  2.84M  36.7K
  c2t5d0  -  -  0315  8.66K  4.78M
c2t6d0-  -170190  2.68M  2.69M
c2t7d0-  -386 10  5.53M  71.6K
  -  -  -  -  -  -

 capacity operationsbandwidth
pool  alloc   free   read  write   read  write
  -  -  -  -  -  -
tank  6.40T   867G612 14  8.43M  70.7K
  raidz2  6.40T   867G612 14  8.43M  70.7K
c2t0d0-  -411 11  1.51M  57.9K
c2t1d0-  -414 11  1.50M  58.0K
c2t2d0-  -385 11  1.51M  57.9K
c2t3d0-  -412 11  1.50M  58.0K
c2t4d0-  -412 11  1.45M  57.8K
replacing -  -  0574366   852K
  c2t5d0/old  -  -  0  0  0  0
  c2t5d0  -  -  0324366   852K
c2t6d0-  -427 11  1.45M  57.8K
c2t7d0-  -431 11  1.49M  57.9K
  -  -  -  -  -  -

 capacity operationsbandwidth
pool  alloc   free   read  write   read  write
  -  -  -  -  -  -
tank  6.40T   867G  1.02K 12  11.1M  69.4K
  raidz2  6.40T   867G  1.02K 12  11.1M  69.4K
c2t0d0-  -772 10  1.99M  59.3K
c2t1d0-  -771 10  1.99M  59.4K
c2t2d0-  

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-25 Thread Freddie Cash
On Mon, Apr 25, 2011 at 10:55 AM, Erik Trimble  wrote:
> Min block size is 512 bytes.

Technically, isn't the minimum block size 2^(ashift value)?  Thus, on
4 KB disks where the vdevs have an ashift=12, the minimum block size
will be 4 KB.

-- 
Freddie Cash
fjwc...@gmail.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-25 Thread Neil Perrin

On 04/25/11 11:55, Erik Trimble wrote:

On 4/25/2011 8:20 AM, Edward Ned Harvey wrote:


And one more comment:  Based on what's below, it seems that the DDT 
gets stored on the cache device and also in RAM.  Is that correct?  
What if you didn't have a cache device?  Shouldn't it *always* be in 
ram?  And doesn't the cache device get wiped every time you reboot?  
It seems to me like putting the DDT on the cache device would be 
harmful...  Is that really how it is?


 

Nope. The DDT is stored only in one place: cache device if present, 
/or/ RAM otherwise (technically, ARC, but that's in RAM).  If a cache 
device is present, the DDT is stored there, BUT RAM also must store a 
basic lookup table for the DDT (yea, I know, a lookup table for a 
lookup table).


No, that's not true. The DDT is just like any other ZFS metadata and can 
be split over the ARC,
cache device (L2ARC) and the main pool devices. An infrequently 
referenced DDT block will get

evicted from the ARC to the L2ARC then evicted from the L2ARC.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-25 Thread Erik Trimble

On 4/25/2011 8:20 AM, Edward Ned Harvey wrote:


There are a lot of conflicting references on the Internet, so I'd 
really like to solicit actual experts (ZFS developers or people who 
have physical evidence) to weigh in on this...


After searching around, the reference I found to be the most seemingly 
useful was Erik's post here:


http://opensolaris.org/jive/thread.jspa?threadID=131296

Unfortunately it looks like there's an arithmetic error (1TB of 4k 
blocks means 268million blocks, not 1 billion).  Also, IMHO it seems 
important make the distinction, #files != #blocks.  Due to the 
existence of larger files, there will sometimes be more than one block 
per file; and if I'm not mistaken, thanks to write aggregation, there 
will sometimes be more than one file per block.  YMMV.  Average block 
size could be anywhere between 1 byte and 128k assuming default 
recordsize.  (BTW, recordsize seems to be a zfs property, not a zpool 
property.  So how can you know or configure the blocksize for 
something like a zvol iscsi target?)


I said 2^30, which is roughly a quarter billion.  But, I should have 
been more exact.  And, the file != block difference is important to note.


zvols also take a Recordsize attribute. And, zvols tend to be sticklers 
about all blocks being /exactly/ the recordsize value, unlike 
filesystems, which use it as a *maximum* block size.


Min block size is 512 bytes.


(BTW, is there any way to get a measurement of number of blocks 
consumed per zpool?  Per vdev?  Per zfs filesystem?)  The calculations 
below are based on assumption of 4KB blocks adding up to a known total 
data consumption.  The actual thing that matters is the number of 
blocks consumed, so the conclusions drawn will vary enormously when 
people actually have average block sizes != 4KB.




you need to use zdb to see what the current block usage is for a 
filesystem. I'd have to look up the particular CLI usage for that, as I 
don't know what it is off the top of my head.


And one more comment:  Based on what's below, it seems that the DDT 
gets stored on the cache device and also in RAM.  Is that correct?  
What if you didn't have a cache device?  Shouldn't it *always* be in 
ram?  And doesn't the cache device get wiped every time you reboot?  
It seems to me like putting the DDT on the cache device would be 
harmful...  Is that really how it is?


Nope. The DDT is stored only in one place: cache device if present, /or/ 
RAM otherwise (technically, ARC, but that's in RAM).  If a cache device 
is present, the DDT is stored there, BUT RAM also must store a basic 
lookup table for the DDT (yea, I know, a lookup table for a lookup table).



My minor corrections here:

The rule-of-thumb is 270 bytes/DDT entry, and 200 bytes of ARC for every 
L2ARC entry, since the DDT is stored on the cache device.


the DDT itself doesn't consume any ARC space usage if stored in a L2ARC 
cache


E.g.: I have 1TB of 4k blocks that are to be deduped, and it turns out 
that I have about a 5:1 dedup ratio. I'd also like to see how much ARC 
usage I eat up with using a 160GB L2ARC to store my DDT on.


(1) How many entries are there in the DDT?

1TB of 4k blocks means there are 268million blocks.  However, at a 
5:1 dedup ratio, I'm only actually storing 20% of that, so I have about 
54 million blocks.  Thus, I need a DDT of about 270bytes * 54 million =~ 
14GB in size


(2) How much ARC space does this DDT take up?
The 54 million entries in my DDT take up about 200bytes * 54 
million =~ 10G of ARC space, so I need to have 10G of RAM dedicated just 
to storing the references to the DDT in the L2ARC.



(3) How much space do I have left on the L2ARC device, and how many 
blocks can that hold?
Well, I have 160GB - 14GB (DDT) = 146GB of cache space left on the 
device, which, assuming I'm still using 4k blocks, means I can cache 
about 37 million 4k blocks, or about 66% of my total data. This extra 
cache of blocks in the L2ARC would eat up 200 b * 37 million =~ 7.5GB of 
ARC entries.


Thus, for the aforementioned dedup scenario, I'd better spec it with 
(whatever base RAM for basic OS and ordinary ZFS cache and application 
requirements) at least a 14G L2ARC device for dedup + 10G more of RAM 
for the DDT L2ARC requirements + 1GB of RAM for every 20GB of additional 
space in the L2ARC cache beyond that used by the DDT.




--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs problem vdev I/O failure

2011-04-25 Thread Konstantin Kuklin
So, I install FreeBSD 8.2 with ZFS patch v28 and have this error message
with full freeze zfs system:
Solaris: Warning: can`t open object for zroot/var/crash
log_sysevent: type19 is not emplement
log_sysevent: type19 is not emplement
log_sysevent: type19 is not emplement
log_sysevent: type19 is not emplement
log_sysevent: type19 is not emplement
Solaris: Warning: can`t open object for zroot/var/crash
log_sysevent: type19 is not emplement
log_sysevent: type19 is not emplement
log_sysevent: type19 is not emplement
log_sysevent: type19 is not emplement
log_sysevent: type19 is not emplement


2011/4/24 Pawel Tyll 

> Hi Konstantin,
>
> > zpool status:
> > Flash# zpool status
> >   pool: zroot
> >  state: DEGRADED
> > status: One or more devices are faulted in response to IO failures.
> > action: Make sure the affected devices are connected, then run 'zpool
> > clear'.
> >see: http://www.sun.com/msg/ZFS-8000-HC
> >  scrub: resilver in progress for 0h6m, 0.00% done, 1582566h29m to go
> > config:
>
> > NAME STATE READ WRITE CKSUM
> > zrootDEGRADED12 0 1
> >   mirror DEGRADED36 0 4
> > 7159451150335751026  UNAVAIL  0 0 0  was
> > /dev/gpt/disk0
> > gpt/disk1ONLINE   0 040
>
> > errors: 12 data errors, use '-v' for a list
>
> > Zpool scrub freeze and time to resilver up in time...
> > How i can repair it, if zpool scrub -s zroot and detach don`t work...and
> > don`t work all of zfs commands =\
>
> Try booting mfsBSD and fixing there, http://mfsbsd.vx.sk/
>
> http://mfsbsd.vx.sk/iso/mfsbsd-8.2-zfsv28-i386.iso
> http://mfsbsd.vx.sk/iso/mfsbsd-se-8.2-zfsv28-amd64.iso
>
>
>


-- 
С уважением
Куклин Константин.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-25 Thread Roy Sigurd Karlsbakk
> After modifications that I hope are corrections, I think the post
> should look like this:
> 
> The rule-of-thumb is 270 bytes/DDT entry, and 200 bytes of ARC for
> every L2ARC entry.
> 
> DDT doesn't count for this ARC space usage
> 
> E.g.: I have 1TB of 4k blocks that are to be deduped, and it turns out
> that I have about a 5:1 dedup ratio. I'd also like to see how much ARC
> usage I eat up with a 160GB L2ARC.
> 
> (1) How many entries are there in the DDT:
> 
> 1TB of 4k blocks means there are 268million blocks. However, at a 5:1
> dedup ratio, I'm only actually storing 20% of that, so I have about 54
> million blocks. Thus, I need a DDT of about 270bytes * 54 million =~
> 14GB in size
> 
> (2) My L2ARC is 160GB in size, but I'm using 14GB for the DDT. Thus, I
> have 146GB free for use as a data cache. 146GB / 4k =~ 38 million
> blocks can be stored in the
> remaining L2ARC space. However, 38 million files takes up: 200bytes *
> 38 million =~ 7GB of space in ARC.
> 
> Thus, I better spec my system with (whatever base RAM for basic OS and
> cache and application requirements) + 14G because of dedup + 7G
> because of L2ARC.

Thanks, but one more ting: Add some tuning parameters, such as "set 
zfs:zfs_arc_meta_limit = somevalue in /etc/system" to help zfs use more memory 
for its metadata (like the DDT), as it won't use more than (RAM-1GB)/4 by 
default

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
r...@karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er 
et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av 
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og 
relevante synonymer på norsk.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Dedup and L2ARC memory requirements (again)

2011-04-25 Thread Edward Ned Harvey
There are a lot of conflicting references on the Internet, so I'd really
like to solicit actual experts (ZFS developers or people who have physical
evidence) to weigh in on this...

 

After searching around, the reference I found to be the most seemingly
useful was Erik's post here:

http://opensolaris.org/jive/thread.jspa?threadID=131296

 

Unfortunately it looks like there's an arithmetic error (1TB of 4k blocks
means 268million blocks, not 1 billion).  Also, IMHO it seems important make
the distinction, #files != #blocks.  Due to the existence of larger files,
there will sometimes be more than one block per file; and if I'm not
mistaken, thanks to write aggregation, there will sometimes be more than one
file per block.  YMMV.  Average block size could be anywhere between 1 byte
and 128k assuming default recordsize.  (BTW, recordsize seems to be a zfs
property, not a zpool property.  So how can you know or configure the
blocksize for something like a zvol iscsi target?)

 

(BTW, is there any way to get a measurement of number of blocks consumed per
zpool?  Per vdev?  Per zfs filesystem?)  The calculations below are based on
assumption of 4KB blocks adding up to a known total data consumption.  The
actual thing that matters is the number of blocks consumed, so the
conclusions drawn will vary enormously when people actually have average
block sizes != 4KB.  

 

And one more comment:  Based on what's below, it seems that the DDT gets
stored on the cache device and also in RAM.  Is that correct?  What if you
didn't have a cache device?  Shouldn't it *always* be in ram?  And doesn't
the cache device get wiped every time you reboot?  It seems to me like
putting the DDT on the cache device would be harmful...  Is that really how
it is?

 

After modifications that I hope are corrections, I think the post should
look like this:

 

The rule-of-thumb is 270 bytes/DDT entry, and 200 bytes of ARC for every
L2ARC entry.

DDT doesn't count for this ARC space usage

E.g.: I have 1TB of 4k blocks that are to be deduped, and it turns out that
I have about a 5:1 dedup ratio. I'd also like to see how much ARC usage I
eat up with a 160GB L2ARC.

(1) How many entries are there in the DDT: 

1TB of 4k blocks means there are 268million blocks.  However, at a 5:1 dedup
ratio, I'm only actually storing 20% of that, so I have about 54 million
blocks.  Thus, I need a DDT of about 270bytes * 54 million =~ 14GB in size

(2) My L2ARC is 160GB in size, but I'm using 14GB for the DDT. Thus, I have
146GB free for use as a data cache.  146GB / 4k =~ 38 million blocks can be
stored in the 
remaining L2ARC space.  However, 38 million files takes up: 200bytes * 38
million =~ 7GB of space in ARC.

 

Thus, I better spec my system with (whatever base RAM for basic OS and cache
and application requirements) + 14G because of dedup + 7G because of L2ARC.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs problem vdev I/O failure

2011-04-25 Thread Pawel Tyll
Hi Konstantin,

> zpool status:
> Flash# zpool status
>   pool: zroot
>  state: DEGRADED
> status: One or more devices are faulted in response to IO failures.
> action: Make sure the affected devices are connected, then run 'zpool
> clear'.
>see: http://www.sun.com/msg/ZFS-8000-HC
>  scrub: resilver in progress for 0h6m, 0.00% done, 1582566h29m to go
> config:

> NAME STATE READ WRITE CKSUM
> zrootDEGRADED12 0 1
>   mirror DEGRADED36 0 4
> 7159451150335751026  UNAVAIL  0 0 0  was
> /dev/gpt/disk0
> gpt/disk1ONLINE   0 040

> errors: 12 data errors, use '-v' for a list

> Zpool scrub freeze and time to resilver up in time...
> How i can repair it, if zpool scrub -s zroot and detach don`t work...and
> don`t work all of zfs commands =\

Try booting mfsBSD and fixing there, http://mfsbsd.vx.sk/

http://mfsbsd.vx.sk/iso/mfsbsd-8.2-zfsv28-i386.iso
http://mfsbsd.vx.sk/iso/mfsbsd-se-8.2-zfsv28-amd64.iso


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs problem vdev I/O failure

2011-04-25 Thread Konstantin Kuklin
Good morning, I have a problem with ZFS:

ZFS filesystem version 4

ZFS storage pool version 15


Yesterday my comp with Freebsd 8.2 releng shutdown with ad4 error
detached,when I copy a big file...
and after reboot in 2 wd green 1tb say me goodbye. One of them die and other
with zfs errors:

Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path=
offset=187921768448 size=512 error=6
Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path=
offset=187921768960 size=512 error=6
Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path=
offset=311738368 size=21504 error=6
Apr 24 04:53:41 Flash root: ZFS: zpool I/O failure, zpool=zroot error=6
Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path= offset=
size= error=
Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path=
offset=635155456 size=3072 error=6
Apr 24 04:53:41 Flash root: ZFS: zpool I/O failure, zpool=zroot error=6
Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path= offset=
size= error=
Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path=
offset=635158528 size=12288 error=6
Apr 24 04:53:41 Flash root: ZFS: zpool I/O failure, zpool=zroot error=6
Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path= offset=
size= error=
Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path=
offset=635170816 size=512 error=6
Apr 24 04:53:41 Flash root: ZFS: zpool I/O failure, zpool=zroot error=6
Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path= offset=
size= error=
Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path=
offset=635171328 size=512 error=6
Apr 24 04:53:41 Flash root: ZFS: vdev I/O failure, zpool=zroot path=
offset=635171840 size=512 error=6
Apr 24 04:53:41 Flash root: ZFS: zpool I/O failure, zpool=zroot error=6

zpool status:
Flash# zpool status
  pool: zroot
 state: DEGRADED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool
clear'.
   see: http://www.sun.com/msg/ZFS-8000-HC
 scrub: resilver in progress for 0h6m, 0.00% done, 1582566h29m to go
config:

NAME STATE READ WRITE CKSUM
zrootDEGRADED12 0 1
  mirror DEGRADED36 0 4
7159451150335751026  UNAVAIL  0 0 0  was
/dev/gpt/disk0
gpt/disk1ONLINE   0 040

errors: 12 data errors, use '-v' for a list

Zpool scrub freeze and time to resilver up in time...
How i can repair it, if zpool scrub -s zroot and detach don`t work...and
don`t work all of zfs commands =\
Thx
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss