Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Brandon High
On Thu, May 5, 2011 at 8:50 PM, Edward Ned Harvey
 wrote:
> If you have to use the 4k recordsize, it is likely to consume 32x more
> memory than the default 128k recordsize of ZFS.  At this rate, it becomes
> increasingly difficult to get a justification to enable the dedup.  But it's
> certainly possible.

You're forgetting that zvols use an 8k volblocksize by default. If
you're currently exporting exporting volumes with iSCSI it's only a 2x
increase.

The tradeoff is that you should have more duplicate blocks, and reap
the rewards there. I'm fairly certain that it won't offset the large
increase in the size of the DDT however. Dedup with zvols is probably
never a good idea as a result.

Only if you're hosting your VM images in .vmdk files will you get 128k
blocks. Of course, your chance of getting many identical blocks gets
much, much smaller. You'll have to worry about the guests' block
alignment in the context of the image file, since two identical files
may not create identical blocks as seen from ZFS. This means you may
get only fractional savings and have an enormous DDT.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Edward Ned Harvey
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> If you have to use the 4k recordsize, it is likely to consume 32x more
> memory than the default 128k recordsize of ZFS.  At this rate, it becomes
> increasingly difficult to get a justification to enable the dedup.  But
it's
> certainly possible.

Sorry, I didn't realize ... RE just said (and I take his word for it) that
the default recordsize for a zvol is 8k.  While of course the default
recordsize for a ZFS filesystem is 128k.

Emphasis is that memory requirement is a constant multiplied by number of
blocks, so smaller blocks ==> higher number of blocks ==> more memory
consumption.

This could be a major difference in implementation ... If you are going to
use ZFS over NFS as your VM storage backend that would default to 128k
recordsize, while if you're going to use ZFS over ISCSI as your VM storage
backend that would default to 8k recordsize.

In either case, you really want to be aware of, and tune your recordsize
appropriately for the guest(s) that you are running.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Edward Ned Harvey
> From: Brandon High [mailto:bh...@freaks.com]
> 
> On Wed, May 4, 2011 at 8:23 PM, Edward Ned Harvey
>  wrote:
> > Generally speaking, dedup doesn't work on VM images.  (Same is true for
> ZFS
> > or netapp or anything else.)  Because the VM images are all going to
have
> > their own filesystems internally with whatever blocksize is relevant to
the
> > guest OS.  If the virtual blocks in the VM don't align with the ZFS (or
> > whatever FS) host blocks...  Then even when you write duplicated data
> inside
> > the guest, the host won't see it as a duplicated block.
> 
> A zvol with 4k blocks should give you decent results with Windows
> guests. Recent versions use 4k alignment by default and 4k blocks, so
> there should be lots of duplicates for a base OS image.


I agree with everything Brandon said.

The one thing I would add is:  The "correct" recordsize for each guest
machine would depend on the filesystem that the guest machine is using.
Without knowing a specific filesystem on a specific guest OS, the 4k
recordsize sounds like a reasonable general-purpose setting.  But if you
know more details of the guest, you could hopefully use a larger recordsize
and therefore consume less ram on the host.

If you have to use the 4k recordsize, it is likely to consume 32x more
memory than the default 128k recordsize of ZFS.  At this rate, it becomes
increasingly difficult to get a justification to enable the dedup.  But it's
certainly possible.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-05 Thread Richard Elling
On May 4, 2011, at 7:56 PM, Edward Ned Harvey wrote:

> This is a summary of a much longer discussion "Dedup and L2ARC memory
> requirements (again)"
> Sorry even this summary is long.  But the results vary enormously based on
> individual usage, so any "rule of thumb" metric that has been bouncing
> around on the internet is simply not sufficient.  You need to go into this
> level of detail to get an estimate that's worth the napkin or bathroom
> tissue it's scribbled on.
> 
> This is how to (reasonably) accurately estimate the hypothetical ram
> requirements to hold the complete data deduplication tables (DDT) and L2ARC
> references in ram.  Please note both the DDT and L2ARC references can be
> evicted from memory according to system policy, whenever the system decides
> some other data is more valuable to keep.  So following this guide does not
> guarantee that the whole DDT will remain in ARC or L2ARC.  But it's a good
> start.

As the size of the data grows, the need to have the whole DDT in RAM or L2ARC
decreases. With one notable exception, destroying a dataset or snapshot requires
the DDT entries for the destroyed blocks to be updated. This is why people can
go for months or years and not see a problem, until they try to destroy a 
dataset.

> 
> I am using a solaris 11 express x86 test system for my example numbers
> below.  
> 
> --- To calculate size of DDT ---
> 
> Each entry in the DDT is a fixed size, which varies by platform.  You can
> find it with the command:
>   echo ::sizeof ddt_entry_t | mdb -k
> This will return a hex value, that you probably want to convert to decimal.
> On my test system, it is 0x178 which is 376 bytes
> 
> There is one DDT entry per non-dedup'd (unique) block in the zpool.

The workloads which are nicely dedupable tend to not have unique blocks.
So this is another way of saying, "if your workload isn't dedupable, don't 
bother
with deduplication." For years now we have been trying to convey this message.
One way to help convey the message is...

>  Be
> aware that you cannot reliably estimate #blocks by counting #files.  You can
> find the number of total blocks including dedup'd blocks in your pool with
> this command:
>   zdb -bb poolname | grep 'bp count'

Ugh. A better method is to simulate dedup on existing data:
zdb -S poolname
or measure dedup efficacy
zdb -DD poolname
which offer similar tabular analysis

> Note:  This command will run a long time and is IO intensive.  On my systems
> where a scrub runs for 8-9 hours, this zdb command ran for about 90 minutes.
> On my test system, the result is 44145049 (44.1M) total blocks.
> 
> To estimate the number of non-dedup'd (unique) blocks (assuming average size
> of dedup'd blocks = average size of blocks in the whole pool), use:
>   zpool list
> Find the dedup ratio.  In my test system, it is 2.24x.  Divide the total
> blocks by the dedup ratio to find the number of non-dedup'd (unique) blocks.

Or just count the unique and non-unique blocks with:
zdb -D poolname

> 
> In my test system:
>   44145049 total blocks / 2.24 dedup ratio = 19707611 (19.7M) approx
> non-dedup'd (unique) blocks
> 
> Then multiply by the size of a DDT entry.
>   19707611 * 376 = 7410061796 bytes = 7G total DDT size

A minor gripe about zdb -D output is that it doesn't do the math.

> 
> --- To calculate size of ARC/L2ARC references ---
> 
> Each reference to a L2ARC entry requires an entry in ARC (ram).  This is
> another fixed size, which varies by platform.  You can find it with the
> command:
>   echo ::sizeof arc_buf_hdr_t | mdb -k
> On my test system, it is 0xb0 which is 176 bytes

Better yet, without need for mdb privilege, measure the current L2ARC header
size in use. Normal user accounts can:
kstat -p zfs::arcstats:hdr_size
kstat -p zfs::arcstats:l2_hdr_size

arcstat will allow you to easily track this over time.

> 
> We need to know the average block size in the pool, to estimate the number
> of blocks that will fit into L2ARC.  Find the amount of space ALLOC in the
> pool:
>   zpool list
> Divide by the number of non-dedup'd (unique) blocks in the pool, to find the
> average block size.  In my test system:
>   790G / 19707611 = 42K average block size
> 
> Remember:  If your L2ARC were only caching average size blocks, then the
> payload ratio of L2ARC vs ARC would be excellent.  In my test system, every
> 42K L2ARC would require 176bytes ARC (a ratio of 244x).  This would result
> in a negligible ARC memory consumption.  But since your DDT can be pushed
> out of ARC into L2ARC, you get a really bad ratio of L2ARC vs ARC memory
> consumption.  In my test system every 376bytes DDT entry in L2ARC consumes
> 176bytes ARC (a ratio of 2.1x).  Yes, it is approximately possible to have
> the complete DDT present in ARC and L2ARC, thus consuming tons of ram.

This is a good thing for those cases when you need to quickly reference la

Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-05 Thread Edward Ned Harvey
> From: Karl Wagner [mailto:k...@mouse-hole.com]
> 
> so there's an ARC entry referencing each individual DDT entry in the L2ARC?!
> I had made the assumption that DDT entries would be grouped into at least
> minimum block sized groups (8k?), which would have lead to a much more
> reasonable ARC requirement.
> 
> seems like a bad design to me, which leads to dedup only being usable by
> those prepared to spend a LOT of dosh... which may as well go into more
> storage (I know there are other benefits too, but that's my opinion)

The whole point of the DDT is that it needs to be structured, and really fast 
searchable.  So no, you're not going to consolidate it into an unstructured 
memory block as you said.  You pay the memory consumption price for the sake of 
performance.  Yes it consumes a lot of ram, but don't call it a "bad design."  
It's just a different design than what you expected, because what you expected 
would hurt performance while consuming less ram.

And we're not talking crazy dollars here.  So your emphasis on a LOT of dosh 
seems exaggerated.  I just spec'd out a system where upgrading from 12 to 24G 
of ram to enable dedup effectively doubled the storage capacity of the system, 
and that upgrade cost the same as one of the disks.  (This is a 12-disk 
system.)   So it was actually a 6x cost reducer, at least.  It all depends on 
how much mileage you get out of the dedup.  Your mileage may vary.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Richard Elling
On May 5, 2011, at 6:02 AM, Edward Ned Harvey wrote:
> Is this a zfs discussion list, or a nexenta sales & promotion list?

Obviously, this is a Nextenta sales & promotion list. And Oracle. And OSX.
And BSD. And Linux. And anyone who needs help or can offer help with ZFS
technology :-) This list has never been more diverse. The only sad part is the
unnecessary assassination of the OpenSolaris brand. But life moves on, and so
does good technology.
 -- richard-who-is-proud-to-work-at-Nexenta

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Richard Elling
On May 5, 2011, at 2:58 PM, Brandon High wrote:
> On Wed, May 4, 2011 at 8:23 PM, Edward Ned Harvey
> 
>> Or if you're intimately familiar with both the guest & host filesystems, and
>> you choose blocksizes carefully to make them align.  But that seems
>> complicated and likely to fail.
> 
> Using a 4k block size is a safe bet, since most OSs use a block size
> that is a multiple of 4k. It's the same reason that the new "Advanced
> Format" drives use 4k sectors.

Yes, 4KB block sizes are replacing the 512B blocks of yesteryear. However,
the real reason the HDD manufacturers headed this way is because they can
get more usable bits per platter. The tradeoff is that your workload may consume
more real space on the platter than before. TANSTAAFL.

The trick for best performance and best opportunity for dedup (alignment 
notwithstanding)
is to have a block size that is smaller than your workload.  Or, don't bring a 
128KB block
to a 4KB block battle. For this reason, the default 8KB block size for a zvol 
is a reasonable
choice, but perhaps 4KB is better for many workloads.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Permanently using hot spare?

2011-05-05 Thread Ray Van Dolson
On Thu, May 05, 2011 at 03:13:06PM -0700, TianHong Zhao wrote:
> Just detach the faulty disk, then the spare will become the "normal"
> disk once it's finished resilvering.
> 
> #zfs detach  
> 
> Then you need to the new spare :
> #zfs add  
> 
> There seems to be a new feature in illumos project to support a zpool
> property like "spare promotion", 
> which would not require the manual "detach" operation.
>  
> Tianhong

Thanks!  Great tip.

Ray
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multipl disk failures cause zpool hang

2011-05-05 Thread TianHong Zhao
Thanks again.

 

No, I don’t see any bio functions, but you have shed very useful lights on the 
issue.

 

My test platform is b147, the pool disks are from a storage system via a Qlogic 
fiber HBA.

 

My test case is :

1.   zpool set failmode=continue pool1

2.   dd if=/dev/zero of=/pool1/fs/myfile count=1000 &

3.   unplug the fiber cable, wait about 30 sec.

4.   zpool status  (hang)

5.   wait about 1 min.

6.   can not open a new ssh session to the box,  but existing ssh sessions 
are still alive though.

7.   Use the existing session to get into mdb and get threadlist.

8.   Eventually, I have to power cycle the box.

 

Tianhong

 

From: Steve Gonczi [mailto:gon...@comcast.net] 
Sent: Thursday, May 05, 2011 6:32 PM
To: TianHong Zhao
Subject: Re: [zfs-discuss] multipl disk failures cause zpool hang

 

You are most welcome.

The zio_wait just indicates that the sync thread is waiting for an io to 
complete.

Search through the threadlist and see if there is a thread that is stuck 
in "biowait".   zio is asynchronous, so the thread performing the actual io
will be a different thread.

But first let's just verify again that you are not deleting large files or 
large snapshots,
or zfs destroy-in large file systems when this hang happens, and that you are 
running
a fairly modern zfs version ( something 145+). If I am reading your posts 
correctly,
you can rpeatably make this happen on a mostly idle system, just by 
disconnecting and
reconnecting your cable, correct?

In that case, maybe this is a lost "biodone" problem.
If  you find a thread sitting in biowait for a long time, that would be my 
suspicion.

When you unplug the cable, the strategy routine that would normally complete
or time out or fail the io,  could be taking a rare exit path, and on that 
particular path,
fails to issue a biodone() like it is supposed to.

The next step after this, would be figuring out which is the device's strategy 
call
and give that function a good thorough review, esp. the different exit paths.

Steve

/sG/

- "TianHong Zhao"  wrote: 

Thanks for the information.

 

I think you’re right that spa_sync thread is blocked in zio_wait while holding 
scl_lock

which blocks all zpool related command (such as zpool status).

 

Question is why zio_wait is blocked forever ? if the underlying device is 
offline, could zio service just bail out ?

what if I set “zfs sync=disabled” ?

 

Here is what I collected “threadlists”

#mdb -K
   >::threadlist -v

ff02d9627400 ff02f05f80a8 ff02d95f2780   1  59 ff02d57e585c
  PC: _resume_from_idle+0xf1CMD: zpool status
  stack pointer for thread ff02d9627400: ff00108a3a70
  [ ff00108a3a70 _resume_from_idle+0xf1() ]
swtch+0x145()
cv_wait+0x61()
spa_config_enter+0x86()
spa_vdev_state_enter+0x3c()
spa_vdev_set_common+0x37()
spa_vdev_setpath+0x22()
zfs_ioc_vdev_setpath+0x48()
zfsdev_ioctl+0x15e()
cdev_ioctl+0x45()
spec_ioctl+0x5a()
fop_ioctl+0x7b()
ioctl+0x18e()
_sys_sysenter_post_swapgs+0x149()

…

ff0010378c40 fbc2e3300   0  60 ff034935bcb8
  PC: _resume_from_idle+0xf1THREAD: txg_sync_thread()
  stack pointer for thread ff0010378c40: ff00103789b0
  [ ff00103789b0 _resume_from_idle+0xf1() ]
swtch+0x145()
cv_wait+0x61()
zio_wait+0x5d()
dsl_pool_sync+0xe1()
spa_sync+0x38d()
txg_sync_thread+0x247()
thread_start+8()

 

Tianhong

 

From: Steve Gonczi [mailto:gon...@comcast.net] 
Sent: Wednesday, May 04, 2011 10:43 AM
To: TianHong Zhao
Subject: Re: [zfs-discuss] multipl disk failures cause zpool hang

 

Hi TianHong,

I have seen similar apparent hangs, all related to destroying large snapshots, 
file systems
or deleting large files  ( with dedup enabled. by large I mean in the terabyte 
range)

In the cases I have looked at,   the root problem is the sync taking way too 
long, and 
because of the sync interlock with keeping the current txg open, zfs eventually
runs out of space in the current txg, and,  unable to accept any more 
transactions.

In those cases, the system would come back to life eventually, 
but it may take a long time ( days potentially).

Looks like yours is a reproducible scenario, and I think the 
disconnect-reconnect
triggered hang may be new. It would be good to root cause this. 

I recommend loading the kernel debugger up 
and generating a crash dump.  It would be pretty straight forward to verify if 
this is the 
"sync taking a long time" failure or not.   The output from ::threadlist -v
would be telling.

There have been posts earlier as to how to load the debugger and creating a 
crash dump.

Best wishes

Steve
 
/sG/

 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Permanently using hot spare?

2011-05-05 Thread Ian Collins

 On 05/ 6/11 09:53 AM, Ray Van Dolson wrote:

Have a failed drive on a ZFS pool (three RAIDZ2 vdevs, one hot spare).
The hot spare kicked in and all is well.

Is it possible to just make that hot spare disk -- already silvered
into the pool -- as a permanent part of the pool?  We could then throw
in a new disk and mark it as a spare and avoid what would seem to be an
unnecessary resilver (twice, once when the spare is brought in and
again when we replace the failed disk).


Yes, as Tianhong just posted, just detach the faulted device.

What you describe is what I normally do, add the original drive back as 
a spare when it is replaced.


--
Ian.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Permanently using hot spare?

2011-05-05 Thread TianHong Zhao
Just detach the faulty disk, then the spare will become the "normal"
disk once it's finished resilvering.

#zfs detach  

Then you need to the new spare :
#zfs add  

There seems to be a new feature in illumos project to support a zpool
property like "spare promotion", 
which would not require the manual "detach" operation.
 

Tianhong


-Original Message-
From: zfs-discuss-boun...@opensolaris.org
[mailto:zfs-discuss-boun...@opensolaris.org] On Behalf Of Ray Van Dolson
Sent: Thursday, May 05, 2011 5:53 PM
To: zfs-discuss@opensolaris.org
Subject: [zfs-discuss] Permanently using hot spare?

Have a failed drive on a ZFS pool (three RAIDZ2 vdevs, one hot spare).
The hot spare kicked in and all is well.

Is it possible to just make that hot spare disk -- already silvered into
the pool -- as a permanent part of the pool?  We could then throw in a
new disk and mark it as a spare and avoid what would seem to be an
unnecessary resilver (twice, once when the spare is brought in and again
when we replace the failed disk).

This document[1] seems to make it sound like it can be done, but I'm not
really seeing how... 

Can I "add" the spare disk to the pool when it's already in use?
Probably not...

Note this is on Solaris 10 U9.

Thanks,
Ray

[1] http://dlc.sun.com/osol/docs/content/ZFSADMIN/gayrd.html#gcvcw
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] multipl disk failures cause zpool hang

2011-05-05 Thread TianHong Zhao
Thanks for the information.

 

I think you’re right that spa_sync thread is blocked in zio_wait while holding 
scl_lock

which blocks all zpool related command (such as zpool status).

 

Question is why zio_wait is blocked forever ? if the underlying device is 
offline, could zio service just bail out ?

what if I set “zfs sync=disabled” ?

 

Here is what I collected “threadlists”

#mdb -K
   >::threadlist -v

ff02d9627400 ff02f05f80a8 ff02d95f2780   1  59 ff02d57e585c
  PC: _resume_from_idle+0xf1CMD: zpool status
  stack pointer for thread ff02d9627400: ff00108a3a70
  [ ff00108a3a70 _resume_from_idle+0xf1() ]
swtch+0x145()
cv_wait+0x61()
spa_config_enter+0x86()
spa_vdev_state_enter+0x3c()
spa_vdev_set_common+0x37()
spa_vdev_setpath+0x22()
zfs_ioc_vdev_setpath+0x48()
zfsdev_ioctl+0x15e()
cdev_ioctl+0x45()
spec_ioctl+0x5a()
fop_ioctl+0x7b()
ioctl+0x18e()
_sys_sysenter_post_swapgs+0x149()

…

ff0010378c40 fbc2e3300   0  60 ff034935bcb8
  PC: _resume_from_idle+0xf1THREAD: txg_sync_thread()
  stack pointer for thread ff0010378c40: ff00103789b0
  [ ff00103789b0 _resume_from_idle+0xf1() ]
swtch+0x145()
cv_wait+0x61()
zio_wait+0x5d()
dsl_pool_sync+0xe1()
spa_sync+0x38d()
txg_sync_thread+0x247()
thread_start+8()

 

Tianhong

 

From: Steve Gonczi [mailto:gon...@comcast.net] 
Sent: Wednesday, May 04, 2011 10:43 AM
To: TianHong Zhao
Subject: Re: [zfs-discuss] multipl disk failures cause zpool hang

 

Hi TianHong,

I have seen similar apparent hangs, all related to destroying large snapshots, 
file systems
or deleting large files  ( with dedup enabled. by large I mean in the terabyte 
range)

In the cases I have looked at,   the root problem is the sync taking way too 
long, and 
because of the sync interlock with keeping the current txg open, zfs eventually
runs out of space in the current txg, and,  unable to accept any more 
transactions.

In those cases, the system would come back to life eventually, 
but it may take a long time ( days potentially).

Looks like yours is a reproducible scenario, and I think the 
disconnect-reconnect
triggered hang may be new. It would be good to root cause this. 

I recommend loading the kernel debugger up 
and generating a crash dump.  It would be pretty straight forward to verify if 
this is the 
"sync taking a long time" failure or not.   The output from ::threadlist -v
would be telling.

There have been posts earlier as to how to load the debugger and creating a 
crash dump.

Best wishes

Steve
 
/sG/

- "TianHong Zhao"  wrote: 

Thanks for the reply.

This sounds a serious issue if we have to reboot a machine in such case, I am 
wondering if anybody is working on this.
BTW, the zpool failmode is set to continue, in my test case.

Tianhong Zhao

 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Brandon High
On Wed, May 4, 2011 at 8:23 PM, Edward Ned Harvey
 wrote:
> Generally speaking, dedup doesn't work on VM images.  (Same is true for ZFS
> or netapp or anything else.)  Because the VM images are all going to have
> their own filesystems internally with whatever blocksize is relevant to the
> guest OS.  If the virtual blocks in the VM don't align with the ZFS (or
> whatever FS) host blocks...  Then even when you write duplicated data inside
> the guest, the host won't see it as a duplicated block.

A zvol with 4k blocks should give you decent results with Windows
guests. Recent versions use 4k alignment by default and 4k blocks, so
there should be lots of duplicates for a base OS image.

> There are some situations where dedup may help on VM images...  For example
> if you're not using sparse files and you have a zero-filed disk...  But in

compression=zle works even better for these cases, since it doesn't
require DDT resources.

> Or if you're intimately familiar with both the guest & host filesystems, and
> you choose blocksizes carefully to make them align.  But that seems
> complicated and likely to fail.

Using a 4k block size is a safe bet, since most OSs use a block size
that is a multiple of 4k. It's the same reason that the new "Advanced
Format" drives use 4k sectors.

Windows uses 4k alignment and 4k (or larger) clusters.
ext3/ext4 uses 1k, 2k, or 4k blocks. Drives over 512MB should use 4k
by default. The block alignment is determined by the partitioning, so
some care needs to be taken there.
zfs uses 'ashift' size blocks. I'm not sure what ashift works out to
be when using a zvol though, so it could be as small as 512b but may
be set to the same as the blocksize property.
ufs is 4k or 8k on x86 and 8k on sun4u. As with ext4, block alignment
is determined by partitioning and slices.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Permanently using hot spare?

2011-05-05 Thread Ray Van Dolson
Have a failed drive on a ZFS pool (three RAIDZ2 vdevs, one hot spare).
The hot spare kicked in and all is well.

Is it possible to just make that hot spare disk -- already silvered
into the pool -- as a permanent part of the pool?  We could then throw
in a new disk and mark it as a spare and avoid what would seem to be an
unnecessary resilver (twice, once when the spare is brought in and
again when we replace the failed disk).

This document[1] seems to make it sound like it can be done, but I'm
not really seeing how... 

Can I "add" the spare disk to the pool when it's already in use?
Probably not...

Note this is on Solaris 10 U9.

Thanks,
Ray

[1] http://dlc.sun.com/osol/docs/content/ZFSADMIN/gayrd.html#gcvcw 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-05 Thread Brandon High
On Thu, May 5, 2011 at 11:17 AM, Giovanni Tirloni  wrote:
> What I find it curious is that it only happens with incrementals. Full
> send's go as fast as possible (monitored with mbuffer). I was just wondering
> if other people have seen it, if there is a bug (b111 is quite old), etc.

I missed that you were using b111 earlier. That's probably a large
part of the problem. There were a lot of performance and reliability
improvements between b111 and b134, and there have been more between
b134 and b148 (OI) or b151 (S11 Express).

Updating the host you're receiving on to something more recent may fix
the performance problem you're seeing.

Fragmentation shouldn't be to great of an issue if the pool you're
writing to is relatively empty. There were changes made to zpool
metaslab allocation post-b111 that might improve performance for pools
between 70% and 96% full. This could also be why the full sends
perform better than incremental sends.

-B

-- 
Brandon High : bh...@freaks.com
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-05 Thread Paul Kraus
On Thu, May 5, 2011 at 2:17 PM, Giovanni Tirloni  wrote:

> What I find it curious is that it only happens with incrementals. Full
> send's go as fast as possible (monitored with mbuffer). I was just wondering
> if other people have seen it, if there is a bug (b111 is quite old), etc.

I have been using zfs send / recv via ssh and a WAN connection to
replicate about 20 TB of data. One initial Full followed by an
Incremental every 4 hours. This has been going on for over a year and
I have not had any reliability issues. I started at Solaris 10U6, then
10U8, and now 10U9.

I did run into a bug early on that if the ssh failed, then the zfs
recv would hang, but that was fixed ages ago.

-- 
{1-2-3-4-5-6-7-}
Paul Kraus
-> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ )
-> Sound Coordinator, Schenectady Light Opera Company (
http://www.sloctheater.org/ )
-> Technical Advisor, RPI Players
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Quick zfs send -i performance questions

2011-05-05 Thread Giovanni Tirloni
On Wed, May 4, 2011 at 9:04 PM, Brandon High  wrote:

> On Wed, May 4, 2011 at 2:25 PM, Giovanni Tirloni 
> wrote:
> >   The problem we've started seeing is that a zfs send -i is taking hours
> to
> > send a very small amount of data (eg. 20GB in 6 hours) while a zfs send
> full
> > transfer everything faster than the incremental (40-70MB/s). Sometimes we
> > just give up on sending the incremental and send a full altogether.
>
> Does the send complete faster if you just pipe to /dev/null? I've
> observed that if recv stalls, it'll pause the send, and they two go
> back and forth stepping on each other's toes. Unfortunately, send and
> recv tend to pause with each individual snapshot they are working on.
>
> Putting something like mbuffer
> (http://www.maier-komor.de/mbuffer.html) in the middle can help smooth
> it out and speed things up tremendously. It prevents the send from
> pausing when the recv stalls, and allows the recv to continue working
> when the send is stalled. You will have to fiddle with the buffer size
> and other options to tune it for your use.
>


We've done various tests piping it to /dev/null and then transferring the
files to the destination. What seems to stall is the recv because it doesn't
complete (through mbuffer, ssh, locally, etc). The zfs send always complete
at the same rate.

Mbuffer is being used but doesn't seem to help. When things start to stall,
the in / out buffers will quickly fill up and nothing will be sent. Probably
because the mbuffer on the other side can't receive any more data until the
zfs recv gives it some air to breath.

What I find it curious is that it only happens with incrementals. Full
send's go as fast as possible (monitored with mbuffer). I was just wondering
if other people have seen it, if there is a bug (b111 is quite old), etc.

-- 
Giovanni Tirloni
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Garrett D'Amore
We have customers using dedup with lots of vm images... in one extreme case 
they are getting dedup ratios of over 200:1! 

You don't need dedup or sparse files for zero filling.  Simple zle compression 
will eliminate those for you far more efficiently and without needing massive 
amounts of ram.

Our customers have the ability to access our systems engineers to design the 
solution for their needs.  If you are serious about doing this stuff right, 
work with someone like Nexenta that can engineer a complete solution instead of 
trying to figure out which of us on this forum are quacks and which are cracks. 
 :)

Tim Cook  wrote:

>On Wed, May 4, 2011 at 10:23 PM, Edward Ned Harvey <
>opensolarisisdeadlongliveopensola...@nedharvey.com> wrote:
>
>> > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> > boun...@opensolaris.org] On Behalf Of Ray Van Dolson
>> >
>> > Are any of you out there using dedupe ZFS file systems to store VMware
>> > VMDK (or any VM tech. really)?  Curious what recordsize you use and
>> > what your hardware specs / experiences have been.
>>
>> Generally speaking, dedup doesn't work on VM images.  (Same is true for ZFS
>> or netapp or anything else.)  Because the VM images are all going to have
>> their own filesystems internally with whatever blocksize is relevant to the
>> guest OS.  If the virtual blocks in the VM don't align with the ZFS (or
>> whatever FS) host blocks...  Then even when you write duplicated data
>> inside
>> the guest, the host won't see it as a duplicated block.
>>
>> There are some situations where dedup may help on VM images...  For example
>> if you're not using sparse files and you have a zero-filed disk...  But in
>> that case, you should probably just use a sparse file instead...  Or ...
>>  If
>> you have a "golden" image that you're copying all over the place ... but in
>> that case, you should probably just use clones instead...
>>
>> Or if you're intimately familiar with both the guest & host filesystems,
>> and
>> you choose blocksizes carefully to make them align.  But that seems
>> complicated and likely to fail.
>>
>>
>>
>That's patently false.  VM images are the absolute best use-case for dedup
>outside of backup workloads.  I'm not sure who told you/where you got the
>idea that VM images are not ripe for dedup, but it's wrong.
>
>--Tim
>
>___
>zfs-discuss mailing list
>zfs-discuss@opensolaris.org
>http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-05 Thread Karl Wagner
so there's an ARC entry referencing each individual DDT entry in the L2ARC?! I 
had made the assumption that DDT entries would be grouped into at least minimum 
block sized groups (8k?), which would have lead to a much more reasonable ARC 
requirement.

seems like a bad design to me, which leads to dedup only being usable by those 
prepared to spend a LOT of dosh... which may as well go into more storage (I 
know there are other benefits too, but that's my opinion)
-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Edward Ned Harvey  wrote:

> From: Erik Trimble [mailto:erik.trim...@oracle.com] > > Using the standard 
> c_max value of 80%, remember that this is 80% of the > TOTAL system RAM, 
> including that RAM normally dedicated to other > purposes. So long as the 
> total amount of RAM you expect to dedicate to > ARC usage (for all ZFS uses, 
> not just dedup) is less than 4 times that > of all other RAM consumption, you 
> don't need to "overprovision". Correct, usually you don't need to 
> overprovision for the sake of ensuring enough ram available for OS and 
> processes. But you do need to overprovision 25% if you want to increase the 
> size of your usable ARC without reducing the amount of ARC you currently have 
> in the system being used to cache other files etc. > Any > entry that is 
> migrated back from L2ARC into ARC is considered "stale" > data in the L2ARC, 
> and thus, is no longer tracked in the ARC's reference > table for L2ARC. Good 
> news. I didn't know that. I thought the L2ARC was still valid, even if 
> something was pulled 
 back
into ARC. So there are two useful models: (a) The upper bound: The whole DDT is 
in ARC, and the whole L2ARC is filled with average-size blocks. or (b) The 
lower bound: The whole DDT is in L2ARC, and all the rest of the L2ARC is filled 
with average-size blocks. ARC requirements are based only on L2ARC references. 
The actual usage will be something between (a) and (b)... And the actual is 
probably closer to (b) In my test system: (a) (upper bound) On my test system I 
guess the OS and processes consume 1G. (I'm making that up without any reason.) 
On my test system I guess I need 8G in the system to get reasonable performance 
without dedup or L2ARC. (Again, I'm just making that up.) I need 7G for DDT and 
I have 748982 average-size blocks in L2ARC, which means 131820832 bytes = 125M 
or 0.1G for L2ARC I really just need to plan for 7.1G ARC usage Multiply by 5/4 
and it means I need 8.875G system ram My system needs to be built with at least 
8G + 8.875G = 16.875G. (b) (lower bound) 
 On my
test system I guess the OS and processes consume 1G. (I'm making that up 
without any reason.) On my test system I guess I need 8G in the system to get 
reasonable performance without dedup or L2ARC. (Again, I'm just making that 
up.) I need 0G for DDT (because it's in L2ARC) and I need 3.4G ARC to hold all 
the L2ARC references, including the DDT in L2ARC So I really just need to plan 
for 3.4G ARC for my L2ARC references. Multiply by 5/4 and it means I need 4.25G 
system ram My system needs to be built with at least 8G + 4.25G = 12.25G. Thank 
you for your input, Erik. Previously I would have only been comfortable with 
24G in this system, because I was calculating a need for significantly higher 
than 16G. But now, what we're calling the upper bound is just *slightly* higher 
than 16G, while the lower bound and most likely actual figure is significantly 
lower than 16G. So in this system, I would be comfortable running with 16G. But 
I would be even more comfortable running with 24G.
;-)_
zfs-discuss mailing list zfs-discuss@opensolaris.org 
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Joerg Moellenkamp



I assume you're talking about a situation where there is an initial VM image, 
and then to clone the machine, the customers copy the VM, correct?
If that is correct, have you considered ZFS cloning instead?

When I said dedup wasn't good for VM's, what I'm talking about is:  If there is data 
inside the VM which is cloned...  For example if somebody logs into the guest OS and then 
does a "cp" operation...  Then dedup of the host is unlikely to be able to 
recognize that data as cloned data inside the virtual disk.
I have the same opinion. When having talks with customers about the 
usage of dedup and cloning, the answer is simple: When you know that 
duplicates will occur but don't know when, then use dedup, when you know 
that duplicates occur and you know that they are there from the 
beginning, then use cloning.


Thus VM images cries for cloning. I'm not a fan for dedup for VMs. I 
heard  the argument once "but what is with vm patching". Aside from the 
problem of detecting the clones, i wouldn't patch a vm, but patch the 
master image and regenerate the clones, especially when it's about 
general patching session (just saving a gig because there is a patch on 
2 or 3 of 100 server) isn't worth the effort of spending a lot of memory 
for dedup). Out of a simple reason: Patching the VMs each on it's own is 
likely to increase VM sprawl. So all i save is some iron, but i'm not 
simplifying administration. However this needs good administrative 
processes.


You can use dedup for VMs, but i'm not sure someone should ...


Is this a zfs discussion list, or a nexenta sales & promotion list?

Well ... i have an opinion how he sees that ... however it's just my own ;)

--
ORACLE
Joerg Moellenkamp | Sales Consultant
Phone: +49 40 251523-460 | Mobile: +49 172 8318433
Oracle Hardware Presales - Nord

ORACLE Deutschland B.V.&   Co. KG | Nagelsweg 55 | 20097 Hamburg

ORACLE Deutschland B.V.&   Co. KG
Hauptverwaltung: Riesstr. 25, D-80992 München
Registergericht: Amtsgericht München, HRA 95603

Komplementärin: ORACLE Deutschland Verwaltung B.V.
Rijnzathe 6, 3454PV De Meern, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Geschäftsführer: Jürgen Kunz, Marcel van de Molen, Alexander van der Ven

Oracle is committed to developing practices and products that help protect the 
environment

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Garrett D'Amore
On Thu, 2011-05-05 at 09:02 -0400, Edward Ned Harvey wrote:
> > From: Garrett D'Amore [mailto:garr...@nexenta.com]
> > 
> > We have customers using dedup with lots of vm images... in one extreme
> > case they are getting dedup ratios of over 200:1!
> 
> I assume you're talking about a situation where there is an initial VM image, 
> and then to clone the machine, the customers copy the VM, correct?
> If that is correct, have you considered ZFS cloning instead?

No.  Obviously if you can clone, its better.  But sometimes you can't do
this even with v12n, and we have this situation at customer sites today.
(I have always said, zfs clone is far easier, far more proven, and far
more efficient, *if* you can control the "ancestral" relationship to
take advantage of the clone.)  For example, one are where cloning can't
help is with patches and updates.  In some instances these can get quite
large, and across 1000's of VMs the space required can be considerable.

> 
> When I said dedup wasn't good for VM's, what I'm talking about is:  If there 
> is data inside the VM which is cloned...  For example if somebody logs into 
> the guest OS and then does a "cp" operation...  Then dedup of the host is 
> unlikely to be able to recognize that data as cloned data inside the virtual 
> disk.

I disagree.  I believe that within the VMDKs data is aligned nicely,
since these are disk images.

At any rate, we are seeing real (and large) dedup ratios in the field
when used with v12n.  In fact, this is the killer app for dedup.
 
> 
> > Our customers have the ability to access our systems engineers to design the
> > solution for their needs.  If you are serious about doing this stuff right, 
> > work
> > with someone like Nexenta that can engineer a complete solution instead of
> > trying to figure out which of us on this forum are quacks and which are
> > cracks.  :)
> 
> Is this a zfs discussion list, or a nexenta sales & promotion list?

My point here was that there is a lot of half baked advice being
given... the idea that you should only use dedup if you have a bunch of
zeros on your disk images is absolutely and totally nuts for example.
It doesn't match real world experience, and it doesn't match the theory
either.

And sometimes real-world experience trumps the theory.  I've been shown
on numerous occasions that ideas that I thought were half-baked turned
out to be very effective in the field, and vice versa.  (I'm a
developer, not a systems engineer.  Fortunately I have a very close
working relationship with a couple of awesome systems engineers.)

Folks come here looking for advice.  I think the advice that if you're
contemplating these kinds of solutions, you should get someone with some
real world experience solving these kinds of problems every day, is very
sound advice.  Trying to pull out the truths from the myths I see stated
here nearly every day is going to be difficult for the average reader
here, I think.

- Garrett


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Constantin Gonzalez

Hi,

On 05/ 5/11 03:02 PM, Edward Ned Harvey wrote:

From: Garrett D'Amore [mailto:garr...@nexenta.com]

We have customers using dedup with lots of vm images... in one extreme
case they are getting dedup ratios of over 200:1!


I assume you're talking about a situation where there is an initial VM image, 
and then to clone the machine, the customers copy the VM, correct?
If that is correct, have you considered ZFS cloning instead?

When I said dedup wasn't good for VM's, what I'm talking about is:  If there is data 
inside the VM which is cloned...  For example if somebody logs into the guest OS and then 
does a "cp" operation...  Then dedup of the host is unlikely to be able to 
recognize that data as cloned data inside the virtual disk.


ZFS cloning and ZFS dedup are solving two problems that are related, but
different:

- Through Cloning, a lot of space can be saved in situations where it is
  known beforehand that data is going to be used multiple times from multiple
  different "views". Virtualization is a perfect example of this.

- Through Dedup, space can be saved in situations where the duplicate nature
  of data is not known, or not known beforehand. Again, in virtualization
  scenarios, this could be common modifications to VM images that are
  performed multiple times, but not anticipated, such as extra software,
  OS patches, or simply man users saving the same files to their local
  desktops.

To go back to the "cp" example: If someone logs into a VM that is backed by
ZFS with dedup enabled, then copies a file, the extra space that the file will
take will be minimal. The act of copying the file will break down into a
series of blocks that will be recognized as duplicate blocks.

This is completely independent of the clone nature of the underlying VM's
backing store.

But I agree that the biggest savings are to be expected from cloning first,
as they typically translate into n GB (for the base image) x # of users,
which is a _lot_.

Dedup is still the icing on the cake for all those data blocks that were
unforeseen. And that can be a lot, too, as everone who has seen cluttered
desktops full of downloaded files can probably confirm.


Cheers,
   Constantin


--

Constantin Gonzalez Schmitz, Sales Consultant,
Oracle Hardware Presales Germany
Phone: +49 89 460 08 25 91  | Mobile: +49 172 834 90 30
Blog: http://constantin.glez.de/| Twitter: zalez

ORACLE Deutschland B.V. & Co. KG, Sonnenallee 1, 85551 Kirchheim-Heimstetten

ORACLE Deutschland B.V. & Co. KG
Hauptverwaltung: Riesstraße 25, D-80992 München
Registergericht: Amtsgericht München, HRA 95603

Komplementärin: ORACLE Deutschland Verwaltung B.V.
Hertogswetering 163/167, 3543 AS Utrecht
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Geschäftsführer: Jürgen Kunz, Marcel van de Molen, Alexander van der Ven

Oracle is committed to developing practices and products that help protect the
environment
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-05 Thread Edward Ned Harvey
> From: Erik Trimble [mailto:erik.trim...@oracle.com]
> 
> Using the standard c_max value of 80%, remember that this is 80% of the
> TOTAL system RAM, including that RAM normally dedicated to other
> purposes.  So long as the total amount of RAM you expect to dedicate to
> ARC usage (for all ZFS uses, not just dedup) is less than 4 times that
> of all other RAM consumption, you don't need to "overprovision".

Correct, usually you don't need to overprovision for the sake of ensuring
enough ram available for OS and processes.  But you do need to overprovision
25% if you want to increase the size of your usable ARC without reducing the
amount of ARC you currently have in the system being used to cache other
files etc.


> Any
> entry that is migrated back from L2ARC into ARC is considered "stale"
> data in the L2ARC, and thus, is no longer tracked in the ARC's reference
> table for L2ARC.

Good news.  I didn't know that.  I thought the L2ARC was still valid, even
if something was pulled back into ARC.

So there are two useful models:
(a) The upper bound:  The whole DDT is in ARC, and the whole L2ARC is filled
with average-size blocks.
or
(b) The lower bound:  The whole DDT is in L2ARC, and all the rest of the
L2ARC is filled with average-size blocks.  ARC requirements are based only
on L2ARC references.

The actual usage will be something between (a) and (b)...  And the actual is
probably closer to (b)

In my test system:
(a)  (upper bound)
On my test system I guess the OS and processes consume 1G.  (I'm making that
up without any reason.)
On my test system I guess I need 8G in the system to get reasonable
performance without dedup or L2ARC.  (Again, I'm just making that up.)
I need 7G for DDT and 
I have 748982 average-size blocks in L2ARC, which means 131820832 bytes =
125M or 0.1G for L2ARC
I really just need to plan for 7.1G ARC usage
Multiply by 5/4 and it means I need 8.875G system ram
My system needs to be built with at least 8G + 8.875G = 16.875G.

(b)  (lower bound)
On my test system I guess the OS and processes consume 1G.  (I'm making that
up without any reason.)
On my test system I guess I need 8G in the system to get reasonable
performance without dedup or L2ARC.  (Again, I'm just making that up.)
I need 0G for DDT  (because it's in L2ARC) and 
I need 3.4G ARC to hold all the L2ARC references, including the DDT in L2ARC
So I really just need to plan for 3.4G ARC for my L2ARC references.
Multiply by 5/4 and it means I need 4.25G system ram
My system needs to be built with at least 8G + 4.25G = 12.25G.

Thank you for your input, Erik.  Previously I would have only been
comfortable with 24G in this system, because I was calculating a need for
significantly higher than 16G.  But now, what we're calling the upper bound
is just *slightly* higher than 16G, while the lower bound and most likely
actual figure is significantly lower than 16G.  So in this system, I would
be comfortable running with 16G.  But I would be even more comfortable
running with 24G.   ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Deduplication Memory Requirements

2011-05-05 Thread Edward Ned Harvey
> From: Garrett D'Amore [mailto:garr...@nexenta.com]
> 
> We have customers using dedup with lots of vm images... in one extreme
> case they are getting dedup ratios of over 200:1!

I assume you're talking about a situation where there is an initial VM image, 
and then to clone the machine, the customers copy the VM, correct?
If that is correct, have you considered ZFS cloning instead?

When I said dedup wasn't good for VM's, what I'm talking about is:  If there is 
data inside the VM which is cloned...  For example if somebody logs into the 
guest OS and then does a "cp" operation...  Then dedup of the host is unlikely 
to be able to recognize that data as cloned data inside the virtual disk.


> Our customers have the ability to access our systems engineers to design the
> solution for their needs.  If you are serious about doing this stuff right, 
> work
> with someone like Nexenta that can engineer a complete solution instead of
> trying to figure out which of us on this forum are quacks and which are
> cracks.  :)

Is this a zfs discussion list, or a nexenta sales & promotion list?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Faster copy from UFS to ZFS

2011-05-05 Thread Joerg Schilling
Ian Collins  wrote:

> >> *ufsrestore works fine on ZFS filesystems (although I haven't tried it
> >> with any POSIX ACLs on the original ufs filesystem, which would probably
> >> simply get lost).
> > star -copy -no-fsync  is typically 30% faster that ufsdump | ufsrestore.
> >
> Does it preserve ACLs?

Star supports ACLs from the withdrawn POSIX draft.

Star could already support ZFS ACLs in case that Sun had offered a correctly 
working ACL support library when they introdiced ZFS ACLs. Unfortunately it 
took some time until this lib was fixed and since then, I had other projects 
that took my time. ZFS ACLs are not fogetten however.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Faster copy from UFS to ZFS

2011-05-05 Thread Joerg Schilling
Erik Trimble  wrote:

> rsync is indeed slower than star; so far as I can tell, this is due 
> almost exclusively to the fact that rsync needs to build an in-memory 
> table of all work being done *before* it starts to copy. After that, it 
> copies at about the same rate as star (my observations). I'd have to 
> look at the code, but rsync appears to internally buffer a signification 
> amount (due to its expect network use pattern), which helps for ZFS 
> copying.  The one thing I'm not sure of is whether rsync uses a socket, 
> pipe, or semaphore method when doing same-host copying. I presume socket 
> (which would slightly slow it down vs star).

The reason why star is faster than any other copy method is based on the fact 
that star is not implemented like historical tar or cpio implementations.

Since around 1990, star forks into two processes unless you forbid this by an 
option. In the normal modes, one of them is the "archive process" that just 
reads or writes from/to the archive file or tape, the other is the tar process 
that understands the archive content and deals with the filesystem (the 
direction of the filesystem operation depends on whether it is in extract or
create mode).

Between both processes, there is a large FIFO of shared memory that is used to 
share the data. If the FIFO has much free space, star will read files in one 
single chunk into the FIFO, this is another reason for it's speed.

Another advantage in star is that it reads every directory in one large chunk 
and thus allows the OS to do optimization at this place. BTW: An OS that floods 
(and probably overflows) the stat/vnode cache in such a case may cause an 
unneeded slow down.

In copy mode, star starts two archive processes and a FIFO between them.

The create process tries to keep the FIFO as full as possible and as is makes 
sense to use a FIFO size up to aprox. half of the real system memory, this FIFO 
may be really huge, so it will even be able to keep modern tapes streaming for 
at least 30-60 seconds. Ufsdump only allows a small number of 126 kB buffers 
(I belive is it 6 buffer) and thus ufsdump | ufsrestore is tightly coupled 
while star allows to freely run both creation and extration of the internal 
virtual archive nearly independent from each other. This way, star does not 
need to wait every time extraction slows down but just fills the FIFO instead.

Before SEEK_HOLE/SEEK_DATA existed, the only place where ufsdump was faster 
than star have been sparse files. This is why I talked with Jeff Bonwick in 
September 2004 to find a useful interface for user space programs (in special 
star) that do not read the filesystem at block level (like ufsdump) but cleanly 
in the documented POSIX way.

Since SEEK_HOLE/SEEK_DATA have been introduced, there is no single known case, 
where star is not at least 30% faster than ufsdump. BTW: ufsdump is another 
implementation that first sits and collects all filenames before it starts to 
read file content.


> That said, rsync is really the only solution if you have a partial or 
> interrupted copy.  It's also really the best method to do verification.

Star offers another method to continue  interrupted extracts or copies:

Star sets the time stamp of an incomplete file to 0 (1.1.1970 GMT). As star 
does not overwrite files in case they are not newer in the archive, star can 
skip the other files in extract mode and continue with the missing files or 
with the file(s) that have the time stamp 0.

Jörg

-- 
 EMail:jo...@schily.isdn.cs.tu-berlin.de (home) Jörg Schilling D-13353 Berlin
   j...@cs.tu-berlin.de(uni)  
   joerg.schill...@fokus.fraunhofer.de (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss