Re: [zfs-discuss] DDT sync?

2011-06-01 Thread Frank Van Damme
2011/6/1 Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com:
 (2)  The above is pretty much the best you can do, if your server is going
 to be a normal server, handling both reads  writes.  Because the data and
 the meta_data are both stored in the ARC, the data has a tendency to push
 the meta_data out.  But in a special use case - Suppose you only care about
 write performance and saving disk space.  For example, suppose you're the
 destination server of a backup policy.  You only do writes, so you don't
 care about keeping data in cache.  You want to enable dedup to save cost on
 backup disks.  You only care about keeping meta_data in ARC.  If you set
 primarycache=metadata   I'll go test this now.  The hypothesis is that
 my arc_meta_used should actually climb up to the arc_meta_limit before I
 start hitting any disk reads, so my write performance with/without dedup
 should be pretty much equal up to that point.  I'm sacrificing the potential
 read benefit of caching data in ARC, in order to hopefully gain write
 performance - So write performance can be just as good with dedup enabled or
 disabled.  In fact, if there's much duplicate data, the dedup write
 performance in this case should be significantly better than without dedup.

I guess this is pretty much why I have primarycache=metadata and
set zfs:zfs_arc_meta_limit=0x1
set zfs:zfs_arc_min=0xC000
in /etc/system.

And the ARC size on this box tends to drop far below arc_min after a
few days, not withstanding the fact it's supposed to be a hard limit.

I call for an arc_data_max setting :)

-- 
Frank Van Damme
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DDT sync?

2011-05-31 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 
 So here's what I'm going to do.  With arc_meta_limit at 7680M, of which
 100M
 was consumed naturally, that leaves me 7580 to play with.  Call it
7500M.
 Divide by 412 bytes, it means I'll hit a brick wall when I reach a little
 over 19M blocks.  Which means if I set my recordsize to 32K, I'll hit that
 limit around 582G disk space consumed.  That is my hypothesis, and now
 beginning the test.

Well, this is interesting.  With 7580MB theoretically available for DDT in
ARC, the expectation was that 19M DDT entries would finally max out the ARC
and then I'd jump off a performance cliff and start seeing a bunch of pool
reads killing my write performance.

In reality, what I saw was:  
* Up to a million blocks, the performance difference with/without dedup was
basically negligible.  Write time with dedup = 1x write time without dedup.
* After a million, the dedup write time consistently reached 2x longer than
the native write time.  This happened when my ARC became full of user data
(not meta data)
* As the # of unique blocks in pool increased, gradually, the dedup write
time deviated from the non-dedup write time.  2x, 3x, 4x.  I got a
consistent 4x longer write time with dedup enabled, after the pool reached
22.5M blocks.
* And then it jumped off a cliff.  When I got to 24M blocks, it was the last
datapoint able to be collected.  28x slower write with dedup (4966 sec to
write 3G, as compared to 178sec), and for the first time, a nonzero rm time.
All the way up till now, even with dedup, the rm time was zero.  But now it
was 72sec.  
* I waited another 6 hours, and never got another data point.  So I found
the limit where the pool becomes unusably slow.  

At a cursory look, you might say this supported the hypothesis.  You might
say 24M compared to 19M, that's not too far off.  This could be accounted
for by using the 376byte size of ddt_entry_t, instead of the 412byte size
apparently measured... This would adjust the hypothesis to 21.1M blocks.

But I don't think that's quite fair.  Because my arc_meta_used never got
above 5,159.  And I never saw the massive read overload that was predicted
to be the cause of failure.  In fact, starting from 0.4M to 0.5M blocks
(early, early, early on) from that point onward, I always had 40-50 reads
for every 250 writes.  Right to the bitter end.  And my arc is full of user
data, not meta data.

So the conclusions I'm drawing are:

(1)  If you don't tweak arc_meta_limit, and you want to enable dedup, you're
toast.  But if you do tweak arc_meta_limit, you might reasonably expect
dedup to perform 3x to 4x slower on unique data...  And based on results
that I haven't talked about yet here, dedup performs 3x to 4x faster on
duplicate data.  So if you have 50% or higher duplicate data (dedup ratio 2x
or higher) and you have plenty of memory and tweak it, then your performance
with dedup could be comparable, or even faster than running without dedup.
Of course, depending on your data patterns and usage patterns.  YMMV.

(2)  The above is pretty much the best you can do, if your server is going
to be a normal server, handling both reads  writes.  Because the data and
the meta_data are both stored in the ARC, the data has a tendency to push
the meta_data out.  But in a special use case - Suppose you only care about
write performance and saving disk space.  For example, suppose you're the
destination server of a backup policy.  You only do writes, so you don't
care about keeping data in cache.  You want to enable dedup to save cost on
backup disks.  You only care about keeping meta_data in ARC.  If you set
primarycache=metadata   I'll go test this now.  The hypothesis is that
my arc_meta_used should actually climb up to the arc_meta_limit before I
start hitting any disk reads, so my write performance with/without dedup
should be pretty much equal up to that point.  I'm sacrificing the potential
read benefit of caching data in ARC, in order to hopefully gain write
performance - So write performance can be just as good with dedup enabled or
disabled.  In fact, if there's much duplicate data, the dedup write
performance in this case should be significantly better than without dedup.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DDT sync?

2011-05-29 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 
 (1) I'll push the recordsize back
 up to 128k, and then repeat this test something slightly smaller than
128k.
 Say, 120k. 

Good news.  :-)  Changing the recordsize made a world of difference.  I've
had this benchmark running for the last 2-3 days now, and I'm up to 3.4M
blocks in the pool (as reported by zdb -D) ...  Over this period, my
arc_meta_used climbed from initial 100M to 1500M.  Call it 412 bytes per
unique block in the pool.  Not too far from what was previously said (376
bytes based on sizeof ddt_entry_t).  

I am repeating a cycle:  time write some unique data with dedup off, then
time rm the file, and time write the same data again with dedup on (no
verify), then time rm the file, and finally time write it again and leave it
on disk just to hog more DDT entries.  Move on to the next unique file.

Each file is about 3G in size, on a single sata disk.  No records are ever
repeated - every block in the pool is unique.  Presently I'm up to about
500G written of a 1T drive, and the performance difference with/without
dedup is 100sec vs 75sec to write the same 3G of data.

Currently my arc_meta_limit is 7680M (I tweaked it) which theoretically
means I should be able to continue this up to around 19M blocks or so.
Clearly much more than my little 1T drive can handle on a 128K recordsize.
But I don't just want to see it succeed - I want to predict where it will
fail, and confirm the hypothesis with a measured result.

So here's what I'm going to do.  With arc_meta_limit at 7680M, of which 100M
was consumed naturally, that leaves me 7580 to play with.  Call it 7500M.
Divide by 412 bytes, it means I'll hit a brick wall when I reach a little
over 19M blocks.  Which means if I set my recordsize to 32K, I'll hit that
limit around 582G disk space consumed.  That is my hypothesis, and now
beginning the test.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DDT sync?

2011-05-27 Thread Frank Van Damme
Op 26-05-11 13:38, Edward Ned Harvey schreef:
 Perhaps a property could be
 set, which would store the DDT exclusively on that device.

Oh yes please, let me put my DDT on an SSD.

But what if you loose it (the vdev), would there be a way to reconstruct
the DDT (which you need to be able to delete old, deduplicated files)?
Let me guess - this requires tracing down all blocks and depends on an
infamous feature called BPR? ;)

 Both the necessity to read  write the primary storage pool...  That's
 very hurtful.  And even with infinite ram, it's going to be
 unavoidable for things like destroying snapshots, or anything at all
 you ever want to do after a reboot.  

Indeed. But then again, zfs also doesn't (yet?) keep its l2arc cache
between reboots. Once it does, you could flush out the entire arc to
l2arc before reboot.

-- 
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DDT sync?

2011-05-27 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Frank Van Damme
 
 Op 26-05-11 13:38, Edward Ned Harvey schreef:
  Perhaps a property could be
  set, which would store the DDT exclusively on that device.
 
 Oh yes please, let me put my DDT on an SSD.
 
 But what if you loose it (the vdev), would there be a way to reconstruct
 the DDT (which you need to be able to delete old, deduplicated files)?
 Let me guess - this requires tracing down all blocks and depends on an
 infamous feature called BPR? ;)

How is that different from putting your DDT on a hard drive, which is what
we currently do?

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DDT sync?

2011-05-27 Thread Jim Klimov
  From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
  boun...@opensolaris.org] On Behalf Of Frank Van Damme
  
  Op 26-05-11 13:38, Edward Ned Harvey schreef:
  But what if you loose it (the vdev), would there be a way to 
 reconstruct the DDT (which you need to be able to delete old, 
 deduplicated files)?
  Let me guess - this requires tracing down all blocks and 
  depends on an infamous feature called BPR? ;)
 
 How is that different from putting your DDT on a hard drive, 
 which is what we currently do?

I think you two might be talking about somewhat different ideas
of implementing such DDT storage.
 
One approach might be like we have now: the DDT blocks are
spread in your pool consisting of several top-level vdevs and
are redundantly protected by ZFS raidz or mirroring. If one of
such top-level vdevs is lost, the whole pool is faulted or dead.
 
Another approach might be more like a dedicated extra device
(or mirror/raidz of devices) like L2ARC or rather ZIL (more 
analogies below) - this task would need a write-oriented media 
like SLC SSDs with a large capacity, and throttling of L2ARC 
hardware link and potential unreliability of MLCs might make 
DDT storage a bad neighbor for L2ARC SSDs.
 
Since ZILs are usually treated as write-only devices with a
low capacity requirement (i.e. 2-4Gb might be more than
enough), dedicating the rest of even a 20Gb SSD to the 
DDT may be a good investment overall.
 
If the ZIL device (mirror) fails, you might need to rollback 
your pool a few transactions back, detach the ZIL and fall
back to using HDD blocks for the ZIL.
 
Since zdb -s can seemingly construct a DDT from scratch,
and since for reads you still have many references to a single
on-disk block (DDT is not used for reads, right?) - you can
reconstruct the DDT for either in-pool or external storage.
That might take some downtime, true. But if coupled with
offline dedup (as discussed in another thread) running in the
background, maybe not.
 
One thing to think of, though: what would we do when the
dedicated DDT storage overflows - write the extra entries
into the HDD pool like we do now? 
(BTW, what to we do with dedicated ZIL device - flush the 
TXG early?)
 
//Jim
 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DDT sync?

2011-05-27 Thread Richard Elling
On May 27, 2011, at 6:20 AM, Jim Klimov wrote:

   From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
   boun...@opensolaris.org] On Behalf Of Frank Van Damme
   
   Op 26-05-11 13:38, Edward Ned Harvey schreef:
   But what if you loose it (the vdev), would there be a way to 
  reconstruct the DDT (which you need to be able to delete old, 
  deduplicated files)?
   Let me guess - this requires tracing down all blocks and 
   depends on an infamous feature called BPR? ;)
  
  How is that different from putting your DDT on a hard drive, 
  which is what we currently do?
 I think you two might be talking about somewhat different ideas
 of implementing such DDT storage.
  
 One approach might be like we have now: the DDT blocks are
 spread in your pool consisting of several top-level vdevs and
 are redundantly protected by ZFS raidz or mirroring. If one of
 such top-level vdevs is lost, the whole pool is faulted or dead.
  
 Another approach might be more like a dedicated extra device
 (or mirror/raidz of devices) like L2ARC or rather ZIL (more
 analogies below) - this task would need a write-oriented media
 like SLC SSDs with a large capacity, and throttling of L2ARC
 hardware link and potential unreliability of MLCs might make
 DDT storage a bad neighbor for L2ARC SSDs.

I filed an RFE for this about 2 years ago... I would send a URL but
Oracle shut down the OpenSolaris bugs database interface and 
left what is left mostly useless :-(

 Since ZILs are usually treated as write-only devices with a
 low capacity requirement (i.e. 2-4Gb might be more than
 enough), dedicating the rest of even a 20Gb SSD to the
 DDT may be a good investment overall.
  
 If the ZIL device (mirror) fails, you might need to rollback
 your pool a few transactions back, detach the ZIL and fall
 back to using HDD blocks for the ZIL.
  
 Since zdb -s can seemingly construct a DDT from scratch,
 and since for reads you still have many references to a single
 on-disk block (DDT is not used for reads, right?) - you can
 reconstruct the DDT for either in-pool or external storage.

Nope. If you lose the DDT, then you lose any or all deduped data.
Today, the DDT is treated like metadata which means there are always
at least 2 copies in the pool.

 That might take some downtime, true. But if coupled with
 offline dedup (as discussed in another thread) running in the
 background, maybe not.
  
 One thing to think of, though: what would we do when the
 dedicated DDT storage overflows - write the extra entries
 into the HDD pool like we do now?

Other designs which use fixed-sized DDT areas suffer from that design
limitation -- once it fills, they no longer dedup.
 

 (BTW, what to we do with dedicated ZIL device - flush the
 TXG early?)

No, just write into the pool. Frankly, I don't see this as a problem for real
world machines.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DDT sync?

2011-05-26 Thread Edward Ned Harvey
 From: Daniel Carosone [mailto:d...@geek.com.au]
 Sent: Wednesday, May 25, 2011 10:10 PM
 
 These are additional
 iops that dedup creates, not ones that it substitutes for others in
 roughly equal number.

Hey ZFS developers - Of course there are many ways to possibly address these
issues.  Tweaking ARC prioritization and the like...  Has anybody considered
the possibility of making an option to always keep DDT on a specific vdev?
Presumably a nonvolatile mirror with very fast iops.  It is likely a lot of
people already have cache devices present...  Perhaps a property could be
set, which would store the DDT exclusively on that device.  Naturally there
are implications - you would need to recommend mirroring the device, which
you can't do, so maybe we're talking about slicing the cache device...  As I
said, a lot of ways to address the issue.

Both the necessity to read  write the primary storage pool...  That's very
hurtful.  And even with infinite ram, it's going to be unavoidable for
things like destroying snapshots, or anything at all you ever want to do
after a reboot.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DDT sync?

2011-05-26 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
  
 Both the necessity to read  write the primary storage pool...  That's
very
 hurtful.  

Actually, I'm seeing two different modes of degradation:
(1) Previously described.  When I run into arc_meta_limit, in a pool of
approx 1.0M to 1.5M unique blocks, I suffer ~50 reads for every new unique
write.  Countermeasure was easy.  Increase arc_meta_limit.

(2) Now, in a pool with 2.4M unique blocks and dedup enabled (no verify), a
test file requires 10m38s to write and 2m54s to delete, but with dedup
disabled it only requires 0m40s to write and 0m13s to delete exactly the
same file.  So ... 13x performance degradation.  

zpool iostat is indicating the disks are fully utilized doing writes.  No
reads.  During this time, it is clear the only bottleneck is write iops.
There is still oodles of free mem.  I am not near arc_meta_limit, nor c_max.
The cpu is 99% idle.  It is write iops limited.  Period.

Assuming DDT maintenance is the only disk write overhead that dedup adds, I
can only conclude that with dedup enabled, and a couple million unique
blocks in the pool, the DDT must require substantial maintenance.  In my
case, something like 12 DDT writes for every 1 actual intended new unique
file block write.

For the heck of it, since this machine has no other purpose at the present
time, I plan to do two more tests.  And I'm open to suggestions if anyone
can think of anything else useful to measure: 

(1) I'm currently using a recordsize of 512b, because the intended purpose
of this test has been to rapidly generate a high number of new unique
blocks.  Now just to eliminate the possibility that I'm shooting myself in
the foot by systematically generating a worst case scenario, I'll try to
systematically generate a best-case scenario.  I'll push the recordsize back
up to 128k, and then repeat this test something slightly smaller than 128k.
Say, 120k. That way there should be plenty of room available for any write
aggregation the system may be trying to perform.

(2) For the heck of it, why not.  Disable ZIL and confirm that nothing
changes.  (Understanding so far is that all these writes are async, and
therefore ZIL should not be a factor.  Nice to confirm this belief.)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DDT sync?

2011-05-26 Thread Daniel Carosone
On Thu, May 26, 2011 at 07:38:05AM -0400, Edward Ned Harvey wrote:
  From: Daniel Carosone [mailto:d...@geek.com.au]
  Sent: Wednesday, May 25, 2011 10:10 PM
  
  These are additional
  iops that dedup creates, not ones that it substitutes for others in
  roughly equal number.
 
 Hey ZFS developers - Of course there are many ways to possibly address these
 issues.  Tweaking ARC prioritization and the like...  Has anybody considered
 the possibility of making an option to always keep DDT on a specific vdev?
 Presumably a nonvolatile mirror with very fast iops.  It is likely a lot of
 people already have cache devices present...  Perhaps a property could be
 set, which would store the DDT exclusively on that device.  Naturally there
 are implications - you would need to recommend mirroring the device, which
 you can't do, so maybe we're talking about slicing the cache device...  As I
 said, a lot of ways to address the issue.

I think l2arc persistence will just about cover that nicely, perhaps
in combination with some smarter auto-tuning for arc percentages with
large DDT. 

The writes are async, and aren't so much a problem in themselves other
than that they can get in the way of other more important things.  The
best thing you can do with them is spread them as widely as possible,
rather than bottlenecking specific devices/channels/etc. 

If you have a capacity shortfall overall, either you make the other
things faster in preference (zil, nv write cache, more arc for reads)
or you make the whole pool faster for iops (different layout, more
spindles) or you limit dedup usage within your capacity.

Another thing that can happen is that you have enough other sync
writes going on that DDT writes lag behind and delay the txg close. 
In this case, the same solutions above apply, as does judicious use of
sync=disabled to allow more of the writes to be async.

 Both the necessity to read  write the primary storage pool...  That's very
 hurtful.  And even with infinite ram, it's going to be unavoidable for
 things like destroying snapshots, or anything at all you ever want to do
 after a reboot.

Yeah, again, persistent l2arc helps the post-reboot case.  With
infinite ram, I'm not sure I'd have much use for dedup :)

--
Dan.

pgpqUsz7d3iF6.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DDT sync?

2011-05-26 Thread Daniel Carosone
On Thu, May 26, 2011 at 10:25:04AM -0400, Edward Ned Harvey wrote:
 (2) Now, in a pool with 2.4M unique blocks and dedup enabled (no verify), a
 test file requires 10m38s to write and 2m54s to delete, but with dedup
 disabled it only requires 0m40s to write and 0m13s to delete exactly the
 same file.  So ... 13x performance degradation.  
 
 zpool iostat is indicating the disks are fully utilized doing writes.  No
 reads.  During this time, it is clear the only bottleneck is write iops.
 There is still oodles of free mem.  I am not near arc_meta_limit, nor c_max.
 The cpu is 99% idle.  It is write iops limited.  Period.

Ok.

 Assuming DDT maintenance is the only disk write overhead that dedup adds, I
 can only conclude that with dedup enabled, and a couple million unique
 blocks in the pool, the DDT must require substantial maintenance.  In my
 case, something like 12 DDT writes for every 1 actual intended new unique
 file block write.

Where did number come from?  Are there actually 13x as many IOs, or is
that just extrapolated from elapsed time?  It won't be anything like a
linear extrapolation, especially if the heads are thrashing.

Note that DDT blocks have their own allocation metadata to be updated
as well.

Try to get a number for actual total IOs and scaling factor.

 For the heck of it, since this machine has no other purpose at the present
 time, I plan to do two more tests.  And I'm open to suggestions if anyone
 can think of anything else useful to measure: 
 
 (1) I'm currently using a recordsize of 512b, because the intended purpose
 of this test has been to rapidly generate a high number of new unique
 blocks.  Now just to eliminate the possibility that I'm shooting myself in
 the foot by systematically generating a worst case scenario, I'll try to
 systematically generate a best-case scenario.  I'll push the recordsize back
 up to 128k, and then repeat this test something slightly smaller than 128k.
 Say, 120k. That way there should be plenty of room available for any write
 aggregation the system may be trying to perform.
 
 (2) For the heck of it, why not.  Disable ZIL and confirm that nothing
 changes.  (Understanding so far is that all these writes are async, and
 therefore ZIL should not be a factor.  Nice to confirm this belief.)

Good tests. See how the IO expansion factor changes with block size.

(3) Experiment with the maximum number of allowed outstanding current
io's per disk (I forget the specific tunable OTTOMH).  If the load
really is ~100% async write, this might well be a case where raising
that figure lets the disk firmware maximise throughput without causing
the latency impact that can happen otherwise (and leads to
recommendations to shorten the limit in general cases).

(4) See if changing the txg sync interval to (much) longer
helps. Multiple DDT entries can live in the same block, and a longer
interval may allow coalescing of these writes.

--
Dan.

pgppptUIsk3CQ.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DDT sync?

2011-05-25 Thread Matthew Ahrens
On Wed, May 25, 2011 at 2:23 PM, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 I've finally returned to this dedup testing project, trying to get a handle
 on why performance is so terrible.  At the moment I'm re-running tests and
 monitoring memory_throttle_count, to see if maybe that's what's causing the
 limit.  But while that's in progress and I'm still thinking...



 I assume the DDT tree must be stored on disk, in the regular pool, and each
 entry is stored independently from each other entry, right?  So whenever
 you're performing new unique writes, that means you're creating new entries
 in the tree, and every so often the tree will need to rebalance itself.  By
 any chance, are DDT entry creation treated as sync writes?  If so, that
 could be hurting me.  For every new unique block written, there might be a
 significant amount of small random writes taking place that are necessary to
 support the actual data write.  Anyone have any knowledge to share along
 these lines?


The DDT is a ZAP object, so it is an on-disk hashtable, free of O(log(n))
rebalancing operations.  It is written asynchronously, from syncing context.
 That said, for each block written (unique or not), the DDT must be updated,
which means reading and then writing the block that contains that dedup
table entry, and the indirect blocks to get to it.  With a reasonably large
DDT, I would expect about 1 write to the DDT for every block written to the
pool (or written but actually dedup'd).

--matt
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DDT sync?

2011-05-25 Thread Edward Ned Harvey
 From: Matthew Ahrens [mailto:mahr...@delphix.com]
 Sent: Wednesday, May 25, 2011 6:50 PM
 
 The DDT is a ZAP object, so it is an on-disk hashtable, free of O(log(n))
 rebalancing operations.  It is written asynchronously, from syncing
 context.  That said, for each block written (unique or not), the DDT must
be
 updated, which means reading and then writing the block that contains that
 dedup table entry, and the indirect blocks to get to it.  With a
reasonably
 large DDT, I would expect about 1 write to the DDT for every block written
to
 the pool (or written but actually dedup'd).

So ... If the DDT were already cached completely in ARC, and I write a new
unique block to a file, ideally I would hope (after write buffering because
all of this will be async) that one write will be completed to disk - It
would be the aggregate of the new block plus the new DDT entry, but because
of write aggregation it should literally be a single seek+latency penalty. 

Most likely in reality, additional writes will be necessary, to update the
parent block pointers or parent DDT branches and so forth, but hopefully
that's all managed well and kept to a minimum.  So maybe a single new write
ultimately yields a dozen times the disk access time...

I'm honing this in closer, but so far what I'm seeing is ... zpool iostat
indicates 1000 reads taking place for every 20 writes.  This is on a
literally 100% idle pool, where the only activity in the system is me
performing this write benchmark.  The only logical explanation I see for
this behavior is to conclude the DDT must not be cached in ARC.  So every
write yields a flurry of random reads...  50 or so...

Anyway, like I said, still exploring this.  No conclusions drawn yet.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] DDT sync?

2011-05-25 Thread Daniel Carosone
On Wed, May 25, 2011 at 03:50:09PM -0700, Matthew Ahrens wrote:
  That said, for each block written (unique or not), the DDT must be updated,
 which means reading and then writing the block that contains that dedup
 table entry, and the indirect blocks to get to it.  With a reasonably large
 DDT, I would expect about 1 write to the DDT for every block written to the
 pool (or written but actually dedup'd).

That, right there, illustrates exactly why some people are
disappointed wrt performance expectations from dedup.

To paraphrase, and in general: 

 * for write, dedup may save bandwidth but will not save write iops.
 * dedup may amplify iops with more metadata reads 
 * dedup may turn larger sequential io into smaller random io patterns 
 * many systems will be iops bound before they are bandwidth or space
   bound (and l2arc only mitigates read iops)
 * any iops benefit will only come on later reads of dedup'd data, so
   is heavily dependent on access pattern.

Assessing whether these amortised costs are worth it for you can be
complex, especially when the above is not clearly understood.

To me, the thing that makes dedup most expensive in iops is the writes
for update when a file (or snapshot) is deleted.  These are additional
iops that dedup creates, not ones that it substitutes for others in
roughly equal number.  

This load is easily forgotten in a cursory analysis, and yet is always
there in a steady state with rolling auto-snapshots.  As I've written
before, I've had some success managing this load using deferred deletes
and snapshot holds, either to spread the load or to shift it to
otherwise-quiet times, as the case demanded.  I'd rather not have to. :-)

--
Dan.

pgpS8sJWxBEVR.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss