Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-20 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 
 New problem:
 
 I'm following all the advice I summarized into the OP of this thread, and
 testing on a test system.  (A laptop).  And it's just not working.  I am
 jumping into the dedup performance abyss far, far eariler than
predicted...

(resending this message, because it doesn't seem to have been delivered the
first time.  If this is a repeat, please ignore.)

Now I'm repeating all these tests on a system that more closely resembles a
server.  This is a workstation with 6 core processor, 16G ram, and a single
1TB hard disk.

In the default configuration, arc_meta_limit is 3837MB.  And as I increase
the number of unique blocks in the data pool, it is perfectly clear that
performance jumps off a cliff when arc_meta_used starts to reach that level,
which is approx 880,000 to 1,030,000 unique blocks.  FWIW, this means,
without evil tuning, a 16G server is only sufficient to run dedup on approx
33GB to 125GB unique data without severe performance degradation.  I'm
calling severe degradation anything that's an order of magnitude or worse.
(That's 40K average block size * 880,000 unique blocks, and 128K average
block size * 1,030,000 unique blocks.)

So clearly this needs to be addressed, if dedup is going to be super-awesome
moving forward.

But I didn't quit there.

So then I tweak the arc_meta_limit.  Set to 7680MB.  And repeat the test.
This time, the edge of the cliff is not so clearly defined, something like
1,480,000 to 1,620,000 blocks.  But the problem is - arc_meta_used never
even comes close to 7680MB.  At all times, I still have at LEAST 2G unused
free mem.

I have 16G physical mem, but at all times, I always have at least 2G free.
my arcstats:c_max is 15G.  But my arc size never exceeds 8.7G
my arc_meta_limit is 7680 MB, but my arc_meta_used never exceeds 3647 MB.

So what's the holdup?

All of the above is, of course, just a summary.  If you want complete
overwhelming details, here they are:
http://dl.dropbox.com/u/543241/dedup%20tests/readme.txt

http://dl.dropbox.com/u/543241/dedup%20tests/datagenerate.c
http://dl.dropbox.com/u/543241/dedup%20tests/getmemstats.sh
http://dl.dropbox.com/u/543241/dedup%20tests/parse.py
http://dl.dropbox.com/u/543241/dedup%20tests/runtest.sh

http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-1st-pass.txt
http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-1st-pass-parsed.xlsx

http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-2nd-pass.txt
http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-2nd-pass-parsed.xlsx


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-18 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 
 New problem:
 
 I'm following all the advice I summarized into the OP of this thread, and
 testing on a test system.  (A laptop).  And it's just not working.  I am
 jumping into the dedup performance abyss far, far eariler than
predicted...

Now I'm repeating all these tests on a system that more closely resembles a
server.  This is a workstation with 6 core processor, 16G ram, and a single
1TB hard disk.

In the default configuration, arc_meta_limit is 3837MB.  And as I increase
the number of unique blocks in the data pool, it is perfectly clear that
performance jumps off a cliff when arc_meta_used starts to reach that level,
which is approx 880,000 to 1,030,000 unique blocks.  FWIW, this means,
without evil tuning, a 16G server is only sufficient to run dedup on approx
33GB to 125GB unique data without severe performance degradation.  I'm
calling severe degradation anything that's an order of magnitude or worse.
(That's 40K average block size * 880,000 unique blocks, and 128K average
block size * 1,030,000 unique blocks.)

So clearly this needs to be addressed, if dedup is going to be super-awesome
moving forward.

But I didn't quit there.

So then I tweak the arc_meta_limit.  Set to 7680MB.  And repeat the test.
This time, the edge of the cliff is not so clearly defined, something like
1,480,000 to 1,620,000 blocks.  But the problem is - arc_meta_used never
even comes close to 7680MB.  At all times, I still have at LEAST 2G unused
free mem.

I have 16G physical mem, but at all times, I always have at least 2G free.
my arcstats:c_max is 15G.  But my arc size never exceeds 8.7G
my arc_meta_limit is 7680 MB, but my arc_meta_used never exceeds 3647 MB.

So what's the holdup?

All of the above is, of course, just a summary.  If you want complete
overwhelming details, here they are:
http://dl.dropbox.com/u/543241/dedup%20tests/readme.txt

http://dl.dropbox.com/u/543241/dedup%20tests/datagenerate.c
http://dl.dropbox.com/u/543241/dedup%20tests/getmemstats.sh
http://dl.dropbox.com/u/543241/dedup%20tests/parse.py
http://dl.dropbox.com/u/543241/dedup%20tests/runtest.sh

http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-1st-pass.txt
http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-1st-pass-parsed.xlsx

http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-2nd-pass.txt
http://dl.dropbox.com/u/543241/dedup%20tests/work%20workstation/runtest-outp
ut-2nd-pass-parsed.xlsx


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-11 Thread Frank Van Damme
Op 10-05-11 06:56, Edward Ned Harvey schreef:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey

 BTW, here's how to tune it:

 echo arc_meta_limit/Z 0x3000 | sudo mdb -kw

 echo ::arc | sudo mdb -k | grep meta_limit
 arc_meta_limit=   768 MB
 
 Well ... I don't know what to think yet.  I've been reading these numbers
 for like an hour, finding interesting things here and there, but nothing to
 really solidly point my finger at.
 
 The one thing I know for sure...  The free mem drops at an unnatural rate.
 Initially the free mem disappears at a rate approx 2x faster than the sum of
 file size and metadata combined.  Meaning the system could be caching the
 entire file and all the metadata, and that would only explain half of the
 memory disappearance.

I'm seeing similar things. Yesterday I first rebooted with set
zfs:zfs_arc_meta_limit=0x1 (that's 4 GiB) set in /etc/system and
monitored while the box was doing its regular job (taking backups).
zfs_arc_min is also set to 4 GiB. What I noticed is that shortly after
the reboot, the arc started filling up rapidly, mostly with metadata. It
shot up to:

arc_meta_max  =  3130 MB

afterwards, the number for arc_meta_used steadily dropped. Some 12 hours
ago, I started deleting files, it has deleted about 600 files since
then. Now at the moment the arc size stays right at the minimum of 2
GiB, of which metadata fluctuates around 1650 MB.

This is the output of the getmemstats.sh script you posted.

Memory: 6135M phys mem, 539M free mem, 6144M total swap, 6144M free swap
zfs:0:arcstats:c2147483648  = 2 GiB target size
zfs:0:arcstats:c_max5350862848  = 5 GiB
zfs:0:arcstats:c_min2147483648  = 2 GiB
zfs:0:arcstats:data_size829660160   = 791 MiB
zfs:0:arcstats:hdr_size 93396336= 89 MiB
zfs:0:arcstats:other_size   411215168   = 392 MiB
zfs:0:arcstats:size 1741492896  = 1661 Mi
arc_meta_used =  1626 MB
arc_meta_limit=  4096 MB
arc_meta_max  =  3130 MB

I get way more cache misses then I'd like:

Time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcszc
10:01:133K   380 10   1667   214   15   2597 1G   2G
10:02:132K   340 16372   302   46   323   16 1G   2G
10:03:132K   368 18473   321   46   347   17 1G   2G
10:04:131K   348 25444   303   63   335   24 1G   2G
10:05:132K   420 15874   332   36   383   14 1G   2G
10:06:133K   489 16   1326   357   35   427   14 1G   2G
10:07:132K   405 15492   355   39   401   15 1G   2G
10:08:132K   366 13402   326   37   366   13 1G   2G
10:09:131K   364 20181   345   58   363   20 1G   2G
10:10:134K   370  8592   311   21   3698 1G   2G
10:11:134K   351  8572   294   21   3508 1G   2G
10:12:133K   378 10592   319   26   372   10 1G   2G
10:13:133K   393 11532   339   28   393   11 1G   2G
10:14:132K   403 13402   363   35   402   13 1G   2G
10:15:133K   365 11482   317   30   365   11 1G   2G
10:16:132K   374 15402   334   40   374   15 1G   2G
10:17:133K   385 12432   341   28   383   12 1G   2G
10:18:134K   343  8642   279   19   3438 1G   2G
10:19:133K   391 10592   332   23   391   10 1G   2G


So, one explanation I can think of is that the rest of the memory are
l2arc pointers, supposing they are not actually counted in the arc
memory usage totals (AFAIK l2arc pointers are considered to be part of
arc). Then again my l2arc is still growing (slowly) and I'm only caching
metadata at the moment, so you'd think it'd shrink if there's no more
room for l2arc pointers. Besides I'm getting very little reads from ssd:

 capacity operationsbandwidth
pool  alloc   free   read  write   read  write
  -  -  -  -  -  -
backups   5.49T  1.57T415121  3.13M  1.58M
  raidz1  5.49T  1.57T415121  3.13M  1.58M
c0t0d0s1  -  -170 16  2.47M   551K
c0t1d0s1  -  -171 16  2.46M   550K
c0t2d0s1  -  -170 16  2.53M   552K
c0t3d0s1  -  -170 16  2.44M   550K
cache -  -  -  -  -  -
  c1t5d0  63.4G  48.4G 20  0  2.45M  0
  -  -  -  -  -  -

(typical statistic over 1 minute)


I might try the windows solution and reboot the machine to free up
memory and let it fill the cache all over again and see if I get more
cache hits... hmmm...

 I set the 

Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-09 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 
 So now I'll change meta_max and
 see if it helps...

Oh, know what?  Nevermind.
I just looked at the source, and it seems arc_meta_max is just a gauge for
you to use, so you can know what's the highest arc_meta_used has ever
reached.  So the most useful thing for you to do would be to set this to 0
to reset the counter.  And then you can start watching it over time. 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-09 Thread Frank Van Damme
Op 09-05-11 14:36, Edward Ned Harvey schreef:
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey

 So now I'll change meta_max and
 see if it helps...
 
 Oh, know what?  Nevermind.
 I just looked at the source, and it seems arc_meta_max is just a gauge for
 you to use, so you can know what's the highest arc_meta_used has ever
 reached.  So the most useful thing for you to do would be to set this to 0
 to reset the counter.  And then you can start watching it over time. 

Ok good to know - but that confuses me even more since in my previous
post my arc_meta_used was bigger than my arc_meta_limit (by about 50%)
and now wince I doubled _limit, _used only shrunk by a couple megs.

I'd really like to find some way to tell this machine CACHE MORE
METADATA, DAMNIT! :-)

-- 
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-09 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Frank Van Damme
 
 in my previous
 post my arc_meta_used was bigger than my arc_meta_limit (by about 50%)

I have the same thing.  But as I sit here and run more and more extensive
tests on it ... it seems like arc_meta_limit is sort of a soft limit.  Or it
only checks periodically or something like that.  Because although I
sometimes see size  limit, and I definitely see max  limit ...  When I do
bigger and bigger more intensive stuff, the size never grows much more than
limit.  It always gets knocked back down within a few seconds...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-09 Thread Frank Van Damme
Op 09-05-11 15:42, Edward Ned Harvey schreef:
  in my previous
  post my arc_meta_used was bigger than my arc_meta_limit (by about 50%)
 I have the same thing.  But as I sit here and run more and more extensive
 tests on it ... it seems like arc_meta_limit is sort of a soft limit.  Or it
 only checks periodically or something like that.  Because although I
 sometimes see size  limit, and I definitely see max  limit ...  When I do
 bigger and bigger more intensive stuff, the size never grows much more than
 limit.  It always gets knocked back down within a few seconds...


I found a script called arc_summary.pl and look what it says.


ARC Size:
 Current Size: 1734 MB (arcsize)
 Target Size (Adaptive):   1387 MB (c)
 Min Size (Hard Limit):637 MB (zfs_arc_min)
 Max Size (Hard Limit):5102 MB (zfs_arc_max)



c =  1512 MB
c_min =   637 MB
c_max =  5102 MB
size  =  1736 MB
...
arc_meta_used =  1735 MB
arc_meta_limit=  2550 MB
arc_meta_max  =  1832 MB

There are a dew seconds between running the script and ::arc | mdb -k,
but it seems that it just doesn't use more arc than 1734 or so MB, and
that nearly all of it is used for metadata. (I set primarycache=metadata
to my data fs, so I deem it logical). So the goal seems shifted to
trying to enlarge the arc size (what's it doing with the other memory???
I have close to no processes running.)


-- 
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-09 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 
 BTW, here's how to tune it:
 
 echo arc_meta_limit/Z 0x3000 | sudo mdb -kw
 
 echo ::arc | sudo mdb -k | grep meta_limit
 arc_meta_limit=   768 MB

Well ... I don't know what to think yet.  I've been reading these numbers
for like an hour, finding interesting things here and there, but nothing to
really solidly point my finger at.

The one thing I know for sure...  The free mem drops at an unnatural rate.
Initially the free mem disappears at a rate approx 2x faster than the sum of
file size and metadata combined.  Meaning the system could be caching the
entire file and all the metadata, and that would only explain half of the
memory disappearance.

I set the arc_meta_limit to 768 as mentioned above.  I ran all these tests,
and here are the results:
(sorry it's extremely verbose)
http://dl.dropbox.com/u/543241/dedup%20tests/runtest-output.xlsx

BTW, here are all the scripts etc that I used to produce those results:
http://dl.dropbox.com/u/543241/dedup%20tests/datagenerate.c
http://dl.dropbox.com/u/543241/dedup%20tests/getmemstats.sh
http://dl.dropbox.com/u/543241/dedup%20tests/runtest.sh


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-08 Thread Edward Ned Harvey
 From: Erik Trimble [mailto:erik.trim...@oracle.com]
 
 (1) I'm assuming you run your script repeatedly in the same pool,
 without deleting the pool. If that is the case, that means that a run of
 X+1 should dedup completely with the run of X.  E.g. a run with 12
 blocks will dedup the first 11 blocks with the prior run of 11.

I rm the file in between each run.  So if I'm not mistaken, no dedup happens
on consecutive runs based on previous runs.


 (2) can you NOT enable verify ?  Verify *requires* a disk read before
 writing for any potential dedup-able block. 

Every block is unique.  There is never anything to verify because there is
never a checksum match.

Why would I test dedup on non-dedupable data?  You can see it's a test.  In
any pool where you want to enable dedup, you're going to have a number of
dedupable blocks, and a number of non-dedupable blocks.  The memory
requirement is based on number of allocated blocks in the pool.  So I want
to establish an upper and lower bound for dedup performance.  I am running
some tests on entirely duplicate data to see how fast it goes, and also
running the described test on entirely non-duplicate data...  With enough
ram and without enough ram...  As verification that we know how to predict
the lower bound.

So far, I'm failing to predict the lower bound, which is why I've come here
to talk about it.

I've done a bunch of tests with dedup=verify or dedup=sha256.  Results the
same.  But I didn't do that for this particular test.  I'll run with just
sha256 if you would still like me to after what I just said.


 (3) fflush is NOT the same as fsync.  If you're running the script in a
 loop, it's entirely possible that ZFS hasn't completely committed things
 to disk yet, 

Oh.  Well I'll change that - but - I actually sat here and watched the HDD
light, so even though I did that wrong, I can say the hard drive finished
and became idle in between each run.  (I stuck sleep statements in between
each run specifically so I could watch the HDD light.)


  i=0
  while [i -lt 80 ];
  do
  j = $[10 + ( 1  * 1)]
  ./run_your_script j
  sync
  sleep 10
  i = $[$i+1]
  done

Oh, yeah.  That's what I did, minus the sync command.  I'll make sure to
include that next time.  And I used time ~/datagenerator

Incidentally, does fsync() and sync return instantly or wait?  Cuz time
sync might product 0 sec every time even if there were something waiting to
be flushed to disk.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-08 Thread Edward Ned Harvey
 From: Garrett D'Amore [mailto:garr...@nexenta.com]
 
 Just another data point.  The ddt is considered metadata, and by default the
 arc will not allow more than 1/4 of it to be used for metadata.   Are you 
 still
 sure it fits?

That's interesting.  Is it tunable?  That could certainly start to explain why 
my arc size arcstats:c never grew to any size I thought seemed reasonable...  
And in fact it grew larger when I had dedup disabled.  Smaller when dedup was 
enabled.  Weird, I thought.

Seems like a really important factor to mention in this summary.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-08 Thread Toby Thain
On 08/05/11 10:31 AM, Edward Ned Harvey wrote:
...
 Incidentally, does fsync() and sync return instantly or wait?  Cuz time
 sync might product 0 sec every time even if there were something waiting to
 be flushed to disk.

The semantics need to be synchronous. Anything else would be a horrible bug.

--Toby

 
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-08 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey

 That could certainly start to explain why my
 arc size arcstats:c never grew to any size I thought seemed reasonable...


Also now that I'm looking closer at arcstats, it seems arcstats:size might
be the appropriate measure, not arcstats:c

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-08 Thread Garrett D'Amore
It is tunable, I don't remember the exact tunable name... Arc_metadata_limit or 
some such.

  -- Garrett D'Amore

On May 8, 2011, at 7:37 AM, Edward Ned Harvey 
opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 From: Garrett D'Amore [mailto:garr...@nexenta.com]
 
 Just another data point.  The ddt is considered metadata, and by default the
 arc will not allow more than 1/4 of it to be used for metadata.   Are you 
 still
 sure it fits?
 
 That's interesting.  Is it tunable?  That could certainly start to explain 
 why my arc size arcstats:c never grew to any size I thought seemed 
 reasonable...  And in fact it grew larger when I had dedup disabled.  Smaller 
 when dedup was enabled.  Weird, I thought.
 
 Seems like a really important factor to mention in this summary.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-08 Thread Edward Ned Harvey
 From: Garrett D'Amore [mailto:garr...@nexenta.com]
 
 It is tunable, I don't remember the exact tunable name...
Arc_metadata_limit
 or some such.

There it is:
echo ::arc | sudo mdb -k | grep meta_limit
arc_meta_limit=   286 MB

Looking at my chart earlier in this discussion, it seems like this might not
be the cause of the problem.  In my absolute largest test that I ran, my
supposed (calculated) DDT size was 287MB, so this performance abyss was
definitely happening at sizes smaller than the arc_meta_limit.

But I'll go tune and test with this knowledge, just to be sure.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-08 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 
 But I'll go tune and test with this knowledge, just to be sure.

BTW, here's how to tune it:

echo arc_meta_limit/Z 0x3000 | sudo mdb -kw

echo ::arc | sudo mdb -k | grep meta_limit
arc_meta_limit=   768 MB

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-08 Thread Andrew Gabriel

Toby Thain wrote:

On 08/05/11 10:31 AM, Edward Ned Harvey wrote:
  

...
Incidentally, does fsync() and sync return instantly or wait?  Cuz time
sync might product 0 sec every time even if there were something waiting to
be flushed to disk.



The semantics need to be synchronous. Anything else would be a horrible bug.
  


sync(2) is not required to be synchronous.
I believe that for ZFS it is synchronous, but for most other 
filesystems, it isn't (although a second sync will block until the 
actions resulting from a previous sync have completed).


fsync(3C) is synchronous.

--
Andrew Gabriel
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-07 Thread Edward Ned Harvey
New problem:

I'm following all the advice I summarized into the OP of this thread, and
testing on a test system.  (A laptop).  And it's just not working.  I am
jumping into the dedup performance abyss far, far eariler than predicted...


My test system is a laptop with 1.5G ram, c_min =150M, c_max =1.2G
I have just a single sata 7.2krpm hard drive, no SSD.
Before I start, I have 1G free ram (according to top.)  
According to everything we've been talking about, I expect roughly 1G
divided by 376 bytes = 2855696 (2.8M) blocks in my pool before I start
running out of ram to hold the DDT and performance degrades.

I create a pool.  Enable dedup.  Set recordsize=512
I write a program that will very quickly generate unique non-dedupable data:
#include stdio.h
#include stdlib.h
int main(int argc, char *argv[]) {
int i;
int numblocks=atoi(argv[1]);
// Note: Expect one command-line argument integer.
FILE *outfile;
outfile=fopen(junk.file,w);
for (i=0; inumblocks ; i++) 
fprintf(outfile,%512d,i);
fflush(outfile);
fclose(outfile);
}

Disable dedup.  Run with a small numblocks.   For example:  time
~/datagenerator 100
Enable dedup and repeat.
They both complete instantly.

Repeat with a higher numblocks...  1000, 1, 10...  
Repeat until you find the point where performance with dedup is
significantly different from performance without dedup. 

See below.  Right around 400,000 blocks, dedup is suddenly an order of
magnitude slower than without dedup.

Times to create the file:
numblocks   dedup=off   dedup=verifyDDTsize Filesize
10  2.5sec  1.2sec  36 MB   49 MB
11  1.4sec  1.3sec  39 MB   54 MB
12  1.4sec  1.5sec  43 MB   59 MB
13  1.5sec  1.8sec  47 MB   63 MB
14  1.6sec  1.6sec  50 MB   68 MB
15  4.8sec  7.0sec  54 MB   73 MB
16  4.8sec  7.6sec  57 MB   78 MB
17  2.1sec  2.1sec  61 MB   83 MB
18  5.2sec  5.6sec  65 MB   88 MB
19  6.0sec  10.1sec 68 MB   93 MB
20  4.7sec  2.6sec  72 MB   98 MB
21  6.8sec  6.7sec  75 MB   103 MB
22  6.2sec  18.0sec 79 MB   107 MB
23  6.5sec  16.7sec 82 MB   112 MB
24  8.8sec  10.4sec 86 MB   117 MB
25  8.2sec  17.0sec 90 MB   122 MB
26  8.4sec  17.5sec 93 MB   127 MB
27  6.8sec  19.2sec 97 MB   132 MB
28  13.1sec 16.5sec 100 MB  137 MB
29  9.4sec  73.1sec 104 MB  142 MB
30  8.5sec  7.7sec  108 MB  146 MB
31  8.5sec  7.7sec  111 MB  151 MB
32  8.6sec  11.9sec 115 MB  156 MB
33  9.3sec  33.5sec 118 MB  161 MB
34  8.3sec  54.3sec 122 MB  166 MB
35  8.3sec  50.0sec 126 MB  171 MB
36  9.3sec  109.0sec129 MB  176 MB
37  9.5sec  12.5sec 133 MB  181 MB
38  10.1sec 28.6sec 136 MB  186 MB
39  10.2sec 14.6sec 140 MB  190 MB
40  10.7sec 136.7sec143 MB  195 MB
41  11.4sec 116.6sec147 MB  200 MB
42  11.5sec 220.9sec151 MB  205 MB
43  11.7sec 151.3sec154 MB  210 MB
44  12.7sec 144.7sec158 MB  215 MB
45  12.0sec 202.1sec161 MB  220 MB
46  13.9sec 134.7sec165 MB  225 MB
47  12.2sec 127.6sec169 MB  229 MB
48  13.1sec 122.7sec172 MB  234 MB
49  13.1sec 106.3sec176 MB  239 MB
50  15.8sec 174.6sec179 MB  244 MB
55  14.2sec 216.6sec197 MB  269 MB
60  15.6sec 294.2sec215 MB  293 MB
65  16.7sec 332.8sec233 MB  317 MB
70  19.0sec 269.6sec251 MB  342 MB
75  20.1sec 472.0sec269 MB  

Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-07 Thread Edward Ned Harvey
 See below.  Right around 400,000 blocks, dedup is suddenly an order of
 magnitude slower than without dedup.
 
 4010.7sec 136.7sec143 MB  195
MB
 8021.0sec 465.6sec287 MB  391
MB

The interesting thing is - In all these cases, the complete DDT and the
complete data file itself should fit entirely in ARC comfortably.  So it
makes no sense for performance to be so terrible at this level.

So I need to start figuring out exactly what's going on.  Unfortunately I
don't know how to do that very well.  I'm looking for advice from anyone -
how to poke around and see how much memory is being consumed for what
purposes.  I know how to lookup c_min and c and c_max...  But that didn't do
me much good.  The actual value for c barely changes at all over time...
Even when I rm the file, c does not change immediately.

All the other metrics from kstat ... have less than obvious names ... so I
don't know what to look for...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-07 Thread Erik Trimble

On 5/7/2011 6:47 AM, Edward Ned Harvey wrote:

See below.  Right around 400,000 blocks, dedup is suddenly an order of
magnitude slower than without dedup.

40  10.7sec 136.7sec143 MB  195

MB

80  21.0sec 465.6sec287 MB  391

MB

The interesting thing is - In all these cases, the complete DDT and the
complete data file itself should fit entirely in ARC comfortably.  So it
makes no sense for performance to be so terrible at this level.

So I need to start figuring out exactly what's going on.  Unfortunately I
don't know how to do that very well.  I'm looking for advice from anyone -
how to poke around and see how much memory is being consumed for what
purposes.  I know how to lookup c_min and c and c_max...  But that didn't do
me much good.  The actual value for c barely changes at all over time...
Even when I rm the file, c does not change immediately.

All the other metrics from kstat ... have less than obvious names ... so I
don't know what to look for...

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Some minor issues that might affect the above:

(1) I'm assuming you run your script repeatedly in the same pool, 
without deleting the pool. If that is the case, that means that a run of 
X+1 should dedup completely with the run of X.  E.g. a run with 12 
blocks will dedup the first 11 blocks with the prior run of 11.


(2) can you NOT enable verify ?  Verify *requires* a disk read before 
writing for any potential dedup-able block. If case #1 above applies, 
then by turning on dedup, you *rapidly* increase the amount of disk I/O 
you require on each subsequent run.  E.g. the run of 10 requires no 
disk I/O due to verify, but the run of 11 requires 10 I/O 
requests, while the run of 12 requires 11 requests, etc.  This 
will skew your results as the ARC buffering of file info changes over time.


(3) fflush is NOT the same as fsync.  If you're running the script in a 
loop, it's entirely possible that ZFS hasn't completely committed things 
to disk yet, which means that you get I/O requests to flush out the ARC 
write buffer in the middle of your runs.   Honestly, I'd do the 
following for benchmarking:


i=0
while [i -lt 80 ];
do
j = $[10 + ( 1  * 1)]
./run_your_script j
sync
sleep 10
i = $[$i+1]
done



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-06 Thread Frank Van Damme
Op 06-05-11 05:44, Richard Elling schreef:
 As the size of the data grows, the need to have the whole DDT in RAM or L2ARC
 decreases. With one notable exception, destroying a dataset or snapshot 
 requires
 the DDT entries for the destroyed blocks to be updated. This is why people can
 go for months or years and not see a problem, until they try to destroy a 
 dataset.

So what you are saying is you with your ram-starved system, don't even
try to start using snapshots on that system. Right?

-- 
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-06 Thread Casper . Dik

Op 06-05-11 05:44, Richard Elling schreef:
 As the size of the data grows, the need to have the whole DDT in RAM or L2ARC
 decreases. With one notable exception, destroying a dataset or snapshot 
 requires
 the DDT entries for the destroyed blocks to be updated. This is why people 
 can
 go for months or years and not see a problem, until they try to destroy a 
 dataset.

So what you are saying is you with your ram-starved system, don't even
try to start using snapshots on that system. Right?


I think it's more like don't use dedup when you don't have RAM.

(It is not possible to not use snapshots in Solaris; they are used for
everything)

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-06 Thread Erik Trimble

On 5/6/2011 1:37 AM, casper@oracle.com wrote:

Op 06-05-11 05:44, Richard Elling schreef:

As the size of the data grows, the need to have the whole DDT in RAM or L2ARC
decreases. With one notable exception, destroying a dataset or snapshot requires
the DDT entries for the destroyed blocks to be updated. This is why people can
go for months or years and not see a problem, until they try to destroy a 
dataset.

So what you are saying is you with your ram-starved system, don't even
try to start using snapshots on that system. Right?


I think it's more like don't use dedup when you don't have RAM.

(It is not possible to not use snapshots in Solaris; they are used for
everything)

Casper

Casper and Richard are correct - RAM starvation seriously impacts 
snapshot or dataset deletion when a pool has dedup enabled.  The reason 
behind this is that ZFS needs to scan the entire DDT to check to see if 
it can actually delete each block in the to-be-deleted snapshot/dataset, 
or if it just needs to update the dedup reference count. If it can't 
store the entire DDT in either the ARC or L2ARC, it will be forced to do 
considerable I/O to disk, as it brings in the appropriate DDT entry.   
Worst case for insufficient ARC/L2ARC space can increase deletion times 
by many orders of magnitude. E.g. days, weeks, or even months to do a 
deletion.



If dedup isn't enabled, snapshot and data deletion is very light on RAM 
requirements, and generally won't need to do much (if any) disk I/O.  
Such deletion should take milliseconds to a minute or so.




--

Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-06 Thread Tomas Ögren
On 06 May, 2011 - Erik Trimble sent me these 1,8K bytes:

 If dedup isn't enabled, snapshot and data deletion is very light on RAM  
 requirements, and generally won't need to do much (if any) disk I/O.   
 Such deletion should take milliseconds to a minute or so.

.. or hours. We've had problems on an old raidz2 that a recursive
snapshot creation over ~800 filesystems could take quite some time, up
until the sata-scsi disk box ate the pool. Now we're using raid10 on a
scsi box, and it takes 3-15 minute or so, during which sync writes (NFS)
are almost unusable. Using 2 fast usb sticks as l2arc, waiting for a
Vertex2EX and a Vertex3 to arrive for ZILL2ARC testing. IO to the
filesystems are quite low (50 writes, 500k data per sec average), but
snapshot times goes waay up during backups.

/Tomas
-- 
Tomas Ögren, st...@acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Umeå
`- Sysadmin at {cs,acc}.umu.se
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-06 Thread Edward Ned Harvey
 From: Richard Elling [mailto:richard.ell...@gmail.com]
 
  --- To calculate size of DDT ---
   zdb -S poolname
Look at total blocks allocated.  It is rounded, and uses a suffix like K,
M, G but it's in decimal (powers of 10) notation, so you have to remember
that...  So I prefer the zdb -D method below, but this works too.  Total
blocks allocated * mem requirement per DDT entry, and you have the mem
requirement to hold whole DDT in ram.


   zdb -DD poolname
This just gives you the -S output, and the -D output all in one go.  So I
recommend using -DD, and base your calculations on #duplicate and #unique,
as mentioned below.  Consider the histogram to be informational.

   zdb -D poolname
It gives you a number of duplicate, and a number of unique blocks.  Add them
to get the total number of blocks.  Multiply by the mem requirement per DDT
entry, and you have the mem requirement to hold the whole DDT in ram.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-06 Thread Edward Ned Harvey
 From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
 boun...@opensolaris.org] On Behalf Of Edward Ned Harvey
 
  zdb -DD poolname
 This just gives you the -S output, and the -D output all in one go.  So I

Sorry, zdb -DD only works for pools that are already dedup'd.
If you want to get a measurement for a pool that is not already dedup'd, you
have to use -S

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-06 Thread Yaverot
One of the quoted participants is Richard Elling, the other is Edward Ned 
Harvey, but my quoting was screwed up enough that I don't know which is which.  
Apologies.

 zdb -DD poolname
 This just gives you the -S output, and the -D output all in one go.  So I

Sorry, zdb -DD only works for pools that are already dedup'd.
If you want to get a measurement for a pool that is not already dedup'd, you 
have to use -S

And since zdb -S runs for 2 hours and dumps core (without results), the correct 
answer remains:
zdb -bb poolname | grep 'bp count'
as was given in the summary.

The theoretical output of zdb -S my be superior if you have a version that 
works, but I haven't seen anyone mention onlist which version(s) it is, or 
if/how it can be obtained; short of recompiling it yourself.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-06 Thread Richard Elling
On May 6, 2011, at 3:24 AM, Erik Trimble erik.trim...@oracle.com wrote:

 On 5/6/2011 1:37 AM, casper@oracle.com wrote:
 Op 06-05-11 05:44, Richard Elling schreef:
 As the size of the data grows, the need to have the whole DDT in RAM or 
 L2ARC
 decreases. With one notable exception, destroying a dataset or snapshot 
 requires
 the DDT entries for the destroyed blocks to be updated. This is why people 
 can
 go for months or years and not see a problem, until they try to destroy a 
 dataset.
 So what you are saying is you with your ram-starved system, don't even
 try to start using snapshots on that system. Right?
 
 I think it's more like don't use dedup when you don't have RAM.
 
 (It is not possible to not use snapshots in Solaris; they are used for
 everything)

:-)

 
 Casper
 
 Casper and Richard are correct - RAM starvation seriously impacts snapshot or 
 dataset deletion when a pool has dedup enabled.  The reason behind this is 
 that ZFS needs to scan the entire DDT to check to see if it can actually 
 delete each block in the to-be-deleted snapshot/dataset, or if it just needs 
 to update the dedup reference count.

AIUI, the issue is not the the DDT is scanned, it is an AVL tree for a reason. 
The issue is that each reference update means that one, small bit of data is 
changed. If the reference is not already in ARC, then a small, probably random 
read is needed. If you have a typical consumer disk, especially a green disk, 
and have not tuned zfs_vdev_max_pending, then that itty bitty read can easily 
take more than 100 milliseconds(!) Consider that you can have thousands or 
millions of reference updates to do during a zfs destroy, and the math gets 
ugly. This is why fast SSDs make good dedup candidates.

 If it can't store the entire DDT in either the ARC or L2ARC, it will be 
 forced to do considerable I/O to disk, as it brings in the appropriate DDT 
 entry.   Worst case for insufficient ARC/L2ARC space can increase deletion 
 times by many orders of magnitude. E.g. days, weeks, or even months to do a 
 deletion.

I've never seen months, but I have seen days, especially for low-perf disks.

 
 If dedup isn't enabled, snapshot and data deletion is very light on RAM 
 requirements, and generally won't need to do much (if any) disk I/O.  Such 
 deletion should take milliseconds to a minute or so.

Yes, perhaps a bit longer for recursive destruction, but everyone here knows 
recursion is evil, right? :-)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-06 Thread Erik Trimble

On 5/6/2011 5:46 PM, Richard Elling wrote:

On May 6, 2011, at 3:24 AM, Erik Trimbleerik.trim...@oracle.com  wrote:


Casper and Richard are correct - RAM starvation seriously impacts snapshot or 
dataset deletion when a pool has dedup enabled.  The reason behind this is that 
ZFS needs to scan the entire DDT to check to see if it can actually delete each 
block in the to-be-deleted snapshot/dataset, or if it just needs to update the 
dedup reference count.

AIUI, the issue is not the the DDT is scanned, it is an AVL tree for a reason. The issue 
is that each reference update means that one, small bit of data is changed. If the 
reference is not already in ARC, then a small, probably random read is needed. If you 
have a typical consumer disk, especially a green disk, and have not tuned 
zfs_vdev_max_pending, then that itty bitty read can easily take more than 100 
milliseconds(!) Consider that you can have thousands or millions of reference updates to 
do during a zfs destroy, and the math gets ugly. This is why fast SSDs make good dedup 
candidates.

Just out of curiosity - I'm assuming that a delete works like this:

(1) find list of blocks associated with file to be deleted
(2) using the DDT, find out if any other files are using those blocks
(3) delete/update any metadata associated with the file (dirents, 
ACLs, etc.)

(4) for each block in the file
(4a) if the DDT indicates there ARE other files using this 
block, update the DDT entry to change the refcount
(4b) if the DDT indicates there AREN'T any other files, move 
the physical block to the free list, and delete the DDT entry



In a bulk delete scenario (not just snapshot deletion), I'd presume #1 
above almost always causes a Random I/O request to disk, as all the 
relevant metadata for every (to be deleted) file is unlikely to be 
stored in ARC.  If you can't fit the DDT in ARC/L2ARC, #2 above would 
require you to pull in the remainder of the DDT info from disk, right?  
#3 and #4 can be batched up, so they don't hurt that much.


Is that a (roughly) correct deletion methodology? Or can someone give a 
more accurate view of what's actually going on?





If it can't store the entire DDT in either the ARC or L2ARC, it will be forced 
to do considerable I/O to disk, as it brings in the appropriate DDT entry.   
Worst case for insufficient ARC/L2ARC space can increase deletion times by many 
orders of magnitude. E.g. days, weeks, or even months to do a deletion.

I've never seen months, but I have seen days, especially for low-perf disks.
I've seen an estimate of 5 weeks for removing a snapshot on a 1TB dedup 
pool made up of 1 disk.


Not an optimal set up.

:-)


If dedup isn't enabled, snapshot and data deletion is very light on RAM 
requirements, and generally won't need to do much (if any) disk I/O.  Such 
deletion should take milliseconds to a minute or so.

Yes, perhaps a bit longer for recursive destruction, but everyone here knows 
recursion is evil, right? :-)
  -- richard
You, my friend, have obviously never worshipped at the Temple of the 
Lamba Calculus, nor been exposed to the Holy Writ that is Structure and 
Interpretation of Computer Programs 
(http://mitpress.mit.edu/sicp/full-text/book/book.html).


I sentence you to a semester of 6.001 problem sets, written by Prof 
Sussman sometime in the 1980s.


(yes, I went to MIT.)

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-05 Thread Edward Ned Harvey
 From: Erik Trimble [mailto:erik.trim...@oracle.com]
 
 Using the standard c_max value of 80%, remember that this is 80% of the
 TOTAL system RAM, including that RAM normally dedicated to other
 purposes.  So long as the total amount of RAM you expect to dedicate to
 ARC usage (for all ZFS uses, not just dedup) is less than 4 times that
 of all other RAM consumption, you don't need to overprovision.

Correct, usually you don't need to overprovision for the sake of ensuring
enough ram available for OS and processes.  But you do need to overprovision
25% if you want to increase the size of your usable ARC without reducing the
amount of ARC you currently have in the system being used to cache other
files etc.


 Any
 entry that is migrated back from L2ARC into ARC is considered stale
 data in the L2ARC, and thus, is no longer tracked in the ARC's reference
 table for L2ARC.

Good news.  I didn't know that.  I thought the L2ARC was still valid, even
if something was pulled back into ARC.

So there are two useful models:
(a) The upper bound:  The whole DDT is in ARC, and the whole L2ARC is filled
with average-size blocks.
or
(b) The lower bound:  The whole DDT is in L2ARC, and all the rest of the
L2ARC is filled with average-size blocks.  ARC requirements are based only
on L2ARC references.

The actual usage will be something between (a) and (b)...  And the actual is
probably closer to (b)

In my test system:
(a)  (upper bound)
On my test system I guess the OS and processes consume 1G.  (I'm making that
up without any reason.)
On my test system I guess I need 8G in the system to get reasonable
performance without dedup or L2ARC.  (Again, I'm just making that up.)
I need 7G for DDT and 
I have 748982 average-size blocks in L2ARC, which means 131820832 bytes =
125M or 0.1G for L2ARC
I really just need to plan for 7.1G ARC usage
Multiply by 5/4 and it means I need 8.875G system ram
My system needs to be built with at least 8G + 8.875G = 16.875G.

(b)  (lower bound)
On my test system I guess the OS and processes consume 1G.  (I'm making that
up without any reason.)
On my test system I guess I need 8G in the system to get reasonable
performance without dedup or L2ARC.  (Again, I'm just making that up.)
I need 0G for DDT  (because it's in L2ARC) and 
I need 3.4G ARC to hold all the L2ARC references, including the DDT in L2ARC
So I really just need to plan for 3.4G ARC for my L2ARC references.
Multiply by 5/4 and it means I need 4.25G system ram
My system needs to be built with at least 8G + 4.25G = 12.25G.

Thank you for your input, Erik.  Previously I would have only been
comfortable with 24G in this system, because I was calculating a need for
significantly higher than 16G.  But now, what we're calling the upper bound
is just *slightly* higher than 16G, while the lower bound and most likely
actual figure is significantly lower than 16G.  So in this system, I would
be comfortable running with 16G.  But I would be even more comfortable
running with 24G.   ;-)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-05 Thread Karl Wagner
so there's an ARC entry referencing each individual DDT entry in the L2ARC?! I 
had made the assumption that DDT entries would be grouped into at least minimum 
block sized groups (8k?), which would have lead to a much more reasonable ARC 
requirement.

seems like a bad design to me, which leads to dedup only being usable by those 
prepared to spend a LOT of dosh... which may as well go into more storage (I 
know there are other benefits too, but that's my opinion)
-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Edward Ned Harvey opensolarisisdeadlongliveopensola...@nedharvey.com wrote:

 From: Erik Trimble [mailto:erik.trim...@oracle.com]   Using the standard 
 c_max value of 80%, remember that this is 80% of the  TOTAL system RAM, 
 including that RAM normally dedicated to other  purposes. So long as the 
 total amount of RAM you expect to dedicate to  ARC usage (for all ZFS uses, 
 not just dedup) is less than 4 times that  of all other RAM consumption, you 
 don't need to overprovision. Correct, usually you don't need to 
 overprovision for the sake of ensuring enough ram available for OS and 
 processes. But you do need to overprovision 25% if you want to increase the 
 size of your usable ARC without reducing the amount of ARC you currently have 
 in the system being used to cache other files etc.  Any  entry that is 
 migrated back from L2ARC into ARC is considered stale  data in the L2ARC, 
 and thus, is no longer tracked in the ARC's reference  table for L2ARC. Good 
 news. I didn't know that. I thought the L2ARC was still valid, even if 
 something was pulled 
 back
into ARC. So there are two useful models: (a) The upper bound: The whole DDT is 
in ARC, and the whole L2ARC is filled with average-size blocks. or (b) The 
lower bound: The whole DDT is in L2ARC, and all the rest of the L2ARC is filled 
with average-size blocks. ARC requirements are based only on L2ARC references. 
The actual usage will be something between (a) and (b)... And the actual is 
probably closer to (b) In my test system: (a) (upper bound) On my test system I 
guess the OS and processes consume 1G. (I'm making that up without any reason.) 
On my test system I guess I need 8G in the system to get reasonable performance 
without dedup or L2ARC. (Again, I'm just making that up.) I need 7G for DDT and 
I have 748982 average-size blocks in L2ARC, which means 131820832 bytes = 125M 
or 0.1G for L2ARC I really just need to plan for 7.1G ARC usage Multiply by 5/4 
and it means I need 8.875G system ram My system needs to be built with at least 
8G + 8.875G = 16.875G. (b) (lower bound) 
 On my
test system I guess the OS and processes consume 1G. (I'm making that up 
without any reason.) On my test system I guess I need 8G in the system to get 
reasonable performance without dedup or L2ARC. (Again, I'm just making that 
up.) I need 0G for DDT (because it's in L2ARC) and I need 3.4G ARC to hold all 
the L2ARC references, including the DDT in L2ARC So I really just need to plan 
for 3.4G ARC for my L2ARC references. Multiply by 5/4 and it means I need 4.25G 
system ram My system needs to be built with at least 8G + 4.25G = 12.25G. Thank 
you for your input, Erik. Previously I would have only been comfortable with 
24G in this system, because I was calculating a need for significantly higher 
than 16G. But now, what we're calling the upper bound is just *slightly* higher 
than 16G, while the lower bound and most likely actual figure is significantly 
lower than 16G. So in this system, I would be comfortable running with 16G. But 
I would be even more comfortable running with 24G.
;-)_
zfs-discuss mailing list zfs-discuss@opensolaris.org 
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss 

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-05 Thread Edward Ned Harvey
 From: Karl Wagner [mailto:k...@mouse-hole.com]
 
 so there's an ARC entry referencing each individual DDT entry in the L2ARC?!
 I had made the assumption that DDT entries would be grouped into at least
 minimum block sized groups (8k?), which would have lead to a much more
 reasonable ARC requirement.
 
 seems like a bad design to me, which leads to dedup only being usable by
 those prepared to spend a LOT of dosh... which may as well go into more
 storage (I know there are other benefits too, but that's my opinion)

The whole point of the DDT is that it needs to be structured, and really fast 
searchable.  So no, you're not going to consolidate it into an unstructured 
memory block as you said.  You pay the memory consumption price for the sake of 
performance.  Yes it consumes a lot of ram, but don't call it a bad design.  
It's just a different design than what you expected, because what you expected 
would hurt performance while consuming less ram.

And we're not talking crazy dollars here.  So your emphasis on a LOT of dosh 
seems exaggerated.  I just spec'd out a system where upgrading from 12 to 24G 
of ram to enable dedup effectively doubled the storage capacity of the system, 
and that upgrade cost the same as one of the disks.  (This is a 12-disk 
system.)   So it was actually a 6x cost reducer, at least.  It all depends on 
how much mileage you get out of the dedup.  Your mileage may vary.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-05 Thread Richard Elling
On May 4, 2011, at 7:56 PM, Edward Ned Harvey wrote:

 This is a summary of a much longer discussion Dedup and L2ARC memory
 requirements (again)
 Sorry even this summary is long.  But the results vary enormously based on
 individual usage, so any rule of thumb metric that has been bouncing
 around on the internet is simply not sufficient.  You need to go into this
 level of detail to get an estimate that's worth the napkin or bathroom
 tissue it's scribbled on.
 
 This is how to (reasonably) accurately estimate the hypothetical ram
 requirements to hold the complete data deduplication tables (DDT) and L2ARC
 references in ram.  Please note both the DDT and L2ARC references can be
 evicted from memory according to system policy, whenever the system decides
 some other data is more valuable to keep.  So following this guide does not
 guarantee that the whole DDT will remain in ARC or L2ARC.  But it's a good
 start.

As the size of the data grows, the need to have the whole DDT in RAM or L2ARC
decreases. With one notable exception, destroying a dataset or snapshot requires
the DDT entries for the destroyed blocks to be updated. This is why people can
go for months or years and not see a problem, until they try to destroy a 
dataset.

 
 I am using a solaris 11 express x86 test system for my example numbers
 below.  
 
 --- To calculate size of DDT ---
 
 Each entry in the DDT is a fixed size, which varies by platform.  You can
 find it with the command:
   echo ::sizeof ddt_entry_t | mdb -k
 This will return a hex value, that you probably want to convert to decimal.
 On my test system, it is 0x178 which is 376 bytes
 
 There is one DDT entry per non-dedup'd (unique) block in the zpool.

The workloads which are nicely dedupable tend to not have unique blocks.
So this is another way of saying, if your workload isn't dedupable, don't 
bother
with deduplication. For years now we have been trying to convey this message.
One way to help convey the message is...

  Be
 aware that you cannot reliably estimate #blocks by counting #files.  You can
 find the number of total blocks including dedup'd blocks in your pool with
 this command:
   zdb -bb poolname | grep 'bp count'

Ugh. A better method is to simulate dedup on existing data:
zdb -S poolname
or measure dedup efficacy
zdb -DD poolname
which offer similar tabular analysis

 Note:  This command will run a long time and is IO intensive.  On my systems
 where a scrub runs for 8-9 hours, this zdb command ran for about 90 minutes.
 On my test system, the result is 44145049 (44.1M) total blocks.
 
 To estimate the number of non-dedup'd (unique) blocks (assuming average size
 of dedup'd blocks = average size of blocks in the whole pool), use:
   zpool list
 Find the dedup ratio.  In my test system, it is 2.24x.  Divide the total
 blocks by the dedup ratio to find the number of non-dedup'd (unique) blocks.

Or just count the unique and non-unique blocks with:
zdb -D poolname

 
 In my test system:
   44145049 total blocks / 2.24 dedup ratio = 19707611 (19.7M) approx
 non-dedup'd (unique) blocks
 
 Then multiply by the size of a DDT entry.
   19707611 * 376 = 7410061796 bytes = 7G total DDT size

A minor gripe about zdb -D output is that it doesn't do the math.

 
 --- To calculate size of ARC/L2ARC references ---
 
 Each reference to a L2ARC entry requires an entry in ARC (ram).  This is
 another fixed size, which varies by platform.  You can find it with the
 command:
   echo ::sizeof arc_buf_hdr_t | mdb -k
 On my test system, it is 0xb0 which is 176 bytes

Better yet, without need for mdb privilege, measure the current L2ARC header
size in use. Normal user accounts can:
kstat -p zfs::arcstats:hdr_size
kstat -p zfs::arcstats:l2_hdr_size

arcstat will allow you to easily track this over time.

 
 We need to know the average block size in the pool, to estimate the number
 of blocks that will fit into L2ARC.  Find the amount of space ALLOC in the
 pool:
   zpool list
 Divide by the number of non-dedup'd (unique) blocks in the pool, to find the
 average block size.  In my test system:
   790G / 19707611 = 42K average block size
 
 Remember:  If your L2ARC were only caching average size blocks, then the
 payload ratio of L2ARC vs ARC would be excellent.  In my test system, every
 42K L2ARC would require 176bytes ARC (a ratio of 244x).  This would result
 in a negligible ARC memory consumption.  But since your DDT can be pushed
 out of ARC into L2ARC, you get a really bad ratio of L2ARC vs ARC memory
 consumption.  In my test system every 376bytes DDT entry in L2ARC consumes
 176bytes ARC (a ratio of 2.1x).  Yes, it is approximately possible to have
 the complete DDT present in ARC and L2ARC, thus consuming tons of ram.

This is a good thing for those cases when you need to quickly reference large
numbers of DDT entries.

 
 Remember disk mfgrs use base-10.  So my 32G SSD 

Re: [zfs-discuss] Summary: Dedup and L2ARC memory requirements

2011-05-04 Thread Erik Trimble

Good summary, Ned.  A couple of minor corrections.

On 5/4/2011 7:56 PM, Edward Ned Harvey wrote:

This is a summary of a much longer discussion Dedup and L2ARC memory
requirements (again)
Sorry even this summary is long.  But the results vary enormously based on
individual usage, so any rule of thumb metric that has been bouncing
around on the internet is simply not sufficient.  You need to go into this
level of detail to get an estimate that's worth the napkin or bathroom
tissue it's scribbled on.

This is how to (reasonably) accurately estimate the hypothetical ram
requirements to hold the complete data deduplication tables (DDT) and L2ARC
references in ram.  Please note both the DDT and L2ARC references can be
evicted from memory according to system policy, whenever the system decides
some other data is more valuable to keep.  So following this guide does not
guarantee that the whole DDT will remain in ARC or L2ARC.  But it's a good
start.

I am using a solaris 11 express x86 test system for my example numbers
below.

--- To calculate size of DDT ---

Each entry in the DDT is a fixed size, which varies by platform.  You can
find it with the command:
echo ::sizeof ddt_entry_t | mdb -k
This will return a hex value, that you probably want to convert to decimal.
On my test system, it is 0x178 which is 376 bytes

There is one DDT entry per non-dedup'd (unique) block in the zpool.  Be
aware that you cannot reliably estimate #blocks by counting #files.  You can
find the number of total blocks including dedup'd blocks in your pool with
this command:
zdb -bb poolname | grep 'bp count'
Note:  This command will run a long time and is IO intensive.  On my systems
where a scrub runs for 8-9 hours, this zdb command ran for about 90 minutes.
On my test system, the result is 44145049 (44.1M) total blocks.

To estimate the number of non-dedup'd (unique) blocks (assuming average size
of dedup'd blocks = average size of blocks in the whole pool), use:
zpool list
Find the dedup ratio.  In my test system, it is 2.24x.  Divide the total
blocks by the dedup ratio to find the number of non-dedup'd (unique) blocks.

In my test system:
44145049 total blocks / 2.24 dedup ratio = 19707611 (19.7M) approx
non-dedup'd (unique) blocks

Then multiply by the size of a DDT entry.
19707611 * 376 = 7410061796 bytes = 7G total DDT size

--- To calculate size of ARC/L2ARC references ---

Each reference to a L2ARC entry requires an entry in ARC (ram).  This is
another fixed size, which varies by platform.  You can find it with the
command:
echo ::sizeof arc_buf_hdr_t | mdb -k
On my test system, it is 0xb0 which is 176 bytes

We need to know the average block size in the pool, to estimate the number
of blocks that will fit into L2ARC.  Find the amount of space ALLOC in the
pool:
zpool list
Divide by the number of non-dedup'd (unique) blocks in the pool, to find the
average block size.  In my test system:
790G / 19707611 = 42K average block size

Remember:  If your L2ARC were only caching average size blocks, then the
payload ratio of L2ARC vs ARC would be excellent.  In my test system, every
42K L2ARC would require 176bytes ARC (a ratio of 244x).  This would result
in a negligible ARC memory consumption.  But since your DDT can be pushed
out of ARC into L2ARC, you get a really bad ratio of L2ARC vs ARC memory
consumption.  In my test system every 376bytes DDT entry in L2ARC consumes
176bytes ARC (a ratio of 2.1x).  Yes, it is approximately possible to have
the complete DDT present in ARC and L2ARC, thus consuming tons of ram.

Remember disk mfgrs use base-10.  So my 32G SSD is only 30G base-2.
(32,000,000,000 / 1024/1024/1024)

So I have 30G L2ARC, and the first 7G may be consumed by DDT.  This leaves
23G remaining to be used for average-sized blocks.
The ARC consumed to reference the DDT in L2ARC is 176/376 * DDT size. In my
test system this is 176/376 * 7G = 3.3G

Take the remaining size of your L2ARC, divide by average block size, to get
the number of average size blocks the L2ARC can hold.  In my test system:
23G / 42K = 574220 average-size blocks in L2ARC
Multiply by the ARC size of a L2ARC reference.  On my test system:
574220 * 176 = 101062753 bytes = 96MB ARC consumed to reference the
average-size blocks in L2ARC

So the total ARC consumption to hold L2ARC references in my test system is
3.3G + 96M ~= 3.4G

--- To calculate total ram needed ---

And finally - The max size the ARC is allowed to grow, is a constant that
varies by platform.  On my system, it is 80% of system ram.  You can find
this value using the command:
kstat -p zfs::arcstats:c_max
Divide by your total system memory to find the ratio.
Assuming the ratio is 4/5, it means you need to buy 5/4 the amount of
calculated ram to satisfy all your requirements.

Using the standard c_max value of 80%, remember that this is 80% of the 
TOTAL