Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

Richard Elling Thu, 28 Apr 2011 21:50:53 -0700

[the dog jumped on the keyboard and wiped out my first reply, second attempt...]


On Apr 27, 2011, at 9:26 PM, Edward Ned Harvey wrote:

>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
>> boun...@opensolaris.org] On Behalf Of Neil Perrin
>> 
>> No, that's not true. The DDT is just like any other ZFS metadata and can
> be
>> split over the ARC,
>> cache device (L2ARC) and the main pool devices. An infrequently referenced
>> DDT block will get
>> evicted from the ARC to the L2ARC then evicted from the L2ARC.
> 
> When somebody has their "baseline" system, and they're thinking about adding
> dedup and/or cache, I'd like to understand the effect of not having enough
> ram.  Obviously the impact will be performance, but precisely...

"precisely" only works when you know precisely what your data looks like.
For most folks, that is unknown in advance.

slow disks + small RAM = bad recipe for dedup

> At bootup, I presume the arc & l2arc are all empty.  So all the DDT entries
> reside in pool.  

Yes

> As the system reads things (anything, files etc) from pool,
> it will populate arc, and follow fill rate policies to populate the l2arc
> over time.  Every entry in l2arc requires 200 bytes of arc, regardless of
> what type of entry it is.  (A DDT entry in l2arc consumes just as much arc
> memory as any other type of l2arc entry.)

Approximately 200 bytes, this is subject to change.

>  (Ummm...  What's the point of
> that?  Aren't DDT entries 270 bytes and ARC references 200 bytes?  

DDT entries vary in size. More references means more bytes needed.

> Seems
> like a very questionable benefit to allow DDT entries to get evicted into
> L2ARC.)  So the ram consumption caused by the presence of l2arc will
> initially be zero after bootup, and it will grow over time as the l2arc
> populates, up to a maximum which is determined linearly as 200 bytes * the
> number of entries that can fit in the l2arc.  Of course that number varies
> based on the size of each entry and size of l2arc, but at least you can
> estimate and establish upper and lower bounds.

Yes, this is simple enough to toss into a spreadsheet.

> So that's how the l2arc consumes system memory in arc.  The penalty of
> insufficient ram, in conjunction with enabled L2ARC, is insufficient arc
> availability for other purposes - Maybe the whole arc is consumed by l2arc
> entries, and so the arc doesn't have any room for other stuff like commonly
> used files.  

I've never witnessed such a condition and doubt that it would happen.

> Worse yet, your arc consumption could be so large, that
> PROCESSES don't fit in ram anymore.  In this case, your processes get pushed
> out to swap space, which is really bad.

This will not happen. The ARC will be asked to shrink when other memory 
consumers demand memory. The lower bound of ARC size is c_min

# kstat -p zfs::arcstats:c_min

> Correct me if I'm wrong, but the dedup sha256 checksum happens in addition
> to (not instead of) the fletcher2 integrity checksum.  

You are wrong, as others have pointed out.  Documented in the man page.

> So after bootup,
> while the system is reading a bunch of data from the pool, all those reads
> are not populating the arc/l2arc with DDT entries.  Reads are just
> populating the arc and l2arc with other stuff.

L2ARC is populated by a thread that watches the soon-to-be-evicted list.
If the flow through the ARC is much greater than the throttle of the L2ARC
filling thread, then the data just won't make it into the L2ARC. The thottle
changes after the ARC fills, so it can warm the L2ARC faster, but then
gets out of the way when needed.

> DDT entries don't get into the arc/l2arc until something tries to do a
> write.  When performing a write, dedup calculates the checksum of the block
> to be written, and then it needs to figure out if that's a duplicate of
> another block that's already on disk somewhere.  So (I guess this part)
> there's probably a tree-structure

AVL trees

> (I'll use the subdirectories and files
> analogy even though I'm certain that's not technically correct) on disk.
> You need to find the DDT entry, if it exists, for the block whose checksum
> is 1234ABCD.  So you start by looking under the 1 directory, and from there
> look for the 2 subdirectory, and then the 3 subdirectory, [...etc...] If you
> encounter "not found" at any step, then the DDT entry doesn't already exist
> and you decide to create a new one.  But if you get all the way down to the
> C subdirectory and it contains a file named "D,"  then you have found a
> possible dedup hit - the checksum matched another block that's already on
> disk.  Now the DDT entry is stored in ARC just like anything else you read
> from disk.

http://en.wikipedia.org/wiki/AVL_tree

> So the point is - Whenever you do a write, and the calculated DDT is not
> already in ARC/L2ARC, the system will actually perform several small reads
> looking for the DDT entry before it finally knows that the DDT entry
> actually exists.  So the penalty of performing a write, with dedup enabled,
> and the relevant DDT entry not already in ARC/L2ARC is a very large penalty.

"very" is a relative term, but also keep in mind that these writes are async.
Writes to the ZIL are not deduped (obviously). So if it takes a few extra 
milliseconds
to get everything flushed in the txg it is not a big deal.

> What originated as a single write quickly became several small reads plus a
> write, due to the fact the necessary DDT entry was not already available.
> 
> The penalty of insufficient ram, in conjunction with dedup, is terrible
> write performance.

The biggest penalty is for slow disks.  People with small RAM machines and big, 
slow
disks are expecting Santa Claus and the Tooth Fairy to fulfill their dedup 
desires.
 -- richard


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Dedup and L2ARC memory requirements (again)

Reply via email to