On 4/25/2011 8:20 AM, Edward Ned Harvey wrote:
There are a lot of conflicting references on the Internet, so I'd
really like to solicit actual experts (ZFS developers or people who
have physical evidence) to weigh in on this...
After searching around, the reference I found to be the most seemingly
useful was Erik's post here:
http://opensolaris.org/jive/thread.jspa?threadID=131296
Unfortunately it looks like there's an arithmetic error (1TB of 4k
blocks means 268million blocks, not 1 billion). Also, IMHO it seems
important make the distinction, #files != #blocks. Due to the
existence of larger files, there will sometimes be more than one block
per file; and if I'm not mistaken, thanks to write aggregation, there
will sometimes be more than one file per block. YMMV. Average block
size could be anywhere between 1 byte and 128k assuming default
recordsize. (BTW, recordsize seems to be a zfs property, not a zpool
property. So how can you know or configure the blocksize for
something like a zvol iscsi target?)
I said 2^30, which is roughly a quarter billion. But, I should have
been more exact. And, the file != block difference is important to note.
zvols also take a Recordsize attribute. And, zvols tend to be sticklers
about all blocks being /exactly/ the recordsize value, unlike
filesystems, which use it as a *maximum* block size.
Min block size is 512 bytes.
(BTW, is there any way to get a measurement of number of blocks
consumed per zpool? Per vdev? Per zfs filesystem?) The calculations
below are based on assumption of 4KB blocks adding up to a known total
data consumption. The actual thing that matters is the number of
blocks consumed, so the conclusions drawn will vary enormously when
people actually have average block sizes != 4KB.
you need to use zdb to see what the current block usage is for a
filesystem. I'd have to look up the particular CLI usage for that, as I
don't know what it is off the top of my head.
And one more comment: Based on what's below, it seems that the DDT
gets stored on the cache device and also in RAM. Is that correct?
What if you didn't have a cache device? Shouldn't it *always* be in
ram? And doesn't the cache device get wiped every time you reboot?
It seems to me like putting the DDT on the cache device would be
harmful... Is that really how it is?
Nope. The DDT is stored only in one place: cache device if present, /or/
RAM otherwise (technically, ARC, but that's in RAM). If a cache device
is present, the DDT is stored there, BUT RAM also must store a basic
lookup table for the DDT (yea, I know, a lookup table for a lookup table).
My minor corrections here:
The rule-of-thumb is 270 bytes/DDT entry, and 200 bytes of ARC for every
L2ARC entry, since the DDT is stored on the cache device.
the DDT itself doesn't consume any ARC space usage if stored in a L2ARC
cache
E.g.: I have 1TB of 4k blocks that are to be deduped, and it turns out
that I have about a 5:1 dedup ratio. I'd also like to see how much ARC
usage I eat up with using a 160GB L2ARC to store my DDT on.
(1) How many entries are there in the DDT?
1TB of 4k blocks means there are 268million blocks. However, at a
5:1 dedup ratio, I'm only actually storing 20% of that, so I have about
54 million blocks. Thus, I need a DDT of about 270bytes * 54 million =~
14GB in size
(2) How much ARC space does this DDT take up?
The 54 million entries in my DDT take up about 200bytes * 54
million =~ 10G of ARC space, so I need to have 10G of RAM dedicated just
to storing the references to the DDT in the L2ARC.
(3) How much space do I have left on the L2ARC device, and how many
blocks can that hold?
Well, I have 160GB - 14GB (DDT) = 146GB of cache space left on the
device, which, assuming I'm still using 4k blocks, means I can cache
about 37 million 4k blocks, or about 66% of my total data. This extra
cache of blocks in the L2ARC would eat up 200 b * 37 million =~ 7.5GB of
ARC entries.
Thus, for the aforementioned dedup scenario, I'd better spec it with
(whatever base RAM for basic OS and ordinary ZFS cache and application
requirements) at least a 14G L2ARC device for dedup + 10G more of RAM
for the DDT L2ARC requirements + 1GB of RAM for every 20GB of additional
space in the L2ARC cache beyond that used by the DDT.
--
Erik Trimble
Java System Support
Mailstop: usca22-123
Phone: x17195
Santa Clara, CA
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss