On 4/25/2011 8:20 AM, Edward Ned Harvey wrote:

There are a lot of conflicting references on the Internet, so I'd really like to solicit actual experts (ZFS developers or people who have physical evidence) to weigh in on this...

After searching around, the reference I found to be the most seemingly useful was Erik's post here:

http://opensolaris.org/jive/thread.jspa?threadID=131296

Unfortunately it looks like there's an arithmetic error (1TB of 4k blocks means 268million blocks, not 1 billion). Also, IMHO it seems important make the distinction, #files != #blocks. Due to the existence of larger files, there will sometimes be more than one block per file; and if I'm not mistaken, thanks to write aggregation, there will sometimes be more than one file per block. YMMV. Average block size could be anywhere between 1 byte and 128k assuming default recordsize. (BTW, recordsize seems to be a zfs property, not a zpool property. So how can you know or configure the blocksize for something like a zvol iscsi target?)

I said 2^30, which is roughly a quarter billion. But, I should have been more exact. And, the file != block difference is important to note.

zvols also take a Recordsize attribute. And, zvols tend to be sticklers about all blocks being /exactly/ the recordsize value, unlike filesystems, which use it as a *maximum* block size.

Min block size is 512 bytes.


(BTW, is there any way to get a measurement of number of blocks consumed per zpool? Per vdev? Per zfs filesystem?) The calculations below are based on assumption of 4KB blocks adding up to a known total data consumption. The actual thing that matters is the number of blocks consumed, so the conclusions drawn will vary enormously when people actually have average block sizes != 4KB.


you need to use zdb to see what the current block usage is for a filesystem. I'd have to look up the particular CLI usage for that, as I don't know what it is off the top of my head.

And one more comment: Based on what's below, it seems that the DDT gets stored on the cache device and also in RAM. Is that correct? What if you didn't have a cache device? Shouldn't it *always* be in ram? And doesn't the cache device get wiped every time you reboot? It seems to me like putting the DDT on the cache device would be harmful... Is that really how it is?

Nope. The DDT is stored only in one place: cache device if present, /or/ RAM otherwise (technically, ARC, but that's in RAM). If a cache device is present, the DDT is stored there, BUT RAM also must store a basic lookup table for the DDT (yea, I know, a lookup table for a lookup table).


My minor corrections here:

The rule-of-thumb is 270 bytes/DDT entry, and 200 bytes of ARC for every L2ARC entry, since the DDT is stored on the cache device.

the DDT itself doesn't consume any ARC space usage if stored in a L2ARC cache

E.g.: I have 1TB of 4k blocks that are to be deduped, and it turns out that I have about a 5:1 dedup ratio. I'd also like to see how much ARC usage I eat up with using a 160GB L2ARC to store my DDT on.

(1) How many entries are there in the DDT?

1TB of 4k blocks means there are 268million blocks. However, at a 5:1 dedup ratio, I'm only actually storing 20% of that, so I have about 54 million blocks. Thus, I need a DDT of about 270bytes * 54 million =~ 14GB in size

(2) How much ARC space does this DDT take up?
The 54 million entries in my DDT take up about 200bytes * 54 million =~ 10G of ARC space, so I need to have 10G of RAM dedicated just to storing the references to the DDT in the L2ARC.


(3) How much space do I have left on the L2ARC device, and how many blocks can that hold? Well, I have 160GB - 14GB (DDT) = 146GB of cache space left on the device, which, assuming I'm still using 4k blocks, means I can cache about 37 million 4k blocks, or about 66% of my total data. This extra cache of blocks in the L2ARC would eat up 200 b * 37 million =~ 7.5GB of ARC entries.

Thus, for the aforementioned dedup scenario, I'd better spec it with (whatever base RAM for basic OS and ordinary ZFS cache and application requirements) at least a 14G L2ARC device for dedup + 10G more of RAM for the DDT L2ARC requirements + 1GB of RAM for every 20GB of additional space in the L2ARC cache beyond that used by the DDT.



--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to