Re: [discuss] double data written to slog?

Richard Elling via illumos-discuss Sun, 16 Nov 2014 20:11:08 -0800

On Nov 16, 2014, at 3:41 PM, Andrew Kinney <[email protected]> wrote:
> Richard Elling via illumos-discuss wrote:
>> On Nov 14, 2014, at 5:44 PM, Andrew Kinney<[email protected]>  
>> wrote:
>> 
>>> Richard Elling via illumos-discuss wrote:
>>>>> Is there a known reason why I'm seeing double writes to the slog? Am I 
>>>>> alone or are others also seeing the same data amplification for sync 
>>>>> writes with a slog?
>>>> You are seeing allocations, not the same thing as writes. zpool iostat is 
>>>> not the best tool
>>>> for understanding performance, for this and other reasons. What do you 
>>>> measure at the
>>>> device itself?
>>> 
>>> Fair enough. What would be the best way to measure the quantity of data 
>>> actually written to the device?
>> 
>> Most commonly used is:
>>      iostat -x
>> or
>>      iostat -xn
>> 
> 
> test command:
> dd if=/testpool/randomfile.deleteme of=/testpool/newrandom.deleteme bs=1M 
> oflag=sync count=100
> 100+0 records in
> 100+0 records out
> 104857600 bytes (105 MB) copied, 0.346287 s, 303 MB/s
> 
> For the interval in which the test was done (no other activity on the pool), 
> "iostat -Inx 30" shows:
> 
>    r/i    w/i   kr/i   kw/i wait actv wsvc_t asvc_t  %w  %b device
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 rpool
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4t0d0
>    0.0 1600.0    0.0 108800.0  0.0  0.1    0.0    1.0   0   1 c5t0d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c5t1d0
>    0.0  884.0    0.0 102702.0  0.0  0.0    0.0    0.8   0   2 
> c1t5000CCA05C68D505d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 
> c1t5000CCA05C681EB9d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 
> c1t5000CCA03B1007E5d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 
> c1t5000CCA03B10C085d0
>    0.0 2480.0    0.0 211502.0  4.7  0.1   56.8    0.9   2   3 testpool
> 
> Note that this is total for the interval, not per second.
> ---------------------------------------------------------
> 
> test command:
> dd if=/testpool/randomfile.deleteme of=/testpool/newrandom.deleteme bs=1M 
> oflag=sync count=100
> 100+0 records in
> 100+0 records out
> 104857600 bytes (105 MB) copied, 0.2314 s, 453 MB/s
> 
> After adding a second identical slog (c5t1d0, not mirrored) and repeating the 
> test, "iostat -Inx 30" shows:
> 
>    r/i    w/i   kr/i   kw/i wait actv wsvc_t asvc_t  %w  %b device
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 rpool
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4t0d0
>    0.0  800.0    0.0 54400.0  0.0  0.0    0.0    0.5   0   0 c5t0d0
>    0.0  800.0    0.0 54400.0  0.0  0.0    0.0    0.5   0   0 c5t1d0
>    0.0  896.0    0.0 102736.5  0.0  0.0    0.0    0.9   0   3 
> c1t5000CCA05C68D505d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 
> c1t5000CCA05C681EB9d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 
> c1t5000CCA03B1007E5d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 
> c1t5000CCA03B10C085d0
>    0.0 2492.0    0.0 211536.5  4.5  0.1   53.8    0.7   2   3 testpool
> 
> ---------------------------------------------------------
> 
> test command
> dd if=/testpool/randomfile.deleteme of=/testpool/newrandom.deleteme bs=4K 
> oflag=sync count=25600
> 25600+0 records in
> 25600+0 records out
> 104857600 bytes (105 MB) copied, 2.64324 s, 39.7 MB/s
> 
> In the degenerate case of 4KiB sync writes, we do get twice the data written 
> to the slogs, but only because of checksums:
> 
>    r/i    w/i   kr/i   kw/i wait actv wsvc_t asvc_t  %w  %b device
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 rpool
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4t0d0
>    0.0 12800.0    0.0 102400.0  0.0  0.0    0.0    0.0   0   1 c5t0d0
>    0.0 12800.0    0.0 102400.0  0.0  0.0    0.0    0.0   0   1 c5t1d0
>    0.0  942.0    0.0 103493.0  0.0  0.0    0.0    0.7   0   2 
> c1t5000CCA05C68D505d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 
> c1t5000CCA05C681EB9d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 
> c1t5000CCA03B1007E5d0
>    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 
> c1t5000CCA03B10C085d0
>    0.0 26538.0    0.0 308293.0  4.3  0.1    4.9    0.1   2   6 testpool
> 
> ---------------------------------------------------------
> 
> Some interesting data here:
> 
> - for the original test, the log device (c5t0d0) takes 1600 68KiB writes 
> (64KiB data, 4KiB checksum?) totaling 106.25MiB


A slog write includes one or more blocks and a log chain block that contains, 
amongst other things,
the checksum of the data. After a crash, we know that we've hit the end of the 
chain when the 
checksum doesn't match. This extra data is 4KB because the smallest amount of 
data written to a
ZIL is 4KB.

> - the data vdev (c1t5000CCA05C68D505d0) takes fewer bigger writes (~116KiB 
> average block size) totaling 100.29MiB
> - clearly, it isn't doubling the data written to the slog, though checksums 
> create huge overhead on small writes
> - the slog is much less efficient than the data vdev because of the smaller 
> IO, no IO aggregation, and extra checksums

Normally, there is aggregation, but since you don't see it, the aggregation 
doesn't affect your load or
your load is not sufficient to cause aggregation. In particular, we don't 
expect a single-threaded dd workload
to benefit from ZIL aggregation. 

> While this does assuage my initial concern about writing the data to the slog 
> twice, it does raise a couple questions:
> 
> 1. Why is 200MiB allocated in the slog for 100MiB of data? Shouldn't it be 
> bounded to data + checksums?

No. The space is pre-allocated so we don't have to wait on aggregation in the 
critical path. So the
aggregation size is a guess. These guesses are divided into zil_block_buckets. 
By default, for most
illumos-based distros, there is a 36KB bucket for (32KB + 4KB) which, in 
theory, fits NFS workloads
(though it really doesn't). The next biggest bucket is max size, or 132KB 
(again, for most distros). 
So for your 64KB blocks, we expect ZIL allocations to be 132KB, unless we know 
more about the
distro.

> 2. Can/should we change the 64KiB max block size for the slog to better use 
> high bandwidth slog devices?

If you'd like to experiment, you can change the zil_block_buckets array. This 
can be done on a live
system using mdb for illumos-based distros. See the code around:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/zil.c#897

> 
> We expect most of our synchronous writes will be under 64KiB, but these 
> particular devices really hit their stride at 1MiB blocks so it would be nice 
> if we could make that maximum IO size to the slog 1MiB. For the rare 
> synchronous IO above 64KiB, it would improve performance and efficiency.
> 
> Finally, I've read repeatedly that slog devices are always queue depth 1 and 
> I understand why. That said, with two slogs and two synchronous writers to 
> the pool, do we get qd=1 + qd=1 from slogs and writers operating in parallel? 
> With N slogs and N synchronous writers, do we get N*(qd1) for total 
> capability? Is there an upper bound to that scaling mode (presuming it works 
> that way) other than CPU time?

AFAIK, there is no fixed upper bound, but there might be a practical limit 
lurking. I'm not aware of
anyone trying to identify such limits.
 -- richard

> 
> Sincerely,
> Andrew Kinney
> 
> 




-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Re: [discuss] double data written to slog?

Reply via email to