Richard Elling via illumos-discuss wrote:
On Nov 14, 2014, at 5:44 PM, Andrew Kinney<[email protected]>  wrote:

Richard Elling via illumos-discuss wrote:
Is there a known reason why I'm seeing double writes to the slog? Am I alone or 
are others also seeing the same data amplification for sync writes with a slog?
You are seeing allocations, not the same thing as writes. zpool iostat is not 
the best tool
for understanding performance, for this and other reasons. What do you measure 
at the
device itself?

Fair enough. What would be the best way to measure the quantity of data 
actually written to the device?

Most commonly used is:
        iostat -x
or
        iostat -xn


test command:
dd if=/testpool/randomfile.deleteme of=/testpool/newrandom.deleteme bs=1M oflag=sync count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.346287 s, 303 MB/s

For the interval in which the test was done (no other activity on the pool), "iostat -Inx 30" shows:

    r/i    w/i   kr/i   kw/i wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 rpool
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4t0d0
    0.0 1600.0    0.0 108800.0  0.0  0.1    0.0    1.0   0   1 c5t0d0
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c5t1d0
0.0 884.0 0.0 102702.0 0.0 0.0 0.0 0.8 0 2 c1t5000CCA05C68D505d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t5000CCA05C681EB9d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t5000CCA03B1007E5d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t5000CCA03B10C085d0
    0.0 2480.0    0.0 211502.0  4.7  0.1   56.8    0.9   2   3 testpool

Note that this is total for the interval, not per second.
---------------------------------------------------------

test command:
dd if=/testpool/randomfile.deleteme of=/testpool/newrandom.deleteme bs=1M oflag=sync count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 0.2314 s, 453 MB/s

After adding a second identical slog (c5t1d0, not mirrored) and repeating the test, "iostat -Inx 30" shows:

    r/i    w/i   kr/i   kw/i wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 rpool
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4t0d0
    0.0  800.0    0.0 54400.0  0.0  0.0    0.0    0.5   0   0 c5t0d0
    0.0  800.0    0.0 54400.0  0.0  0.0    0.0    0.5   0   0 c5t1d0
0.0 896.0 0.0 102736.5 0.0 0.0 0.0 0.9 0 3 c1t5000CCA05C68D505d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t5000CCA05C681EB9d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t5000CCA03B1007E5d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t5000CCA03B10C085d0
    0.0 2492.0    0.0 211536.5  4.5  0.1   53.8    0.7   2   3 testpool

---------------------------------------------------------

test command
dd if=/testpool/randomfile.deleteme of=/testpool/newrandom.deleteme bs=4K oflag=sync count=25600
25600+0 records in
25600+0 records out
104857600 bytes (105 MB) copied, 2.64324 s, 39.7 MB/s

In the degenerate case of 4KiB sync writes, we do get twice the data written to the slogs, but only because of checksums:

    r/i    w/i   kr/i   kw/i wait actv wsvc_t asvc_t  %w  %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 rpool
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0 c4t0d0
    0.0 12800.0    0.0 102400.0  0.0  0.0    0.0    0.0   0   1 c5t0d0
    0.0 12800.0    0.0 102400.0  0.0  0.0    0.0    0.0   0   1 c5t1d0
0.0 942.0 0.0 103493.0 0.0 0.0 0.0 0.7 0 2 c1t5000CCA05C68D505d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t5000CCA05C681EB9d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t5000CCA03B1007E5d0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 c1t5000CCA03B10C085d0
    0.0 26538.0    0.0 308293.0  4.3  0.1    4.9    0.1   2   6 testpool

---------------------------------------------------------

Some interesting data here:

- for the original test, the log device (c5t0d0) takes 1600 68KiB writes (64KiB data, 4KiB checksum?) totaling 106.25MiB - the data vdev (c1t5000CCA05C68D505d0) takes fewer bigger writes (~116KiB average block size) totaling 100.29MiB - clearly, it isn't doubling the data written to the slog, though checksums create huge overhead on small writes - the slog is much less efficient than the data vdev because of the smaller IO, no IO aggregation, and extra checksums


While this does assuage my initial concern about writing the data to the slog twice, it does raise a couple questions:

1. Why is 200MiB allocated in the slog for 100MiB of data? Shouldn't it be bounded to data + checksums? 2. Can/should we change the 64KiB max block size for the slog to better use high bandwidth slog devices?


We expect most of our synchronous writes will be under 64KiB, but these particular devices really hit their stride at 1MiB blocks so it would be nice if we could make that maximum IO size to the slog 1MiB. For the rare synchronous IO above 64KiB, it would improve performance and efficiency.

Finally, I've read repeatedly that slog devices are always queue depth 1 and I understand why. That said, with two slogs and two synchronous writers to the pool, do we get qd=1 + qd=1 from slogs and writers operating in parallel? With N slogs and N synchronous writers, do we get N*(qd1) for total capability? Is there an upper bound to that scaling mode (presuming it works that way) other than CPU time?

Sincerely,
Andrew Kinney




-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

Reply via email to