Richard Elling via illumos-discuss wrote:
- the data vdev (c1t5000CCA05C68D505d0) takes fewer bigger writes
(~116KiB average block size) totaling 100.29MiB
- clearly, it isn't doubling the data written to the slog, though
checksums create huge overhead on small writes
- the slog is much less efficient than the data vdev because of the
smaller IO, no IO aggregation, and extra checksums
Normally, there is aggregation, but since you don't see it, the
aggregation doesn't affect your load or
your load is not sufficient to cause aggregation. In particular, we
don't expect a single-threaded dd workload
to benefit from ZIL aggregation.
With 32 writers, fio showed that there is definitely aggregation that
happens when the workload gains some parallelism.
1. Why is 200MiB allocated in the slog for 100MiB of data? Shouldn't
it be bounded to data + checksums?
No. The space is pre-allocated so we don't have to wait on aggregation
in the critical path. So the
aggregation size is a guess. These guesses are divided into
zil_block_buckets. By default, for most
illumos-based distros, there is a 36KB bucket for (32KB + 4KB) which, in
theory, fits NFS workloads
(though it really doesn't). The next biggest bucket is max size, or
132KB (again, for most distros).
So for your 64KB blocks, we expect ZIL allocations to be 132KB, unless
we know more about the
distro.
In light of your comment and a brief read of the zil.c code, I think I
understand this better now. Thank you.
2. Can/should we change the 64KiB max block size for the slog to
better use high bandwidth slog devices?
If you'd like to experiment, you can change the zil_block_buckets array.
This can be done on a live
system using mdb for illumos-based distros. See the code around:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/zil.c#897
With more writer threads via fio, I was able to get ~91% of manufacturer
throughput specs and ~79% of manufacturer IOPS specs out of the slog
devices. The remaining gap can probably be chalked up to sync commands,
checksums, and related latency.
Finally, I've read repeatedly that slog devices are always queue depth
1 and I understand why. That said, with two slogs and two synchronous
writers to the pool, do we get qd=1 + qd=1 from slogs and writers
operating in parallel? With N slogs and N synchronous writers, do we
get N*(qd1) for total capability? Is there an upper bound to that
scaling mode (presuming it works that way) other than CPU time?
AFAIK, there is no fixed upper bound, but there might be a practical
limit lurking. I'm not aware of
anyone trying to identify such limits.
With the additional writer threads via fio, I was able to see that
aggregate 4KiB random write IOPS to the slog devices increased only
about 6% when adding a second slog.
However, with 1MiB random writes, there was a 91% increase in throughput
when adding a second slog, which is exceeding my expectations.
We'll probably scale by adding another storage box long before we
encounter a need to add a third slog device, so it was more a point of
curiosity than practical use for anything beyond two slogs.
The moral of the story is that slogs scale well for high concurrency
loads and exhibit some odd performance characteristics with low
concurrency loads.
The allocations are still a bit hinky with low concurrency loads, but
understanding that it is just a guess and that it probably gets better
with high concurrency loads, I'm not too worried about it. We'll just
account for that in how much slog space we allocate.
Overall, I'm going to chalk this up in the win column.
Thanks for helping me to understand what I was seeing.
Sincerely,
Andrew Kinney
-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription:
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com