-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/151/#review434
-----------------------------------------------------------

Ship it!


Ship It!

- Richard Elling


On Dec. 30, 2014, 11 p.m., Matthew Ahrens wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.csiden.org/r/151/
> -----------------------------------------------------------
> 
> (Updated Dec. 30, 2014, 11 p.m.)
> 
> 
> Review request for OpenZFS Developer Mailing List, Christopher Siden and 
> Prakash Surya.
> 
> 
> Bugs: 5497
>     https://www.illumos.org/projects/illumos-gate//issues/5497
> 
> 
> Repository: illumos-gate
> 
> 
> Description
> -------
> 
> 5497 lock contention on arcs_mtx
> Reviewed by: George Wilson <george.wil...@delphix.com>
> Reviewed by: Matthew Ahrens <mahr...@delphix.com>
> 
> Original author: Prakash Surya
> 
> 
> This patch attempts to reduce lock contention on the current arc_state_t
> mutexes. These mutexes are used liberally to protect the number of LRU
> lists within the ARC (e.g. ARC_mru, ARC_mfu, etc). The granularity at
> which these locks are acquired has been shown to greatly affect the
> performance of highly concurrent, cached workloads.
> 
> To illustrate this performance impact, I used fio to run a benchmark
> on a system with 32GB of RAM and varied the number of CPUs on the system
> from 1 to 32 CPUs, testing each power of two number of CPUs. The workload
> remained constant for each test, and consisted of 32 threads each randomly
> reading from unique files, consisting of 16GB worth of data in total (each
> file was 512MB in size). The ZFS filesystem was configured to have a
> recordsize of 8K, and the fio threads were configured to use a block size
> of 8K.
> 
> Without this patch, when running with 32 CPUs, fio reported 388.6 thousand
> IOPs during a 60 second run of the workload, which equates to about
> 3036.2 MB/s of aggregate read bandwidth. With this patch applied (again,
> using 32 CPUs) fio reported 1266.8 thousand IOPs during a 60 second run,
> which equates to about 9896.6 MB/s of aggregate read bandwidth. That's a
> 225% increase in the aggregate read bandwidth when using 32 CPUs.
> 
> As I mentioned above, I also ran the same workload while varying the
> number of CPUs on the system, ranging from 1 CPU to 32 CPUs. The table
> below shows the aggregate bandwidth results from those tests when running
> on master, and the results when running with this patch applied:
> 
> +----------------------+
> |  patched             |
> +--------+-------------+
> | # CPUS | Bandwidth   |
> +--------+-------------+
> |      1 | 833163KB/s  |
> |      2 | 1520.2MB/s  |
> |      4 | 2985.3MB/s  |
> |      8 | 5913.2MB/s  |
> |     16 | 8938.9MB/s  |
> |     32 | 9896.6MB/s  |
> +--------+-------------+
>  
> +----------------------+
> |   master             |
> +--------+-------------+
> | # CPUS | Bandwidth   |
> +--------+-------------+
> |      1 | 840893KB/s  |
> |      2 | 1559.5MB/s  |
> |      4 | 3022.1MB/s  |
> |      8 | 5104.9MB/s  |
> |     16 | 3854.3MB/s  |
> |     32 | 3036.2MB/s  |
> +--------+-------------+
> 
> As you can see, the aggregate bandwidth scales much better when adding
> more CPUs with this patch applied, than it does without this patch.
> 
> To achieve this, the single LRU and mutex that is used for each of the
> arc states (e.g. ARC_mru, ARC_mfu, etc) has been split up into multiple
> LRU lists each with it's own mutex. So, for the above test, 32
> "sublists" were used for each arc state; one sublist per CPU on the
> system. The multilist_t object was introduced to make this transition
> from one LRU to many LRUs.
> 
> In addition, the logic performing reclamation from the ARC was reworked.
> Now, only the arc_reclaim_thread will call the arc_evict() and
> arc_evict_ghost() functions, whereas arc_evict() could previously be
> called from arc_get_data_buf() (although, it can still be called from
> dsl_pool_close() and "zinject -a"). This was done to try and prevent
> multiple threads simultaneously contending on the arc state locks, all
> trying to reclaim simultaneously. A secondary reason for this change is
> it subjectively makes the code easier to understand as it removes some
> edge cases (e.g. the "recycle" case, and mru vs. mfu decision in
> arc_get_data_buf()).
> 
> One concern that this change introduces is "fairness" in the reclaim
> logic. Now that each arc state is composed of multiple independent but
> related LRUs, it's more difficult to determine which object is the
> "oldest" across all of the LRUs composing each arc state. Thus, it is
> possible for a buffer that isn't strictly the "oldest" in the set of
> LRUs to be evicted out of order (e.g. before the "oldest" buffer). I've
> reasoned this concern away, by assuming (and measuring during testing)
> that buffers will be inserted into each arc state's LRU in a even
> distribution across all sublists; thus, simply evicting the "oldest"
> buffer(s) in one randomly selected sublist should be "good enough".
> 
> To work around a potential 3-way deadlock involving the multilist sublist
> locks and the l2ad_mtx, the l2ad_mtx locking had to be slightly reworked
> (we need to grab the sublist lock prior to the l2ad_mtx) in
> l2arc_write_buffers. To accomodate these locking changes, the reclaim
> logic surrounding l2arc buffers needed to changed as well, due to a couple
> regressions introduced. The locking changes are straight forward, so I'll
> refrain from discussing them here, but the changes to the reclaim logic
> are more note worthy:
> 
> 1. We need to be careful not to evict a header to the arc_l2c_only
>    state that's concurrently being written to the l2arc device. Since
>    the buffer isn't written out with the hash lock held, we must
>    explicitly account for this in the reclaim logic. While not a common
>    case, it is possible, so we must account for it. Now, if this was to
>    happen, the buffer will be skipped during eviction.
>  
> 2. Now that the l2ad_mtx isn't held throughout the whole execution of
>    l2arc_write_header, we must be careful to place a header realloc'ed
>    to the arc_l2c_only state back into the l2ad's buflist in the same
>    location. Previously, the new header would be inserted into the
>    head of the buflist, which would allow the header to be re-written
>    out in l2arc_write_buffers (even though it no longer has a b_l1hdr).
>  
> 3. And finally, if we encounter a hash lock collision in
>    l2arc_write_done(), we wait for the lock to become uncontended and
>    retry. Unlike the previous behavior, we can't skip the header, leaving
>    its ARC_FLAG_L2_WRITING bit set, since that would prevent the buffer
>    from being evicted from the ghost list until it was evicted from
>    the L2ARC device.
> 
> The last big change introduced was the call to arc_do_user_evicts() was
> removed from arc_reclaim_thread(), and split out into a completely
> separate thread. This was due to a deadlock being introduced as a result
> of the change to arc_get_data_buf(). When arc_get_data_buf() sleeps
> waiting for the eviction thread to evict buffers, it does so holding the
> hash lock. The eviction function is carefully crafted as to not sleep
> on any arc header's hash lock (e.g. using mutex_tryenter). Unfortunately,
> the callbacks in arc_do_user_evicts() are not so careful, so we must
> execute them in such a way that they cannot block the reclaim thread
> waiting on to acquire a header's hash lock.
> 
> A few miscellaneous notes:
> 
>  * A couple new static dtrace probes were introduced to help measure
> the evenness of a given multilist structure: 'multilist-insert' and
> 'multilist-remove'
> 
>  * The tunable zfs_arc_num_sublists_per_state was added to allow a user
> to configure the number of sublists to user for each arc state.
> 
>  * The tunable zfs_arc_overflow_shift was added to configure how far
> past arc_c we're allowed to grow, before arc_get_data_buf() will
> sleep and wait for the reclaim thread to free up some space.
> 
>  * The tunable arc_evict_iterations has been removed and replaced with
> zfs_arc_evict_batch_limit, which defines the number of headers to
> evict from each sublist before moving on to another sublist.
> 
>  * A couple new arcstats were added to track the new L2ARC reclaim
> logic that's been added. These stats are "evict_l2_skip" and
> "l2_writes_lock_retry".
> 
>  * A new kstat was added, "evict_not_enough", to track the number of
> times arc_evict_state() is not able to meet it's target number of
> bytes to evict from a given arc state.
> 
> 
> Diffs
> -----
> 
>   usr/src/uts/common/fs/zfs/zio_inject.c 
> 991a0a34ff6a3364a9f553bbb2e9701a1d2579be 
>   usr/src/uts/common/fs/zfs/sys/multilist.h PRE-CREATION 
>   usr/src/uts/common/fs/zfs/sys/arc.h 
> 4e9e8e202a7e7ede8acb50975665fe6e769bbb2c 
>   usr/src/uts/common/fs/zfs/multilist.c PRE-CREATION 
>   usr/src/uts/common/fs/zfs/dsl_pool.c 
> 39a797c27958b924d41309d7c671ba48fc5d848d 
>   usr/src/uts/common/fs/zfs/arc.c 012dd83cd7c3edcbeeb8254bde074356b2e0324d 
>   usr/src/uts/common/Makefile.files 2534527201caf23d9018712e4c437a2272b9ca61 
>   usr/src/lib/libzpool/common/sys/zfs_context.h 
> 5a19542b8f78ea9b9ddaf158e35728a3f516cb43 
> 
> Diff: https://reviews.csiden.org/r/151/diff/
> 
> 
> Testing
> -------
> 
> zfs test suite
> ztest
> 
> internal link: http://jenkins/job/zfs-precommit/1477/
> 
> In addition to the normal regession tests using git-zfs-precommit, I ran
> this patch through numerous hours of "zloop" and "zone" using a couple
> basic VMs.
> 
> Both "zloop" and "zone" were run for well over 24 hours.
> 
> I also ran the zfs-test suite manually in an infinite loop in another
> two VMs for well over 24 hours. I didn't verify all of the logs
> to see if the failed tests are all "known" failures, but I didn't hit
> any panics, assertions, or deadlocks either (which was my motivation,
> to try and see if I'd hit any panics or deadlocks).
> 
> ========================================================================
> This patch was manually tested using a couple of FIO workloads. First,
> the workload used to generate the performance numbers quoted in the
> description is given here:
> 
> [global]
> group_reporting=1
> fallocate=none
> ioengine=psync
> numjobs=32
> iodepth=1
> bs=8k
> filesize=512m
> size=512m
> randrepeat=0
> use_os_rand=1
>  
> [32threads]
> rw=randread
> time_based
> runtime=1m
> directory=/tank/fish
> 
> 
> ========================================================================
> Additionally, a branch with this patch applied was stress tested using
> the following FIO workload to try and catch any remaining issues that may
> have been missed. This workload was run for well over 24 hours while also
> periodically running "zinject -a". It was also run on a pool with and
> without an L2ARC cache device.
> 
> [global]
> group_reporting=1
> fallocate=none
> ioengine=psync
> iodepth=1
> bs=8k
> randrepeat=0
> use_os_rand=1
> norandommap
> loops=1048576
> runtime=24h
> time_based
> directory=/tank/fish
> numjobs=32
> filesize=512m
> size=512m
>  
> [write]
> rw=write
>  
> [randwrite]
> rw=randwrite
>  
> [read]
> rw=read
>  
> [randread]
> rw=randread
>  
> [rw,readwrite]
> rw=readwrite
>  
> [randrw]
> rw=randrw
> 
> ========================================================================
> Another long running stress test that I ran over a weekend was a slight
> variation on the fio test I posted just above. The difference was that
> instead of running a single FIO instance which spawned 192 threads, I used
> 6 independent FIO tests each of which used 32 threads. Each FIO test ran one
> of the unique "rw" types (e.g. one 32 thread test ran "write", another ran
> "randwrite", another running "read", etc) for 30 minutes. In addition, each
> test ran in it's own independent pool, and after the 30 minute FIO test the
> pool would be exported, reimported, and the fio test would run again. So,
> the flow of the test was:
> 
> while true; do
>     zpool export $POOL
>     zpool import $POOL
>     fio $FILE
> done
> 
> This test ran for about 18 hours (can't remember the exact time I started)
> and no deadlocks, ASSERTs, or other malformed behavior was noticed. Also,
> I intenionally crashed the system with an NMI and ran '::findleaks' after
> the ~18 hours of run time; nothing of interest came up:
> 
> > ::findleaks
> CACHE             LEAKED           BUFCTL CALLER
> ffffff090042b888       1 ffffff096c9a6e68 au_pathdup+0x62
> ffffff090042c448       1 ffffff0a528bfb00 au_pathdup+0x62
> ffffff0900430008       2 ffffff0967354820 impl_acc_hdl_alloc+0x3d
> ffffff090042b008       2 ffffff0966338c30 impl_acc_hdl_alloc+0x52
> ffffff090042e008       2 ffffff0967355ca8 impl_acc_hdl_alloc+0x6b
> ------------------------------------------------------------------------
>            Total       8 buffers, 1080 bytes
> 
> 
> Thanks,
> 
> Matthew Ahrens
> 
>

_______________________________________________
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to