Re: [OpenZFS Developer] Review Request 151: 5497 lock contention on arcs_mtx

Richard Elling Thu, 01 Jan 2015 12:21:16 -0800

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.csiden.org/r/151/#review429
-----------------------------------------------------------




usr/src/uts/common/fs/zfs/arc.c
<https://reviews.csiden.org/r/151/#comment358>

    missing space?



usr/src/uts/common/fs/zfs/arc.c
<https://reviews.csiden.org/r/151/#comment359>

    Is there science behind the 2x target size limit? Or is this just a 
reasonable guess?
    
    It is ok to be a guess, but if so, what measurement can we add to turn it 
into science?



usr/src/uts/common/fs/zfs/arc.c
<https://reviews.csiden.org/r/151/#comment360>

    It seems to me that we want to bump an arcstat when we are overflowing. If 
we are consistently in this situation, then we need to balance the system 
elsewhere, perhaps outside of ZFS.


- Richard Elling


On Dec. 30, 2014, 11 p.m., Matthew Ahrens wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.csiden.org/r/151/
> -----------------------------------------------------------
> 
> (Updated Dec. 30, 2014, 11 p.m.)
> 
> 
> Review request for OpenZFS Developer Mailing List, Christopher Siden and 
> Prakash Surya.
> 
> 
> Bugs: 5497
>     https://www.illumos.org/projects/illumos-gate//issues/5497
> 
> 
> Repository: illumos-gate
> 
> 
> Description
> -------
> 
> 5497 lock contention on arcs_mtx
> Reviewed by: George Wilson <george.wil...@delphix.com>
> Reviewed by: Matthew Ahrens <mahr...@delphix.com>
> 
> Original author: Prakash Surya
> 
> 
> This patch attempts to reduce lock contention on the current arc_state_t
> mutexes. These mutexes are used liberally to protect the number of LRU
> lists within the ARC (e.g. ARC_mru, ARC_mfu, etc). The granularity at
> which these locks are acquired has been shown to greatly affect the
> performance of highly concurrent, cached workloads.
> 
> To illustrate this performance impact, I used fio to run a benchmark
> on a system with 32GB of RAM and varied the number of CPUs on the system
> from 1 to 32 CPUs, testing each power of two number of CPUs. The workload
> remained constant for each test, and consisted of 32 threads each randomly
> reading from unique files, consisting of 16GB worth of data in total (each
> file was 512MB in size). The ZFS filesystem was configured to have a
> recordsize of 8K, and the fio threads were configured to use a block size
> of 8K.
> 
> Without this patch, when running with 32 CPUs, fio reported 388.6 thousand
> IOPs during a 60 second run of the workload, which equates to about
> 3036.2 MB/s of aggregate read bandwidth. With this patch applied (again,
> using 32 CPUs) fio reported 1266.8 thousand IOPs during a 60 second run,
> which equates to about 9896.6 MB/s of aggregate read bandwidth. That's a
> 225% increase in the aggregate read bandwidth when using 32 CPUs.
> 
> As I mentioned above, I also ran the same workload while varying the
> number of CPUs on the system, ranging from 1 CPU to 32 CPUs. The table
> below shows the aggregate bandwidth results from those tests when running
> on master, and the results when running with this patch applied:
> 
> +----------------------+
> |  patched             |
> +--------+-------------+
> | # CPUS | Bandwidth   |
> +--------+-------------+
> |      1 | 833163KB/s  |
> |      2 | 1520.2MB/s  |
> |      4 | 2985.3MB/s  |
> |      8 | 5913.2MB/s  |
> |     16 | 8938.9MB/s  |
> |     32 | 9896.6MB/s  |
> +--------+-------------+
>  
> +----------------------+
> |   master             |
> +--------+-------------+
> | # CPUS | Bandwidth   |
> +--------+-------------+
> |      1 | 840893KB/s  |
> |      2 | 1559.5MB/s  |
> |      4 | 3022.1MB/s  |
> |      8 | 5104.9MB/s  |
> |     16 | 3854.3MB/s  |
> |     32 | 3036.2MB/s  |
> +--------+-------------+
> 
> As you can see, the aggregate bandwidth scales much better when adding
> more CPUs with this patch applied, than it does without this patch.
> 
> To achieve this, the single LRU and mutex that is used for each of the
> arc states (e.g. ARC_mru, ARC_mfu, etc) has been split up into multiple
> LRU lists each with it's own mutex. So, for the above test, 32
> "sublists" were used for each arc state; one sublist per CPU on the
> system. The multilist_t object was introduced to make this transition
> from one LRU to many LRUs.
> 
> In addition, the logic performing reclamation from the ARC was reworked.
> Now, only the arc_reclaim_thread will call the arc_evict() and
> arc_evict_ghost() functions, whereas arc_evict() could previously be
> called from arc_get_data_buf() (although, it can still be called from
> dsl_pool_close() and "zinject -a"). This was done to try and prevent
> multiple threads simultaneously contending on the arc state locks, all
> trying to reclaim simultaneously. A secondary reason for this change is
> it subjectively makes the code easier to understand as it removes some
> edge cases (e.g. the "recycle" case, and mru vs. mfu decision in
> arc_get_data_buf()).
> 
> One concern that this change introduces is "fairness" in the reclaim
> logic. Now that each arc state is composed of multiple independent but
> related LRUs, it's more difficult to determine which object is the
> "oldest" across all of the LRUs composing each arc state. Thus, it is
> possible for a buffer that isn't strictly the "oldest" in the set of
> LRUs to be evicted out of order (e.g. before the "oldest" buffer). I've
> reasoned this concern away, by assuming (and measuring during testing)
> that buffers will be inserted into each arc state's LRU in a even
> distribution across all sublists; thus, simply evicting the "oldest"
> buffer(s) in one randomly selected sublist should be "good enough".
> 
> To work around a potential 3-way deadlock involving the multilist sublist
> locks and the l2ad_mtx, the l2ad_mtx locking had to be slightly reworked
> (we need to grab the sublist lock prior to the l2ad_mtx) in
> l2arc_write_buffers. To accomodate these locking changes, the reclaim
> logic surrounding l2arc buffers needed to changed as well, due to a couple
> regressions introduced. The locking changes are straight forward, so I'll
> refrain from discussing them here, but the changes to the reclaim logic
> are more note worthy:
> 
> 1. We need to be careful not to evict a header to the arc_l2c_only
>    state that's concurrently being written to the l2arc device. Since
>    the buffer isn't written out with the hash lock held, we must
>    explicitly account for this in the reclaim logic. While not a common
>    case, it is possible, so we must account for it. Now, if this was to
>    happen, the buffer will be skipped during eviction.
>  
> 2. Now that the l2ad_mtx isn't held throughout the whole execution of
>    l2arc_write_header, we must be careful to place a header realloc'ed
>    to the arc_l2c_only state back into the l2ad's buflist in the same
>    location. Previously, the new header would be inserted into the
>    head of the buflist, which would allow the header to be re-written
>    out in l2arc_write_buffers (even though it no longer has a b_l1hdr).
>  
> 3. And finally, if we encounter a hash lock collision in
>    l2arc_write_done(), we wait for the lock to become uncontended and
>    retry. Unlike the previous behavior, we can't skip the header, leaving
>    its ARC_FLAG_L2_WRITING bit set, since that would prevent the buffer
>    from being evicted from the ghost list until it was evicted from
>    the L2ARC device.
> 
> The last big change introduced was the call to arc_do_user_evicts() was
> removed from arc_reclaim_thread(), and split out into a completely
> separate thread. This was due to a deadlock being introduced as a result
> of the change to arc_get_data_buf(). When arc_get_data_buf() sleeps
> waiting for the eviction thread to evict buffers, it does so holding the
> hash lock. The eviction function is carefully crafted as to not sleep
> on any arc header's hash lock (e.g. using mutex_tryenter). Unfortunately,
> the callbacks in arc_do_user_evicts() are not so careful, so we must
> execute them in such a way that they cannot block the reclaim thread
> waiting on to acquire a header's hash lock.
> 
> A few miscellaneous notes:
> 
>  * A couple new static dtrace probes were introduced to help measure
> the evenness of a given multilist structure: 'multilist-insert' and
> 'multilist-remove'
> 
>  * The tunable zfs_arc_num_sublists_per_state was added to allow a user
> to configure the number of sublists to user for each arc state.
> 
>  * The tunable zfs_arc_overflow_shift was added to configure how far
> past arc_c we're allowed to grow, before arc_get_data_buf() will
> sleep and wait for the reclaim thread to free up some space.
> 
>  * The tunable arc_evict_iterations has been removed and replaced with
> zfs_arc_evict_batch_limit, which defines the number of headers to
> evict from each sublist before moving on to another sublist.
> 
>  * A couple new arcstats were added to track the new L2ARC reclaim
> logic that's been added. These stats are "evict_l2_skip" and
> "l2_writes_lock_retry".
> 
>  * A new kstat was added, "evict_not_enough", to track the number of
> times arc_evict_state() is not able to meet it's target number of
> bytes to evict from a given arc state.
> 
> 
> Diffs
> -----
> 
>   usr/src/uts/common/fs/zfs/zio_inject.c 
> 991a0a34ff6a3364a9f553bbb2e9701a1d2579be 
>   usr/src/uts/common/fs/zfs/sys/multilist.h PRE-CREATION 
>   usr/src/uts/common/fs/zfs/sys/arc.h 
> 4e9e8e202a7e7ede8acb50975665fe6e769bbb2c 
>   usr/src/uts/common/fs/zfs/multilist.c PRE-CREATION 
>   usr/src/uts/common/fs/zfs/dsl_pool.c 
> 39a797c27958b924d41309d7c671ba48fc5d848d 
>   usr/src/uts/common/fs/zfs/arc.c 012dd83cd7c3edcbeeb8254bde074356b2e0324d 
>   usr/src/uts/common/Makefile.files 2534527201caf23d9018712e4c437a2272b9ca61 
>   usr/src/lib/libzpool/common/sys/zfs_context.h 
> 5a19542b8f78ea9b9ddaf158e35728a3f516cb43 
> 
> Diff: https://reviews.csiden.org/r/151/diff/
> 
> 
> Testing
> -------
> 
> zfs test suite
> ztest
> 
> internal link: http://jenkins/job/zfs-precommit/1477/
> 
> In addition to the normal regession tests using git-zfs-precommit, I ran
> this patch through numerous hours of "zloop" and "zone" using a couple
> basic VMs.
> 
> Both "zloop" and "zone" were run for well over 24 hours.
> 
> I also ran the zfs-test suite manually in an infinite loop in another
> two VMs for well over 24 hours. I didn't verify all of the logs
> to see if the failed tests are all "known" failures, but I didn't hit
> any panics, assertions, or deadlocks either (which was my motivation,
> to try and see if I'd hit any panics or deadlocks).
> 
> ========================================================================
> This patch was manually tested using a couple of FIO workloads. First,
> the workload used to generate the performance numbers quoted in the
> description is given here:
> 
> [global]
> group_reporting=1
> fallocate=none
> ioengine=psync
> numjobs=32
> iodepth=1
> bs=8k
> filesize=512m
> size=512m
> randrepeat=0
> use_os_rand=1
>  
> [32threads]
> rw=randread
> time_based
> runtime=1m
> directory=/tank/fish
> 
> 
> ========================================================================
> Additionally, a branch with this patch applied was stress tested using
> the following FIO workload to try and catch any remaining issues that may
> have been missed. This workload was run for well over 24 hours while also
> periodically running "zinject -a". It was also run on a pool with and
> without an L2ARC cache device.
> 
> [global]
> group_reporting=1
> fallocate=none
> ioengine=psync
> iodepth=1
> bs=8k
> randrepeat=0
> use_os_rand=1
> norandommap
> loops=1048576
> runtime=24h
> time_based
> directory=/tank/fish
> numjobs=32
> filesize=512m
> size=512m
>  
> [write]
> rw=write
>  
> [randwrite]
> rw=randwrite
>  
> [read]
> rw=read
>  
> [randread]
> rw=randread
>  
> [rw,readwrite]
> rw=readwrite
>  
> [randrw]
> rw=randrw
> 
> ========================================================================
> Another long running stress test that I ran over a weekend was a slight
> variation on the fio test I posted just above. The difference was that
> instead of running a single FIO instance which spawned 192 threads, I used
> 6 independent FIO tests each of which used 32 threads. Each FIO test ran one
> of the unique "rw" types (e.g. one 32 thread test ran "write", another ran
> "randwrite", another running "read", etc) for 30 minutes. In addition, each
> test ran in it's own independent pool, and after the 30 minute FIO test the
> pool would be exported, reimported, and the fio test would run again. So,
> the flow of the test was:
> 
> while true; do
>     zpool export $POOL
>     zpool import $POOL
>     fio $FILE
> done
> 
> This test ran for about 18 hours (can't remember the exact time I started)
> and no deadlocks, ASSERTs, or other malformed behavior was noticed. Also,
> I intenionally crashed the system with an NMI and ran '::findleaks' after
> the ~18 hours of run time; nothing of interest came up:
> 
> > ::findleaks
> CACHE             LEAKED           BUFCTL CALLER
> ffffff090042b888       1 ffffff096c9a6e68 au_pathdup+0x62
> ffffff090042c448       1 ffffff0a528bfb00 au_pathdup+0x62
> ffffff0900430008       2 ffffff0967354820 impl_acc_hdl_alloc+0x3d
> ffffff090042b008       2 ffffff0966338c30 impl_acc_hdl_alloc+0x52
> ffffff090042e008       2 ffffff0967355ca8 impl_acc_hdl_alloc+0x6b
> ------------------------------------------------------------------------
>            Total       8 buffers, 1080 bytes
> 
> 
> Thanks,
> 
> Matthew Ahrens
> 
>

_______________________________________________
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer

Re: [OpenZFS Developer] Review Request 151: 5497 lock contention on arcs_mtx

Reply via email to