----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.csiden.org/r/151/#review434 -----------------------------------------------------------
Ship it! Ship It! - Richard Elling On Dec. 30, 2014, 11 p.m., Matthew Ahrens wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.csiden.org/r/151/ > ----------------------------------------------------------- > > (Updated Dec. 30, 2014, 11 p.m.) > > > Review request for OpenZFS Developer Mailing List, Christopher Siden and > Prakash Surya. > > > Bugs: 5497 > https://www.illumos.org/projects/illumos-gate//issues/5497 > > > Repository: illumos-gate > > > Description > ------- > > 5497 lock contention on arcs_mtx > Reviewed by: George Wilson <george.wil...@delphix.com> > Reviewed by: Matthew Ahrens <mahr...@delphix.com> > > Original author: Prakash Surya > > > This patch attempts to reduce lock contention on the current arc_state_t > mutexes. These mutexes are used liberally to protect the number of LRU > lists within the ARC (e.g. ARC_mru, ARC_mfu, etc). The granularity at > which these locks are acquired has been shown to greatly affect the > performance of highly concurrent, cached workloads. > > To illustrate this performance impact, I used fio to run a benchmark > on a system with 32GB of RAM and varied the number of CPUs on the system > from 1 to 32 CPUs, testing each power of two number of CPUs. The workload > remained constant for each test, and consisted of 32 threads each randomly > reading from unique files, consisting of 16GB worth of data in total (each > file was 512MB in size). The ZFS filesystem was configured to have a > recordsize of 8K, and the fio threads were configured to use a block size > of 8K. > > Without this patch, when running with 32 CPUs, fio reported 388.6 thousand > IOPs during a 60 second run of the workload, which equates to about > 3036.2 MB/s of aggregate read bandwidth. With this patch applied (again, > using 32 CPUs) fio reported 1266.8 thousand IOPs during a 60 second run, > which equates to about 9896.6 MB/s of aggregate read bandwidth. That's a > 225% increase in the aggregate read bandwidth when using 32 CPUs. > > As I mentioned above, I also ran the same workload while varying the > number of CPUs on the system, ranging from 1 CPU to 32 CPUs. The table > below shows the aggregate bandwidth results from those tests when running > on master, and the results when running with this patch applied: > > +----------------------+ > | patched | > +--------+-------------+ > | # CPUS | Bandwidth | > +--------+-------------+ > | 1 | 833163KB/s | > | 2 | 1520.2MB/s | > | 4 | 2985.3MB/s | > | 8 | 5913.2MB/s | > | 16 | 8938.9MB/s | > | 32 | 9896.6MB/s | > +--------+-------------+ > > +----------------------+ > | master | > +--------+-------------+ > | # CPUS | Bandwidth | > +--------+-------------+ > | 1 | 840893KB/s | > | 2 | 1559.5MB/s | > | 4 | 3022.1MB/s | > | 8 | 5104.9MB/s | > | 16 | 3854.3MB/s | > | 32 | 3036.2MB/s | > +--------+-------------+ > > As you can see, the aggregate bandwidth scales much better when adding > more CPUs with this patch applied, than it does without this patch. > > To achieve this, the single LRU and mutex that is used for each of the > arc states (e.g. ARC_mru, ARC_mfu, etc) has been split up into multiple > LRU lists each with it's own mutex. So, for the above test, 32 > "sublists" were used for each arc state; one sublist per CPU on the > system. The multilist_t object was introduced to make this transition > from one LRU to many LRUs. > > In addition, the logic performing reclamation from the ARC was reworked. > Now, only the arc_reclaim_thread will call the arc_evict() and > arc_evict_ghost() functions, whereas arc_evict() could previously be > called from arc_get_data_buf() (although, it can still be called from > dsl_pool_close() and "zinject -a"). This was done to try and prevent > multiple threads simultaneously contending on the arc state locks, all > trying to reclaim simultaneously. A secondary reason for this change is > it subjectively makes the code easier to understand as it removes some > edge cases (e.g. the "recycle" case, and mru vs. mfu decision in > arc_get_data_buf()). > > One concern that this change introduces is "fairness" in the reclaim > logic. Now that each arc state is composed of multiple independent but > related LRUs, it's more difficult to determine which object is the > "oldest" across all of the LRUs composing each arc state. Thus, it is > possible for a buffer that isn't strictly the "oldest" in the set of > LRUs to be evicted out of order (e.g. before the "oldest" buffer). I've > reasoned this concern away, by assuming (and measuring during testing) > that buffers will be inserted into each arc state's LRU in a even > distribution across all sublists; thus, simply evicting the "oldest" > buffer(s) in one randomly selected sublist should be "good enough". > > To work around a potential 3-way deadlock involving the multilist sublist > locks and the l2ad_mtx, the l2ad_mtx locking had to be slightly reworked > (we need to grab the sublist lock prior to the l2ad_mtx) in > l2arc_write_buffers. To accomodate these locking changes, the reclaim > logic surrounding l2arc buffers needed to changed as well, due to a couple > regressions introduced. The locking changes are straight forward, so I'll > refrain from discussing them here, but the changes to the reclaim logic > are more note worthy: > > 1. We need to be careful not to evict a header to the arc_l2c_only > state that's concurrently being written to the l2arc device. Since > the buffer isn't written out with the hash lock held, we must > explicitly account for this in the reclaim logic. While not a common > case, it is possible, so we must account for it. Now, if this was to > happen, the buffer will be skipped during eviction. > > 2. Now that the l2ad_mtx isn't held throughout the whole execution of > l2arc_write_header, we must be careful to place a header realloc'ed > to the arc_l2c_only state back into the l2ad's buflist in the same > location. Previously, the new header would be inserted into the > head of the buflist, which would allow the header to be re-written > out in l2arc_write_buffers (even though it no longer has a b_l1hdr). > > 3. And finally, if we encounter a hash lock collision in > l2arc_write_done(), we wait for the lock to become uncontended and > retry. Unlike the previous behavior, we can't skip the header, leaving > its ARC_FLAG_L2_WRITING bit set, since that would prevent the buffer > from being evicted from the ghost list until it was evicted from > the L2ARC device. > > The last big change introduced was the call to arc_do_user_evicts() was > removed from arc_reclaim_thread(), and split out into a completely > separate thread. This was due to a deadlock being introduced as a result > of the change to arc_get_data_buf(). When arc_get_data_buf() sleeps > waiting for the eviction thread to evict buffers, it does so holding the > hash lock. The eviction function is carefully crafted as to not sleep > on any arc header's hash lock (e.g. using mutex_tryenter). Unfortunately, > the callbacks in arc_do_user_evicts() are not so careful, so we must > execute them in such a way that they cannot block the reclaim thread > waiting on to acquire a header's hash lock. > > A few miscellaneous notes: > > * A couple new static dtrace probes were introduced to help measure > the evenness of a given multilist structure: 'multilist-insert' and > 'multilist-remove' > > * The tunable zfs_arc_num_sublists_per_state was added to allow a user > to configure the number of sublists to user for each arc state. > > * The tunable zfs_arc_overflow_shift was added to configure how far > past arc_c we're allowed to grow, before arc_get_data_buf() will > sleep and wait for the reclaim thread to free up some space. > > * The tunable arc_evict_iterations has been removed and replaced with > zfs_arc_evict_batch_limit, which defines the number of headers to > evict from each sublist before moving on to another sublist. > > * A couple new arcstats were added to track the new L2ARC reclaim > logic that's been added. These stats are "evict_l2_skip" and > "l2_writes_lock_retry". > > * A new kstat was added, "evict_not_enough", to track the number of > times arc_evict_state() is not able to meet it's target number of > bytes to evict from a given arc state. > > > Diffs > ----- > > usr/src/uts/common/fs/zfs/zio_inject.c > 991a0a34ff6a3364a9f553bbb2e9701a1d2579be > usr/src/uts/common/fs/zfs/sys/multilist.h PRE-CREATION > usr/src/uts/common/fs/zfs/sys/arc.h > 4e9e8e202a7e7ede8acb50975665fe6e769bbb2c > usr/src/uts/common/fs/zfs/multilist.c PRE-CREATION > usr/src/uts/common/fs/zfs/dsl_pool.c > 39a797c27958b924d41309d7c671ba48fc5d848d > usr/src/uts/common/fs/zfs/arc.c 012dd83cd7c3edcbeeb8254bde074356b2e0324d > usr/src/uts/common/Makefile.files 2534527201caf23d9018712e4c437a2272b9ca61 > usr/src/lib/libzpool/common/sys/zfs_context.h > 5a19542b8f78ea9b9ddaf158e35728a3f516cb43 > > Diff: https://reviews.csiden.org/r/151/diff/ > > > Testing > ------- > > zfs test suite > ztest > > internal link: http://jenkins/job/zfs-precommit/1477/ > > In addition to the normal regession tests using git-zfs-precommit, I ran > this patch through numerous hours of "zloop" and "zone" using a couple > basic VMs. > > Both "zloop" and "zone" were run for well over 24 hours. > > I also ran the zfs-test suite manually in an infinite loop in another > two VMs for well over 24 hours. I didn't verify all of the logs > to see if the failed tests are all "known" failures, but I didn't hit > any panics, assertions, or deadlocks either (which was my motivation, > to try and see if I'd hit any panics or deadlocks). > > ======================================================================== > This patch was manually tested using a couple of FIO workloads. First, > the workload used to generate the performance numbers quoted in the > description is given here: > > [global] > group_reporting=1 > fallocate=none > ioengine=psync > numjobs=32 > iodepth=1 > bs=8k > filesize=512m > size=512m > randrepeat=0 > use_os_rand=1 > > [32threads] > rw=randread > time_based > runtime=1m > directory=/tank/fish > > > ======================================================================== > Additionally, a branch with this patch applied was stress tested using > the following FIO workload to try and catch any remaining issues that may > have been missed. This workload was run for well over 24 hours while also > periodically running "zinject -a". It was also run on a pool with and > without an L2ARC cache device. > > [global] > group_reporting=1 > fallocate=none > ioengine=psync > iodepth=1 > bs=8k > randrepeat=0 > use_os_rand=1 > norandommap > loops=1048576 > runtime=24h > time_based > directory=/tank/fish > numjobs=32 > filesize=512m > size=512m > > [write] > rw=write > > [randwrite] > rw=randwrite > > [read] > rw=read > > [randread] > rw=randread > > [rw,readwrite] > rw=readwrite > > [randrw] > rw=randrw > > ======================================================================== > Another long running stress test that I ran over a weekend was a slight > variation on the fio test I posted just above. The difference was that > instead of running a single FIO instance which spawned 192 threads, I used > 6 independent FIO tests each of which used 32 threads. Each FIO test ran one > of the unique "rw" types (e.g. one 32 thread test ran "write", another ran > "randwrite", another running "read", etc) for 30 minutes. In addition, each > test ran in it's own independent pool, and after the 30 minute FIO test the > pool would be exported, reimported, and the fio test would run again. So, > the flow of the test was: > > while true; do > zpool export $POOL > zpool import $POOL > fio $FILE > done > > This test ran for about 18 hours (can't remember the exact time I started) > and no deadlocks, ASSERTs, or other malformed behavior was noticed. Also, > I intenionally crashed the system with an NMI and ran '::findleaks' after > the ~18 hours of run time; nothing of interest came up: > > > ::findleaks > CACHE LEAKED BUFCTL CALLER > ffffff090042b888 1 ffffff096c9a6e68 au_pathdup+0x62 > ffffff090042c448 1 ffffff0a528bfb00 au_pathdup+0x62 > ffffff0900430008 2 ffffff0967354820 impl_acc_hdl_alloc+0x3d > ffffff090042b008 2 ffffff0966338c30 impl_acc_hdl_alloc+0x52 > ffffff090042e008 2 ffffff0967355ca8 impl_acc_hdl_alloc+0x6b > ------------------------------------------------------------------------ > Total 8 buffers, 1080 bytes > > > Thanks, > > Matthew Ahrens > >
_______________________________________________ developer mailing list developer@open-zfs.org http://lists.open-zfs.org/mailman/listinfo/developer