Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Christoph Hellwig writes: > Hi Ying, > >> Any update to this regression? > > Not really. We've optimized everything we could in XFS without > dropping the architecture that we really want to move to. Now we're > waiting for some MM behavior to be fixed that this unconvered. But > in the end will probabkly stuck with a slight regression in this > artificial workload. I see. Thanks for update. Please keep me posted. Best Regards, Huang, Ying
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Hi Ying, > Any update to this regression? Not really. We've optimized everything we could in XFS without dropping the architecture that we really want to move to. Now we're waiting for some MM behavior to be fixed that this unconvered. But in the end will probabkly stuck with a slight regression in this artificial workload.
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Hi, Christoph, "Huang, Ying" writes: > Hi, Christoph, > > "Huang, Ying" writes: > >> Christoph Hellwig writes: >> >>> Snipping the long contest: >>> >>> I think there are three observations here: >>> >>> (1) removing the mark_page_accessed (which is the only significant >>> change in the parent commit) hurts the >>> aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test. >>> I'd still rather stick to the filemap version and let the >>> VM people sort it out. How do the numbers for this test >>> look for XFS vs say ext4 and btrfs? >>> (2) lots of additional spinlock contention in the new case. A quick >>> check shows that I fat-fingered my rewrite so that we do >>> the xfs_inode_set_eofblocks_tag call now for the pure lookup >>> case, and pretty much all new cycles come from that. >>> (3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and >>> we're already doing way to many even without my little bug above. >>> >>> So I've force pushed a new version of the iomap-fixes branch with >>> (2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a >>> lot less expensive slotted in before that. Would be good to see >>> the numbers with that. >> >> For the original reported regression, the test result is as follow, >> >> = >> compiler/cpufreq_governor/debug-setup/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase: >> >> gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7 >> >> commit: >> f0c6bcba74ac51cb77aadb33ad35cb2dc1ad1506 (parent of first bad commit) >> 68a9f5e7007c1afa2cf6830b690a90d0187c0684 (first bad commit) >> 99091700659f4df965e138b38b4fa26a29b7eade (base of your fixes branch) >> bf4dc6e4ecc2a3d042029319bc8cd4204c185610 (head of your fixes branch) >> >> f0c6bcba74ac51cb 68a9f5e7007c1afa2cf6830b69 99091700659f4df965e138b38b >> bf4dc6e4ecc2a3d042029319bc >> -- -- >> -- >> %stddev %change %stddev %change %stddev >> %change %stddev >> \ |\ |\ >>|\ >> 484435 ± 0% -13.3% 420004 ± 0% -17.0% 402250 ± 0% >> -15.6% 408998 ± 0% aim7.jobs-per-min > > It appears the original reported regression hasn't bee resolved by your > commit. Could you take a look at the test results and the perf data? Any update to this regression? Best Regards, Huang, Ying >> >> And the perf data is as follow, >> >> "perf-profile.func.cycles-pp.intel_idle": 20.25, >> "perf-profile.func.cycles-pp.memset_erms": 11.72, >> "perf-profile.func.cycles-pp.copy_user_enhanced_fast_string": 8.37, >> "perf-profile.func.cycles-pp.__block_commit_write.isra.21": 3.49, >> "perf-profile.func.cycles-pp.block_write_end": 1.77, >> "perf-profile.func.cycles-pp.native_queued_spin_lock_slowpath": 1.63, >> "perf-profile.func.cycles-pp.unlock_page": 1.58, >> "perf-profile.func.cycles-pp.___might_sleep": 1.56, >> "perf-profile.func.cycles-pp.__block_write_begin_int": 1.33, >> "perf-profile.func.cycles-pp.iov_iter_copy_from_user_atomic": 1.23, >> "perf-profile.func.cycles-pp.up_write": 1.21, >> "perf-profile.func.cycles-pp.__mark_inode_dirty": 1.18, >> "perf-profile.func.cycles-pp.down_write": 1.06, >> "perf-profile.func.cycles-pp.mark_buffer_dirty": 0.94, >> "perf-profile.func.cycles-pp.generic_write_end": 0.92, >> "perf-profile.func.cycles-pp.__radix_tree_lookup": 0.91, >> "perf-profile.func.cycles-pp._raw_spin_lock": 0.81, >> "perf-profile.func.cycles-pp.entry_SYSCALL_64_fastpath": 0.79, >> "perf-profile.func.cycles-pp.__might_sleep": 0.79, >> "perf-profile.func.cycles-pp.xfs_file_iomap_begin_delay.isra.9": 0.7, >> "perf-profile.func.cycles-pp.__list_del_entry": 0.7, >> "perf-profile.func.cycles-pp.vfs_write": 0.69, >> "perf-profile.func.cycles-pp.drop_buffers": 0.68, >> "perf-profile.func.cycles-pp.xfs_file_write_iter": 0.67, >> "perf-profile.func.cycles-pp.rwsem_spin_on_owner": 0.67, >> >> Best Regards, >> Huang, Ying >> ___ >> LKP mailing list >> l...@lists.01.org >> https://lists.01.org/mailman/listinfo/lkp
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Mel Gorman writes: > On Fri, Sep 02, 2016 at 09:32:58AM +1000, Dave Chinner wrote: >> On Fri, Aug 19, 2016 at 04:08:34PM +0100, Mel Gorman wrote: >> > On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote: >> > > On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote: >> > > > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote: >> > > > > > Yes, we could try to batch the locking like DaveC already suggested >> > > > > > (ie we could move the locking to the caller, and then make >> > > > > > shrink_page_list() just try to keep the lock held for a few pages >> > > > > > if >> > > > > > the mapping doesn't change), and that might result in fewer crazy >> > > > > > cacheline ping-pongs overall. But that feels like exactly the wrong >> > > > > > kind of workaround. >> > > > > > >> > > > > >> > > > > Even if such batching was implemented, it would be very specific to >> > > > > the >> > > > > case of a single large file filling LRUs on multiple nodes. >> > > > > >> > > > >> > > > The latest Jason Bourne movie was sufficiently bad that I spent time >> > > > thinking how the tree_lock could be batched during reclaim. It's not >> > > > straight-forward but this prototype did not blow up on UMA and may be >> > > > worth considering if Dave can test either approach has a positive >> > > > impact. >> > > >> > > SO, I just did a couple of tests. I'll call the two patches "sleepy" >> > > for the contention backoff patch and "bourney" for the Jason Bourne >> > > inspired batching patch. This is an average of 3 runs, overwriting >> > > a 47GB file on a machine with 16GB RAM: >> > > >> > > IO throughput wall time __pv_queued_spin_lock_slowpath >> > > vanilla 470MB/s 1m42s 25-30% >> > > sleepy 295MB/s 2m43s <1% >> > > bourney 425MB/s 1m53s 25-30% >> > > >> > >> > This is another blunt-force patch that >> >> Sorry for taking so long to get back to this - had a bunch of other >> stuff to do (e.g. XFS metadata CRCs have found their first compiler >> bug) and haven't had to time test this. >> > > No problem. Thanks for getting back to me. > >> The blunt force approach seems to work ok: >> > > Ok, good to know. Unfortunately I found that it's not a universal win. For > the swapping-to-fast-storage case (simulated with ramdisk), the batching is > a bigger gain *except* in the single threaded case. Stalling kswap in the > "blunt force approach" severely regressed a streaming anonymous reader > for all thread counts so it's not the right answer. > > I'm working on a series during spare time that tries to balance all the > issues for either swapcache and filecache on different workloads but right > now, the complexity is high and it's still "win some, lose some". > > As an aside for the LKP people using ramdisk for swap -- ramdisk considers > itself to be rotational storage. It takes the paths that are optimised to > minimise seeks but it's quite slow. When tree_lock contention is reduced, > workload is dominated by scan_swap_map. It's a one-line fix and I have > a patch for it but it only really matters if ramdisk is being used as a > simulator for swapping to fast storage. We (LKP people) use drivers/nvdimm/pmem.c instead of drivers/block/brd.c as ramdisk. Which considers itself to be non-rotational storage. And we have a series to optimize other locks in the swap path too, for example batching the swap space allocating and freeing, etc. If your solution to optimize batching removing pages from the swap cache can be merged, that will help us much! Best Regards, Huang, Ying
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Fri, Sep 02, 2016 at 09:32:58AM +1000, Dave Chinner wrote: > On Fri, Aug 19, 2016 at 04:08:34PM +0100, Mel Gorman wrote: > > On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote: > > > On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote: > > > > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote: > > > > > > Yes, we could try to batch the locking like DaveC already suggested > > > > > > (ie we could move the locking to the caller, and then make > > > > > > shrink_page_list() just try to keep the lock held for a few pages if > > > > > > the mapping doesn't change), and that might result in fewer crazy > > > > > > cacheline ping-pongs overall. But that feels like exactly the wrong > > > > > > kind of workaround. > > > > > > > > > > > > > > > > Even if such batching was implemented, it would be very specific to > > > > > the > > > > > case of a single large file filling LRUs on multiple nodes. > > > > > > > > > > > > > The latest Jason Bourne movie was sufficiently bad that I spent time > > > > thinking how the tree_lock could be batched during reclaim. It's not > > > > straight-forward but this prototype did not blow up on UMA and may be > > > > worth considering if Dave can test either approach has a positive > > > > impact. > > > > > > SO, I just did a couple of tests. I'll call the two patches "sleepy" > > > for the contention backoff patch and "bourney" for the Jason Bourne > > > inspired batching patch. This is an average of 3 runs, overwriting > > > a 47GB file on a machine with 16GB RAM: > > > > > > IO throughput wall time __pv_queued_spin_lock_slowpath > > > vanilla 470MB/s 1m42s 25-30% > > > sleepy295MB/s 2m43s <1% > > > bourney 425MB/s 1m53s 25-30% > > > > > > > This is another blunt-force patch that > > Sorry for taking so long to get back to this - had a bunch of other > stuff to do (e.g. XFS metadata CRCs have found their first compiler > bug) and haven't had to time test this. > No problem. Thanks for getting back to me. > The blunt force approach seems to work ok: > Ok, good to know. Unfortunately I found that it's not a universal win. For the swapping-to-fast-storage case (simulated with ramdisk), the batching is a bigger gain *except* in the single threaded case. Stalling kswap in the "blunt force approach" severely regressed a streaming anonymous reader for all thread counts so it's not the right answer. I'm working on a series during spare time that tries to balance all the issues for either swapcache and filecache on different workloads but right now, the complexity is high and it's still "win some, lose some". As an aside for the LKP people using ramdisk for swap -- ramdisk considers itself to be rotational storage. It takes the paths that are optimised to minimise seeks but it's quite slow. When tree_lock contention is reduced, workload is dominated by scan_swap_map. It's a one-line fix and I have a patch for it but it only really matters if ramdisk is being used as a simulator for swapping to fast storage. -- Mel Gorman SUSE Labs
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Fri, Aug 19, 2016 at 04:08:34PM +0100, Mel Gorman wrote: > On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote: > > On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote: > > > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote: > > > > > Yes, we could try to batch the locking like DaveC already suggested > > > > > (ie we could move the locking to the caller, and then make > > > > > shrink_page_list() just try to keep the lock held for a few pages if > > > > > the mapping doesn't change), and that might result in fewer crazy > > > > > cacheline ping-pongs overall. But that feels like exactly the wrong > > > > > kind of workaround. > > > > > > > > > > > > > Even if such batching was implemented, it would be very specific to the > > > > case of a single large file filling LRUs on multiple nodes. > > > > > > > > > > The latest Jason Bourne movie was sufficiently bad that I spent time > > > thinking how the tree_lock could be batched during reclaim. It's not > > > straight-forward but this prototype did not blow up on UMA and may be > > > worth considering if Dave can test either approach has a positive impact. > > > > SO, I just did a couple of tests. I'll call the two patches "sleepy" > > for the contention backoff patch and "bourney" for the Jason Bourne > > inspired batching patch. This is an average of 3 runs, overwriting > > a 47GB file on a machine with 16GB RAM: > > > > IO throughput wall time __pv_queued_spin_lock_slowpath > > vanilla 470MB/s 1m42s 25-30% > > sleepy 295MB/s 2m43s <1% > > bourney 425MB/s 1m53s 25-30% > > > > This is another blunt-force patch that Sorry for taking so long to get back to this - had a bunch of other stuff to do (e.g. XFS metadata CRCs have found their first compiler bug) and haven't had to time test this. The blunt force approach seems to work ok: IO throughput wall time __pv_queued_spin_lock_slowpath vanilla 470MB/s 1m42s 25-30% sleepy 295MB/s 2m43s <1% bourney 425MB/s 1m53s 25-30% blunt 470MB/s 1m41s ~2% Performance is pretty much the same as teh vanilla kernel - maybe a little bit faster if we consider median rather than mean results. A snapshot profile from 'perf top -U' looks like: 11.31% [kernel] [k] copy_user_generic_string 3.59% [kernel] [k] get_page_from_freelist 3.22% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 2.80% [kernel] [k] __block_commit_write.isra.29 2.14% [kernel] [k] __pv_queued_spin_lock_slowpath 1.99% [kernel] [k] _raw_spin_lock 1.98% [kernel] [k] wake_all_kswapds 1.92% [kernel] [k] _raw_spin_lock_irqsave 1.90% [kernel] [k] node_dirty_ok 1.69% [kernel] [k] __wake_up_bit 1.57% [kernel] [k] ___might_sleep 1.49% [kernel] [k] __might_sleep 1.24% [kernel] [k] __radix_tree_lookup 1.18% [kernel] [k] kmem_cache_alloc 1.13% [kernel] [k] update_fast_ctr 1.11% [kernel] [k] radix_tree_tag_set 1.08% [kernel] [k] clear_page_dirty_for_io 1.06% [kernel] [k] down_write 1.06% [kernel] [k] up_write 1.01% [kernel] [k] unlock_page 0.99% [kernel] [k] xfs_log_commit_cil 0.97% [kernel] [k] __inc_node_state 0.95% [kernel] [k] __memset 0.89% [kernel] [k] xfs_do_writepage 0.89% [kernel] [k] __list_del_entry 0.87% [kernel] [k] __vfs_write 0.85% [kernel] [k] xfs_inode_item_format 0.84% [kernel] [k] shrink_page_list 0.82% [kernel] [k] kmem_cache_free 0.79% [kernel] [k] radix_tree_tag_clear 0.78% [kernel] [k] _raw_spin_lock_irq 0.77% [kernel] [k] _raw_spin_unlock_irqrestore 0.76% [kernel] [k] node_page_state 0.72% [kernel] [k] xfs_count_page_state 0.68% [kernel] [k] xfs_file_aio_write_checks 0.65% [kernel] [k] wakeup_kswapd There's still a lot of time in locking, but it's no longer obviously being spent by spinning contention. We seem to be spending a lot of time trying to wake kswapds now - the context switch rate of the workload is only 400-500/s, so there aren't a lot of sleeps and wakeups actually occurring Regardless, throughput and locking behvaiour seems to be a lot better than the other patches... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Wed, Aug 24, 2016 at 08:40:37AM -0700, Huang, Ying wrote: > Mel Gorman writes: > > > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote: > >> > Yes, we could try to batch the locking like DaveC already suggested > >> > (ie we could move the locking to the caller, and then make > >> > shrink_page_list() just try to keep the lock held for a few pages if > >> > the mapping doesn't change), and that might result in fewer crazy > >> > cacheline ping-pongs overall. But that feels like exactly the wrong > >> > kind of workaround. > >> > > >> > >> Even if such batching was implemented, it would be very specific to the > >> case of a single large file filling LRUs on multiple nodes. > >> > > > > The latest Jason Bourne movie was sufficiently bad that I spent time > > thinking how the tree_lock could be batched during reclaim. It's not > > straight-forward but this prototype did not blow up on UMA and may be > > worth considering if Dave can test either approach has a positive impact. > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 374d95d04178..926110219cd9 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -621,19 +621,39 @@ static pageout_t pageout(struct page *page, struct > > address_space *mapping, > > return PAGE_CLEAN; > > } > > We found this patch helps much for swap out performance, where there are > usually only one mapping for all swap pages. Yeah, I expected it would be an unconditional win on swapping. I just did not concentrate on it very much as it was not the problem at hand. > In our 16 processes > sequential swap write test case for a ramdisk on a Xeon E5 v3 machine, > the swap out throughput improved 40.4%, from ~0.97GB/s to ~1.36GB/s. Ok, so main benefit would be for ultra-fast storage. I doubt it's noticable on slow disks. > What's your plan for this patch? If it can be merged soon, that will be > great! > Until this mail, no plan. I'm still waiting to hear if Dave's test case has improved with the latest prototype for reducing contention. > I found some issues in the original patch to work with swap cache. Below > is my fixes to make it work for swap cache. > Thanks for the fix. I'm going offline today for a few days but I added a todo item to finish this patch at some point. I won't be rushing it but it'll get done eventually. -- Mel Gorman SUSE Labs
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Hi, Mel, Mel Gorman writes: > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote: >> > Yes, we could try to batch the locking like DaveC already suggested >> > (ie we could move the locking to the caller, and then make >> > shrink_page_list() just try to keep the lock held for a few pages if >> > the mapping doesn't change), and that might result in fewer crazy >> > cacheline ping-pongs overall. But that feels like exactly the wrong >> > kind of workaround. >> > >> >> Even if such batching was implemented, it would be very specific to the >> case of a single large file filling LRUs on multiple nodes. >> > > The latest Jason Bourne movie was sufficiently bad that I spent time > thinking how the tree_lock could be batched during reclaim. It's not > straight-forward but this prototype did not blow up on UMA and may be > worth considering if Dave can test either approach has a positive impact. > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 374d95d04178..926110219cd9 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -621,19 +621,39 @@ static pageout_t pageout(struct page *page, struct > address_space *mapping, > return PAGE_CLEAN; > } We found this patch helps much for swap out performance, where there are usually only one mapping for all swap pages. In our 16 processes sequential swap write test case for a ramdisk on a Xeon E5 v3 machine, the swap out throughput improved 40.4%, from ~0.97GB/s to ~1.36GB/s. What's your plan for this patch? If it can be merged soon, that will be great! I found some issues in the original patch to work with swap cache. Below is my fixes to make it work for swap cache. Best Regards, Huang, Ying > diff --git a/mm/vmscan.c b/mm/vmscan.c index ac5fbff..dcaf295 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -623,22 +623,28 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, static void finalise_remove_mapping(struct list_head *swapcache, struct list_head *filecache, + struct list_head *free_pages, void (*freepage)(struct page *)) { struct page *page; while (!list_empty(swapcache)) { - swp_entry_t swap = { .val = page_private(page) }; + swp_entry_t swap; page = lru_to_page(swapcache); list_del(&page->lru); + swap.val = page_private(page); swapcache_free(swap); set_page_private(page, 0); + if (free_pages) + list_add(&page->lru, free_pages); } while (!list_empty(filecache)) { - page = lru_to_page(swapcache); + page = lru_to_page(filecache); list_del(&page->lru); freepage(page); + if (free_pages) + list_add(&page->lru, free_pages); } } @@ -649,7 +655,8 @@ static void finalise_remove_mapping(struct list_head *swapcache, static int __remove_mapping_page(struct address_space *mapping, struct page *page, bool reclaimed, struct list_head *swapcache, -struct list_head *filecache) +struct list_head *filecache, +struct list_head *free_pages) { BUG_ON(!PageLocked(page)); BUG_ON(mapping != page_mapping(page)); @@ -722,6 +729,8 @@ static int __remove_mapping_page(struct address_space *mapping, __delete_from_page_cache(page, shadow); if (freepage) list_add(&page->lru, filecache); + else if (free_pages) + list_add(&page->lru, free_pages); } return 1; @@ -747,7 +756,7 @@ int remove_mapping(struct address_space *mapping, struct page *page) spin_lock_irqsave(&mapping->tree_lock, flags); freepage = mapping->a_ops->freepage; - if (__remove_mapping_page(mapping, page, false, &swapcache, &filecache)) { + if (__remove_mapping_page(mapping, page, false, &swapcache, &filecache, NULL)) { /* * Unfreezing the refcount with 1 rather than 2 effectively * drops the pagecache ref for us without requiring another @@ -757,7 +766,7 @@ int remove_mapping(struct address_space *mapping, struct page *page) ret = 1; } spin_unlock_irqrestore(&mapping->tree_lock, flags); - finalise_remove_mapping(&swapcache, &filecache, freepage); + finalise_remove_mapping(&swapcache, &filecache, NULL, freepage); return ret; } @@ -776,29 +785,28 @@ static void remove_mapping_list(struct list_head *mapping_list, page = lru_to_page(mapping_list); list_del(&page->lru); -
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Hi, Christoph, "Huang, Ying" writes: > Christoph Hellwig writes: > >> Snipping the long contest: >> >> I think there are three observations here: >> >> (1) removing the mark_page_accessed (which is the only significant >> change in the parent commit) hurts the >> aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test. >> I'd still rather stick to the filemap version and let the >> VM people sort it out. How do the numbers for this test >> look for XFS vs say ext4 and btrfs? >> (2) lots of additional spinlock contention in the new case. A quick >> check shows that I fat-fingered my rewrite so that we do >> the xfs_inode_set_eofblocks_tag call now for the pure lookup >> case, and pretty much all new cycles come from that. >> (3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and >> we're already doing way to many even without my little bug above. >> >> So I've force pushed a new version of the iomap-fixes branch with >> (2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a >> lot less expensive slotted in before that. Would be good to see >> the numbers with that. > > For the original reported regression, the test result is as follow, > > = > compiler/cpufreq_governor/debug-setup/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase: > > gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7 > > commit: > f0c6bcba74ac51cb77aadb33ad35cb2dc1ad1506 (parent of first bad commit) > 68a9f5e7007c1afa2cf6830b690a90d0187c0684 (first bad commit) > 99091700659f4df965e138b38b4fa26a29b7eade (base of your fixes branch) > bf4dc6e4ecc2a3d042029319bc8cd4204c185610 (head of your fixes branch) > > f0c6bcba74ac51cb 68a9f5e7007c1afa2cf6830b69 99091700659f4df965e138b38b > bf4dc6e4ecc2a3d042029319bc > -- -- > -- > %stddev %change %stddev %change %stddev > %change %stddev > \ |\ |\ > |\ > 484435 ± 0% -13.3% 420004 ± 0% -17.0% 402250 ± 0% > -15.6% 408998 ± 0% aim7.jobs-per-min It appears the original reported regression hasn't bee resolved by your commit. Could you take a look at the test results and the perf data? Best Regards, Huang, Ying > > And the perf data is as follow, > > "perf-profile.func.cycles-pp.intel_idle": 20.25, > "perf-profile.func.cycles-pp.memset_erms": 11.72, > "perf-profile.func.cycles-pp.copy_user_enhanced_fast_string": 8.37, > "perf-profile.func.cycles-pp.__block_commit_write.isra.21": 3.49, > "perf-profile.func.cycles-pp.block_write_end": 1.77, > "perf-profile.func.cycles-pp.native_queued_spin_lock_slowpath": 1.63, > "perf-profile.func.cycles-pp.unlock_page": 1.58, > "perf-profile.func.cycles-pp.___might_sleep": 1.56, > "perf-profile.func.cycles-pp.__block_write_begin_int": 1.33, > "perf-profile.func.cycles-pp.iov_iter_copy_from_user_atomic": 1.23, > "perf-profile.func.cycles-pp.up_write": 1.21, > "perf-profile.func.cycles-pp.__mark_inode_dirty": 1.18, > "perf-profile.func.cycles-pp.down_write": 1.06, > "perf-profile.func.cycles-pp.mark_buffer_dirty": 0.94, > "perf-profile.func.cycles-pp.generic_write_end": 0.92, > "perf-profile.func.cycles-pp.__radix_tree_lookup": 0.91, > "perf-profile.func.cycles-pp._raw_spin_lock": 0.81, > "perf-profile.func.cycles-pp.entry_SYSCALL_64_fastpath": 0.79, > "perf-profile.func.cycles-pp.__might_sleep": 0.79, > "perf-profile.func.cycles-pp.xfs_file_iomap_begin_delay.isra.9": 0.7, > "perf-profile.func.cycles-pp.__list_del_entry": 0.7, > "perf-profile.func.cycles-pp.vfs_write": 0.69, > "perf-profile.func.cycles-pp.drop_buffers": 0.68, > "perf-profile.func.cycles-pp.xfs_file_write_iter": 0.67, > "perf-profile.func.cycles-pp.rwsem_spin_on_owner": 0.67, > > Best Regards, > Huang, Ying > ___ > LKP mailing list > l...@lists.01.org > https://lists.01.org/mailman/listinfo/lkp
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Sat, Aug 20, 2016 at 09:48:39AM +1000, Dave Chinner wrote: > On Fri, Aug 19, 2016 at 11:49:46AM +0100, Mel Gorman wrote: > > On Thu, Aug 18, 2016 at 03:25:40PM -0700, Linus Torvalds wrote: > > > It *could* be as simple/stupid as just saying "let's allocate the page > > > cache for new pages from the current node" - and if the process that > > > dirties pages just stays around on one single node, that might already > > > be sufficient. > > > > > > So just for testing purposes, you could try changing that > > > > > > return alloc_pages(gfp, 0); > > > > > > in __page_cache_alloc() into something like > > > > > > return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), > > > gfp, 0); > > > > > > or something. > > > > > > > The test would be interesting but I believe that keeping heavy writers > > on one node will force them to stall early on dirty balancing even if > > there is plenty of free memory on other nodes. > > Well, it depends on the speed of the storage. The higher the speed > of the storage, the less we care about stalling on dirty pages > during reclaim. i.e. faster storage == shorter stalls. We really > should stop thinking we need to optimise reclaim purely for the > benefit of slow disks. 500MB/s write speed with latencies of a > under a couple of milliseconds is common hardware these days. pcie > based storage (e.g. m2, nvme) is rapidly becoming commonplace and > they can easily do 1-2GB/s write speeds. > I partially agree. I've been of the opinion for a long time that dirty_time would be desirable and limit the amount of dirty data by microseconds required to sync the data and pick a default like 5 seconds. It's non-trivial as the write speed of all BDIs would have to be estimated and on rotary storage the estimate would be unreliable. A short-term practical idea would be to distribute pages for writing only when the dirty limit is almost reached on a given node. For fast storage, the distribution may never happen. Neither idea would actually impact the current problem though unless it was combined with discarding clean cache agressively if the underlying storage is fast. Hence, it would still be nice if the contention problem could be mitigated. Did that last patch help any? -- Mel Gorman SUSE Labs
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Fri, Aug 19, 2016 at 4:48 PM, Dave Chinner wrote: > > Well, it depends on the speed of the storage. The higher the speed > of the storage, the less we care about stalling on dirty pages > during reclaim Actually, that's largely true independently of the speed of the storage, I feel. On really fast storage, you might as well push it out and buffering lots of dirty memory pointless. And on really slow storage, buffering lots of dirty memory is absolutely *horrible* from a latency standpoint. So I don't think this is about fast-vs-slow disks. I think a lot of our "let's aggressively buffer dirty data" is entirely historical. When you had 16MB of RAM in a workstation, aggressively using half of it for writeback caches meant that you could do things like untar source trees without waiting for IO. But when you have 16GB of RAM in a workstation, and terabytes of RAM in multi-node big machines, it's kind of silly to talk about "percentages of memory available" for dirty data. I think it's likely silly to even see "one node worth of memory" as being some limiter. So I think we should try to avoid stalling on dirty pages during reclaim by simply aiming to have fewer dirty pages in the first place. Not because the stall is shorter on a fast disk, but because we just shouldn't use that much memory for dirty data. Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Fri, Aug 19, 2016 at 11:49:46AM +0100, Mel Gorman wrote: > On Thu, Aug 18, 2016 at 03:25:40PM -0700, Linus Torvalds wrote: > > It *could* be as simple/stupid as just saying "let's allocate the page > > cache for new pages from the current node" - and if the process that > > dirties pages just stays around on one single node, that might already > > be sufficient. > > > > So just for testing purposes, you could try changing that > > > > return alloc_pages(gfp, 0); > > > > in __page_cache_alloc() into something like > > > > return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), gfp, > > 0); > > > > or something. > > > > The test would be interesting but I believe that keeping heavy writers > on one node will force them to stall early on dirty balancing even if > there is plenty of free memory on other nodes. Well, it depends on the speed of the storage. The higher the speed of the storage, the less we care about stalling on dirty pages during reclaim. i.e. faster storage == shorter stalls. We really should stop thinking we need to optimise reclaim purely for the benefit of slow disks. 500MB/s write speed with latencies of a under a couple of milliseconds is common hardware these days. pcie based storage (e.g. m2, nvme) is rapidly becoming commonplace and they can easily do 1-2GB/s write speeds. The fast storage devices that are arriving need to be treated more like a fast network device (e.g. a pci-e 4x nvme SSD has the throughput of 2x10GbE devices). We have to consider if buffering streaming data in the page cache for any longer than it takes to get the data to userspace or to disk is worth the cost of reclaiming it from the page cache. Really, the question that needs to be answered is this: if we can pull data from the storage at similar speeds and latencies as we can from the page cache, then *why are we caching that data*? We've already made that "don't cache for fast storage" decision in the case of pmem - the DAX IO path is slowly moving towards making full use of the mapping infrastructure for all it's tracking requirements. pcie based storage is a bit slower than pmem, but the principle is the same - the storage is sufficiently fast that caching only really makes sense for data that is really hot... I think the underlying principle here is that the faster the backing device, the less we should cache and buffer the device in the OS. I suspect a good initial approximation of "stickiness" for the page cache would the speed of writeback as measured by the BDI underlying the mapping Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote: > On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote: > > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote: > > > > Yes, we could try to batch the locking like DaveC already suggested > > > > (ie we could move the locking to the caller, and then make > > > > shrink_page_list() just try to keep the lock held for a few pages if > > > > the mapping doesn't change), and that might result in fewer crazy > > > > cacheline ping-pongs overall. But that feels like exactly the wrong > > > > kind of workaround. > > > > > > > > > > Even if such batching was implemented, it would be very specific to the > > > case of a single large file filling LRUs on multiple nodes. > > > > > > > The latest Jason Bourne movie was sufficiently bad that I spent time > > thinking how the tree_lock could be batched during reclaim. It's not > > straight-forward but this prototype did not blow up on UMA and may be > > worth considering if Dave can test either approach has a positive impact. > > SO, I just did a couple of tests. I'll call the two patches "sleepy" > for the contention backoff patch and "bourney" for the Jason Bourne > inspired batching patch. This is an average of 3 runs, overwriting > a 47GB file on a machine with 16GB RAM: > > IO throughput wall time __pv_queued_spin_lock_slowpath > vanilla 470MB/s 1m42s 25-30% > sleepy295MB/s 2m43s <1% > bourney 425MB/s 1m53s 25-30% > This is another blunt-force patch that a) stalls all but one kswapd instance when treelock contention occurs b) marks a pgdat congested when tree_lock contention is encountered which may cause direct reclaimers to wait_iff_congested until kswapd finishes balancing the node I tested this on a KVM instance running on a 4-socket box. The vCPUs were bound to pCPUs and the memory nodes in the KVM mapped to physical memory nodes. Without the patch 3% of kswapd cycles were spent on locking. With the patch, the cycle count was 0.23% xfs_io contention was reduced from 0.63% to 0.39% which is not perfect. It can be reduced by stalling all kswapd instances but then xfs_io direct reclaims and throughput drops. diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index d572b78b65e1..f6d3e886f405 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -532,6 +532,7 @@ enum pgdat_flags { * many pages under writeback */ PGDAT_RECLAIM_LOCKED, /* prevents concurrent reclaim */ + PGDAT_CONTENDED,/* kswapd contending on tree_lock */ }; static inline unsigned long zone_end_pfn(const struct zone *zone) diff --git a/mm/vmscan.c b/mm/vmscan.c index 374d95d04178..64ca2148755c 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -621,19 +621,43 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, return PAGE_CLEAN; } +static atomic_t kswapd_contended = ATOMIC_INIT(0); + /* * Same as remove_mapping, but if the page is removed from the mapping, it * gets returned with a refcount of 0. */ static int __remove_mapping(struct address_space *mapping, struct page *page, - bool reclaimed) + bool reclaimed, unsigned long *nr_contended) { unsigned long flags; BUG_ON(!PageLocked(page)); BUG_ON(mapping != page_mapping(page)); - spin_lock_irqsave(&mapping->tree_lock, flags); + if (!nr_contended || !current_is_kswapd()) + spin_lock_irqsave(&mapping->tree_lock, flags); + else { + /* Account for trylock contentions in kswapd */ + if (!spin_trylock_irqsave(&mapping->tree_lock, flags)) { + pg_data_t *pgdat = page_pgdat(page); + int nr_kswapd; + + /* Account for contended pages and contended kswapds */ + (*nr_contended)++; + if (!test_and_set_bit(PGDAT_CONTENDED, &pgdat->flags)) + nr_kswapd = atomic_inc_return(&kswapd_contended); + else + nr_kswapd = atomic_read(&kswapd_contended); + BUG_ON(nr_kswapd > nr_online_nodes || nr_kswapd < 0); + + /* Stall kswapd if multiple kswapds are contending */ + if (nr_kswapd > 1) + congestion_wait(BLK_RW_ASYNC, HZ/10); + + spin_lock_irqsave(&mapping->tree_lock, flags); + } + } /* * The non racy check for a busy page. * @@ -719,7 +743,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, */ int remove_mapping(struct address_space *mapping, struct page *page) {
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 18, 2016 at 03:25:40PM -0700, Linus Torvalds wrote: > >> In fact, looking at the __page_cache_alloc(), we already have that > >> "spread pages out" logic. I'm assuming Dave doesn't actually have that > >> bit set (I don't think it's the default), but I'm also envisioning > >> that maybe we could extend on that notion, and try to spread out > >> allocations in general, but keep page allocations from one particular > >> mapping within one node. > > > > CONFIG_CPUSETS=y > > > > But I don't have any cpusets configured (unless systemd is doing > > something wacky under the covers) so the page spread bit should not > > be set. > > Yeah, but even when it's not set we just do a generic alloc_pages(), > which is just going to fill up all nodes. Not perhaps quite as "spread > out", but there's obviously no attempt to try to be node-aware either. > There is a slight difference. Reads should fill the nodes in turn but dirty pages (__GFP_WRITE) get distributed to balance the number of dirty pages on each node to avoid hitting dirty balance limits prematurely. Yesterday I tried a patch that avoids distributing to remote nodes close to the high watermark to avoid waking remote kswapd instances. It added a lot of overhead to the fast path (3%) which hurts every writer but did not reduce contention enough on the special case of writing a single large file. As an aside, the dirty distribution check itself is very expensive so I prototyped something that does the expensive calculations on a vmstat update. Not sure if it'll work but it's a side issue. > So _if_ we come up with some reasonable way to say "let's keep the > pages of this mapping together", we could try to do it in that > numa-aware __page_cache_alloc(). > > It *could* be as simple/stupid as just saying "let's allocate the page > cache for new pages from the current node" - and if the process that > dirties pages just stays around on one single node, that might already > be sufficient. > > So just for testing purposes, you could try changing that > > return alloc_pages(gfp, 0); > > in __page_cache_alloc() into something like > > return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), gfp, 0); > > or something. > The test would be interesting but I believe that keeping heavy writers on one node will force them to stall early on dirty balancing even if there is plenty of free memory on other nodes. -- Mel Gorman SUSE Labs
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu 18-08-16 15:25:40, Linus Torvalds wrote: [...] > So just for testing purposes, you could try changing that > > return alloc_pages(gfp, 0); > > in __page_cache_alloc() into something like > > return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), gfp, 0); That would break mempolicies AFAICS. Anyway, I might be missing something (the mempolicy kod has a strange sense for aesthetic) but the normal case without any explicit mempolicy should use default_policy which is MPOL_PREFERRED and MPOL_F_LOCAL which means numa_node_id() so the local node. So the above two should do the same thing unless I have missing something. -- Michal Hocko SUSE Labs
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 18, 2016 at 10:55:01AM -0700, Linus Torvalds wrote: > On Thu, Aug 18, 2016 at 6:24 AM, Mel Gorman > wrote: > > On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote: > >> FWIW, I just remembered about /proc/sys/vm/zone_reclaim_mode. > >> > > > > That is a terrifying "fix" for this problem. It just happens to work > > because there is no spillover to other nodes so only one kswapd instance > > is potentially active. > > Well, it may be a terrifying fix, but it does bring up an intriguing > notion: maybe what we should think about is to make the actual page > cache allocations be more "node-sticky" for a particular mapping? Not > some hard node binding, but if we were to make a single mapping *tend* > to allocate pages primarily within the same node, that would have the > kind of secondary afvantage that it would avoid the cross-node mapping > locking. For streaming or use-once IO it makes a lot of sense to restrict the locality of the page cache. The faster the IO device, the less dirty page buffering we need to maintain full device bandwidth. And the larger the machine the greater the effect of global page cache pollution on the other appplications is. > Think of it as a gentler "guiding" fix to the spinlock contention > issue than a hard hammer. > > And trying to (at least initially) keep the allocations of one > particular file to one particular node sounds like it could have other > locality advantages too. > > In fact, looking at the __page_cache_alloc(), we already have that > "spread pages out" logic. I'm assuming Dave doesn't actually have that > bit set (I don't think it's the default), but I'm also envisioning > that maybe we could extend on that notion, and try to spread out > allocations in general, but keep page allocations from one particular > mapping within one node. CONFIG_CPUSETS=y But I don't have any cpusets configured (unless systemd is doing something wacky under the covers) so the page spread bit should not be set. > The fact that zone_reclaim_mode really improves on Dave's numbers > *that* dramatically does seem to imply that there is something to be > said for this. > > We do *not* want to limit the whole page cache to a particular node - > that sounds very unreasonable in general. But limiting any particular > file mapping (by default - I'm sure there are things like databases > that just want their one DB file to take over all of memory) to a > single node sounds much less unreasonable. > > What do you guys think? Worth exploring? The problem is that whenever we turn this sort of behaviour on, some benchmark regresses because it no longer holds it's working set in the page cache, leading to the change being immediately reverted. Enterprise java benchmarks ring a bell, for some reason. Hence my comment above about needing it to be tied into specific "use-once-only" page cache behaviours. I know we have working set estimation, fadvise modes and things like readahead that help track sequential and use-once access patterns, but I'm not sure how we can tie that all together Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 18, 2016 at 2:19 PM, Dave Chinner wrote: > > For streaming or use-once IO it makes a lot of sense to restrict the > locality of the page cache. The faster the IO device, the less dirty > page buffering we need to maintain full device bandwidth. And the > larger the machine the greater the effect of global page cache > pollution on the other appplications is. Yes. But I agree with you that it might be very hard to actually get something that does a good job automagically. >> In fact, looking at the __page_cache_alloc(), we already have that >> "spread pages out" logic. I'm assuming Dave doesn't actually have that >> bit set (I don't think it's the default), but I'm also envisioning >> that maybe we could extend on that notion, and try to spread out >> allocations in general, but keep page allocations from one particular >> mapping within one node. > > CONFIG_CPUSETS=y > > But I don't have any cpusets configured (unless systemd is doing > something wacky under the covers) so the page spread bit should not > be set. Yeah, but even when it's not set we just do a generic alloc_pages(), which is just going to fill up all nodes. Not perhaps quite as "spread out", but there's obviously no attempt to try to be node-aware either. So _if_ we come up with some reasonable way to say "let's keep the pages of this mapping together", we could try to do it in that numa-aware __page_cache_alloc(). It *could* be as simple/stupid as just saying "let's allocate the page cache for new pages from the current node" - and if the process that dirties pages just stays around on one single node, that might already be sufficient. So just for testing purposes, you could try changing that return alloc_pages(gfp, 0); in __page_cache_alloc() into something like return alloc_pages_node(cpu_to_node(raw_smp_processor_id())), gfp, 0); or something. >> The fact that zone_reclaim_mode really improves on Dave's numbers >> *that* dramatically does seem to imply that there is something to be >> said for this. >> >> We do *not* want to limit the whole page cache to a particular node - >> that sounds very unreasonable in general. But limiting any particular >> file mapping (by default - I'm sure there are things like databases >> that just want their one DB file to take over all of memory) to a >> single node sounds much less unreasonable. >> >> What do you guys think? Worth exploring? > > The problem is that whenever we turn this sort of behaviour on, some > benchmark regresses because it no longer holds it's working set in > the page cache, leading to the change being immediately reverted. > Enterprise java benchmarks ring a bell, for some reason. Yeah. It might be ok if we limit the new behavior to just new pages that get allocated for writing, which is where we want to limit the page cache more anyway (we already have all those dirty limits etc). But from a testing standpoint, you can probably try the above "alloc_pages_node()" hack and see if it even makes a difference. It might not work, and the dirtier might be moving around too much etc. Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 18, 2016 at 6:24 AM, Mel Gorman wrote: > On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote: >> FWIW, I just remembered about /proc/sys/vm/zone_reclaim_mode. >> > > That is a terrifying "fix" for this problem. It just happens to work > because there is no spillover to other nodes so only one kswapd instance > is potentially active. Well, it may be a terrifying fix, but it does bring up an intriguing notion: maybe what we should think about is to make the actual page cache allocations be more "node-sticky" for a particular mapping? Not some hard node binding, but if we were to make a single mapping *tend* to allocate pages primarily within the same node, that would have the kind of secondary afvantage that it would avoid the cross-node mapping locking. Think of it as a gentler "guiding" fix to the spinlock contention issue than a hard hammer. And trying to (at least initially) keep the allocations of one particular file to one particular node sounds like it could have other locality advantages too. In fact, looking at the __page_cache_alloc(), we already have that "spread pages out" logic. I'm assuming Dave doesn't actually have that bit set (I don't think it's the default), but I'm also envisioning that maybe we could extend on that notion, and try to spread out allocations in general, but keep page allocations from one particular mapping within one node. The fact that zone_reclaim_mode really improves on Dave's numbers *that* dramatically does seem to imply that there is something to be said for this. We do *not* want to limit the whole page cache to a particular node - that sounds very unreasonable in general. But limiting any particular file mapping (by default - I'm sure there are things like databases that just want their one DB file to take over all of memory) to a single node sounds much less unreasonable. What do you guys think? Worth exploring? Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 18, 2016 at 05:11:11PM +1000, Dave Chinner wrote: > On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote: > > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote: > > > > Yes, we could try to batch the locking like DaveC already suggested > > > > (ie we could move the locking to the caller, and then make > > > > shrink_page_list() just try to keep the lock held for a few pages if > > > > the mapping doesn't change), and that might result in fewer crazy > > > > cacheline ping-pongs overall. But that feels like exactly the wrong > > > > kind of workaround. > > > > > > > > > > Even if such batching was implemented, it would be very specific to the > > > case of a single large file filling LRUs on multiple nodes. > > > > > > > The latest Jason Bourne movie was sufficiently bad that I spent time > > thinking how the tree_lock could be batched during reclaim. It's not > > straight-forward but this prototype did not blow up on UMA and may be > > worth considering if Dave can test either approach has a positive impact. > > SO, I just did a couple of tests. I'll call the two patches "sleepy" > for the contention backoff patch and "bourney" for the Jason Bourne > inspired batching patch. This is an average of 3 runs, overwriting > a 47GB file on a machine with 16GB RAM: > > IO throughput wall time __pv_queued_spin_lock_slowpath > vanilla 470MB/s 1m42s 25-30% > sleepy295MB/s 2m43s <1% > bourney 425MB/s 1m53s 25-30% > Thanks. I updated the tests today and reran them trying to reproduce what you saw but I'm simply not seeing it on bare metal with a spinning disk. xfsio Throughput 4.8.0-rc2 4.8.0-rc2 4.8.0-rc2 vanillasleepy bourney Min tput147.4450 ( 0.00%)147.2580 ( 0.13%)147.3900 ( 0.04%) Hmeantput147.5853 ( 0.00%)147.5101 ( 0.05%)147.6121 ( -0.02%) Stddev tput 0.1041 ( 0.00%) 0.1785 (-71.47%) 0.2036 (-95.63%) CoeffVar tput 0.0705 ( 0.00%) 0.1210 (-71.56%) 0.1379 (-95.59%) Max tput147.6940 ( 0.00%)147.6420 ( 0.04%)147.8820 ( -0.13%) I'm currently setting up a KVM instance that may fare better. Due to quirks of where machines are, I have to setup the KVM instance on real NUMA hardware but maybe that'll make the problem even more obvious. > The overall CPU usage of sleepy was much lower than the others, but > it was also much slower. Too much sleeping and not enough reclaim > work being done, I think. > Looks like it. On my initial test, there was barely any sleeping. > As for bourney, it's not immediately clear as to why it's nearly as > bad as the movie. At worst I would have expected it to have not > noticable impact, but maybe we are delaying freeing of pages too > long and so stalling allocation of new pages? It also doesn't do > much to reduce contention, especially considering the reduction in > throughput. > > On a hunch that the batch list isn't all one mapping, I sorted it. > Patch is below if you're curious. > The fact that sorting makes such a difference makes me think that it's the wrong direction. It's far too specific to this test case and does nothing to throttle a reclaimer. It's also fairly complex and I expected that normal users of remove_mapping such as truncation would take a hit. The hit of bouncing the lock around just hurts too much. > FWIW, I just remembered about /proc/sys/vm/zone_reclaim_mode. > That is a terrifying "fix" for this problem. It just happens to work because there is no spillover to other nodes so only one kswapd instance is potentially active. > Anyway, I've burnt enough erase cycles on this SSD for today > I'll continue looking at getting KVM up and running and then consider other possibilities for throttling. -- Mel Gorman SUSE Labs
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 18, 2016 at 01:45:17AM +0100, Mel Gorman wrote: > On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote: > > > Yes, we could try to batch the locking like DaveC already suggested > > > (ie we could move the locking to the caller, and then make > > > shrink_page_list() just try to keep the lock held for a few pages if > > > the mapping doesn't change), and that might result in fewer crazy > > > cacheline ping-pongs overall. But that feels like exactly the wrong > > > kind of workaround. > > > > > > > Even if such batching was implemented, it would be very specific to the > > case of a single large file filling LRUs on multiple nodes. > > > > The latest Jason Bourne movie was sufficiently bad that I spent time > thinking how the tree_lock could be batched during reclaim. It's not > straight-forward but this prototype did not blow up on UMA and may be > worth considering if Dave can test either approach has a positive impact. SO, I just did a couple of tests. I'll call the two patches "sleepy" for the contention backoff patch and "bourney" for the Jason Bourne inspired batching patch. This is an average of 3 runs, overwriting a 47GB file on a machine with 16GB RAM: IO throughput wall time __pv_queued_spin_lock_slowpath vanilla 470MB/s 1m42s 25-30% sleepy 295MB/s 2m43s <1% bourney 425MB/s 1m53s 25-30% The overall CPU usage of sleepy was much lower than the others, but it was also much slower. Too much sleeping and not enough reclaim work being done, I think. As for bourney, it's not immediately clear as to why it's nearly as bad as the movie. At worst I would have expected it to have not noticable impact, but maybe we are delaying freeing of pages too long and so stalling allocation of new pages? It also doesn't do much to reduce contention, especially considering the reduction in throughput. On a hunch that the batch list isn't all one mapping, I sorted it. Patch is below if you're curious. IO throughput wall time __pv_queued_spin_lock_slowpath vanilla 470MB/s 1m42s 25-30% sleepy 295MB/s 2m43s <1% bourney 425MB/s 1m53s 25-30% sorted-bourney 465MB/s 1m43s 20% The number of reclaim batches (from multiple runs) where the sorting of the lists would have done anything is counted by list swaps (ls) being > 1. # grep " c " /var/log/syslog |grep -v "ls 1" |wc -l 7429 # grep " c " /var/log/syslog |grep "ls 1" |wc -l 1061767 IOWs in 1.07 million batches of pages reclaimed, only ~0.695% of batches switched to a different mapping tree lock more than once. >From those numbers I would not have expected sorting the page list to have any measurable impact on performance. However, performance seems very sensitive to the number of times the mapping tree lock is bounced around. FWIW, I just remembered about /proc/sys/vm/zone_reclaim_mode. IO throughput wall time __pv_queued_spin_lock_slowpath vanilla 470MB/s 1m42s 25-30% zr=1470MB/s 1m42s 2-3% So isolating the page cache usage to a single node maintains performance and shows a significant reduction in pressure on the mapping tree lock. Same as a single node system, I'd guess. Anyway, I've burnt enough erase cycles on this SSD for today -Dave. --- mm/vmscan.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/mm/vmscan.c b/mm/vmscan.c index 9261102..5cf1bd6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -56,6 +56,8 @@ #include "internal.h" +#include + #define CREATE_TRACE_POINTS #include @@ -761,6 +763,17 @@ int remove_mapping(struct address_space *mapping, struct page *page) return ret; } +static int mapping_cmp(void *priv, struct list_head *a, struct list_head *b) +{ + struct address_space *ma = container_of(a, struct page, lru)->mapping; + struct address_space *mb = container_of(a, struct page, lru)->mapping; + + if (ma == mb) + return 0; + if (ma > mb) + return 1; + return -1; +} static void remove_mapping_list(struct list_head *mapping_list, struct list_head *free_pages, struct list_head *ret_pages) @@ -771,12 +784,17 @@ static void remove_mapping_list(struct list_head *mapping_list, LIST_HEAD(swapcache); LIST_HEAD(filecache); struct page *page; + int c = 0, ls = 0; + + list_sort(NULL, mapping_list, mapping_cmp); while (!list_empty(mapping_list)) { + c++; page = lru_to_page(mapping_list); list_del(&page->lru); if (!mapping || page->mapping != mapping) { + ls++; if (mapping) { spin_unlock_irqrestore(
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote: > On Tue, Aug 16, 2016 at 10:47:36AM -0700, Linus Torvalds wrote: > > I've always preferred to see direct reclaim as the primary model for > > reclaim, partly in order to throttle the actual "bad" process, but > > also because "kswapd uses lots of CPU time" is such a nasty thing to > > even begin guessing about. > > > > While I agree that bugs with high CPU usage from kswapd are a pain, > I'm reluctant to move towards direct reclaim being the primary mode. The > stalls can be severe and there is no guarantee that the process punished > is the process responsible. I'm basing this assumption on observations > of severe performance regressions when I accidentally broke kswapd during > the development of node-lru. > > > So I have to admit to liking that "make kswapd sleep a bit if it's > > just looping" logic that got removed in that commit. > > > > It's primarily the direct reclaimer that is affected by that patch. > > > And looking at DaveC's numbers, it really feels like it's not even > > what we do inside the locked region that is the problem. Sure, > > __delete_from_page_cache() (which is most of it) is at 1.86% of CPU > > time (when including all the things it calls), but that still isn't > > all that much. Especially when compared to just: > > > >0.78% [kernel] [k] _raw_spin_unlock_irqrestore > > > > The profile is shocking for such a basic workload. I automated what Dave > described with xfs_io except that the file size is 2*RAM. The filesystem > is sized to be roughly the same size as the file to minimise variances > due to block layout. A call-graph profile collected on bare metal UMA with > numa=fake=4 and paravirt spinlocks showed > > 1.40% 0.16% kswapd1 [kernel.vmlinux][k] > _raw_spin_lock_irqsave > 1.36% 0.16% kswapd2 [kernel.vmlinux][k] > _raw_spin_lock_irqsave > 1.21% 0.12% kswapd0 [kernel.vmlinux][k] > _raw_spin_lock_irqsave > 1.12% 0.13% kswapd3 [kernel.vmlinux][k] > _raw_spin_lock_irqsave > 0.81% 0.45% xfs_io [kernel.vmlinux][k] > _raw_spin_lock_irqsave > > Those contention figures are not great but they are not terrible either. The > vmstats told me there was no direct reclaim activity so either my theory > is wrong or this machine is not reproducing the same problem Dave is seeing. No, that's roughly the same un-normalised CPU percentage I am seeing in spinlock contention. i.e. take way the idle CPU in the profile (probably upwards of 80% if it's a 16p machine), and instead look at that figure as a percentage of total CPU used by the workload. Then you'll that it's 30-40% of the total CPU consumed by the workload. > I have partial results from a 2-socket and 4-socket machine. 2-socket spends > roughtly 1.8% in _raw_spin_lock_irqsave and 4-socket spends roughtly 3%, > both with no direct reclaim. Clearly the problem gets worse the more NUMA > nodes there are but not to the same extent Dave reports. > > I believe potential reasons why I do not see the same problem as Dave are; > > 1. Different memory sizes changing timing > 2. Dave has fast storage and I'm using a spinning disk This particular is using an abused 3 year old SATA SSD that still runs at 500MB/s on sequential writes. This is "cheap desktop" capability these days and is nowhere near what I'd call "fast". > 3. Lock contention problems are magnified inside KVM > > I think 3 is a good possibility if contended locks result in expensive > exiting and reentery of the guest. I have a vague recollection that a > spinning vcpu exits the guest but I did not confirm that. I don't think anything like that has been implemented in the pv spinlocks yet. They just spin right now - it's the same lock implementation as the host. Also, Context switch rates measured on the host are not significantly higher than what is measured in the guest, so there doesn't appear to be any extra scheduling on the host side occurring. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Wed, Aug 17, 2016 at 04:49:07PM +0100, Mel Gorman wrote: > > Yes, we could try to batch the locking like DaveC already suggested > > (ie we could move the locking to the caller, and then make > > shrink_page_list() just try to keep the lock held for a few pages if > > the mapping doesn't change), and that might result in fewer crazy > > cacheline ping-pongs overall. But that feels like exactly the wrong > > kind of workaround. > > > > Even if such batching was implemented, it would be very specific to the > case of a single large file filling LRUs on multiple nodes. > The latest Jason Bourne movie was sufficiently bad that I spent time thinking how the tree_lock could be batched during reclaim. It's not straight-forward but this prototype did not blow up on UMA and may be worth considering if Dave can test either approach has a positive impact. diff --git a/mm/vmscan.c b/mm/vmscan.c index 374d95d04178..926110219cd9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -621,19 +621,39 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, return PAGE_CLEAN; } +static void finalise_remove_mapping(struct list_head *swapcache, + struct list_head *filecache, + void (*freepage)(struct page *)) +{ + struct page *page; + + while (!list_empty(swapcache)) { + swp_entry_t swap = { .val = page_private(page) }; + page = lru_to_page(swapcache); + list_del(&page->lru); + swapcache_free(swap); + set_page_private(page, 0); + } + + while (!list_empty(filecache)) { + page = lru_to_page(swapcache); + list_del(&page->lru); + freepage(page); + } +} + /* * Same as remove_mapping, but if the page is removed from the mapping, it * gets returned with a refcount of 0. */ -static int __remove_mapping(struct address_space *mapping, struct page *page, - bool reclaimed) +static int __remove_mapping_page(struct address_space *mapping, +struct page *page, bool reclaimed, +struct list_head *swapcache, +struct list_head *filecache) { - unsigned long flags; - BUG_ON(!PageLocked(page)); BUG_ON(mapping != page_mapping(page)); - spin_lock_irqsave(&mapping->tree_lock, flags); /* * The non racy check for a busy page. * @@ -668,16 +688,18 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, } if (PageSwapCache(page)) { - swp_entry_t swap = { .val = page_private(page) }; + unsigned long swapval = page_private(page); + swp_entry_t swap = { .val = swapval }; mem_cgroup_swapout(page, swap); __delete_from_swap_cache(page); - spin_unlock_irqrestore(&mapping->tree_lock, flags); - swapcache_free(swap); + set_page_private(page, swapval); + list_add(&page->lru, swapcache); } else { - void (*freepage)(struct page *); void *shadow = NULL; + void (*freepage)(struct page *); freepage = mapping->a_ops->freepage; + /* * Remember a shadow entry for reclaimed file cache in * order to detect refaults, thus thrashing, later on. @@ -698,16 +720,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, !mapping_exiting(mapping) && !dax_mapping(mapping)) shadow = workingset_eviction(mapping, page); __delete_from_page_cache(page, shadow); - spin_unlock_irqrestore(&mapping->tree_lock, flags); - - if (freepage != NULL) - freepage(page); + if (freepage) + list_add(&page->lru, filecache); } return 1; cannot_free: - spin_unlock_irqrestore(&mapping->tree_lock, flags); return 0; } @@ -719,16 +738,68 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, */ int remove_mapping(struct address_space *mapping, struct page *page) { - if (__remove_mapping(mapping, page, false)) { + unsigned long flags; + LIST_HEAD(swapcache); + LIST_HEAD(filecache); + void (*freepage)(struct page *); + int ret = 0; + + spin_lock_irqsave(&mapping->tree_lock, flags); + freepage = mapping->a_ops->freepage; + + if (__remove_mapping_page(mapping, page, false, &swapcache, &filecache)) { /* * Unfreezing the refcount with 1 rather than 2 effectively * drops the pagecache ref for us without requiring another * atomic operation.
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Wed 17-08-16 17:48:25, Michal Hocko wrote: [...] > I will try to catch up with the rest of the email thread but from a > quick glance it just feels like we are doing more more work under the > lock. Hmm, so it doesn't seem to be more work in __remove_mapping as pointed out in http://lkml.kernel.org/r/20160816220250.GI16044@dastard As Mel already pointed out the LRU will be basically single mapping for this workload so any subtle change in timing might make a difference. I was looking through 4.6..4.7 and one thing that has changed is the inactive vs. active LRU size ratio. See 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list"). The machine has quite a lot of memory and so the LRUs will be large as well so I guess this could have change the timing somehow, but it feels like a wild guess so I would be careful to blame this commit... -- Michal Hocko SUSE Labs
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Mon, Aug 15, 2016 at 07:03:00AM +0200, Ingo Molnar wrote: > > * Linus Torvalds wrote: > > > Make sure you actually use "perf record -e cycles:pp" or something > > that uses PEBS to get real profiles using CPU performance counters. > > Btw., 'perf record -e cycles:pp' is the default now for modern versions > of perf tooling (on most x86 systems) - if you do 'perf record' it will > just use the most precise profiling mode available on that particular > CPU model. > > If unsure you can check the event that was used, via: > > triton:~> perf report --stdio 2>&1 | grep '# Samples' > # Samples: 27K of event 'cycles:pp' Problem here is that Dave is using a KVM thingy. Getting hardware counters in a guest is somewhat tricky but doable, but PEBS does not virtualize.
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Tue, Aug 16, 2016 at 10:47:36AM -0700, Linus Torvalds wrote: > I've always preferred to see direct reclaim as the primary model for > reclaim, partly in order to throttle the actual "bad" process, but > also because "kswapd uses lots of CPU time" is such a nasty thing to > even begin guessing about. > While I agree that bugs with high CPU usage from kswapd are a pain, I'm reluctant to move towards direct reclaim being the primary mode. The stalls can be severe and there is no guarantee that the process punished is the process responsible. I'm basing this assumption on observations of severe performance regressions when I accidentally broke kswapd during the development of node-lru. > So I have to admit to liking that "make kswapd sleep a bit if it's > just looping" logic that got removed in that commit. > It's primarily the direct reclaimer that is affected by that patch. > And looking at DaveC's numbers, it really feels like it's not even > what we do inside the locked region that is the problem. Sure, > __delete_from_page_cache() (which is most of it) is at 1.86% of CPU > time (when including all the things it calls), but that still isn't > all that much. Especially when compared to just: > >0.78% [kernel] [k] _raw_spin_unlock_irqrestore > The profile is shocking for such a basic workload. I automated what Dave described with xfs_io except that the file size is 2*RAM. The filesystem is sized to be roughly the same size as the file to minimise variances due to block layout. A call-graph profile collected on bare metal UMA with numa=fake=4 and paravirt spinlocks showed 1.40% 0.16% kswapd1 [kernel.vmlinux][k] _raw_spin_lock_irqsave 1.36% 0.16% kswapd2 [kernel.vmlinux][k] _raw_spin_lock_irqsave 1.21% 0.12% kswapd0 [kernel.vmlinux][k] _raw_spin_lock_irqsave 1.12% 0.13% kswapd3 [kernel.vmlinux][k] _raw_spin_lock_irqsave 0.81% 0.45% xfs_io [kernel.vmlinux][k] _raw_spin_lock_irqsave Those contention figures are not great but they are not terrible either. The vmstats told me there was no direct reclaim activity so either my theory is wrong or this machine is not reproducing the same problem Dave is seeing. I have partial results from a 2-socket and 4-socket machine. 2-socket spends roughtly 1.8% in _raw_spin_lock_irqsave and 4-socket spends roughtly 3%, both with no direct reclaim. Clearly the problem gets worse the more NUMA nodes there are but not to the same extent Dave reports. I believe potential reasons why I do not see the same problem as Dave are; 1. Different memory sizes changing timing 2. Dave has fast storage and I'm using a spinning disk 3. Lock contention problems are magnified inside KVM I think 3 is a good possibility if contended locks result in expensive exiting and reentery of the guest. I have a vague recollection that a spinning vcpu exits the guest but I did not confirm that. I can setup a KVM instance and run the tests but it'll take a few hours and possibly will be pushed out until tomorrow. > So I'm more and more getting the feeling that it's not what we do > inside the lock that is problematic. I started out blaming memcg > accounting or something, but none of the numbers seem to back that up. > So it's primarily really just the fact that kswapd is simply hammering > on that lock way too much. > Agreed. > So yeah, I'm blaming kswapd itself doing something wrong. It's not a > problem in a single-node environment (since there's only one), but > with multiple nodes it clearly just devolves. > > Yes, we could try to batch the locking like DaveC already suggested > (ie we could move the locking to the caller, and then make > shrink_page_list() just try to keep the lock held for a few pages if > the mapping doesn't change), and that might result in fewer crazy > cacheline ping-pongs overall. But that feels like exactly the wrong > kind of workaround. > Even if such batching was implemented, it would be very specific to the case of a single large file filling LRUs on multiple nodes. > I'd much rather re-instate some "if kswapd is just spinning on CPU > time and not actually improving IO parallelism, kswapd should just get > the hell out" logic. > I'm having trouble right now thinking of a good way of identifying when kswapd should give up and force direct reclaim to take a hit. I'd like to pass something else by the wtf-o-meter. I had a prototype patch lying around that replaced a congestion_wait if too many LRU pages were isolated with a waitqueue for an unrelated theoretical problem. It's the bulk of the patch below but can be trivially extended for the case of tree_lock contention. The interesting part is the change to __remove_mapping. It stalls a reclaimer (direct or kswapd) if the lock is contended for either a timeout or a local reclaimer finishing some reclaim. This stalls for a "real" reason
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Tue 16-08-16 10:47:36, Linus Torvalds wrote: > Mel, > thanks for taking a look. Your theory sounds more complete than mine, > and since Dave is able to see the problem with 4.7, it would be nice > to hear about the 4.6 behavior and commit ede37713737 in particular. > > That one seems more likely to affect contention than the zone/node one > I found during the merge window anyway, since it actually removes a > sleep in kswapd during congestion. Hmm, the patch removes a short sleep from wait_iff_congested for kworkers but that cannot affect kswapd context. Then it removes wait_iff_congested from should_reclaim_retry but that is not kswapd path and the sleep was added in the same merge window so it wasn't in 4.6 so it shouldn't make any difference as well. So I am not really sure how it could make any difference. I will try to catch up with the rest of the email thread but from a quick glance it just feels like we are doing more more work under the lock. -- Michal Hocko SUSE Labs
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Tue, Aug 16, 2016 at 3:02 PM, Dave Chinner wrote: >> >> What does your profile show for when you actually dig into >> __remove_mapping() itself?, Looking at your flat profile, I'm assuming >> you get > > - 22.26% 0.93% [kernel] [k] __remove_mapping >- 3.86% __remove_mapping > - 18.35% _raw_spin_lock_irqsave >__pv_queued_spin_lock_slowpath > 1.32% __delete_from_page_cache > - 0.92% _raw_spin_unlock_irqrestore >__raw_callee_save___pv_queued_spin_unlock Ok, that's all very consistent with my profiles, except - obviously - for the crazy spinlock thing. One difference is that your unlock has that PV unlock thing - on raw hardware it's just a single store. But I don't think I saw the unlock_slowpath in there. There's nothing really expensive going on there that I can tell. > And the instruction level profile: Yup. The bulk is in the cmpxchg and a cache miss (it just shows up in the instruction after it: you can use "cycles:pp" to get perf to actually try to fix up the blame to the instruction that _causes_ things rather than the instruction following, but in this case it's all trivial). > It's the same code AFAICT, except the pv version jumps straight to > the "queue" case. Yes. Your profile looks perfectly fine. Most of the profile is rigth after the 'pause', which you'd expect. >From a quick look, it seems like only about 2/3rd of the time is actually spent in the "pause" loop, but the control flow is complex enough that maybe I didn't follow it right. The native case is simpler. But since I suspect that it's not so much about the spinlocked region being too costly, but just about locking too damn much), that 2/3rds actually makes sense: it's not that it's necessarily spinning waiting for the lock all that long in any individual case, it's just that the spin_lock code is called so much. So I still kind of just blame kswapd, rather than any new expense. It would be interesting to hear if Mel is right about that kswapd sleeping change between 4.6 and 4.7.. Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Mon, Aug 15, 2016 at 06:51:42PM -0700, Linus Torvalds wrote: > Anyway, including the direct reclaim call paths gets > __remove_mapping() a bit higher, and _raw_spin_lock_irqsave climbs to > 0.26%. But perhaps more importlantly, looking at what __remove_mapping > actually *does* (apart from the spinlock) gives us: > > - inside remove_mapping itself (0.11% on its own - flat cost, no > child accounting) > > 48.50 │ lock cmpxchg %edx,0x1c(%rbx) > > so that's about 0.05% > > - 0.40% __delete_from_page_cache (0.22% > radix_tree_replace_clear_tags, 0.13%__radix_tree_lookup) > > - 0.06% workingset_eviction() > > so I'm not actually seeing anything *new* expensive in there. The > __delete_from_page_cache() overhead may have changed a bit with the > tagged tree changes, but this doesn't look like memcg. > > But we clearly have very different situations. > > What does your profile show for when you actually dig into > __remove_mapping() itself?, Looking at your flat profile, I'm assuming > you get - 22.26% 0.93% [kernel] [k] __remove_mapping - 3.86% __remove_mapping - 18.35% _raw_spin_lock_irqsave __pv_queued_spin_lock_slowpath 1.32% __delete_from_page_cache - 0.92% _raw_spin_unlock_irqrestore __raw_callee_save___pv_queued_spin_unlock And the instruction level profile: . � xor%ecx,%ecx � mov%rax,%r15 0.39 � mov$0x2,%eax � lock cmpxchg %ecx,0x1c(%rbx) 32.56 � cmp$0x2,%eax � � jne12e � mov0x20(%rbx),%rax � lea-0x1(%rax),%rdx 0.39 � test $0x1,%al � cmove %rbx,%rdx � mov(%rdx),%rax 0.39 � test $0x10,%al � � jne127 � mov(%rbx),%rcx � shr$0xf,%rcx � and$0x1,%ecx � � jne14a � mov0x68(%r14),%rax 36.03 � xor%esi,%esi � test %r13b,%r13b � mov0x50(%rax),%rdx 1.16 � � jnee8 0.96 � a9: mov%rbx,%rdi . Indicates most time on the cmpxchg for the page ref followed by the grabbing on the ->freepage op vector: freepage = mapping->a_ops->freepage; > I come back to wondering whether maybe you're hitting some PV-lock problem. > > I know queued_spin_lock_slowpath() is ok. I'm not entirely sure > __pv_queued_spin_lock_slowpath() is. It's the same code AFAICT, except the pv version jumps straight to the "queue" case. > So I'd love to see you try the non-PV case, but I also think it might > be interesting to see what the instruction profile for > __pv_queued_spin_lock_slowpath() itself is. They share a lot of code > (there's some interesting #include games going on to make > queued_spin_lock_slowpath() actually *be* > __pv_queued_spin_lock_slowpath() with some magic hooks), but there > might be issues. 0.03 � data16 data16 data16 xchg %ax,%ax � push %rbp 0.00 � mov%rsp,%rbp 0.01 � push %r15 � push %r14 � push %r13 0.01 � push %r12 � mov$0x18740,%r12 � push %rbx � mov%rdi,%rbx � sub$0x10,%rsp � add%gs:0x7ef0d0e0(%rip),%r12 � movslq 0xc(%r12),%rax 0.02 � mov%gs:0x7ef0d0db(%rip),%r15d � add$0x1,%r15d � shl$0x12,%r15d � lea0x1(%rax),%edx 0.01 � mov%edx,0xc(%r12) � mov%eax,%edx � shl$0x4,%rax � add%rax,%r12 � shl$0x10,%edx � movq $0x0,(%r12) 0.02 � or %edx,%r15d � mov%gs:0x7ef0d0ad(%rip),%eax 0.00 � movl $0x0,0x8(%r12) 0.01 � mov%eax,0x40(%r12) � movb $0x0,0x44(%r12) � mov(%rdi),%eax 0.88 � test %ax,%ax � � jne8f 0.02 � mov$0x1,%edx � lock cmpxchg %dl,(%rdi) 0.38 � test %al,%al � � je 14a 0.02 � 8f: mov%r15d,%eax � shr$0x10,%eax � xchg %ax,0x2(%rbx) 2.07 � shl$0x10,%eax � test %eax,%eax � � jne171 � movq $0x0,-0x30(%rbp) 0.02 � ac: movzbl 0x44(%r12),%eax 0.97 � mov$0x1,%r13d � mov$0x100,%r14d � cmp$0x2,%al � sete %al � movzbl %al,%eax � mov%rax,-0x38(%rbp) 0.00 � ca: movb $0x0,0x44(%r12) 0.00 � mov$0x8000,%edx � movb $0x1,0x1(%rbx) � � jmpe6 0.04 � db: pause 8.04 � sub$0x1,%edx � � je 229 � e6: movzbl (%rbx),%eax 7.54 � test %al,%al � � jnedb 0.10 � mov%r14d,%eax 0.06 � lock cmpxchg %r13w,(%rbx)
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Mel, thanks for taking a look. Your theory sounds more complete than mine, and since Dave is able to see the problem with 4.7, it would be nice to hear about the 4.6 behavior and commit ede37713737 in particular. That one seems more likely to affect contention than the zone/node one I found during the merge window anyway, since it actually removes a sleep in kswapd during congestion. I've always preferred to see direct reclaim as the primary model for reclaim, partly in order to throttle the actual "bad" process, but also because "kswapd uses lots of CPU time" is such a nasty thing to even begin guessing about. So I have to admit to liking that "make kswapd sleep a bit if it's just looping" logic that got removed in that commit. And looking at DaveC's numbers, it really feels like it's not even what we do inside the locked region that is the problem. Sure, __delete_from_page_cache() (which is most of it) is at 1.86% of CPU time (when including all the things it calls), but that still isn't all that much. Especially when compared to just: 0.78% [kernel] [k] _raw_spin_unlock_irqrestore from his flat profile. That's not some spinning wait, that's just releasing the lock with a single write (and the popf, but while that's an expensive instruction, it's just tens of cpu cycles). So I'm more and more getting the feeling that it's not what we do inside the lock that is problematic. I started out blaming memcg accounting or something, but none of the numbers seem to back that up. So it's primarily really just the fact that kswapd is simply hammering on that lock way too much. So yeah, I'm blaming kswapd itself doing something wrong. It's not a problem in a single-node environment (since there's only one), but with multiple nodes it clearly just devolves. Yes, we could try to batch the locking like DaveC already suggested (ie we could move the locking to the caller, and then make shrink_page_list() just try to keep the lock held for a few pages if the mapping doesn't change), and that might result in fewer crazy cacheline ping-pongs overall. But that feels like exactly the wrong kind of workaround. I'd much rather re-instate some "if kswapd is just spinning on CPU time and not actually improving IO parallelism, kswapd should just get the hell out" logic. Adding Michal Hocko to the participant list too, I think he's one of the gang in this area. Who else should be made aware of this thread? Minchan? Vladimir? [ I'm assuming the new people can look up this thread on lkml. Note to new people: the subject line (and about 75% of the posts) are about an unrelated AIM7 regression, but there's this sub-thread about nasty lock contention on mapping->tree_lock within that bigger context ] Linus On Tue, Aug 16, 2016 at 8:05 AM, Mel Gorman wrote: > > However, historically there have been multiple indirect throttling mechanism > that were branded as congestion control but basically said "I don't know > what's going on so it's nap time". Many of these have been removed over > time and the last major one was ede37713737 ("mm: throttle on IO only when > there are too many dirty and writeback pages"). > > Before that commit, a process that entered direct reclaim and failed to make > progress would sleep before retrying. It's possible that sleep was enough > to reduce contention by temporarily stalling the writer and letting reclaim > make progress. After that commit, it may only do a cond_resched() check > and go back to allocating/reclaiming as quickly as possible. This active > writer may be enough to increase contention. If so, it also means it > stops kswapd making forward progress, leading to more direct reclaim and > more contention. > > It's not a perfect theory and assumes; > > 1. The writer is direct reclaiming > 2. The writer was previously failing to __remove_mapping > 3. The writer calling congestion_wait due to __remove_mapping failing >was enough to allow kswapd or writeback to make enough progress to >avoid contention > 4. The writer staying awake allocating and dirtying pages is keeping all >the kswapd instances awake and writeback continually active and >increasing the contention overall. > > If it was possible to trigger this problem in 4.7 then it would also be > worth checking 4.6. If 4.6 is immune, check that before and after commit > ede37713737. > > -- > Mel Gorman > SUSE Labs
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Mon, Aug 15, 2016 at 04:48:36PM -0700, Linus Torvalds wrote: > On Mon, Aug 15, 2016 at 4:20 PM, Linus Torvalds > wrote: > > > > None of this code is all that new, which is annoying. This must have > > gone on forever, > > ... ooh. > > Wait, I take that back. > > We actually have some very recent changes that I didn't even think > about that went into this very merge window. > > In particular, I wonder if it's all (or at least partly) due to the > new per-node LRU lists. > > So in shrink_page_list(), when kswapd is encountering a page that is > under page writeback due to page reclaim, it does: > > if (current_is_kswapd() && > PageReclaim(page) && > test_bit(PGDAT_WRITEBACK, &pgdat->flags)) { > nr_immediate++; > goto keep_locked; > I have a limited view of the full topic as I've been in meetings all day and have another 3 hours to go. I'll set time aside tomorrow to look closer but there is a theory at the end of the mail. Node-lru does alter what locks are contended and affects the timing of some issues but this spot feels like a bad fit. That logic controls whether kswapd will stall due to dirty/writeback pages reaching the tail of the LRU too quickly. It can affect lru_lock contention that may be worse with node-lru, particularly on single-node machines but a workload of a streaming writer is unlikely to hit that unless the underlying storage is extremely slow. Another alternation of node-lru potentially affects when buffer heads get stripped but that's also a poor fit. I'm not willing to rule out node-lru because it may be wishful thinking but it feels unlikely. > which basically ignores that page and puts it back on the LRU list. > > But that "is this node under writeback" is new - it now does that per > node, and it *used* to do it per zone (so it _used_ to test "is this > zone under writeback"). > Superficially, a small high zone would affect the timing of when a zone got marked congested and triggered a sleep. Sleeping avoids new pages being allocated/dirties and may reduce contention. However, quick sleeps due to small zones was offset by the fair zone allocation policy and is still offset by GFP_WRITE distributing dirty pages on different zones. The timing of when sleeps occur due to excessive dirty pages at the tail of the LRU should be roughly similar with either zone-lru or node-lru. > All the mapping pages used to be in the same zone, so I think it > effectively single-threaded the kswapd reclaim for one mapping under > reclaim writeback. But in your cases, you have multiple nodes... > > Ok, that's a lot of hand-wavy new-age crystal healing thinking. > > Really, I haven't looked at it more than "this is one thing that has > changed recently, I wonder if it changes the patterns and could > explain much higher spin_lock contention on the mapping->tree_lock". > > I'm adding Mel Gorman and his band of miscreants to the cc, so that > they can tell me that I'm full of shit, and completely missed on what > that zone->node change actually ends up meaning. > > Mel? The issue is that Dave Chinner is seeing some nasty spinlock > contention on "mapping->tree_lock": > Band Of Miscreants may be the new name for the MM track at LSF/MM. In the meantime lets try some hand waving; A single-threaded file write on a 4-node system is going to have 4 kswapd instances, writeback and potentially the writer itself all reclaiming. Given the workload, it's likely that almost all pages have the same mapping. As they are contending on __remove_mapping, the pages must be clean when the attempt to reclaim was made and buffers stripped. The throttling mechanisms for kswapd and direct reclaim rely on either too many pages being isolated (unlikely to fire in this case) or too many dirty/writeback pages reaching the end of the LRU. There is not a direct throttling mechanism for excessive lock contention However, historically there have been multiple indirect throttling mechanism that were branded as congestion control but basically said "I don't know what's going on so it's nap time". Many of these have been removed over time and the last major one was ede37713737 ("mm: throttle on IO only when there are too many dirty and writeback pages"). Before that commit, a process that entered direct reclaim and failed to make progress would sleep before retrying. It's possible that sleep was enough to reduce contention by temporarily stalling the writer and letting reclaim make progress. After that commit, it may only do a cond_resched() check and go back to allocating/reclaiming as quickly as possible. This active writer may be enough to increase contention. If so, it also means it stops kswapd making forward progress, leading to more direct reclaim and more contention. It's not a perfect theory and assumes; 1. The writer is direct reclaiming 2. The writer was p
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote: Snipping the long contest: I think there are three observations here: (1) removing the mark_page_accessed (which is the only significant change in the parent commit) hurts the aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test. I'd still rather stick to the filemap version and let the VM people sort it out. How do the numbers for this test look for XFS vs say ext4 and btrfs? Here is a basic comparison of the 3 filesystems based on 99091700 (" Merge tag 'nfs-for-4.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs"). % compare -a -g 99091700659f4df965e138b38b4fa26a29b7eade -d fs xfs ext4 btrfs xfsext4 btrfs testcase/testparams/testbox -- -- --- %stddev change %stddev change %stddev \ |\ |\ 193335 -27% 141400-100% 8 GEO-MEAN aim7.jobs-per-min 267649 ± 3% -51% 130085 aim7/1BRD_48G-disk_cp-3000-performance/ivb44 485217 ± 3% 402%2434088 ± 3% 350%2184471 ± 4% aim7/1BRD_48G-disk_rd-9000-performance/ivb44 360286 -64% 130351 aim7/1BRD_48G-disk_rr-3000-performance/ivb44 338114 -78% 73280 aim7/1BRD_48G-disk_rw-3000-performance/ivb44 60130 ± 5% 361% 277035 aim7/1BRD_48G-disk_src-3000-performance/ivb44 403144 -68% 127584 aim7/1BRD_48G-disk_wrt-3000-performance/ivb44 26327 -60% 10571 aim7/1BRD_48G-sync_disk_rw-600-performance/ivb44 xfsext4 btrfs -- -- 2652 -96%118 -82%468 GEO-MEAN fsmark.files_per_sec 393 ± 4%-6%368 ± 3%10%433 ± 5% fsmark/1x-1t-1BRD_48G-4M-40G-NoSync-performance/ivb44 200 -4%191 -7%185 ± 6% fsmark/1x-1t-1BRD_48G-4M-40G-fsyncBeforeClose-performance/ivb44 1583 ± 3% -29% 1130 -31% 1088 fsmark/1x-64t-1BRD_48G-4M-40G-fsyncBeforeClose-performance/ivb44 21363 59% 33958 fsmark/8-1SSD-16-9B-48G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4 11033 -17% 9117 fsmark/8-1SSD-4-8K-24G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4 11833 12% 13234 fsmark/8-1SSD-4-9B-16G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4 xfsext4 btrfs -- -- 2381976-100% 6598 -96% 100973 GEO-MEAN fsmark.app_overhead 564520 ± 7%21% 681192 ± 3%63% 919364 ± 3% fsmark/1x-64t-1BRD_48G-4M-40G-NoSync-performance/ivb44 860074 ± 5% 112%1820590 ± 14%47%1262443 ± 3% fsmark/1x-64t-1BRD_48G-4M-40G-fsyncBeforeClose-performance/ivb44 12232633 -18% 10085199 fsmark/8-1SSD-16-9B-48G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4 3143334 -11%2784178 fsmark/8-1SSD-4-8K-24G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4 4107347 -21%3248210 fsmark/8-1SSD-4-9B-16G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4 Thanks, Fengguang --- Some less important numbers. xfsext4 btrfs -- -- 1314 222% 4225-100% 2 GEO-MEAN aim7.time.system_time 1491 ± 6% 302% 6004 aim7/1BRD_48G-disk_cp-3000-performance/ivb44 4786 ± 3% -89%502 ± 7% -87%632 ± 7% aim7/1BRD_48G-disk_rd-9000-performance/ivb44 756 689% 5971 aim7/1BRD_48G-disk_rr-3000-performance/ivb44 8911146% 11108 aim7/1BRD_48G-disk_rw-3000-performance/ivb44 940 ± 5%70% 1598 aim7/1BRD_48G-disk_src-3000-performance/ivb44 599 925% 61
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Tue, Aug 16, 2016 at 07:22:40AM +1000, Dave Chinner wrote: On Mon, Aug 15, 2016 at 10:14:55PM +0800, Fengguang Wu wrote: Hi Christoph, On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote: >Snipping the long contest: > >I think there are three observations here: > >(1) removing the mark_page_accessed (which is the only significant >change in the parent commit) hurts the >aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test. >I'd still rather stick to the filemap version and let the >VM people sort it out. How do the numbers for this test >look for XFS vs say ext4 and btrfs? >(2) lots of additional spinlock contention in the new case. A quick >check shows that I fat-fingered my rewrite so that we do >the xfs_inode_set_eofblocks_tag call now for the pure lookup >case, and pretty much all new cycles come from that. >(3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and >we're already doing way to many even without my little bug above. > >So I've force pushed a new version of the iomap-fixes branch with >(2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a >lot less expensive slotted in before that. Would be good to see >the numbers with that. The aim7 1BRD tests finished and there are ups and downs, with overall performance remain flat. 99091700659f4df9 74a242ad94d13436a1644c0b45 bf4dc6e4ecc2a3d042029319bc testcase/testparams/testbox -- -- --- What do these commits refer to, please? They mean nothing without the commit names /me goes searching. Ok: 99091700659 is the top of Linus' tree 74a242ad94d is That's the below one's parent commit, 74a242ad94d ("xfs: make xfs_inode_set_eofblocks_tag cheaper for the common case"). Typically we'll compare a commit with its parent commit, and/or the branch's base commit, which is normally on mainline kernel. bf4dc6e4ecc is the latest in Christoph's tree (because it's mentioned below) %stddev %change %stddev %change %stddev \ |\ | \ 159926 157324 158574 GEO-MEAN aim7.jobs-per-min 70897 5% 74137 4% 73775 aim7/1BRD_48G-xfs-creat-clo-1500-performance/ivb44 485217 ± 3%492431 477533 aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44 360451 -19% 292980 -17% 299377 aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 So, why does random read go backwards by 20%? The iomap IO path patches we are testing only affect the write path, so this doesn't make a whole lot of sense. 338114 338410 5% 354078 aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44 60130 ± 5% 4% 62438 5% 62923 aim7/1BRD_48G-xfs-disk_src-3000-performance/ivb44 403144 397790 410648 aim7/1BRD_48G-xfs-disk_wrt-3000-performance/ivb44 And this is the test the original regression was reported for: gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7 And that shows no improvement at all. The orginal regression was: 484435 ± 0% -13.3% 420004 ± 0% aim7.jobs-per-min So it's still 15% down on the orginal performance which, again, doesn't make a whole lot of sense given the improvement in so many other tests I've run Yes, same performance with 4.8-rc1 means the regression is still not back comparing to the original reported first-bad-commit's parent f0c6bcba74ac51cb ("xfs: reorder zeroing and flushing sequence in truncate") which is on 4.7-rc1. 26327 26534 26128 aim7/1BRD_48G-xfs-sync_disk_rw-600-performance/ivb44 The new commit bf4dc6e ("xfs: rewrite and optimize the delalloc write path") improves the aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44 case by 5%. Here are the detailed numbers: aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44 Not important at all. We need the results for the disk_wrt regression we are chasing (disk_wrt-3000) so we can see how the code change affected behaviour. Yeah it may not relevant to this case study, however should help evaluate the patch in a more complete way. Here are the detailed numbers for the slowed down case: aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 99091700659f4df9 bf4dc6e4ecc2a3d042029319bc -- %stddev change %stddev \ |\ 360451 -17% 299377aim7.jobs-per-min 12806 481% 74447 aim7.time.involuntary_context_switches . 19377 459%
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Mon, Aug 15, 2016 at 5:19 PM, Dave Chinner wrote: > >> None of this code is all that new, which is annoying. This must have >> gone on forever, > > Yes, it has been. Just worse than I've notice before, probably > because of all the stuff put under the tree lock in the past couple > of years. So this is where a good profile can matter. Particularly if it's all about kswapd, and all the contention is just from __remove_mapping(), what should matter is the "all the stuff" added *there* and absolutely nowhere else. Sadly (well, not for me), in my profiles I have --3.37%--kswapd | --3.36%--shrink_node | |--2.88%--shrink_node_memcg | | | --2.87%--shrink_inactive_list | | | |--2.55%--shrink_page_list | | | | | |--0.84%--__remove_mapping | | | | | | | |--0.37%--__delete_from_page_cache | | | | | | | | | --0.21%--radix_tree_replace_clear_tags | | | | | | | | | --0.12%--__radix_tree_lookup | | | | | | | --0.23%--_raw_spin_lock_irqsave | | | | | | | --0.11%--queued_spin_lock_slowpath | | | which is rather different from your 22% spin-lock overhead. Anyway, including the direct reclaim call paths gets __remove_mapping() a bit higher, and _raw_spin_lock_irqsave climbs to 0.26%. But perhaps more importlantly, looking at what __remove_mapping actually *does* (apart from the spinlock) gives us: - inside remove_mapping itself (0.11% on its own - flat cost, no child accounting) 48.50 │ lock cmpxchg %edx,0x1c(%rbx) so that's about 0.05% - 0.40% __delete_from_page_cache (0.22% radix_tree_replace_clear_tags, 0.13%__radix_tree_lookup) - 0.06% workingset_eviction() so I'm not actually seeing anything *new* expensive in there. The __delete_from_page_cache() overhead may have changed a bit with the tagged tree changes, but this doesn't look like memcg. But we clearly have very different situations. What does your profile show for when you actually dig into __remove_mapping() itself?, Looking at your flat profile, I'm assuming you get 1.31% [kernel] [k] __radix_tree_lookup 1.22% [kernel] [k] radix_tree_tag_set 1.14% [kernel] [k] __remove_mapping which is higher (but part of why my percentages are lower is that I have that "50% CPU used for encryption" on my machine). But I'm not seeing anything I'd attribute to "all the stuff added". For example, originally I would have blamed memcg, but that's not actually in this path at all. I come back to wondering whether maybe you're hitting some PV-lock problem. I know queued_spin_lock_slowpath() is ok. I'm not entirely sure __pv_queued_spin_lock_slowpath() is. So I'd love to see you try the non-PV case, but I also think it might be interesting to see what the instruction profile for __pv_queued_spin_lock_slowpath() itself is. They share a lot of code (there's some interesting #include games going on to make queued_spin_lock_slowpath() actually *be* __pv_queued_spin_lock_slowpath() with some magic hooks), but there might be issues. For example, if you run a virtual 16-core system on a physical machine that then doesn't consistently give 16 cores to the virtual machine, you'll get no end of hiccups. Because as mentioned, we've had bugs ("performance anomalies") there before. Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Mon, Aug 15, 2016 at 5:38 PM, Dave Chinner wrote: > > Same in 4.7 (flat profile numbers climbed higher after this > snapshot was taken, as can be seen by the callgraph numbers): Ok, so it's not the zone-vs-node thing. It's just that nobody has looked at that load in recent times. Where "recent" may be years, of course. Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Mon, Aug 15, 2016 at 5:17 PM, Dave Chinner wrote: > > Read the code, Linus? I am. It's how I came up with my current pet theory. But I don't actually have enough sane numbers to make it much more than a cute pet theory. It *might* explain why you see tons of kswap time and bad lock contention where it didn't use to exist, but .. I can't recreate the problem, and your old profiles were bad enough that they aren't really worth looking at. > Except they *aren't broken*. They are simply *less accurate* than > they could be. They are so much less accurate that quite frankly, there's no point in looking at them outside of "there is contention on the lock". And considering that the numbers didn't even change when you had spinlock debugging on, it's not the lock itself that causes this, I'm pretty sure. Because when you have normal contention due to the *locking* itself being the problem, it tends to absolutely _explode_ with the debugging spinlocks, because the lock itself becomes much more expensive. Usually super-linearly. But that wasn't the case here. The numbers stayed constant. So yeah, I started looking at bigger behavioral issues, which is why I zeroed in on that zone-vs-node change. But it might be a completely broken theory. For example, if you still have the contention when running plain 4.7, that theory was clearly complete BS. And this is where "less accurate" means that they are almost entirely useless. More detail needed. It might not be in the profiles themselves, of course. There might be other much more informative sources if you can come up with anything... Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Mon, Aug 15, 2016 at 04:48:36PM -0700, Linus Torvalds wrote: > On Mon, Aug 15, 2016 at 4:20 PM, Linus Torvalds > wrote: > > > > None of this code is all that new, which is annoying. This must have > > gone on forever, > > ... ooh. > > Wait, I take that back. > > We actually have some very recent changes that I didn't even think > about that went into this very merge window. > Mel? The issue is that Dave Chinner is seeing some nasty spinlock > contention on "mapping->tree_lock": > > > 31.18% [kernel] [k] __pv_queued_spin_lock_slowpath > > and one of the main paths is this: > > >- 30.29% kswapd > > - 30.23% shrink_node > > - 30.07% shrink_node_memcg.isra.75 > > - 30.15% shrink_inactive_list > >- 29.49% shrink_page_list > > - 22.79% __remove_mapping > > - 22.27% _raw_spin_lock_irqsave > > __pv_queued_spin_lock_slowpath > > so there's something ridiculously bad going on with a fairly simple benchmark. > > Dave's benchmark is literally just a "write a new 48GB file in > single-page chunks on a 4-node machine". Nothing odd - not rewriting > files, not seeking around, no nothing. > > You can probably recreate it with a silly > > dd bs=4096 count=$((12*1024*1024)) if=/dev/zero of=bigfile > > although Dave actually had something rather fancier, I think. 16p, 16GB RAM, fake_numa=4. Overwrite a 47GB file on a 48GB filesystem: # mkfs.xfs -f -d size=48g /dev/vdc # mount /dev/vdc /mnt/scratch # xfs_io -f -c "pwrite 0 47g" /mnt/scratch/fooey Wait for memory to fill and reclaim to kick in, then look at the profile. If you run it a second time, reclaim kicks in straight away. It's not the new code in 4.8 - it reproduces on 4.7 just fine, and probably will reproduce all the way back to when the memcg-aware writeback code was added -Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Mon, Aug 15, 2016 at 04:20:55PM -0700, Linus Torvalds wrote: > On Mon, Aug 15, 2016 at 3:42 PM, Dave Chinner wrote: > > > > 31.18% [kernel] [k] __pv_queued_spin_lock_slowpath > >9.90% [kernel] [k] copy_user_generic_string > >3.65% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock > >2.62% [kernel] [k] __block_commit_write.isra.29 > >2.26% [kernel] [k] _raw_spin_lock_irqsave > >1.72% [kernel] [k] _raw_spin_lock > > Ok, this is more like it. > > I'd still like to see it on raw hardware, just to see if we may have a > bug in the PV code. Because that code has been buggy before. I > *thought* we fixed it, but ... > > In fact, you don't even need to do it outside of virtualization, but > with paravirt disabled (so that it runs the native non-pv locking in > the virtual machine). > > >36.60% 0.00% [kernel][k] kswapd > >- 30.29% kswapd > > - 30.23% shrink_node > > - 30.07% shrink_node_memcg.isra.75 > > - 30.15% shrink_inactive_list > >- 29.49% shrink_page_list > > - 22.79% __remove_mapping > > - 22.27% _raw_spin_lock_irqsave > > __pv_queued_spin_lock_slowpath > > How I dislike the way perf shows the call graph data... Just last week > I was talking to Arnaldo about how to better visualize the cost of > spinlocks, because the normal way "perf" shows costs is so nasty. Do not change it - it's the way call graph profiles have been presented for the past 20 years. I hate it when long standing conventions are changed because one person doesn't like them and everyone else has to relearn skills the haven't had to think about for years > What happens is that you see that 36% of CPU time is attributed to > kswapd, and then you can drill down and see where that 36% comes from. > So far so good, and that's what perf does fairly well. > > But then when you find the spinlock, you actually want to go the other > way, and instead ask it to show "who were the callers to this routine > and what were the percentages", so that you can then see whether (for > example) it's just that __remove_mapping() use that contends with > itself, or whether it's contending with the page additions or > whatever.. Um, perf already does that: - 31.55%31.55% [kernel][k] __pv_queued_spin_lock_slowpath - 19.83% ret_from_fork - kthread - 18.55% kswapd shrink_node shrink_node_memcg.isra.75 shrink_inactive_list 1.76% worker_thread process_one_work wb_workfn wb_writeback __writeback_inodes_wb writeback_sb_inodes __writeback_single_inode do_writepages xfs_vm_writepages write_cache_pages xfs_do_writepage + 5.95% __libc_pwrite I have that right here because *it's a view of the profile I've already looked at*. I didn't post it because, well, it's shorter to simply say "contention is from in kswapd". > So what I'd like to see (and this is where it becomes *so* much more > useful to be able to recreate it myself so that I can play with the > perf data several different ways) is to see what the profile looks > like in that spinlocked region. Boot your machine with "fake_numa=4", and play till you heart is content. That's all I do with my test VMs to make them exercise NUMA paths. > None of this code is all that new, which is annoying. This must have > gone on forever, Yes, it has been. Just worse than I've notice before, probably because of all the stuff put under the tree lock in the past couple of years. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Mon, Aug 15, 2016 at 05:15:47PM -0700, Linus Torvalds wrote: > DaveC - does the spinlock contention go away if you just go back to > 4.7? If so, I think it's the new zone thing. But it would be good to > verify - maybe it's something entirely different and it goes back much > further. Same in 4.7 (flat profile numbers climbed higher after this snapshot was taken, as can be seen by the callgraph numbers): 29.47% [kernel] [k] __pv_queued_spin_lock_slowpath 11.59% [kernel] [k] copy_user_generic_string 3.13% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 2.87% [kernel] [k] __block_commit_write.isra.29 2.02% [kernel] [k] _raw_spin_lock_irqsave 1.77% [kernel] [k] get_page_from_freelist 1.36% [kernel] [k] __wake_up_bit 1.31% [kernel] [k] __radix_tree_lookup 1.22% [kernel] [k] radix_tree_tag_set 1.16% [kernel] [k] clear_page_dirty_for_io 1.14% [kernel] [k] __remove_mapping 1.14% [kernel] [k] _raw_spin_lock 1.00% [kernel] [k] zone_dirty_ok 0.95% [kernel] [k] radix_tree_tag_clear 0.90% [kernel] [k] generic_write_end 0.89% [kernel] [k] __delete_from_page_cache 0.87% [kernel] [k] unlock_page 0.86% [kernel] [k] cancel_dirty_page 0.81% [kernel] [k] up_write 0.80% [kernel] [k] ___might_sleep 0.77% [kernel] [k] _raw_spin_unlock_irqrestore 0.75% [kernel] [k] generic_perform_write 0.72% [kernel] [k] xfs_do_writepage 0.69% [kernel] [k] down_write 0.63% [kernel] [k] shrink_page_list 0.63% [kernel] [k] __xfs_get_blocks 0.61% [kernel] [k] __test_set_page_writeback 0.59% [kernel] [k] free_hot_cold_page 0.57% [kernel] [k] write_cache_pages 0.56% [kernel] [k] __radix_tree_create 0.55% [kernel] [k] __list_add 0.53% [kernel] [k] page_mapping 0.53% [kernel] [k] drop_buffers 0.51% [kernel] [k] xfs_vm_releasepage 0.51% [kernel] [k] free_pcppages_bulk 0.50% [kernel] [k] __list_del_entry 38.07%38.07% [kernel][k] __pv_queued_spin_lock_slowpath - 25.52% ret_from_fork - kthread - 24.36% kswapd shrink_zone shrink_zone_memcg.isra.73 shrink_inactive_list - 3.21% worker_thread process_one_work wb_workfn wb_writeback __writeback_inodes_wb writeback_sb_inodes __writeback_single_inode do_writepages xfs_vm_writepages write_cache_pages - 10.06% __libc_pwrite entry_SYSCALL_64_fastpath sys_pwrite64 vfs_write __vfs_write xfs_file_write_iter xfs_file_buffered_aio_write - generic_perform_write - 5.51% xfs_vm_write_begin - 4.94% grab_cache_page_write_begin pagecache_get_page 0.57% __block_write_begin create_page_buffers create_empty_buffers _raw_spin_lock __pv_queued_spin_lock_slowpath - 4.88% xfs_vm_write_end generic_write_end block_write_end __block_commit_write.isra.29 mark_buffer_dirty __set_page_dirty -Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Mon, Aug 15, 2016 at 04:01:00PM -0700, Linus Torvalds wrote: > On Mon, Aug 15, 2016 at 3:22 PM, Dave Chinner wrote: > > > > Right, but that does not make the profile data useless, > > Yes it does. Because it basically hides everything that happens inside > the lock, which is what causes the contention in the first place. Read the code, Linus? > So stop making inane and stupid arguments, Dave. We know what happens inside the lock, and we know exactly how much it is supposed to cost. And it isn't anywhere near as much as the profiles indicate the function that contains the lock is costing. Occam's Razor leads to only one conclusion, like it or not > Your profiles are shit. Deal with it, or accept that nobody is ever > going to bother working on them because your profiles don't give > useful information. > > I see that you actually fixed your profiles, but quite frankly, the > amount of pure unadulterated crap you posted in this email is worth > reacting negatively to. I'm happy to be told that I'm wrong *when I'm wrong*, but you always say "read the code to understand a problem" rather than depending on potentially unreliable tools and debug information that is gathered. Yet when I do that using partial profile information, your reaction is to tell me I am "full of shit" because my information isn't 100% reliable? Really, Linus? > You generally make so much sense that it's shocking to see you then > make these crazy excuses for your completely broken profiles. Except they *aren't broken*. They are simply *less accurate* than they could be. That does not invalidate the profile nor does it mean that the insight it gives us into the functioning of the code is wrong. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Mon, Aug 15, 2016 at 10:22:43AM -0700, Huang, Ying wrote: > Hi, Chinner, > > Dave Chinner writes: > > > On Wed, Aug 10, 2016 at 06:00:24PM -0700, Linus Torvalds wrote: > >> On Wed, Aug 10, 2016 at 5:33 PM, Huang, Ying wrote: > >> > > >> > Here it is, > >> > >> Thanks. > >> > >> Appended is a munged "after" list, with the "before" values in > >> parenthesis. It actually looks fairly similar. > >> > >> The biggest difference is that we have "mark_page_accessed()" show up > >> after, and not before. There was also a lot of LRU noise in the > >> non-profile data. I wonder if that is the reason here: the old model > >> of using generic_perform_write/block_page_mkwrite didn't mark the > >> pages accessed, and now with iomap_file_buffered_write() they get > >> marked as active and that screws up the LRU list, and makes us not > >> flush out the dirty pages well (because they are seen as active and > >> not good for writeback), and then you get bad memory use. > >> > >> I'm not seeing anything that looks like locking-related. > > > > Not in that profile. I've been doing some local testing inside a > > 4-node fake-numa 16p/16GB RAM VM to see what I can find. > > You run the test in a virtual machine, I think that is why your perf > data looks strange (high value of _raw_spin_unlock_irqrestore). > > To setup KVM to use perf, you may refer to, > > https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Tuning_and_Optimization_Guide/sect-Virtualization_Tuning_Optimization_Guide-Monitoring_Tools-vPMU.html > https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Virtualization_Administration_Guide/sect-perf-mon.html > > I haven't tested them. You may Google to find more information. Or the > perf/kvm people can give you more information. Thanks, "-cpu host" on the qemu command line works. I hate magic, undocumented(*) features that are necessary to make basic stuff work. -Dave. (*) yeah, try working this capability even exists out from the qemu/kvm man page. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Mon, Aug 15, 2016 at 4:20 PM, Linus Torvalds wrote: > > But I'll try to see what happens > on my profile, even if I can't recreate the contention itself, just > trying to see what happens inside of that region. Yeah, since I run my machines on encrypted disks, my profile shows 60% kthread, but that's just because 55% is crypto. I only have 5% in kswapd. And the spinlock doesn't even show up for me (but "__delete_from_page_cache()" does, which doesn't look unreasonable). And while the biggest reason the spinlock doesn't show up is likely simply my single-node "everything is on one die", I still think the lower kswapd CPU use might be partly due to the node-vs-zone thing. For me, with just one node, the new test_bit(PGDAT_WRITEBACK, &pgdat->flags)) { ends up being very similar to what we used to have before, ie test_bit(ZONE_WRITEBACK, &zone->flags)) { but on a multi-node machine it would be rather different. So I might never see contention anyway. The basic logic in shrink_swap_list() goes back to commit 283aba9f9e0 ("mm: vmscan: block kswapd if it is encountering pages under writeback") but it has been messed around with a lot (and something else existed there before - we've always had some "throttle kswapd so that it doesn't use insane amounts of CPU time"). DaveC - does the spinlock contention go away if you just go back to 4.7? If so, I think it's the new zone thing. But it would be good to verify - maybe it's something entirely different and it goes back much further. Mel - I may be barking up entirely the wrong tree, but it would be good if you could take a look just in case this is actually it. Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Mon, Aug 15, 2016 at 4:20 PM, Linus Torvalds wrote: > > None of this code is all that new, which is annoying. This must have > gone on forever, ... ooh. Wait, I take that back. We actually have some very recent changes that I didn't even think about that went into this very merge window. In particular, I wonder if it's all (or at least partly) due to the new per-node LRU lists. So in shrink_page_list(), when kswapd is encountering a page that is under page writeback due to page reclaim, it does: if (current_is_kswapd() && PageReclaim(page) && test_bit(PGDAT_WRITEBACK, &pgdat->flags)) { nr_immediate++; goto keep_locked; which basically ignores that page and puts it back on the LRU list. But that "is this node under writeback" is new - it now does that per node, and it *used* to do it per zone (so it _used_ to test "is this zone under writeback"). All the mapping pages used to be in the same zone, so I think it effectively single-threaded the kswapd reclaim for one mapping under reclaim writeback. But in your cases, you have multiple nodes... Ok, that's a lot of hand-wavy new-age crystal healing thinking. Really, I haven't looked at it more than "this is one thing that has changed recently, I wonder if it changes the patterns and could explain much higher spin_lock contention on the mapping->tree_lock". I'm adding Mel Gorman and his band of miscreants to the cc, so that they can tell me that I'm full of shit, and completely missed on what that zone->node change actually ends up meaning. Mel? The issue is that Dave Chinner is seeing some nasty spinlock contention on "mapping->tree_lock": > 31.18% [kernel] [k] __pv_queued_spin_lock_slowpath and one of the main paths is this: >- 30.29% kswapd > - 30.23% shrink_node > - 30.07% shrink_node_memcg.isra.75 > - 30.15% shrink_inactive_list >- 29.49% shrink_page_list > - 22.79% __remove_mapping > - 22.27% _raw_spin_lock_irqsave > __pv_queued_spin_lock_slowpath so there's something ridiculously bad going on with a fairly simple benchmark. Dave's benchmark is literally just a "write a new 48GB file in single-page chunks on a 4-node machine". Nothing odd - not rewriting files, not seeking around, no nothing. You can probably recreate it with a silly dd bs=4096 count=$((12*1024*1024)) if=/dev/zero of=bigfile although Dave actually had something rather fancier, I think. Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Mon, Aug 15, 2016 at 3:42 PM, Dave Chinner wrote: > > 31.18% [kernel] [k] __pv_queued_spin_lock_slowpath >9.90% [kernel] [k] copy_user_generic_string >3.65% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock >2.62% [kernel] [k] __block_commit_write.isra.29 >2.26% [kernel] [k] _raw_spin_lock_irqsave >1.72% [kernel] [k] _raw_spin_lock Ok, this is more like it. I'd still like to see it on raw hardware, just to see if we may have a bug in the PV code. Because that code has been buggy before. I *thought* we fixed it, but ... In fact, you don't even need to do it outside of virtualization, but with paravirt disabled (so that it runs the native non-pv locking in the virtual machine). >36.60% 0.00% [kernel][k] kswapd >- 30.29% kswapd > - 30.23% shrink_node > - 30.07% shrink_node_memcg.isra.75 > - 30.15% shrink_inactive_list >- 29.49% shrink_page_list > - 22.79% __remove_mapping > - 22.27% _raw_spin_lock_irqsave > __pv_queued_spin_lock_slowpath How I dislike the way perf shows the call graph data... Just last week I was talking to Arnaldo about how to better visualize the cost of spinlocks, because the normal way "perf" shows costs is so nasty. What happens is that you see that 36% of CPU time is attributed to kswapd, and then you can drill down and see where that 36% comes from. So far so good, and that's what perf does fairly well. But then when you find the spinlock, you actually want to go the other way, and instead ask it to show "who were the callers to this routine and what were the percentages", so that you can then see whether (for example) it's just that __remove_mapping() use that contends with itself, or whether it's contending with the page additions or whatever.. And perf makes that unnecessarily much too hard to see. So what I'd like to see (and this is where it becomes *so* much more useful to be able to recreate it myself so that I can play with the perf data several different ways) is to see what the profile looks like in that spinlocked region. Hmm. I guess you could just send me the "perf.data" and "vmlinux" files, and I can look at it that way. But I'll try to see what happens on my profile, even if I can't recreate the contention itself, just trying to see what happens inside of that region. None of this code is all that new, which is annoying. This must have gone on forever, Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Mon, Aug 15, 2016 at 3:22 PM, Dave Chinner wrote: > > Right, but that does not make the profile data useless, Yes it does. Because it basically hides everything that happens inside the lock, which is what causes the contention in the first place. So stop making inane and stupid arguments, Dave. Your profiles are shit. Deal with it, or accept that nobody is ever going to bother working on them because your profiles don't give useful information. I see that you actually fixed your profiles, but quite frankly, the amount of pure unadulterated crap you posted in this email is worth reacting negatively to. You generally make so much sense that it's shocking to see you then make these crazy excuses for your completely broken profiles. Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Tue, Aug 16, 2016 at 08:22:11AM +1000, Dave Chinner wrote: > On Sun, Aug 14, 2016 at 10:12:20PM -0700, Linus Torvalds wrote: > > On Aug 14, 2016 10:00 PM, "Dave Chinner" wrote: > > > > > > > What does it say if you annotate that _raw_spin_unlock_irqrestore() > > function? > > > > > >¿ > > >¿Disassembly of section load0: > > >¿ > > >¿81e628b0 : > > >¿ nop > > >¿ push %rbp > > >¿ mov%rsp,%rbp > > >¿ movb $0x0,(%rdi) > > >¿ nop > > >¿ mov%rsi,%rdi > > >¿ push %rdi > > >¿ popfq > > > 99.35 ¿ nop > > > > Yeah, that's a good disassembly of a non-debug spin unlock, and the symbols > > are fine, but the profile is not valid. That's an interrupt point, right > > after the popf that enables interiors again. > > > > I don't know why 'perf' isn't working on your machine, but it clearly > > isn't. > > > > Has it ever worked on that machine? > > It's working the same as it's worked since I started using it many > years ago. > > > What cpu is it? > > Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz > > > Are you running in some > > virtualized environment without performance counters, perhaps? > > I've mentioned a couple of times in this thread that I'm testing > inside a VM. It's the same VM I've been running performance tests in > since early 2010. Nobody has complained that the profiles I've > posted are useless before, and not once in all that time have they > been wrong in indicating a spinning lock contention point. > > i.e. In previous cases where I've measured double digit CPU usage > numbers in a spin_unlock variant, it's always been a result of > spinlock contention. And fixing the algorithmic problem that lead to > the spinlock showing up in the profile in the first place has always > substantially improved performance and scalability. > > As such, I'm always going to treat a locking profile like that as > contention because even if it isn't contending *on my machine*, > that amount of work being done under a spinning lock is /way too > much/ and it *will* cause contention problems with larger machines. And, so, after helpfully being pointed at the magic kvm "-cpu host" flag to enable access to the performance counters from the guest (using "-e cycles", because more precise counters aren't available), the profile looks like this: 31.18% [kernel] [k] __pv_queued_spin_lock_slowpath 9.90% [kernel] [k] copy_user_generic_string 3.65% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 2.62% [kernel] [k] __block_commit_write.isra.29 2.26% [kernel] [k] _raw_spin_lock_irqsave 1.72% [kernel] [k] _raw_spin_lock 1.33% [kernel] [k] __wake_up_bit 1.20% [kernel] [k] __radix_tree_lookup 1.19% [kernel] [k] __remove_mapping 1.12% [kernel] [k] __delete_from_page_cache 0.97% [kernel] [k] xfs_do_writepage 0.91% [kernel] [k] get_page_from_freelist 0.90% [kernel] [k] up_write 0.88% [kernel] [k] clear_page_dirty_for_io 0.83% [kernel] [k] radix_tree_tag_set 0.81% [kernel] [k] radix_tree_tag_clear 0.80% [kernel] [k] down_write 0.78% [kernel] [k] _raw_spin_unlock_irqrestore 0.77% [kernel] [k] shrink_page_list 0.76% [kernel] [k] ___might_sleep 0.76% [kernel] [k] unlock_page 0.74% [kernel] [k] __list_del_entry 0.67% [kernel] [k] __add_to_page_cache_locked 0.65% [kernel] [k] node_dirty_ok 0.61% [kernel] [k] __rmqueue 0.61% [kernel] [k] __block_write_begin_int 0.61% [kernel] [k] cancel_dirty_page 0.61% [kernel] [k] __test_set_page_writeback 0.59% [kernel] [k] page_mapping 0.57% [kernel] [k] __list_add 0.56% [kernel] [k] free_pcppages_bulk 0.54% [kernel] [k] _raw_spin_lock_irq 0.54% [kernel] [k] generic_write_end 0.51% [kernel] [k] drop_buffers The call graph should be familiar by now: 36.60% 0.00% [kernel][k] kswapd - 30.29% kswapd - 30.23% shrink_node - 30.07% shrink_node_memcg.isra.75 - 30.15% shrink_inactive_list - 29.49% shrink_page_list - 22.79% __remove_mapping - 22.27% _raw_spin_lock_irqsave __pv_queued_spin_lock_slowpath + 1.86% __delete_from_page_cache + 1.27% _raw_spin_unlock_irqrestore + 4.31% try_to_release_page + 1.21% free_hot_cold_page_list 0.56% page_evictable 0.77% isolate_lru_pages.isra.72 That sure looks like spin lock contention to me Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Sun, Aug 14, 2016 at 10:12:20PM -0700, Linus Torvalds wrote: > On Aug 14, 2016 10:00 PM, "Dave Chinner" wrote: > > > > > What does it say if you annotate that _raw_spin_unlock_irqrestore() > function? > > > >¿ > >¿Disassembly of section load0: > >¿ > >¿81e628b0 : > >¿ nop > >¿ push %rbp > >¿ mov%rsp,%rbp > >¿ movb $0x0,(%rdi) > >¿ nop > >¿ mov%rsi,%rdi > >¿ push %rdi > >¿ popfq > > 99.35 ¿ nop > > Yeah, that's a good disassembly of a non-debug spin unlock, and the symbols > are fine, but the profile is not valid. That's an interrupt point, right > after the popf that enables interiors again. > > I don't know why 'perf' isn't working on your machine, but it clearly > isn't. > > Has it ever worked on that machine? It's working the same as it's worked since I started using it many years ago. > What cpu is it? Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz > Are you running in some > virtualized environment without performance counters, perhaps? I've mentioned a couple of times in this thread that I'm testing inside a VM. It's the same VM I've been running performance tests in since early 2010. Nobody has complained that the profiles I've posted are useless before, and not once in all that time have they been wrong in indicating a spinning lock contention point. i.e. In previous cases where I've measured double digit CPU usage numbers in a spin_unlock variant, it's always been a result of spinlock contention. And fixing the algorithmic problem that lead to the spinlock showing up in the profile in the first place has always substantially improved performance and scalability. As such, I'm always going to treat a locking profile like that as contention because even if it isn't contending *on my machine*, that amount of work being done under a spinning lock is /way too much/ and it *will* cause contention problems with larger machines. > It's not actually the unlock that is expensive, and there is no contention > on the lock (if there had been, the numbers would have been entirely > different for the debug case, which makes locking an order of magnitude > more expensive). All the cost of everything that happened while interrupts > were disabled is just accounted to the instruction after they were enabled > again. Right, but that does not make the profile data useless, nor you should shoot the messenger because they weren't supplied with information you think should have been in the message. The message still says that the majority of the overhead is in __remove_mapping(), and it's an excessive amount of work being done inside the tree_lock with interrupts disabled Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Mon, Aug 15, 2016 at 10:14:55PM +0800, Fengguang Wu wrote: > Hi Christoph, > > On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote: > >Snipping the long contest: > > > >I think there are three observations here: > > > >(1) removing the mark_page_accessed (which is the only significant > >change in the parent commit) hurts the > >aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test. > >I'd still rather stick to the filemap version and let the > >VM people sort it out. How do the numbers for this test > >look for XFS vs say ext4 and btrfs? > >(2) lots of additional spinlock contention in the new case. A quick > >check shows that I fat-fingered my rewrite so that we do > >the xfs_inode_set_eofblocks_tag call now for the pure lookup > >case, and pretty much all new cycles come from that. > >(3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and > >we're already doing way to many even without my little bug above. > > > >So I've force pushed a new version of the iomap-fixes branch with > >(2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a > >lot less expensive slotted in before that. Would be good to see > >the numbers with that. > > The aim7 1BRD tests finished and there are ups and downs, with overall > performance remain flat. > > 99091700659f4df9 74a242ad94d13436a1644c0b45 bf4dc6e4ecc2a3d042029319bc > testcase/testparams/testbox > -- -- > --- What do these commits refer to, please? They mean nothing without the commit names /me goes searching. Ok: 99091700659 is the top of Linus' tree 74a242ad94d is bf4dc6e4ecc is the latest in Christoph's tree (because it's mentioned below) > %stddev %change %stddev %change %stddev > \ |\ | > \ 159926 157324 158574 > GEO-MEAN aim7.jobs-per-min > 70897 5% 74137 4% 73775 > aim7/1BRD_48G-xfs-creat-clo-1500-performance/ivb44 >485217 ± 3%492431 477533 > aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44 >360451 -19% 292980 -17% 299377 > aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 So, why does random read go backwards by 20%? The iomap IO path patches we are testing only affect the write path, so this doesn't make a whole lot of sense. >338114 338410 5% 354078 > aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44 > 60130 ± 5% 4% 62438 5% 62923 > aim7/1BRD_48G-xfs-disk_src-3000-performance/ivb44 >403144 397790 410648 > aim7/1BRD_48G-xfs-disk_wrt-3000-performance/ivb44 And this is the test the original regression was reported for: gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7 And that shows no improvement at all. The orginal regression was: 484435 ± 0% -13.3% 420004 ± 0% aim7.jobs-per-min So it's still 15% down on the orginal performance which, again, doesn't make a whole lot of sense given the improvement in so many other tests I've run > 26327 26534 26128 > aim7/1BRD_48G-xfs-sync_disk_rw-600-performance/ivb44 > > The new commit bf4dc6e ("xfs: rewrite and optimize the delalloc write > path") improves the aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44 > case by 5%. Here are the detailed numbers: > > aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44 Not important at all. We need the results for the disk_wrt regression we are chasing (disk_wrt-3000) so we can see how the code change affected behaviour. > Here are the detailed numbers for the slowed down case: > > aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 > > 99091700659f4df9 bf4dc6e4ecc2a3d042029319bc > -- > %stddev change %stddev > \ |\ >360451 -17% 299377aim7.jobs-per-min > 12806 481% 74447 > aim7.time.involuntary_context_switches . > 19377 459% 108364vmstat.system.cs . > 487 ± 89% 3e+04 26448 ± 57% > latency_stats.max.down.xfs_buf_lock._xfs_buf_find.xfs_buf_get_map.xfs_buf_read_map.xfs_trans_read_buf_map.xfs_read_agf.xfs_alloc_read_agf.xfs_alloc_fix_freelist.xfs_free_extent_fix_freelist.xfs_free_extent.xfs_trans_free_extent > 1823 ± 82% 2e+061913796 ± 38% > latency_stats.sum.down.xfs_buf_lock._xfs_buf_find.xfs_buf_get_map.xfs_buf_read_map.xfs_trans_read_buf_map.xfs_read_agf.xfs_alloc_read_agf.xfs_alloc_fix_freelist.xfs_free_ext
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Christoph Hellwig writes: > Snipping the long contest: > > I think there are three observations here: > > (1) removing the mark_page_accessed (which is the only significant > change in the parent commit) hurts the > aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test. > I'd still rather stick to the filemap version and let the > VM people sort it out. How do the numbers for this test > look for XFS vs say ext4 and btrfs? > (2) lots of additional spinlock contention in the new case. A quick > check shows that I fat-fingered my rewrite so that we do > the xfs_inode_set_eofblocks_tag call now for the pure lookup > case, and pretty much all new cycles come from that. > (3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and > we're already doing way to many even without my little bug above. > > So I've force pushed a new version of the iomap-fixes branch with > (2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a > lot less expensive slotted in before that. Would be good to see > the numbers with that. For the original reported regression, the test result is as follow, = compiler/cpufreq_governor/debug-setup/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase: gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7 commit: f0c6bcba74ac51cb77aadb33ad35cb2dc1ad1506 (parent of first bad commit) 68a9f5e7007c1afa2cf6830b690a90d0187c0684 (first bad commit) 99091700659f4df965e138b38b4fa26a29b7eade (base of your fixes branch) bf4dc6e4ecc2a3d042029319bc8cd4204c185610 (head of your fixes branch) f0c6bcba74ac51cb 68a9f5e7007c1afa2cf6830b69 99091700659f4df965e138b38b bf4dc6e4ecc2a3d042029319bc -- -- -- %stddev %change %stddev %change %stddev %change %stddev \ |\ |\ |\ 484435 ± 0% -13.3% 420004 ± 0% -17.0% 402250 ± 0% -15.6% 408998 ± 0% aim7.jobs-per-min And the perf data is as follow, "perf-profile.func.cycles-pp.intel_idle": 20.25, "perf-profile.func.cycles-pp.memset_erms": 11.72, "perf-profile.func.cycles-pp.copy_user_enhanced_fast_string": 8.37, "perf-profile.func.cycles-pp.__block_commit_write.isra.21": 3.49, "perf-profile.func.cycles-pp.block_write_end": 1.77, "perf-profile.func.cycles-pp.native_queued_spin_lock_slowpath": 1.63, "perf-profile.func.cycles-pp.unlock_page": 1.58, "perf-profile.func.cycles-pp.___might_sleep": 1.56, "perf-profile.func.cycles-pp.__block_write_begin_int": 1.33, "perf-profile.func.cycles-pp.iov_iter_copy_from_user_atomic": 1.23, "perf-profile.func.cycles-pp.up_write": 1.21, "perf-profile.func.cycles-pp.__mark_inode_dirty": 1.18, "perf-profile.func.cycles-pp.down_write": 1.06, "perf-profile.func.cycles-pp.mark_buffer_dirty": 0.94, "perf-profile.func.cycles-pp.generic_write_end": 0.92, "perf-profile.func.cycles-pp.__radix_tree_lookup": 0.91, "perf-profile.func.cycles-pp._raw_spin_lock": 0.81, "perf-profile.func.cycles-pp.entry_SYSCALL_64_fastpath": 0.79, "perf-profile.func.cycles-pp.__might_sleep": 0.79, "perf-profile.func.cycles-pp.xfs_file_iomap_begin_delay.isra.9": 0.7, "perf-profile.func.cycles-pp.__list_del_entry": 0.7, "perf-profile.func.cycles-pp.vfs_write": 0.69, "perf-profile.func.cycles-pp.drop_buffers": 0.68, "perf-profile.func.cycles-pp.xfs_file_write_iter": 0.67, "perf-profile.func.cycles-pp.rwsem_spin_on_owner": 0.67, Best Regards, Huang, Ying
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Hi, Chinner, Dave Chinner writes: > On Wed, Aug 10, 2016 at 06:00:24PM -0700, Linus Torvalds wrote: >> On Wed, Aug 10, 2016 at 5:33 PM, Huang, Ying wrote: >> > >> > Here it is, >> >> Thanks. >> >> Appended is a munged "after" list, with the "before" values in >> parenthesis. It actually looks fairly similar. >> >> The biggest difference is that we have "mark_page_accessed()" show up >> after, and not before. There was also a lot of LRU noise in the >> non-profile data. I wonder if that is the reason here: the old model >> of using generic_perform_write/block_page_mkwrite didn't mark the >> pages accessed, and now with iomap_file_buffered_write() they get >> marked as active and that screws up the LRU list, and makes us not >> flush out the dirty pages well (because they are seen as active and >> not good for writeback), and then you get bad memory use. >> >> I'm not seeing anything that looks like locking-related. > > Not in that profile. I've been doing some local testing inside a > 4-node fake-numa 16p/16GB RAM VM to see what I can find. You run the test in a virtual machine, I think that is why your perf data looks strange (high value of _raw_spin_unlock_irqrestore). To setup KVM to use perf, you may refer to, https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Tuning_and_Optimization_Guide/sect-Virtualization_Tuning_Optimization_Guide-Monitoring_Tools-vPMU.html https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Virtualization_Administration_Guide/sect-perf-mon.html I haven't tested them. You may Google to find more information. Or the perf/kvm people can give you more information. > I'm yet to work out how I can trigger a profile like the one that > was reported (I really need to see the event traces), but in the > mean time I found this > > Doing a large sequential single threaded buffered write using a 4k > buffer (so single page per syscall to make the XFS IO path allocator > behave the same way as in 4.7), I'm seeing a CPU profile that > indicates we have a potential mapping->tree_lock issue: > > # xfs_io -f -c "truncate 0" -c "pwrite 0 47g" /mnt/scratch/fooey > wrote 50465865728/50465865728 bytes at offset 0 > 47.000 GiB, 12320768 ops; 0:01:36.00 (499.418 MiB/sec and 127850.9132 ops/sec) > > > > 24.15% [kernel] [k] _raw_spin_unlock_irqrestore >9.67% [kernel] [k] copy_user_generic_string >5.64% [kernel] [k] _raw_spin_unlock_irq >3.34% [kernel] [k] get_page_from_freelist >2.57% [kernel] [k] mark_page_accessed >2.45% [kernel] [k] do_raw_spin_lock >1.83% [kernel] [k] shrink_page_list >1.70% [kernel] [k] free_hot_cold_page >1.26% [kernel] [k] xfs_do_writepage Best Regards, Huang, Ying
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Hi Christoph, On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote: Snipping the long contest: I think there are three observations here: (1) removing the mark_page_accessed (which is the only significant change in the parent commit) hurts the aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test. I'd still rather stick to the filemap version and let the VM people sort it out. How do the numbers for this test look for XFS vs say ext4 and btrfs? (2) lots of additional spinlock contention in the new case. A quick check shows that I fat-fingered my rewrite so that we do the xfs_inode_set_eofblocks_tag call now for the pure lookup case, and pretty much all new cycles come from that. (3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and we're already doing way to many even without my little bug above. So I've force pushed a new version of the iomap-fixes branch with (2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a lot less expensive slotted in before that. Would be good to see the numbers with that. The aim7 1BRD tests finished and there are ups and downs, with overall performance remain flat. 99091700659f4df9 74a242ad94d13436a1644c0b45 bf4dc6e4ecc2a3d042029319bc testcase/testparams/testbox -- -- --- %stddev %change %stddev %change %stddev \ |\ |\ 159926 157324 158574GEO-MEAN aim7.jobs-per-min 70897 5% 74137 4% 73775 aim7/1BRD_48G-xfs-creat-clo-1500-performance/ivb44 485217 ± 3%492431 477533 aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44 360451 -19% 292980 -17% 299377 aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 338114 338410 5% 354078 aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44 60130 ± 5% 4% 62438 5% 62923 aim7/1BRD_48G-xfs-disk_src-3000-performance/ivb44 403144 397790 410648 aim7/1BRD_48G-xfs-disk_wrt-3000-performance/ivb44 26327 26534 26128 aim7/1BRD_48G-xfs-sync_disk_rw-600-performance/ivb44 The new commit bf4dc6e ("xfs: rewrite and optimize the delalloc write path") improves the aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44 case by 5%. Here are the detailed numbers: aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44 74a242ad94d13436 bf4dc6e4ecc2a3d042029319bc -- %stddev %change %stddev \ |\ 338410 5% 354078aim7.jobs-per-min 404390 8% 435117 aim7.time.voluntary_context_switches 2502 -4% 2396aim7.time.maximum_resident_set_size 15018 -9% 13701 aim7.time.involuntary_context_switches 900 -11%801aim7.time.system_time 17432 11% 19365vmstat.system.cs 47736 ± 19% -24% 36087 interrupts.CAL:Function_call_interrupts 2129646 31%2790638proc-vmstat.pgalloc_dma32 379503 13% 429384numa-meminfo.node0.Dirty 15018 -9% 13701time.involuntary_context_switches 900 -11%801time.system_time 1560 10% 1716slabinfo.mnt_cache.active_objs 1560 10% 1716slabinfo.mnt_cache.num_objs 61.53 -4 57.45 ± 4% perf-profile.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.call_cpuidle.cpu_startup_entry 61.63 -4 57.55 ± 4% perf-profile.func.cycles-pp.intel_idle 1007188 ± 16% 156%2577911 ± 6% numa-numastat.node0.numa_miss 9662857 ± 4% -13%8420159 ± 3% numa-numastat.node0.numa_foreign 1008220 ± 16% 155%2570630 ± 6% numa-numastat.node1.numa_foreign 9664033 ± 4% -13%8413184 ± 3% numa-numastat.node1.numa_miss 26519887 ± 3%18% 31322674cpuidle.C1-IVT.time 122238 16% 142383cpuidle.C1-IVT.usage 46548 11% 51645cpuidle.C1E-IVT.usage 17253419 13% 19567582cpuidle.C3-IVT.time 86847 13% 98333cpuidle.C3-IVT.usage 482033 ± 12% 108%1000665 ± 8% numa-vmstat.node0.numa_miss 94689 14% 107744 numa-vmstat.node0.nr_zone_write_pending 94677 14% 107718numa-vmstat.node0.nr_dirty 3156643 ± 3
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Fri, Aug 12, 2016 at 11:03:33AM -0700, Linus Torvalds wrote: On Thu, Aug 11, 2016 at 8:56 PM, Dave Chinner wrote: On Thu, Aug 11, 2016 at 07:27:52PM -0700, Linus Torvalds wrote: I don't recall having ever seen the mapping tree_lock as a contention point before, but it's not like I've tried that load either. So it might be a regression (going back long, I suspect), or just an unusual load that nobody has traditionally tested much. Single-threaded big file write one page at a time, was it? Yup. On a 4 node NUMA system. Ok, I can't see any real contention on my single-node workstation (running ext4 too, so there may be filesystem differences), but I guess that shouldn't surprise me. The cacheline bouncing just isn't expensive enough when it all stays on-die. I can see the tree_lock in my profiles (just not very high), and at least for ext4 the main caller ssems to be __set_page_dirty_nobuffers(). And yes, looking at that, the biggest cost by _far_ inside the spinlock seems to be the accounting. Which doesn't even have to be inside the mapping lock, as far as I can tell, and as far as comments go. So a stupid patch to just move the dirty page accounting to outside the spinlock might help a lot. Does this attached patch help your contention numbers? Hi Linus, The 1BRD tests finished and there are no conclusive changes. The overall aim7 jobs-per-min slightly decreases and the overall fsmark files_per_sec slightly increases, however both are small enough (less than 1%), which are kind of expected numbers. NUMA test results should be available tomorrow. 99091700659f4df9 1b5f2eb4a752e1fa7102f37545 testcase/testparams/testbox -- --- %stddev %change %stddev \ |\ 71443 71286GEO-MEAN aim7.jobs-per-min 972 961 aim7/1BRD_48G-btrfs-creat-clo-4-performance/ivb44 52205 51525 aim7/1BRD_48G-btrfs-disk_cp-1500-performance/ivb44 2184471 ± 4%-6%2051740 ± 3% aim7/1BRD_48G-btrfs-disk_rd-9000-performance/ivb44 47049 46630 ± 3% aim7/1BRD_48G-btrfs-disk_rr-1500-performance/ivb44 24932 -4% 23812 aim7/1BRD_48G-btrfs-disk_rw-1500-performance/ivb44 58845856 aim7/1BRD_48G-btrfs-disk_src-500-performance/ivb44 51430 51286 aim7/1BRD_48G-btrfs-disk_wrt-1500-performance/ivb44 218 220 aim7/1BRD_48G-btrfs-sync_disk_rw-10-performance/ivb44 22777 23199 aim7/1BRD_48G-ext4-creat-clo-1000-performance/ivb44 130085 128991 aim7/1BRD_48G-ext4-disk_cp-3000-performance/ivb44 2434088 ± 3%-8%2232211 ± 4% aim7/1BRD_48G-ext4-disk_rd-9000-performance/ivb44 130351 128977 aim7/1BRD_48G-ext4-disk_rr-3000-performance/ivb44 73280 74044 aim7/1BRD_48G-ext4-disk_rw-3000-performance/ivb44 277035 -3% 268057 aim7/1BRD_48G-ext4-disk_src-3000-performance/ivb44 127584 4% 132639 aim7/1BRD_48G-ext4-disk_wrt-3000-performance/ivb44 10571 10659 aim7/1BRD_48G-ext4-sync_disk_rw-600-performance/ivb44 36924 ± 7% 36327 aim7/1BRD_48G-f2fs-creat-clo-1500-performance/ivb44 117238 119130 aim7/1BRD_48G-f2fs-disk_cp-3000-performance/ivb44 2340512 ± 5% 2352619 ± 10% aim7/1BRD_48G-f2fs-disk_rd-9000-performance/ivb44 107506 ± 9% 7% 114869 aim7/1BRD_48G-f2fs-disk_rr-3000-performance/ivb44 105642 106835 aim7/1BRD_48G-f2fs-disk_rw-3000-performance/ivb44 26900 ± 3% 26442 ± 3% aim7/1BRD_48G-f2fs-disk_src-3000-performance/ivb44 117124 ± 3%117678 aim7/1BRD_48G-f2fs-disk_wrt-3000-performance/ivb44 36893616 aim7/1BRD_48G-f2fs-sync_disk_rw-600-performance/ivb44 70897 72758 aim7/1BRD_48G-xfs-creat-clo-1500-performance/ivb44 267649 ± 3%270867 aim7/1BRD_48G-xfs-disk_cp-3000-performance/ivb44 485217 ± 3%489403 aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44 360451 359042 aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 338114 336838 aim7/1BRD_48G-xfs-disk_rw-3000-performance/ivb44 60130 ± 5% 4% 62663 aim7/1BRD_48G-xfs-disk_src-3000-performance/ivb44 403144 401476 aim7/1BRD_48G-xfs-disk_wrt-3000-performance/ivb44 26327 26513 aim7/1BRD_48G-xfs-sync_disk_rw-600-performance/ivb
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
* Linus Torvalds wrote: > Make sure you actually use "perf record -e cycles:pp" or something > that uses PEBS to get real profiles using CPU performance counters. Btw., 'perf record -e cycles:pp' is the default now for modern versions of perf tooling (on most x86 systems) - if you do 'perf record' it will just use the most precise profiling mode available on that particular CPU model. If unsure you can check the event that was used, via: triton:~> perf report --stdio 2>&1 | grep '# Samples' # Samples: 27K of event 'cycles:pp' Thanks, Ingo
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Sun, Aug 14, 2016 at 07:53:40PM -0700, Linus Torvalds wrote: > On Sun, Aug 14, 2016 at 7:28 PM, Dave Chinner wrote: > >> > >> Maybe your symbol table came from a old kernel, and functions moved > >> around enough that the profile attributions ended up bogus. > > > > No, I don't think so. I don't install symbol tables on my test VMs, > > I let /proc/kallsyms do that work for me. From an strace of 'perf > > top -U -g": > > Ok. But something is definitely wrong with your profile. > > What does it say if you annotate that _raw_spin_unlock_irqrestore() function? raw_spin_unlock_irqrestore /proc/kcore ¿ ¿ ¿ ¿Disassembly of section load0: ¿ ¿81e628b0 : ¿ nop ¿ push %rbp ¿ mov%rsp,%rbp ¿ movb $0x0,(%rdi) ¿ nop ¿ mov%rsi,%rdi ¿ push %rdi ¿ popfq 99.35 ¿ nop ¿ decl %gs:0x7e1a9bc7(%rip) 0.65 ¿¿ je 25 ¿ pop%rbp ¿¿ retq ¿25: callq 0x81002000 ¿ pop%rbp ¿¿ retq > I guarantee you that no, it's not spending 41% of time in > spin_unlock_irqrestore. That just isn't a valid profile. There's > something seriously wrong somewhere. > > The fact that you then get the same profile when you turn _off_ > CONFIG_DEBUG_SPINLOCK only proves there is something going on that is > pure garbage. > > I suspect that what you did was to edit the .config file, remove > DEBUG_SPINLOCK, and then do "make oldconfig" again. Yes. > And it got turned on again, No. I'm not that stupid - I checked: $ grep SPINLOCK .config CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y CONFIG_QUEUED_SPINLOCKS=y CONFIG_PARAVIRT_SPINLOCKS=y # CONFIG_DEBUG_SPINLOCK is not set $ > because you have one of the lock debugging > options on that force spinlock debuggin on again: > - DEBUG_WW_MUTEX_SLOWPATH > - DEBUG_LOCK_ALLOC > - PROVE_LOCKING None of which are set: $ grep 'DEBUG\|PROVE' .config |grep -v '#' CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_DEBUG_RODATA=y CONFIG_SLUB_DEBUG=y CONFIG_HAVE_DMA_API_DEBUG=y CONFIG_X86_DEBUGCTLMSR=y CONFIG_PM_DEBUG=y CONFIG_PM_SLEEP_DEBUG=y CONFIG_DEBUG_DEVRES=y CONFIG_PNP_DEBUG_MESSAGES=y CONFIG_XFS_DEBUG=y CONFIG_OCFS2_DEBUG_MASKLOG=y CONFIG_CIFS_DEBUG=y CONFIG_DEBUG_INFO=y CONFIG_DEBUG_FS=y CONFIG_DEBUG_KERNEL=y CONFIG_HAVE_DEBUG_KMEMLEAK=y CONFIG_DEBUG_STACK_USAGE=y CONFIG_HAVE_DEBUG_STACKOVERFLOW=y CONFIG_SCHED_DEBUG=y CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_ATOMIC_SLEEP=y CONFIG_DEBUG_BUGVERBOSE=y CONFIG_DEBUG_LIST=y CONFIG_FAULT_INJECTION_DEBUG_FS=y CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS=y CONFIG_DEBUG_BOOT_PARAMS=y $ > [ Light goes on ] > > Oh, no, I can see another possibility: you're not doing proper CPU > profiles, you're doing some timer-irq profile, and the reason you get > 41% on spin_unlock_irq_restore() is that that is where the interrupts > are enabled again. > > Timer-interrupt based profiles are not useful either. I've just been using whatever perf defaults to. Defaults are supposed to be useful - if they aren't then perf needs to be fixed. perf top reports this by default: Samples: 118K of event 'cpu-clock', Event count (approx.): 793748915 Overhead Shared O Symbol ¿ 34.48% [kernel] [k] _raw_spin_unlock_irqrestore ¿ 7.89% [kernel] [k] copy_user_generic_string ¿ 5.08% [kernel] [k] _raw_spin_unlock_irq ... > Make sure you actually use "perf record -e cycles:pp" or something > that uses PEBS to get real profiles using CPU performance counters. WTF is PEBS? I'm not a CPU nerd, and I certainly don't expect to have to learn all the intricacies of hardware performance counters just to profile the kernel in a correct and sane manner. That's what the *perf defaults* are supposed to do. Anyway: `perf top -U -e cycles:pp`: Samples: 301K of event 'cpu-clock:ppH', Event count (approx.): 69364814 Overhead Shared O Symbol ¿ 30.89% [kernel] [k] _raw_spin_unlock_irqrestore ¿ 7.04% [kernel] [k] _raw_spin_unlock_irq ¿ 4.08% [kernel] [k] copy_user_generic_string
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Sun, Aug 14, 2016 at 7:28 PM, Dave Chinner wrote: >> >> Maybe your symbol table came from a old kernel, and functions moved >> around enough that the profile attributions ended up bogus. > > No, I don't think so. I don't install symbol tables on my test VMs, > I let /proc/kallsyms do that work for me. From an strace of 'perf > top -U -g": Ok. But something is definitely wrong with your profile. What does it say if you annotate that _raw_spin_unlock_irqrestore() function? I guarantee you that no, it's not spending 41% of time in spin_unlock_irqrestore. That just isn't a valid profile. There's something seriously wrong somewhere. The fact that you then get the same profile when you turn _off_ CONFIG_DEBUG_SPINLOCK only proves there is something going on that is pure garbage. I suspect that what you did was to edit the .config file, remove DEBUG_SPINLOCK, and then do "make oldconfig" again. And it got turned on again, because you have one of the lock debugging options on that force spinlock debuggin on again: - DEBUG_WW_MUTEX_SLOWPATH - DEBUG_LOCK_ALLOC - PROVE_LOCKING all of which would make any profiles entirely pointless. [ Light goes on ] Oh, no, I can see another possibility: you're not doing proper CPU profiles, you're doing some timer-irq profile, and the reason you get 41% on spin_unlock_irq_restore() is that that is where the interrupts are enabled again. Timer-interrupt based profiles are not useful either. Make sure you actually use "perf record -e cycles:pp" or something that uses PEBS to get real profiles using CPU performance counters. Because right now the profile data is worthless. Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Sun, Aug 14, 2016 at 06:37:33PM -0700, Linus Torvalds wrote: > On Sun, Aug 14, 2016 at 5:48 PM, Dave Chinner wrote: > >> > >> Does this attached patch help your contention numbers? > > > > No. If anything, it makes it worse. Without the patch, I was > > measuring 36-37% in _raw_spin_unlock_irqrestore. With the patch, it > > is 42-43%. Write throughtput is the same at ~505MB/s. > > Not helping any I can see, but I don't see how it could hurt... > > Did you perhaps test it together with the other patches that improved > xfs performance? If other things improve, then I'd expect the > contention to get worse. > > Not that it matters. Clearly that patch isn't even a stop-gap solution. Tried it with and without. Same result. > > There's a couple of interesting things showing up in the profile: > > > > 41.64% [kernel] [k] _raw_spin_unlock_irqrestore > > Actually, you didn't point this one out, but *this* is the real kicker. > > There's no way a *unlock* should show up that high. It's not spinning. > It's doing a single store and a pushq/popfq sequence. > > Sure, it's going to take a cross-node cachemiss in the presence of > contention, but even then it should never be more expensive than the > locking side - which will *also* do the node changes. > > So there's something really odd in your profile. I don't think that's valid. > > Maybe your symbol table came from a old kernel, and functions moved > around enough that the profile attributions ended up bogus. No, I don't think so. I don't install symbol tables on my test VMs, I let /proc/kallsyms do that work for me. From an strace of 'perf top -U -g": 18916 open("vmlinux", O_RDONLY) = -1 ENOENT (No such file or directory) 18916 open("/boot/vmlinux", O_RDONLY) = -1 ENOENT (No such file or directory) 18916 open("/boot/vmlinux-4.8.0-rc1-dgc+", O_RDONLY) = -1 ENOENT (No such file or directory) 18916 open("/usr/lib/debug/boot/vmlinux-4.8.0-rc1-dgc+", O_RDONLY) = -1 ENOENT (No such file or directory) 18916 open("/lib/modules/4.8.0-rc1-dgc+/build/vmlinux", O_RDONLY) = -1 ENOENT (No such file or directory) 18916 open("/usr/lib/debug/lib/modules/4.8.0-rc1-dgc+/vmlinux", O_RDONLY) = -1 ENOENT (No such file or directory) 18916 open("/usr/lib/debug/boot/vmlinux-4.8.0-rc1-dgc+.debug", O_RDONLY) = -1 ENOENT (No such file or directory) 18916 open("/root/.debug/.build-id/63/aab665ce90bd81763b90ff2cf103d8e8e823bc", O_RDONLY) = -1 ENOENT (No such file or directory) 18916 open("/sys/kernel/notes", O_RDONLY) = 56 18916 read(56, "\4\0\0\0\24\0\0\0\3\0\0\0", 12) = 12 18916 read(56, "GNU\0", 4) = 4 18916 read(56, "c\252\266e\316\220\275\201v;\220\377,\361\3\330\350\350#\274", 20) = 20 18916 close(56) = 0 18916 open("/root/.debug/[kernel.kcore]/63aab665ce90bd81763b90ff2cf103d8e8e823bc", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = -1 ENOENT (No such file or directory) 18916 open("/proc/kallsyms", O_RDONLY) = 56 18916 fstat(56, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0 18916 read(56, " A irq_stack_uni"..., 1024) = 1024 18916 read(56, "a\nb8c0 A rsp_scratch"..., 1024) = 1024 18916 read(56, "000c6e0 A cmci_storm_state\n0"..., 1024) = 1024 18916 read(56, "00ccd8 A sd_llc_id\n0"..., 1024) = 1024 You can see that perf is pulling the symbol table from the running kernel, so I don't think there's a symbol mismatch here at all. > I suspect it's actually supposed to be _raw_spin_lock_irqrestore() > which is right next to that function. Although I'd actually expect > that if it's lock contention, you should see the contention mostly in > queued_spin_lock_slowpath(). > > Unless you have spinlock debugging turned on, in which case your > contention is all from *that*. That's possible, of course. $ grep SPINLOCK .config CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y CONFIG_QUEUED_SPINLOCKS=y CONFIG_PARAVIRT_SPINLOCKS=y CONFIG_DEBUG_SPINLOCK=y $ So, turn off CONFIG_DEBUG_SPINLOCK, and: 41.06% [kernel] [k] _raw_spin_unlock_irqrestore 7.68% [kernel] [k] copy_user_generic_string 4.52% [kernel] [k] _raw_spin_unlock_irq 2.78% [kernel] [k] _raw_spin_lock 2.30% [kernel] [k] get_page_from_freelist 2.21% [kernel] [k] native_queued_spin_lock_slowpath 2.16% [kernel] [k] __slab_free 2.12% [kernel] [k] __block_commit_write.isra.29 1.55% [kernel] [k] __list_add 1.49% [kernel] [k] shrink_page_list 1.23% [kernel] [k] free_hot_cold_page 1.14% [kernel] [k] __wake_up_bit 1.01% [kernel] [k] try_to_release_page 1.00% [kernel] [k] page_evictable 0.90% [kernel] [k] cancel_dirty_page 0.80% [kernel] [k] unlock_page 0.80% [kernel] [k] up_write 0.73% [kernel] [k] ___might_sleep 0.68% [kernel] [k] clear_page_dirty_for_io 0.64% [kernel] [k] __radix_tree_lookup 0.61% [kernel] [k] __block_write_begin_int 0.60% [kernel] [k] xfs_do_writepage 0.59% [kernel] [k] node_dirty_ok 0.55% [kernel] [k] down_writ
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Sun, Aug 14, 2016 at 5:48 PM, Dave Chinner wrote: >> >> Does this attached patch help your contention numbers? > > No. If anything, it makes it worse. Without the patch, I was > measuring 36-37% in _raw_spin_unlock_irqrestore. With the patch, it > is 42-43%. Write throughtput is the same at ~505MB/s. Not helping any I can see, but I don't see how it could hurt... Did you perhaps test it together with the other patches that improved xfs performance? If other things improve, then I'd expect the contention to get worse. Not that it matters. Clearly that patch isn't even a stop-gap solution. > There's a couple of interesting things showing up in the profile: > > 41.64% [kernel] [k] _raw_spin_unlock_irqrestore Actually, you didn't point this one out, but *this* is the real kicker. There's no way a *unlock* should show up that high. It's not spinning. It's doing a single store and a pushq/popfq sequence. Sure, it's going to take a cross-node cachemiss in the presence of contention, but even then it should never be more expensive than the locking side - which will *also* do the node changes. So there's something really odd in your profile. I don't think that's valid. Maybe your symbol table came from a old kernel, and functions moved around enough that the profile attributions ended up bogus. I suspect it's actually supposed to be _raw_spin_lock_irqrestore() which is right next to that function. Although I'd actually expect that if it's lock contention, you should see the contention mostly in queued_spin_lock_slowpath(). Unless you have spinlock debugging turned on, in which case your contention is all from *that*. That's possible, of course. >7.92% [kernel] [k] copy_user_generic_string >5.87% [kernel] [k] _raw_spin_unlock_irq >3.18% [kernel] [k] do_raw_spin_lock >2.51% [kernel] [k] cancel_dirty_page <<< ... > Why are we even calling into cancel_dirty_page() if the page isn't > dirty? xfs_vm_release_page() won't let dirty pages through to > try_to_free_buffers(), so all this is just pure overhead for XFS. See above: there's something screwy with your profile, you should check that first. Maybe it's not actually cancel_dirty_page() but something close-by. (Although I don't see anything closeby normally, so even if the spin_unlock_irq is bogus, I think *that* part may be incorrect. Anyway, the reason you'd get cancel_dirty_page() is either due to truncate, or due to try_to_free_buffers() having dropped the buffers successfully because the filesystem had already written them out, but the page is still marked dirty. > FWIW, this is not under the mapping->tree_lock, but the profile shows > that reclaiming bufferheads is roughly 20% of all the work kswapd is > doing. Well, that may not actually be wrong. That's the most expensive part of reclaiming memory. But please double-check your profile, because something is seriously wrong in it. Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Fri, Aug 12, 2016 at 11:03:33AM -0700, Linus Torvalds wrote: > On Thu, Aug 11, 2016 at 8:56 PM, Dave Chinner wrote: > > On Thu, Aug 11, 2016 at 07:27:52PM -0700, Linus Torvalds wrote: > >> > >> I don't recall having ever seen the mapping tree_lock as a contention > >> point before, but it's not like I've tried that load either. So it > >> might be a regression (going back long, I suspect), or just an unusual > >> load that nobody has traditionally tested much. > >> > >> Single-threaded big file write one page at a time, was it? > > > > Yup. On a 4 node NUMA system. > > Ok, I can't see any real contention on my single-node workstation > (running ext4 too, so there may be filesystem differences), but I > guess that shouldn't surprise me. The cacheline bouncing just isn't > expensive enough when it all stays on-die. > > I can see the tree_lock in my profiles (just not very high), and at > least for ext4 the main caller ssems to be > __set_page_dirty_nobuffers(). > > And yes, looking at that, the biggest cost by _far_ inside the > spinlock seems to be the accounting. > > Which doesn't even have to be inside the mapping lock, as far as I can > tell, and as far as comments go. > > So a stupid patch to just move the dirty page accounting to outside > the spinlock might help a lot. > > Does this attached patch help your contention numbers? No. If anything, it makes it worse. Without the patch, I was measuring 36-37% in _raw_spin_unlock_irqrestore. With the patch, it is 42-43%. Write throughtput is the same at ~505MB/s. There's a couple of interesting things showing up in the profile: 41.64% [kernel] [k] _raw_spin_unlock_irqrestore 7.92% [kernel] [k] copy_user_generic_string 5.87% [kernel] [k] _raw_spin_unlock_irq 3.18% [kernel] [k] do_raw_spin_lock 2.51% [kernel] [k] cancel_dirty_page <<< 2.35% [kernel] [k] get_page_from_freelist 2.22% [kernel] [k] shrink_page_list 2.04% [kernel] [k] __block_commit_write.isra.30 1.40% [kernel] [k] xfs_vm_releasepage 1.21% [kernel] [k] free_hot_cold_page 1.17% [kernel] [k] delay_tsc 1.12% [kernel] [k] __wake_up_bit 0.92% [kernel] [k] __slab_free 0.91% [kernel] [k] clear_page_dirty_for_io 0.82% [kernel] [k] __radix_tree_lookup 0.76% [kernel] [k] node_dirty_ok 0.68% [kernel] [k] page_evictable 0.63% [kernel] [k] do_raw_spin_unlock 0.62% [kernel] [k] up_write . Why are we even calling into cancel_dirty_page() if the page isn't dirty? xfs_vm_release_page() won't let dirty pages through to try_to_free_buffers(), so all this is just pure overhead for XFS. FWIW, this is not under the mapping->tree_lock, but the profile shows that reclaiming bufferheads is roughly 20% of all the work kswapd is doing. If we take away the mapping->tree_lock contention from the usage totals, it's actually closer to 50% of the real work kswapd is doing. The call graph profile is pretty grim: - 41.89% 0.00% [kernel] [k] kswapd - kswapd - 41.84% shrink_node - 41.61% shrink_node_memcg.isra.75 - 41.50% shrink_inactive_list - 40.21% shrink_page_list - 26.47% __remove_mapping 26.30% _raw_spin_unlock_irqrestore + 9.03% try_to_release_page - 8.82% try_to_release_page - 8.80% xfs_vm_releasepage 7.55% try_to_free_buffers + 1.56% free_hot_cold_page_list 0.73% page_evictable 0.88% _raw_spin_unlock_irq I guess now that the iomap code is in and we no longer really depend on bufferheads in the writeback path, it's time to take the next step in removing bufferheads from XFS altogether > I realize that this has nothing to do with the AIM7 regression (the > spinlock just isn't high enough in that profile), but your contention > numbers just aren't right, and updating accounting statistics inside a > critical spinlock when not needed is just wrong. Yup, but the above profile shows that the lock contention is mainly coming from concurrent access in memory reclaim so I don't think the accounting has anything to do with it. Perhaps some kind of per-mapping reclaim batching reduce the __remove_mapping() locking overhead is in order here. Especially as this problem will only get worse on larger NUMA machines Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Hi Christoph, On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote: Snipping the long contest: I think there are three observations here: (1) removing the mark_page_accessed (which is the only significant change in the parent commit) hurts the aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test. I'd still rather stick to the filemap version and let the VM people sort it out. How do the numbers for this test look for XFS vs say ext4 and btrfs? We'll be able to compare between filesystems when the tests for Linus' patch finish. (2) lots of additional spinlock contention in the new case. A quick check shows that I fat-fingered my rewrite so that we do the xfs_inode_set_eofblocks_tag call now for the pure lookup case, and pretty much all new cycles come from that. (3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and we're already doing way to many even without my little bug above. So I've force pushed a new version of the iomap-fixes branch with (2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a lot less expensive slotted in before that. Would be good to see the numbers with that. I just queued these jobs. The comment-out ones will be submitted as the 2nd stage when the 1st-round quick tests finish. queue=( queue -q vip --repeat-to 3 fs=xfs perf-profile.delay=1 -b hch-vfs/iomap-fixes -k bf4dc6e4ecc2a3d042029319bc8cd4204c185610 -k 74a242ad94d13436a1644c0b4586700e39871491 -k 99091700659f4df965e138b38b4fa26a29b7eade ) "${queue[@]}" -t ivb44 aim7-fs-1brd.yaml "${queue[@]}" -t ivb44 fsmark-generic-1brd.yaml "${queue[@]}" -t ivb43 fsmark-stress-journal-1brd.yaml "${queue[@]}" -t lkp-hsx02 fsmark-generic-brd-raid.yaml "${queue[@]}" -t lkp-hsw-ep4 fsmark-1ssd-nvme-small.yaml #"${queue[@]}" -t ivb43 fsmark-stress-journal-1hdd.yaml #"${queue[@]}" -t ivb44 dd-write-1hdd.yaml fsmark-generic-1hdd.yaml Thanks, Fengguang
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Sun, Aug 14, 2016 at 06:17:24PM +0200, Christoph Hellwig wrote: > Snipping the long contest: > > I think there are three observations here: > > (1) removing the mark_page_accessed (which is the only significant > change in the parent commit) hurts the > aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test. > I'd still rather stick to the filemap version and let the > VM people sort it out. How do the numbers for this test > look for XFS vs say ext4 and btrfs? > (2) lots of additional spinlock contention in the new case. A quick > check shows that I fat-fingered my rewrite so that we do > the xfs_inode_set_eofblocks_tag call now for the pure lookup > case, and pretty much all new cycles come from that. > (3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and > we're already doing way to many even without my little bug above. > > So I've force pushed a new version of the iomap-fixes branch with > (2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a > lot less expensive slotted in before that. Would be good to see > the numbers with that. With this new set of fixes, the 1byte write test runs ~30% faster on my test machine (130k writes/s vs 100k writes/s), and the 1k write on the pmem device runs about 10% faster (660MB/s vs 590MB/s). dbench numbers on the pmem device also go through the roof (they didn't show any regression to begin with) - 50% faster at 16 clients on a 16AG filesystem (5700MB/s vs 3800MB/s). The 10Mx4k file create fsmark workload I run (on the sparse 500TB XFS filesystem backed by a pair of SSDs) is giving the highest throughput *and* the lowest std dev I've ever recorded (55014.8+/-1.3e+04 files/s) and that shows in the runtime which also drops from 3m57s to 3m22s. So regardless of what aim7 results we get from these changes, I'll be merging them pending review and further testing... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Snipping the long contest: I think there are three observations here: (1) removing the mark_page_accessed (which is the only significant change in the parent commit) hurts the aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 test. I'd still rather stick to the filemap version and let the VM people sort it out. How do the numbers for this test look for XFS vs say ext4 and btrfs? (2) lots of additional spinlock contention in the new case. A quick check shows that I fat-fingered my rewrite so that we do the xfs_inode_set_eofblocks_tag call now for the pure lookup case, and pretty much all new cycles come from that. (3) Boy, are those xfs_inode_set_eofblocks_tag calls expensive, and we're already doing way to many even without my little bug above. So I've force pushed a new version of the iomap-fixes branch with (2) fixed, and also a little patch to xfs_inode_set_eofblocks_tag a lot less expensive slotted in before that. Would be good to see the numbers with that.
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Hi Christoph, On Sun, Aug 14, 2016 at 06:51:28AM +0800, Fengguang Wu wrote: Hi Christoph, On Sun, Aug 14, 2016 at 12:15:08AM +0200, Christoph Hellwig wrote: Hi Fengguang, feel free to try this git tree: git://git.infradead.org/users/hch/vfs.git iomap-fixes I just queued some test jobs for it. % queue -q vip -t ivb44 -b hch-vfs/iomap-fixes aim7-fs-1brd.yaml fs=xfs -r3 -k fe9c2c81ed073878768785a985295cbacc349e42 -k ca2edab2e1d8f30dda874b7f717c2d4664991e9b -k 99091700659f4df965e138b38b4fa26a29b7eade That job file can be found here: https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/tree/jobs/aim7-fs-1brd.yaml It specifies a matrix of the below atom tests: wfg /c/lkp-tests% split-job jobs/aim7-fs-1brd.yaml -s 'fs: xfs' jobs/aim7-fs-1brd.yaml => ./aim7-fs-1brd-1BRD_48G-xfs-disk_src-3000-performance.yaml jobs/aim7-fs-1brd.yaml => ./aim7-fs-1brd-1BRD_48G-xfs-disk_rr-3000-performance.yaml jobs/aim7-fs-1brd.yaml => ./aim7-fs-1brd-1BRD_48G-xfs-disk_rw-3000-performance.yaml jobs/aim7-fs-1brd.yaml => ./aim7-fs-1brd-1BRD_48G-xfs-disk_cp-3000-performance.yaml jobs/aim7-fs-1brd.yaml => ./aim7-fs-1brd-1BRD_48G-xfs-disk_wrt-3000-performance.yaml jobs/aim7-fs-1brd.yaml => ./aim7-fs-1brd-1BRD_48G-xfs-sync_disk_rw-600-performance.yaml jobs/aim7-fs-1brd.yaml => ./aim7-fs-1brd-1BRD_48G-xfs-creat-clo-1500-performance.yaml jobs/aim7-fs-1brd.yaml => ./aim7-fs-1brd-1BRD_48G-xfs-disk_rd-9000-performance.yaml I got some results now. The several finished aim7 tests have some performance regressions for commit fe9c2c81 ("xfs: rewrite and optimize the delalloc write path") comparing to its parent commit ca2edab2e and their base mainline commit 990917006 ("Merge tag 'nfs-for-4.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs"). wfg@inn ~% compare -g aim7 -ai 99091700659f4df965e138b38b4fa26a29b7eade ca2edab2e1d8f30dda874b7f717c2d4664991e9b fe9c2c81ed073878768785a985295cbacc349e42 tests: 4 60 perf-index fe9c2c81ed073878768785a985295cbacc349e42 97 power-index fe9c2c81ed073878768785a985295cbacc349e42 99091700659f4df9 ca2edab2e1d8f30dda874b7f71 fe9c2c81ed073878768785a985 testcase/testparams/testbox -- -- --- %stddev %change %stddev %change %stddev \ |\ |\ 270459 272267 ± 3% -48% 139834 ± 3% aim7/1BRD_48G-xfs-disk_cp-3000-performance/ivb44 473257 468546 5% 497512 aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44 360578 -18% 296589 -60% 144974 aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 358701 -6% 335712 -40% 216057 GEO-MEAN aim7.jobs-per-min 99091700659f4df9 ca2edab2e1d8f30dda874b7f71 fe9c2c81ed073878768785a985 -- -- 48.48 48.15 36% 65.85 aim7/1BRD_48G-xfs-disk_cp-3000-performance/ivb44 89.50 89.76 88.75 aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44 35.78 23% 43.93 76% 63.09 aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 53.75 7% 57.48 33% 71.71 GEO-MEAN turbostat.%Busy 99091700659f4df9 ca2edab2e1d8f30dda874b7f71 fe9c2c81ed073878768785a985 -- -- 14391431 36% 1964 aim7/1BRD_48G-xfs-disk_cp-3000-performance/ivb44 267126742650 aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44 1057 23% 1303 78% 1883 aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 1595 7% 1708 34% 2139 GEO-MEAN turbostat.Avg_MHz 99091700659f4df9 ca2edab2e1d8f30dda874b7f71 fe9c2c81ed073878768785a985 -- -- 167 167 6%177 aim7/1BRD_48G-xfs-disk_cp-3000-performance/ivb44 175 175 176 aim7/1BRD_48G-xfs-disk_rd-9000-performance/ivb44 150 8%162 19%178 aim7/1BRD_48G-xfs-disk_rr-3000-performance/ivb44 164 168 8%177 GEO-MEAN turbostat.PkgWatt 99091700659f4df9 ca2edab2e1d8f30dda874b7f71 fe9c2c81ed073878768785a985 -- -- 10.27 10.43
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Sat, Aug 13, 2016 at 02:30:54AM +0200, Christoph Hellwig wrote: > On Fri, Aug 12, 2016 at 08:02:08PM +1000, Dave Chinner wrote: > > Which says "no change". Oh well, back to the drawing board... > > I don't see how it would change thing much - for all relevant calculations > we convert to block units first anyway. THere was definitely an off-by-one in the code, which meant for 1-byte writes it never triggered speculative prealloc, so it was doing the past-EOF real block check for every write. With it also passing less than a block size, when the > XFS_ISIZE check passed 3 out of every 4 want_preallocate checks were landing on an already allocated block, too, so it was doing 3x as many lookups as needed. for 1k writes on a 4k block size filesystem. Amongst other things... > But the whole xfs_iomap_write_delay is a giant mess anyway. For a usual > call we do at least four lookups in the extent btree, which seems rather > costly. Especially given that the low-level xfs_bmap_search_extents > interface would give us all required information in one single call. I noticed, though I was looking for a smaller, targetted fix rather than rewriting the whole thing. Don't get me wrong, I think it needs a rewrite to be efficient for the iomap infrastructure, just didn't want to do that as a regression fix if a 1-liner might be sufficient... > Below is a patch I hacked up this morning to do just that. It passes > xfstests, but I've not done any real benchmarking with it. If the > reduced lookup overhead in it doesn't help enough we'll need to some > sort of look aside cache for the information, but I hope that we > can avoid that. And yes, it's a rather large patch - but the old > path was so entangled that I couldn't come up with something lighter. I'll run some tests on it. If it does so;ve the regression, I'm going to hold it back until we get a decent amount of review and test coverage on it, though... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Hi Christoph, On Sun, Aug 14, 2016 at 12:15:08AM +0200, Christoph Hellwig wrote: Hi Fengguang, feel free to try this git tree: git://git.infradead.org/users/hch/vfs.git iomap-fixes I just queued some test jobs for it. % queue -q vip -t ivb44 -b hch-vfs/iomap-fixes aim7-fs-1brd.yaml fs=xfs -r3 -k fe9c2c81ed073878768785a985295cbacc349e42 -k ca2edab2e1d8f30dda874b7f717c2d4664991e9b -k 99091700659f4df965e138b38b4fa26a29b7eade That job file can be found here: https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/tree/jobs/aim7-fs-1brd.yaml It specifies a matrix of the below atom tests: wfg /c/lkp-tests% split-job jobs/aim7-fs-1brd.yaml -s 'fs: xfs' jobs/aim7-fs-1brd.yaml => ./aim7-fs-1brd-1BRD_48G-xfs-disk_src-3000-performance.yaml jobs/aim7-fs-1brd.yaml => ./aim7-fs-1brd-1BRD_48G-xfs-disk_rr-3000-performance.yaml jobs/aim7-fs-1brd.yaml => ./aim7-fs-1brd-1BRD_48G-xfs-disk_rw-3000-performance.yaml jobs/aim7-fs-1brd.yaml => ./aim7-fs-1brd-1BRD_48G-xfs-disk_cp-3000-performance.yaml jobs/aim7-fs-1brd.yaml => ./aim7-fs-1brd-1BRD_48G-xfs-disk_wrt-3000-performance.yaml jobs/aim7-fs-1brd.yaml => ./aim7-fs-1brd-1BRD_48G-xfs-sync_disk_rw-600-performance.yaml jobs/aim7-fs-1brd.yaml => ./aim7-fs-1brd-1BRD_48G-xfs-creat-clo-1500-performance.yaml jobs/aim7-fs-1brd.yaml => ./aim7-fs-1brd-1BRD_48G-xfs-disk_rd-9000-performance.yaml If you see other suitable tests for this patch, feel free to drop me a hint. I'v queued these jobs to the other machines to make them run in parallel. % queue -q vip -t ivb43 -b hch-vfs/iomap-fixes fsmark-stress-journal-1hdd.yaml fsmark-stress-journal-1brd.yaml fs=xfs -r3 -k fe9c2c81ed073878768785a985295cbacc349e42 -k ca2edab2e1d8f30dda874b7f717c2d4664991e9b -k 99091700659f4df965e138b38b4fa26a29b7eade % queue -q vip -t ivb44 -b hch-vfs/iomap-fixes fsmark-generic-1brd.yaml dd-write-1hdd.yaml fsmark-generic-1hdd.yaml fs=xfs -r3 -k fe9c2c81ed073878768785a985295cbacc349e42 -k ca2edab2e1d8f30dda874b7f717c2d4664991e9b -k 99091700659f4df965e138b38b4fa26a29b7eade % queue -q vip -t lkp-hsx02 -b hch-vfs/iomap-fixes fsmark-generic-brd-raid.yaml fs=xfs -r3 -k fe9c2c81ed073878768785a985[0/1710]349e42 -k ca2edab2e1d8f30dda874b7f717c2d4664991e9b -k 99091700659f4df965e138b38b4fa26a29b7eade % queue -q vip -t lkp-hsw-ep4 -b hch-vfs/iomap-fixes fsmark-1ssd-nvme-small.yaml fs=xfs -r3 -k fe9c2c81ed073878768785a985295cbacc349e42 -k ca2edab2e1d8f30dda874b7f717c2d4664991e9b -k 99091700659f4df965e138b38b4fa26a29b7eade Thanks, Fengguang
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Hi Fengguang, feel free to try this git tree: git://git.infradead.org/users/hch/vfs.git iomap-fixes
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Hi Linus, On Fri, Aug 12, 2016 at 11:03:33AM -0700, Linus Torvalds wrote: On Thu, Aug 11, 2016 at 8:56 PM, Dave Chinner wrote: On Thu, Aug 11, 2016 at 07:27:52PM -0700, Linus Torvalds wrote: I don't recall having ever seen the mapping tree_lock as a contention point before, but it's not like I've tried that load either. So it might be a regression (going back long, I suspect), or just an unusual load that nobody has traditionally tested much. Single-threaded big file write one page at a time, was it? Yup. On a 4 node NUMA system. Ok, I can't see any real contention on my single-node workstation (running ext4 too, so there may be filesystem differences), but I guess that shouldn't surprise me. The cacheline bouncing just isn't expensive enough when it all stays on-die. I can see the tree_lock in my profiles (just not very high), and at least for ext4 the main caller ssems to be __set_page_dirty_nobuffers(). And yes, looking at that, the biggest cost by _far_ inside the spinlock seems to be the accounting. Which doesn't even have to be inside the mapping lock, as far as I can tell, and as far as comments go. So a stupid patch to just move the dirty page accounting to outside the spinlock might help a lot. Does this attached patch help your contention numbers? Adding a few people who get blamed for account_page_dirtied() and inode_attach_wb() just to make sure that nobody expected the mapping_lock spinlock to be held when calling account_page_dirtied(). I realize that this has nothing to do with the AIM7 regression (the spinlock just isn't high enough in that profile), but your contention numbers just aren't right, and updating accounting statistics inside a critical spinlock when not needed is just wrong. I'm testing this patch on top of 9909170065 ("Merge tag 'nfs-for-4.8-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs"). The BRD (Ram backed block device, drivers/block/brd.c) tests enables pretty fast IO. And the fsmark-generic-brd-raid.yaml on lkp-hsx02 will simulate 8 RAID disks on a 4-node NUMA machine. queue -q vip -t ivb44 -b wfg/account_page_dirtied-linus aim7-fs-1brd.yaml -R3 -k 1b5f2eb4a752e1fa7102f37545f92e64fabd0cf8 -k 99091700659f4df965e138b38b4fa26a29b7eade queue -q vip -t ivb43 -b wfg/account_page_dirtied-linus fsmark-stress-journal-1hdd.yaml fsmark-stress-journal-1brd.yaml -R3 -k 1b5f2eb4a752e1fa7102f37545f92e64fabd0cf8 -k 99091700659f4df965e138b38b4fa26a29b7eade queue -q vip -t ivb44 -b wfg/account_page_dirtied-linus fsmark-generic-1brd.yaml dd-write-1hdd.yaml fsmark-generic-1hdd.yaml -R3 -k 1b5f2eb4a752e1fa7102f37545f92e64fabd0cf8 -k 99091700659f4df965e138b38b4fa26a29b7eade queue -q vip -t lkp-hsx02 -b wfg/account_page_dirtied-linus fsmark-generic-brd-raid.yaml -R3 -k 1b5f2eb4a752e1fa7102f37545f92e64fabd0cf8 -k 99091700659f4df965e138b38b4fa26a29b7eade queue -q vip -t lkp-hsw-ep4 -b wfg/account_page_dirtied-linus fsmark-1ssd-nvme-small.yaml -R3 -k 1b5f2eb4a752e1fa7102f37545f92e64fabd0cf8 -k 99091700659f4df965e138b38b4fa26a29b7eade queue -q vip -t lkp-hsw-ep4 -b wfg/account_page_dirtied-linus fsmark-1ssd-nvme-small.yaml -R3 -k 1b5f2eb4a752e1fa7102f37545f92e64fabd0cf8 -k 99091700659f4df965e138b38b4fa26a29b7eade Thanks, Fengguang fs/buffer.c | 5 - fs/xfs/xfs_aops.c | 5 - mm/page-writeback.c | 2 +- 3 files changed, 9 insertions(+), 3 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 9c8eb9b6db6a..f79a9d241589 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -628,15 +628,18 @@ static void __set_page_dirty(struct page *page, struct address_space *mapping, int warn) { unsigned long flags; + bool account = false; spin_lock_irqsave(&mapping->tree_lock, flags); if (page->mapping) { /* Race with truncate? */ WARN_ON_ONCE(warn && !PageUptodate(page)); - account_page_dirtied(page, mapping); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); + account = true; } spin_unlock_irqrestore(&mapping->tree_lock, flags); + if (account) + account_page_dirtied(page, mapping); } /* diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 7575cfc3ad15..59169c36765e 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1490,15 +1490,18 @@ xfs_vm_set_page_dirty( if (newly_dirty) { /* sigh - __set_page_dirty() is static, so copy it here, too */ unsigned long flags; + bool account = false; spin_lock_irqsave(&mapping->tree_lock, flags); if (page->mapping) { /* Race with truncate? */ WARN_ON_ONCE(!PageUptodate(page)); - account_page_dirtied(page, mapping); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Hi Christoph, On Sat, Aug 13, 2016 at 11:48:25PM +0200, Christoph Hellwig wrote: On Sat, Aug 13, 2016 at 02:30:54AM +0200, Christoph Hellwig wrote: Below is a patch I hacked up this morning to do just that. It passes xfstests, but I've not done any real benchmarking with it. If the reduced lookup overhead in it doesn't help enough we'll need to some sort of look aside cache for the information, but I hope that we can avoid that. And yes, it's a rather large patch - but the old path was so entangled that I couldn't come up with something lighter. Hi Fengguang or Xiaolong, any chance to add this thread to a lkp run? Sure. To which base should I apply it? Or if you already pushed the git tree, I'll test your commit directly. Thanks, Fengguang
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Sat, Aug 13, 2016 at 02:30:54AM +0200, Christoph Hellwig wrote: > Below is a patch I hacked up this morning to do just that. It passes > xfstests, but I've not done any real benchmarking with it. If the > reduced lookup overhead in it doesn't help enough we'll need to some > sort of look aside cache for the information, but I hope that we > can avoid that. And yes, it's a rather large patch - but the old > path was so entangled that I couldn't come up with something lighter. Hi Fengguang or Xiaolong, any chance to add this thread to a lkp run? I've played around with Dave's simplied xfs_io run, and while the end result for 1k block size looks pretty similar in terms of execution time and throughput the profiles look much better. For 512 byte or 1 byte tests the tests completes a lot faster too. Here is the perf report output for a 1k block size run, the first item directly related to the block mapping shows up is xfs_file_iomap_begin_delay at .75%. Although I'm a bit worried up up_/down_read showing up so much. While we take a ilock and iolock a lot they should be mostly uncontended for such a single threaded write, so the overhead seems a bit worrisome. (FYI, the tree this was tested on also has the mark_page_accessed and pagefault_disable fixes applied) # To display the perf.data header info, please use --header/--header-only options. # # Samples: 7K of event 'cpu-clock' # Event count (approx.): 190925 # # Overhead Command Shared Object Symbol # . . # 37.71% swapper [kernel.kallsyms] [k] native_safe_halt 9.85% kworker/u8:5 [kernel.kallsyms] [k] __copy_user_nocache 2.83%xfs_io [kernel.kallsyms] [k] copy_user_generic_string 2.33%xfs_io [kernel.kallsyms] [k] __memset 2.23%xfs_io [kernel.kallsyms] [k] __block_commit_write.isra.34 1.73%xfs_io [kernel.kallsyms] [k] down_write 1.64%xfs_io [kernel.kallsyms] [k] up_write 1.39%xfs_io [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 1.23%xfs_io [kernel.kallsyms] [k] entry_SYSCALL_64_fastpath 1.18%xfs_io [kernel.kallsyms] [k] __mark_inode_dirty 1.18%xfs_io [kernel.kallsyms] [k] _raw_spin_lock 1.15% kworker/u8:5 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 1.13%xfs_io [kernel.kallsyms] [k] __block_write_begin_int 1.10%xfs_io [kernel.kallsyms] [k] mark_buffer_dirty 1.07%xfs_io [kernel.kallsyms] [k] __radix_tree_lookup 1.01% kworker/0:2 [kernel.kallsyms] [k] end_buffer_async_write 0.97%xfs_io [kernel.kallsyms] [k] unlock_page 0.92% kworker/0:2 [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 0.89%xfs_io [kernel.kallsyms] [k] iov_iter_copy_from_user_atomic 0.84%xfs_io [kernel.kallsyms] [k] generic_write_end 0.80%xfs_io [kernel.kallsyms] [k] get_page_from_freelist 0.80%xfs_io [kernel.kallsyms] [k] xfs_perag_put 0.79%xfs_io [kernel.kallsyms] [k] __add_to_page_cache_locked 0.75%xfs_io libc-2.19.so [.] __libc_pwrite 0.72%xfs_io [kernel.kallsyms] [k] xfs_file_iomap_begin_delay.isra.5 0.71%xfs_io [kernel.kallsyms] [k] iomap_write_actor 0.67%xfs_io [kernel.kallsyms] [k] pagecache_get_page 0.64%xfs_io [kernel.kallsyms] [k] balance_dirty_pages_ratelimited 0.64%xfs_io [kernel.kallsyms] [k] vfs_write 0.63% kworker/u8:5 [kernel.kallsyms] [k] clear_page_dirty_for_io 0.62%xfs_io [kernel.kallsyms] [k] xfs_file_write_iter 0.60%xfs_io [kernel.kallsyms] [k] __vfs_write 0.55%xfs_io [kernel.kallsyms] [k] page_waitqueue 0.54%xfs_io [kernel.kallsyms] [k] xfs_perag_get 0.52%xfs_io [kernel.kallsyms] [k] __wake_up_bit 0.52%xfs_io [kernel.kallsyms] [k] radix_tree_tag_set 0.50% kworker/u8:5 [kernel.kallsyms] [k] xfs_do_writepage 0.50%xfs_io [kernel.kallsyms] [k] iov_iter_advance 0.47%xfs_io [kernel.kallsyms] [k] kmem_cache_alloc 0.46%xfs_io [kernel.kallsyms] [k] xfs_file_buffered_aio_write 0.46%x
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Fri, Aug 12, 2016 at 08:02:08PM +1000, Dave Chinner wrote: > Which says "no change". Oh well, back to the drawing board... I don't see how it would change thing much - for all relevant calculations we convert to block units first anyway. But the whole xfs_iomap_write_delay is a giant mess anyway. For a usual call we do at least four lookups in the extent btree, which seems rather costly. Especially given that the low-level xfs_bmap_search_extents interface would give us all required information in one single call. Below is a patch I hacked up this morning to do just that. It passes xfstests, but I've not done any real benchmarking with it. If the reduced lookup overhead in it doesn't help enough we'll need to some sort of look aside cache for the information, but I hope that we can avoid that. And yes, it's a rather large patch - but the old path was so entangled that I couldn't come up with something lighter. diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index b060bca..614803b 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -1388,7 +1388,7 @@ xfs_bmap_search_multi_extents( * Else, *lastxp will be set to the index of the found * entry; *gotp will contain the entry. */ -STATIC xfs_bmbt_rec_host_t * /* pointer to found extent entry */ +xfs_bmbt_rec_host_t * /* pointer to found extent entry */ xfs_bmap_search_extents( xfs_inode_t *ip,/* incore inode pointer */ xfs_fileoff_t bno,/* block number searched for */ @@ -4074,7 +4074,7 @@ xfs_bmapi_read( return 0; } -STATIC int +int xfs_bmapi_reserve_delalloc( struct xfs_inode*ip, xfs_fileoff_t aoff, @@ -4170,91 +4170,6 @@ out_unreserve_quota: return error; } -/* - * Map file blocks to filesystem blocks, adding delayed allocations as needed. - */ -int -xfs_bmapi_delay( - struct xfs_inode*ip,/* incore inode */ - xfs_fileoff_t bno,/* starting file offs. mapped */ - xfs_filblks_t len,/* length to map in file */ - struct xfs_bmbt_irec*mval, /* output: map values */ - int *nmap, /* i/o: mval size/count */ - int flags) /* XFS_BMAPI_... */ -{ - struct xfs_mount*mp = ip->i_mount; - struct xfs_ifork*ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK); - struct xfs_bmbt_irecgot;/* current file extent record */ - struct xfs_bmbt_irecprev; /* previous file extent record */ - xfs_fileoff_t obno; /* old block number (offset) */ - xfs_fileoff_t end;/* end of mapped file region */ - xfs_extnum_tlastx; /* last useful extent number */ - int eof;/* we've hit the end of extents */ - int n = 0; /* current extent index */ - int error = 0; - - ASSERT(*nmap >= 1); - ASSERT(*nmap <= XFS_BMAP_MAX_NMAP); - ASSERT(!(flags & ~XFS_BMAPI_ENTIRE)); - ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL)); - - if (unlikely(XFS_TEST_ERROR( - (XFS_IFORK_FORMAT(ip, XFS_DATA_FORK) != XFS_DINODE_FMT_EXTENTS && -XFS_IFORK_FORMAT(ip, XFS_DATA_FORK) != XFS_DINODE_FMT_BTREE), -mp, XFS_ERRTAG_BMAPIFORMAT, XFS_RANDOM_BMAPIFORMAT))) { - XFS_ERROR_REPORT("xfs_bmapi_delay", XFS_ERRLEVEL_LOW, mp); - return -EFSCORRUPTED; - } - - if (XFS_FORCED_SHUTDOWN(mp)) - return -EIO; - - XFS_STATS_INC(mp, xs_blk_mapw); - - if (!(ifp->if_flags & XFS_IFEXTENTS)) { - error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK); - if (error) - return error; - } - - xfs_bmap_search_extents(ip, bno, XFS_DATA_FORK, &eof, &lastx, &got, &prev); - end = bno + len; - obno = bno; - - while (bno < end && n < *nmap) { - if (eof || got.br_startoff > bno) { - error = xfs_bmapi_reserve_delalloc(ip, bno, len, &got, - &prev, &lastx, eof); - if (error) { - if (n == 0) { - *nmap = 0; - return error; - } - break; - } - } - - /* set up the extent map to return. */ - xfs_bmapi_trim_map(mval, &got, &bno, len, obno, end, n, flags); - xfs_bmapi_update_map(&mval, &bno, &len, obno, end, &n, flags); - - /* If we're done, stop now. */ - if (bno >= end || n >= *nmap) - break; - - /* Else go on to the next record. */ - prev = got; - if (++
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 8:56 PM, Dave Chinner wrote: > On Thu, Aug 11, 2016 at 07:27:52PM -0700, Linus Torvalds wrote: >> >> I don't recall having ever seen the mapping tree_lock as a contention >> point before, but it's not like I've tried that load either. So it >> might be a regression (going back long, I suspect), or just an unusual >> load that nobody has traditionally tested much. >> >> Single-threaded big file write one page at a time, was it? > > Yup. On a 4 node NUMA system. Ok, I can't see any real contention on my single-node workstation (running ext4 too, so there may be filesystem differences), but I guess that shouldn't surprise me. The cacheline bouncing just isn't expensive enough when it all stays on-die. I can see the tree_lock in my profiles (just not very high), and at least for ext4 the main caller ssems to be __set_page_dirty_nobuffers(). And yes, looking at that, the biggest cost by _far_ inside the spinlock seems to be the accounting. Which doesn't even have to be inside the mapping lock, as far as I can tell, and as far as comments go. So a stupid patch to just move the dirty page accounting to outside the spinlock might help a lot. Does this attached patch help your contention numbers? Adding a few people who get blamed for account_page_dirtied() and inode_attach_wb() just to make sure that nobody expected the mapping_lock spinlock to be held when calling account_page_dirtied(). I realize that this has nothing to do with the AIM7 regression (the spinlock just isn't high enough in that profile), but your contention numbers just aren't right, and updating accounting statistics inside a critical spinlock when not needed is just wrong. Linus fs/buffer.c | 5 - fs/xfs/xfs_aops.c | 5 - mm/page-writeback.c | 2 +- 3 files changed, 9 insertions(+), 3 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 9c8eb9b6db6a..f79a9d241589 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -628,15 +628,18 @@ static void __set_page_dirty(struct page *page, struct address_space *mapping, int warn) { unsigned long flags; + bool account = false; spin_lock_irqsave(&mapping->tree_lock, flags); if (page->mapping) {/* Race with truncate? */ WARN_ON_ONCE(warn && !PageUptodate(page)); - account_page_dirtied(page, mapping); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); + account = true; } spin_unlock_irqrestore(&mapping->tree_lock, flags); + if (account) + account_page_dirtied(page, mapping); } /* diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index 7575cfc3ad15..59169c36765e 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1490,15 +1490,18 @@ xfs_vm_set_page_dirty( if (newly_dirty) { /* sigh - __set_page_dirty() is static, so copy it here, too */ unsigned long flags; + bool account = false; spin_lock_irqsave(&mapping->tree_lock, flags); if (page->mapping) {/* Race with truncate? */ WARN_ON_ONCE(!PageUptodate(page)); - account_page_dirtied(page, mapping); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); + account = true; } spin_unlock_irqrestore(&mapping->tree_lock, flags); + if (account) + account_page_dirtied(page, mapping); } unlock_page_memcg(page); if (newly_dirty) diff --git a/mm/page-writeback.c b/mm/page-writeback.c index f4cd7d8005c9..9a6a6b99acfe 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -2517,10 +2517,10 @@ int __set_page_dirty_nobuffers(struct page *page) spin_lock_irqsave(&mapping->tree_lock, flags); BUG_ON(page_mapping(page) != mapping); WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); - account_page_dirtied(page, mapping); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); spin_unlock_irqrestore(&mapping->tree_lock, flags); + account_page_dirtied(page, mapping); unlock_page_memcg(page); if (mapping->host) {
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Fri, Aug 12, 2016 at 04:51:24PM +0800, Ye Xiaolong wrote: > On 08/12, Ye Xiaolong wrote: > >On 08/12, Dave Chinner wrote: > > [snip] > > >>lkp-folk: the patch I've just tested it attached below - can you > >>feed that through your test and see if it fixes the regression? > >> > > > >Hi, Dave > > > >I am verifying your fix patch in lkp environment now, will send the > >result once I get it. > > > > Here is the test result. Which says "no change". Oh well, back to the drawing board... Can you send me the aim7 config file and command line you are using for the test? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On 08/12, Ye Xiaolong wrote: >On 08/12, Dave Chinner wrote: [snip] >>lkp-folk: the patch I've just tested it attached below - can you >>feed that through your test and see if it fixes the regression? >> > >Hi, Dave > >I am verifying your fix patch in lkp environment now, will send the >result once I get it. > Here is the test result. commit 636b594f38278080db93f2d67d11d31700924f5d Author: Dave Chinner AuthorDate: Fri Aug 12 14:23:44 2016 +0800 Commit: Xiaolong Ye CommitDate: Fri Aug 12 14:23:44 2016 +0800 When a write occurs that extends the file, we check to see if we need to preallocate more delalloc space. When we do sub-page writes, the new iomap write path passes a sub-block write length to the block mapping code. xfs_iomap_write_delay does not expect to be pased byte counts smaller than one filesystem block, so it ends up checking the BMBT on for blocks beyond EOF on every write, regardless of whether we need to or not. This causes a regression in aim7 benchmarks as it is full of sub-page writes. To fix this, clamp the minimum length of a mapping request coming through xfs_file_iomap_begin() to one filesystem block. This ensures we are passing the same length to xfs_iomap_write_delay() as we did when calling through the get_blocks path. This substantially reduces the amount of lookup load being placed on the BMBT during sub-block write loads. Signed-off-by: Dave Chinner --- fs/xfs/xfs_iomap.c | 5 + 1 file changed, 5 insertions(+) diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 620fc91..5eaace0 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -1015,10 +1015,15 @@ xfs_file_iomap_begin( * number pulled out of thin air as a best guess for initial * testing. * +* xfs_iomap_write_delay() only works if the length passed in is +* >= one filesystem block. Hence we need to clamp the minimum +* length we map, too. +* * Note that the values needs to be less than 32-bits wide until * the lower level functions are updated. */ length = min_t(loff_t, length, 1024 * PAGE_SIZE); + length = max_t(loff_t, length, (1 << inode->i_blkbits)); if (xfs_get_extsz_hint(ip)) { /* * xfs_iomap_write_direct() expects the shared lock. It f0c6bcba74ac51cb 68a9f5e7007c1afa2cf6830b69 636b594f38278080db93f2d67d -- -- %stddev %change %stddev %change %stddev \ |\ |\ 484435 ± 0% -13.3% 420004 ± 0% -14.0% 416777 ± 0% aim7.jobs-per-min 6491 ± 3% +30.8% 8491 ± 0% +35.7% 8806 ± 1% aim7.time.involuntary_context_switches 376 ± 0% +28.4%484 ± 0% +29.6%488 ± 0% aim7.time.system_time 430512 ± 0% -20.1% 343838 ± 0% -19.7% 345708 ± 0% aim7.time.voluntary_context_switches 37.37 ± 0% +15.3% 43.09 ± 0% +16.1% 43.41 ± 0% aim7.time.elapsed_time 37.37 ± 0% +15.3% 43.09 ± 0% +16.1% 43.41 ± 0% aim7.time.elapsed_time.max 155184 ± 1% -2.1% 151864 ± 1% -2.7% 150937 ± 1% aim7.time.minor_page_faults 0 ± 0% +Inf% 215412 ±141% +Inf% 334416 ± 75% latency_stats.sum.wait_on_page_bit.__migration_entry_wait.migration_entry_wait.handle_pte_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault 24772 ± 0% -28.6% 17675 ± 0% -26.7% 18149 ± 2% vmstat.system.cs 26816 ± 8% +10.2% 29542 ± 1% +13.3% 30370 ± 1% interrupts.CAL:Function_call_interrupts 125122 ± 10% -10.7% 111758 ± 12% -11.1% 111223 ± 11% softirqs.SCHED 3906 ± 0% +28.8% 5032 ± 2% +29.1% 5045 ± 1% proc-vmstat.nr_active_file 3444 ± 5% +41.8% 4884 ± 0% +25.0% 4304 ± 11% proc-vmstat.nr_shmem 4092 ± 14% +61.2% 6595 ± 1% +40.0% 5728 ± 15% proc-vmstat.pgactivate 15627 ± 0% +27.7% 19956 ± 1% +27.4% 19902 ± 0% meminfo.Active(file) 16103 ± 3% +14.3% 18405 ± 8% +11.2% 17900 ± 1% meminfo.AnonHugePages 13777 ± 5% +43.1% 19709 ± 0% +25.0% 17220 ± 11% meminfo.Shmem 1724300 ± 27% -40.5%1025538 ± 1% -41.3%1012868 ± 0% sched_debug.cfs_rq:/.load.max 1724300 ± 27% -40.5%1025538 ± 1% -41.3%1012868 ± 0% sched_debug.cpu.load.max 37.37 ± 0% +15.3% 43.09 ± 0% +16.1% 43.41 ± 0% time.elapsed_time 37.37 ± 0% +15.3% 43.09 ± 0% +16.1%
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On 08/12, Dave Chinner wrote: >On Thu, Aug 11, 2016 at 10:02:39PM -0700, Linus Torvalds wrote: >> On Thu, Aug 11, 2016 at 9:16 PM, Dave Chinner wrote: >> > >> > That's why running aim7 as your "does the filesystem scale" >> > benchmark is somewhat irrelevant to scaling applications on high >> > performance systems these days >> >> Yes, don't get me wrong - I'm not at all trying to say that AIM7 is a >> good benchmark. It's just that I think what it happens to test is >> still meaningful, even if it's not necessarily in any way some kind of >> "high performance IO" thing. >> >> There are probably lots of other more important loads, I just reacted >> to Christoph seeming to argue that the AIM7 behavior was _so_ broken >> that we shouldn't even care. It's not _that_ broken, it's just not >> about high-performance IO streaming, it happens to test something else >> entirely. > >Right - I admit that my first reaction once I worked out what the >problem was is exactly what Christoph said. But after looking at it >further, regardless of how crappy the benchmark it, it is a >regression > >> We've actually had AIM7 occasionally find other issues just because >> some of the things it does is so odd. > >*nod* > >> And let's face it, user programs doing odd and not very efficient >> things should be considered par for the course. We're never going to >> get rid of insane user programs, so we might as well fix the >> performance problems even when we say "that's just stupid". > >Yup, that's what I'm doing :/ > >It looks like the underlying cause is that the old block mapping >code only fed filesystem block size lengths into >xfs_iomap_write_delay(), whereas the iomap code is feeding the >(capped) write() length into it. Hence xfs_iomap_write_delay() is >not detecting the need for speculative preallocation correctly on >these sub-block writes. The profile looks better for the 1 byte >write - I've combined the old and new for comparison below: > > 4.22% __block_commit_write.isra.30 > 3.80% up_write > 3.74% xfs_bmapi_read > 3.65% ___might_sleep > 3.55% down_write > 3.20% entry_SYSCALL_64_fastpath > 3.02% mark_buffer_dirty > 2.78% __mark_inode_dirty > 2.78% unlock_page > 2.59% xfs_break_layouts > 2.47% xfs_iext_bno_to_ext > 2.38% __block_write_begin_int > 2.22% find_get_entry > 2.17% xfs_file_write_iter > 2.16% __radix_tree_lookup > 2.13% iomap_write_actor > 2.04% xfs_bmap_search_extents > 1.98% __might_sleep > 1.84% xfs_file_buffered_aio_write > 1.76% iomap_apply > 1.71% generic_write_end > 1.68% vfs_write > 1.66% iov_iter_copy_from_user_atomic > 1.56% xfs_bmap_search_multi_extents > 1.55% __vfs_write > 1.52% pagecache_get_page > 1.46% xfs_bmapi_update_map > 1.33% xfs_iunlock > 1.32% xfs_iomap_write_delay > 1.29% xfs_file_iomap_begin > 1.29% do_raw_spin_lock > 1.29% __xfs_bmbt_get_all > 1.21% iov_iter_advance > 1.20% xfs_file_aio_write_checks > 1.14% xfs_ilock > 1.11% balance_dirty_pages_ratelimited > 1.10% xfs_bmapi_trim_map > 1.06% xfs_iomap_eof_want_preallocate > 1.00% xfs_bmapi_delay > >Comparison of common functions: > >OldNew function >4.50% 3.74% xfs_bmapi_read >3.64% 4.22% __block_commit_write.isra.30 >3.55% 2.16% __radix_tree_lookup >3.46% 3.80% up_write >3.43% 3.65% ___might_sleep >3.09% 3.20% entry_SYSCALL_64_fastpath >3.01% 2.47% xfs_iext_bno_to_ext >3.01% 2.22% find_get_entry >2.98% 3.55% down_write >2.71% 3.02% mark_buffer_dirty >2.52% 2.78% __mark_inode_dirty >2.38% 2.78% unlock_page >2.14% 2.59% xfs_break_layouts >2.07% 1.46% xfs_bmapi_update_map >2.06% 2.04% xfs_bmap_search_extents >2.04% 1.32% xfs_iomap_write_delay >2.00% 0.38% generic_write_checks >1.96% 1.56% xfs_bmap_search_multi_extents >1.90% 1.29% __xfs_bmbt_get_all >1.89% 1.11% balance_dirty_pages_ratelimited >1.82% 0.28% wait_for_stable_page >1.76% 2.17% xfs_file_write_iter >1.68% 1.06% xfs_iomap_eof_want_preallocate >1.68% 1.00% xfs_bmapi_delay >1.67% 2.13% iomap_write_actor >1.60% 1.84% xfs_file_buffered_aio_write >1.56% 1.98% __might_sleep >1.48% 1.29% do_raw_spin_lock >1.44% 1.71% generic_write_e
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 10:02:39PM -0700, Linus Torvalds wrote: > On Thu, Aug 11, 2016 at 9:16 PM, Dave Chinner wrote: > > > > That's why running aim7 as your "does the filesystem scale" > > benchmark is somewhat irrelevant to scaling applications on high > > performance systems these days > > Yes, don't get me wrong - I'm not at all trying to say that AIM7 is a > good benchmark. It's just that I think what it happens to test is > still meaningful, even if it's not necessarily in any way some kind of > "high performance IO" thing. > > There are probably lots of other more important loads, I just reacted > to Christoph seeming to argue that the AIM7 behavior was _so_ broken > that we shouldn't even care. It's not _that_ broken, it's just not > about high-performance IO streaming, it happens to test something else > entirely. Right - I admit that my first reaction once I worked out what the problem was is exactly what Christoph said. But after looking at it further, regardless of how crappy the benchmark it, it is a regression > We've actually had AIM7 occasionally find other issues just because > some of the things it does is so odd. *nod* > And let's face it, user programs doing odd and not very efficient > things should be considered par for the course. We're never going to > get rid of insane user programs, so we might as well fix the > performance problems even when we say "that's just stupid". Yup, that's what I'm doing :/ It looks like the underlying cause is that the old block mapping code only fed filesystem block size lengths into xfs_iomap_write_delay(), whereas the iomap code is feeding the (capped) write() length into it. Hence xfs_iomap_write_delay() is not detecting the need for speculative preallocation correctly on these sub-block writes. The profile looks better for the 1 byte write - I've combined the old and new for comparison below: 4.22% __block_commit_write.isra.30 3.80% up_write 3.74% xfs_bmapi_read 3.65% ___might_sleep 3.55% down_write 3.20% entry_SYSCALL_64_fastpath 3.02% mark_buffer_dirty 2.78% __mark_inode_dirty 2.78% unlock_page 2.59% xfs_break_layouts 2.47% xfs_iext_bno_to_ext 2.38% __block_write_begin_int 2.22% find_get_entry 2.17% xfs_file_write_iter 2.16% __radix_tree_lookup 2.13% iomap_write_actor 2.04% xfs_bmap_search_extents 1.98% __might_sleep 1.84% xfs_file_buffered_aio_write 1.76% iomap_apply 1.71% generic_write_end 1.68% vfs_write 1.66% iov_iter_copy_from_user_atomic 1.56% xfs_bmap_search_multi_extents 1.55% __vfs_write 1.52% pagecache_get_page 1.46% xfs_bmapi_update_map 1.33% xfs_iunlock 1.32% xfs_iomap_write_delay 1.29% xfs_file_iomap_begin 1.29% do_raw_spin_lock 1.29% __xfs_bmbt_get_all 1.21% iov_iter_advance 1.20% xfs_file_aio_write_checks 1.14% xfs_ilock 1.11% balance_dirty_pages_ratelimited 1.10% xfs_bmapi_trim_map 1.06% xfs_iomap_eof_want_preallocate 1.00% xfs_bmapi_delay Comparison of common functions: Old New function 4.50% 3.74% xfs_bmapi_read 3.64% 4.22% __block_commit_write.isra.30 3.55% 2.16% __radix_tree_lookup 3.46% 3.80% up_write 3.43% 3.65% ___might_sleep 3.09% 3.20% entry_SYSCALL_64_fastpath 3.01% 2.47% xfs_iext_bno_to_ext 3.01% 2.22% find_get_entry 2.98% 3.55% down_write 2.71% 3.02% mark_buffer_dirty 2.52% 2.78% __mark_inode_dirty 2.38% 2.78% unlock_page 2.14% 2.59% xfs_break_layouts 2.07% 1.46% xfs_bmapi_update_map 2.06% 2.04% xfs_bmap_search_extents 2.04% 1.32% xfs_iomap_write_delay 2.00% 0.38% generic_write_checks 1.96% 1.56% xfs_bmap_search_multi_extents 1.90% 1.29% __xfs_bmbt_get_all 1.89% 1.11% balance_dirty_pages_ratelimited 1.82% 0.28% wait_for_stable_page 1.76% 2.17% xfs_file_write_iter 1.68% 1.06% xfs_iomap_eof_want_preallocate 1.68% 1.00% xfs_bmapi_delay 1.67% 2.13% iomap_write_actor 1.60% 1.84% xfs_file_buffered_aio_write 1.56% 1.98% __might_sleep 1.48% 1.29% do_raw_spin_lock 1.44% 1.71% generic_write_end 1.41% 1.52% pagecache_get_page 1.38% 1.10% xfs_bm
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 9:16 PM, Dave Chinner wrote: > > That's why running aim7 as your "does the filesystem scale" > benchmark is somewhat irrelevant to scaling applications on high > performance systems these days Yes, don't get me wrong - I'm not at all trying to say that AIM7 is a good benchmark. It's just that I think what it happens to test is still meaningful, even if it's not necessarily in any way some kind of "high performance IO" thing. There are probably lots of other more important loads, I just reacted to Christoph seeming to argue that the AIM7 behavior was _so_ broken that we shouldn't even care. It's not _that_ broken, it's just not about high-performance IO streaming, it happens to test something else entirely. We've actually had AIM7 occasionally find other issues just because some of the things it does is so odd. Iirc it has a fork test that doesn't execve (very unusual - you'd generally use threads if you care about performance), and that has shown issues with our anon_vma scaling before anything else did. I also seem to remember some odd pty open/close/ioctl subtest that showed problems with some of the last remnants of the old BKL (the test probably actually tested something else, but ended up choking on the odd tty things). So in general, I'm not a fan of AIM as a benchmark, but it actually _has_ found lots of real issues because it tends to do things that kernel developers think are insane. And let's face it, user programs doing odd and not very efficient things should be considered par for the course. We're never going to get rid of insane user programs, so we might as well fix the performance problems even when we say "that's just stupid". Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 08:20:53PM -0700, Linus Torvalds wrote: > On Thu, Aug 11, 2016 at 7:52 PM, Christoph Hellwig wrote: > > > > I can look at that, but indeed optimizing this patch seems a bit > > stupid. > > The "write less than a full block to the end of the file" is actually > a reasonably common case. > > It may not make for a great filesystem benchmark, but it also isn't > actually insane. People who do logging in user space do this all the > time, for example. And it is *not* stupid in that context. Not at all. > > It's never going to be the *main* thing you do (unless you're AIM), > but I do think it's worth fixing. > > And AIM7 remains one of those odd benchmarks that people use. I'm not > quite sure why, but I really do think that the normal "append smaller > chunks to the end of the file" should absolutely not be dismissed as > stupid. Yes, I agree that there are reasons for making sub-block IO work well (which is why I'm looking to try to fix it), but that does't mean the benchmark is sane. aim7 is, technically, a "scalability benchmark". As such, expecting tiny writes to scale to moving large amounts of data is the "stupid" thing it does. If you scale up the amount of data you need to move, tehn you ned to scale up the efficiency of moving that data. Case in point - writing 1GB of data in 1kb chunks to XFs on a local /dev/pmem1 runs at ~600MB/s, whilst moving it it in 1MB chunks runs at 1.9GB/s. aim7 doesn't actually stress the scalability of the hardware, because inefficiencies in it's implementation prevent it from getting to those limits. That's what aim7 misses - as speeds and capabilities go up, the way code needs to be written to make efficient use of the hardware also changes. e.g. High throughput logging solutions don't write every incoming log event immediately - they aggregate them into larger buffers and then write those, knowing that they can support much higher logging rates by doing this That's why running aim7 as your "does the filesystem scale" benchmark is somewhat irrelevant to scaling applications on high performance systems these days - users with fast storage will be expecting to see that 1.9GB/s throughput from their app, not 600MB/s Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 07:27:52PM -0700, Linus Torvalds wrote: > On Thu, Aug 11, 2016 at 5:54 PM, Dave Chinner wrote: > > > > So, removing mark_page_accessed() made the spinlock contention > > *worse*. > > > > 36.51% [kernel] [k] _raw_spin_unlock_irqrestore > >6.27% [kernel] [k] copy_user_generic_string > >3.73% [kernel] [k] _raw_spin_unlock_irq > >3.55% [kernel] [k] get_page_from_freelist > >1.97% [kernel] [k] do_raw_spin_lock > >1.72% [kernel] [k] __block_commit_write.isra.30 > > I don't recall having ever seen the mapping tree_lock as a contention > point before, but it's not like I've tried that load either. So it > might be a regression (going back long, I suspect), or just an unusual > load that nobody has traditionally tested much. > > Single-threaded big file write one page at a time, was it? Yup. On a 4 node NUMA system. So when memory reclaim kicks in, there's a write process, a writeback kworker and 4 kswapd kthreads all banging on the mapping->tree_lock. There's an awful lot of concurrency happening behind the scenes of that single user process writing to a file... > The mapping tree lock has been around forever (it used to be a rw-lock > long long ago), but I wonder if we might have moved more stuff into it > (memory accounting comes to mind) causing much worse contention or > something. Yeah, there is now a crapton of accounting updated in account_page_dirtied under the tree lock - memcg, writeback, node, zone, task, etc. And there's a *lot* of code that __delete_from_page_cache() can execute under the tree lock. > Hmm. Just for fun, I googled "tree_lock contention". It's shown up > before - back in 2006, and it was you hitting it back then too. Of course! That, however, would have been when I was playing with real big SGI machines, not a tiddly little 16p VM :P Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 7:52 PM, Christoph Hellwig wrote: > > I can look at that, but indeed optimizing this patch seems a bit > stupid. The "write less than a full block to the end of the file" is actually a reasonably common case. It may not make for a great filesystem benchmark, but it also isn't actually insane. People who do logging in user space do this all the time, for example. And it is *not* stupid in that context. Not at all. It's never going to be the *main* thing you do (unless you're AIM), but I do think it's worth fixing. And AIM7 remains one of those odd benchmarks that people use. I'm not quite sure why, but I really do think that the normal "append smaller chunks to the end of the file" should absolutely not be dismissed as stupid. Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Fri, Aug 12, 2016 at 12:23:29PM +1000, Dave Chinner wrote: > Christoph, maybe there's something we can do to only trigger > speculative prealloc growth checks if the new file size crosses the end of > the currently allocated block at the EOF. That would chop out a fair > chunk of the xfs_bmapi_read calls being done in this workload. I'm > not sure how much effort we should spend optimising this slow path, > though I can look at that, but indeed optimizing this patch seems a bit stupid. The other thing we could do is to optimize xfs_bmapi_read - even if it shouldn't be called this often it seems like it should waste a whole lot less CPU cycles.
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 7:23 PM, Dave Chinner wrote: > > And, as usual, that's the answer. Here's the reproducer: > > # sudo mkfs.xfs -f -m crc=0 /dev/pmem1 > # sudo mount -o noatime /dev/pmem1 /mnt/scratch > # sudo xfs_io -f -c "pwrite 0 512m -b 1" /mnt/scratch/fooey Heh. Ok, so 1 byte or 1kB at a time is pretty much the same thing, yeah. And I guess that also explains why the system call entry showed up so high in the profiles. I'l take another look at tree_lock tomorrow, but it sounds like this particular AIM regression is now effectively a solved (or at least known) issue. Thanks, Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 5:54 PM, Dave Chinner wrote: > > So, removing mark_page_accessed() made the spinlock contention > *worse*. > > 36.51% [kernel] [k] _raw_spin_unlock_irqrestore >6.27% [kernel] [k] copy_user_generic_string >3.73% [kernel] [k] _raw_spin_unlock_irq >3.55% [kernel] [k] get_page_from_freelist >1.97% [kernel] [k] do_raw_spin_lock >1.72% [kernel] [k] __block_commit_write.isra.30 I don't recall having ever seen the mapping tree_lock as a contention point before, but it's not like I've tried that load either. So it might be a regression (going back long, I suspect), or just an unusual load that nobody has traditionally tested much. Single-threaded big file write one page at a time, was it? The mapping tree lock has been around forever (it used to be a rw-lock long long ago), but I wonder if we might have moved more stuff into it (memory accounting comes to mind) causing much worse contention or something. Hmm. Just for fun, I googled "tree_lock contention". It's shown up before - back in 2006, and it was you hitting it back then too. There was an even older one (related to AIM7, interesting) which was what caused the tree_lock to become a rw-lock back in 2005 (but then Nick Piggin made it a spinlock again in 2008). So it's not unheard of, but it certainly hasn't been a big issue. That's the only obvious ones I found (apart from some btrfs issues, but btrfs has a completely different notion of tree locking, so those are not about the same thing). Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Fri, Aug 12, 2016 at 10:54:42AM +1000, Dave Chinner wrote: > I'm now going to test Christoph's theory that this is an "overwrite > doing lots of block mapping" issue. More on that to follow. Ok, so going back to the profiles, I can say it's not an overwrite issue, because there is delayed allocation showing up in the profile. Lots of it. Which lead me to think "maybe the benchmark is just completely *dumb*". And, as usual, that's the answer. Here's the reproducer: # sudo mkfs.xfs -f -m crc=0 /dev/pmem1 # sudo mount -o noatime /dev/pmem1 /mnt/scratch # sudo xfs_io -f -c "pwrite 0 512m -b 1" /mnt/scratch/fooey And here's the profile: 4.50% [kernel] [k] xfs_bmapi_read 3.64% [kernel] [k] __block_commit_write.isra.30 3.55% [kernel] [k] __radix_tree_lookup 3.46% [kernel] [k] up_write 3.43% [kernel] [k] ___might_sleep 3.09% [kernel] [k] entry_SYSCALL_64_fastpath 3.01% [kernel] [k] xfs_iext_bno_to_ext 3.01% [kernel] [k] find_get_entry 2.98% [kernel] [k] down_write 2.71% [kernel] [k] mark_buffer_dirty 2.52% [kernel] [k] __mark_inode_dirty 2.38% [kernel] [k] unlock_page 2.14% [kernel] [k] xfs_break_layouts 2.07% [kernel] [k] xfs_bmapi_update_map 2.06% [kernel] [k] xfs_bmap_search_extents 2.04% [kernel] [k] xfs_iomap_write_delay 2.00% [kernel] [k] generic_write_checks 1.96% [kernel] [k] xfs_bmap_search_multi_extents 1.90% [kernel] [k] __xfs_bmbt_get_all 1.89% [kernel] [k] balance_dirty_pages_ratelimited 1.82% [kernel] [k] wait_for_stable_page 1.76% [kernel] [k] xfs_file_write_iter 1.68% [kernel] [k] xfs_iomap_eof_want_preallocate 1.68% [kernel] [k] xfs_bmapi_delay 1.67% [kernel] [k] iomap_write_actor 1.60% [kernel] [k] xfs_file_buffered_aio_write 1.56% [kernel] [k] __might_sleep 1.48% [kernel] [k] do_raw_spin_lock 1.44% [kernel] [k] generic_write_end 1.41% [kernel] [k] pagecache_get_page 1.38% [kernel] [k] xfs_bmapi_trim_map 1.21% [kernel] [k] __block_write_begin_int 1.17% [kernel] [k] vfs_write 1.17% [kernel] [k] xfs_file_iomap_begin 1.17% [kernel] [k] xfs_bmbt_get_startoff 1.14% [kernel] [k] iomap_apply 1.08% [kernel] [k] xfs_iunlock 1.08% [kernel] [k] iov_iter_copy_from_user_atomic 0.97% [kernel] [k] xfs_file_aio_write_checks 0.96% [kernel] [k] xfs_ilock . Yeah, I'm doing a sequential write in *1 byte pwrite() calls*. Ok, so the benchmark isn't /quite/ that abysmally stupid. It's still, ah, extremely challenged: if (NBUFSIZE != 1024) { /* enforce known block size */ fprintf(stderr, "NBUFSIZE changed to %d\n", NBUFSIZE); exit(1); } i.e. it's hard coded to do all it's "disk" IO in 1k block sizes. Every read, every write, every file copy, etc are all done with a 1024 byte buffer. There are lots of loops that look like: while (--n) { write(fd, nbuf, sizeof nbuf) } where n is the file size specified in the job file. Those loops are what is generating the profile we see: repeated partial page writes that extend the file. IOWs, the benchmark is doing exactly what we document in the fstat() man page *not to do* as it is will cause inefficient IO patterns: The st_blksize field gives the "preferred" blocksize for efficient filesystem I/O. (Writing to a file in smaller chunks may cause an inefficient read-modify-rewrite.) The smallest we ever set st_blksize to is PAGE_SIZE, so the benchmark is running well known and documented (at least 10 years ago) slow paths through the IO stack. I'm very tempted now simply to say that the aim7 disk benchmark is showing it's age and as such the results are not actually reflective of what typical applications will see. Christoph, maybe there's something we can do to only trigger speculative prealloc growth checks if the new file size crosses the end of the currently allocated block at the EOF. That would chop out a fair chunk of the xfs_bmapi_read calls being done in this workload. I'm not sure how much effort we should spend optimising this slow path, though Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 11:16:12AM +1000, Dave Chinner wrote: > On Wed, Aug 10, 2016 at 05:33:20PM -0700, Huang, Ying wrote: > We need to know what is happening that is different - there's a good > chance the mapping trace events will tell us. Huang, can you get > a raw event trace from the test? > > I need to see these events: > > xfs_file* > xfs_iomap* > xfs_get_block* > lkp-folks, can I please get these traces run and sent to me? I don't have the time or patience to try to get aim7 running on my machines - the build is full of hard-coded paths and libraries that aren't provided by modern distros (e.g. it requires a static libaio.a!) and it fails at the configure stage complaining that: configure: error: C compiler cannot create executables Which is a complete load of BS. Hence I can't make progress until I have some way of understanding what the IO pattern is that is generating the profile being measured. So far I'm unable to do that with any of the tools I been trying, hence I need the traces to work out what I'm missing... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 09:55:33AM -0700, Linus Torvalds wrote: > On Thu, Aug 11, 2016 at 8:57 AM, Christoph Hellwig wrote: > > > > The one liner below (not tested yet) to simply remove it should fix that > > up. I also noticed we have a spurious pagefault_disable/enable, I > > need to dig into the history of that first, though. > > Hopefully the pagefault_disable/enable doesn't matter for this case. > > Can we get this one-liner tested with the kernel robot for comparison? > I really think a messed-up LRU list could cause bad IO patterns, and > end up keeping dirty pages around that should be streaming out to disk > and re-used, so causing memory pressure etc for no good reason. > > I think the mapping->tree_lock issue that Dave sees is interesting > too, but the kswapd activity (and the extra locking it causes) could > also be a symptom of the same thing - memory pressure due to just > putting pages in the active file that simply shouldn't be there. So, removing mark_page_accessed() made the spinlock contention *worse*. 36.51% [kernel] [k] _raw_spin_unlock_irqrestore 6.27% [kernel] [k] copy_user_generic_string 3.73% [kernel] [k] _raw_spin_unlock_irq 3.55% [kernel] [k] get_page_from_freelist 1.97% [kernel] [k] do_raw_spin_lock 1.72% [kernel] [k] __block_commit_write.isra.30 1.44% [kernel] [k] __wake_up_bit 1.41% [kernel] [k] shrink_page_list 1.24% [kernel] [k] __radix_tree_lookup 1.03% [kernel] [k] xfs_log_commit_cil 0.99% [kernel] [k] free_hot_cold_page 0.96% [kernel] [k] end_buffer_async_write 0.95% [kernel] [k] delay_tsc 0.94% [kernel] [k] ___might_sleep 0.93% [kernel] [k] kmem_cache_alloc 0.90% [kernel] [k] unlock_page 0.82% [kernel] [k] kmem_cache_free 0.74% [kernel] [k] up_write 0.72% [kernel] [k] node_dirty_ok 0.66% [kernel] [k] clear_page_dirty_for_io 0.65% [kernel] [k] __mark_inode_dirty 0.64% [kernel] [k] __block_write_begin_int 0.58% [kernel] [k] xfs_inode_item_format 0.57% [kernel] [k] __memset 0.57% [kernel] [k] cancel_dirty_page 0.56% [kernel] [k] down_write 0.54% [kernel] [k] page_evictable 0.53% [kernel] [k] page_mapping 0.52% [kernel] [k] __slab_free 0.49% [kernel] [k] xfs_do_writepage 0.49% [kernel] [k] drop_buffers - 41.82%41.82% [kernel][k] _raw_spin_unlock_irqrestore - 35.93% ret_from_fork - kthread - 29.76% kswapd shrink_node shrink_node_memcg.isra.75 shrink_inactive_list shrink_page_list __remove_mapping _raw_spin_unlock_irqrestore - 7.13% worker_thread - process_one_work - 4.40% wb_workfn wb_writeback __writeback_inodes_wb writeback_sb_inodes __writeback_single_inode do_writepages xfs_vm_writepages write_cache_pages xfs_do_writepage - 2.71% xfs_end_io xfs_destroy_ioend end_buffer_async_write end_page_writeback test_clear_page_writeback _raw_spin_unlock_irqrestore + 4.88% __libc_pwrite The kswapd contention has jumped from 20% to 30% of the CPU time in the profiles. I can't see how changing what LRU the page is on will improve the contention problem - at it's sources it's a N:1 problem where the writing process and N kswapd worker threads are all trying to access the same lock concurrently This is not the AIM7 problem we are looking for - what this test demonstrates is a fundamental page cache scalability issue at the design level - the mapping->tree_lock is a global serialisation point I'm now going to test Christoph's theory that this is an "overwrite doing lots of block mapping" issue. More on that to follow. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 3:16 PM, Al Viro wrote: > > Huh? The very first thing it does is > char *kaddr = kmap_atomic(page), *p = kaddr + offset; > > If _that_ does not disable pagefaults, we are very deep in shit. Right you are - it does, even with highmem disabled. Never mind, those pagefault_disabled/enabled() calls are clearly bogus. Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 01:35:00PM -0700, Linus Torvalds wrote: > The thing is, iov_iter_copy_from_user_atomic() doesn't itself enforce > non-blocking user accesses, it depends on the caller blocking page > faults. Huh? The very first thing it does is char *kaddr = kmap_atomic(page), *p = kaddr + offset; If _that_ does not disable pagefaults, we are very deep in shit. AFAICS, all instances of that sucker do disable those, including the non-highmem default (it's static inline void *kmap_atomic(struct page *page) { preempt_disable(); pagefault_disable(); return page_address(page); } )
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
I'll need to dig into what AIM7 actually does in this benchmark, which isn't too easy as I'm on a business trip currently, but from the list below it looks like it keeps overwriting and overwriting a file that's already been allocated. This is a pretty stupid workload, but fortunately it should also be able to be optimized by skipping the actual block lookup, which is what the old buffer.c code happens to do.
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 2:16 PM, Huang, Ying wrote: > > Test result is as follow, Thanks. No change. > raw perf data: I redid my munging, with the old (good) percentages in parenthesis: intel_idle: 17.66 (16.88) copy_user_enhanced_fast_string:3.25(3.94) memset_erms: 2.56(3.26) xfs_bmapi_read:2.28 ___might_sleep:2.09(2.33) __block_commit_write.isra.24: 2.07(2.47) xfs_iext_bno_to_ext: 1.79 __block_write_begin_int: 1.74(1.56) up_write: 1.72(1.61) unlock_page: 1.69(1.69) down_write:1.59(1.55) __mark_inode_dirty:1.54(1.88) xfs_bmap_search_extents: 1.33 xfs_iomap_write_delay: 1.23 mark_buffer_dirty: 1.21(1.53) __radix_tree_lookup: 1.2 (1.32) xfs_bmap_search_multi_extents: 1.18 xfs_iomap_eof_want_preallocate.constprop.8:1.17 entry_SYSCALL_64_fastpath: 1.15(1.47) __might_sleep: 1.14(1.26) _raw_spin_lock:0.97(1.17) vfs_write: 0.94(1.14) xfs_bmapi_delay: 0.93 iomap_write_actor: 0.9 pagecache_get_page:0.89(1.03) xfs_file_write_iter: 0.86(1.03) xfs_file_iomap_begin: 0.81 iov_iter_copy_from_user_atomic:0.78(0.87) iomap_apply: 0.77 generic_write_end: 0.74(1.36) xfs_file_buffered_aio_write: 0.72(0.84) find_get_entry:0.69(0.79) __vfs_write: 0.67(0.87) and it's worth noting a few things: - most of the old percentages are bigger, but that's natural: the load used to take longer, and the more efficient (old) case thus has higher percent values. That doesn't mean it was slower, quite the reverse. - the main exception is intel_idle, so we do have more idle time. But the *big* difference is all the functions that didn't use to show up at all, and have no previous percent values: xfs_bmapi_read: 2.28 xfs_iext_bno_to_ext: 1.79 xfs_bmap_search_extents: 1.33 xfs_iomap_write_delay: 1.23 xfs_bmap_search_multi_extents: 1.18 xfs_iomap_eof_want_preallocate.constprop.8: 1.17 xfs_bmapi_delay: 0.93 iomap_write_actor: 0.9 xfs_file_iomap_begin: 0.81 iomap_apply: 0.77 and I think this really can explain the regression. That all adds up to 12% or so of "new overhead". Which is fairly close to the regression. (Ok, that is playing fast and loose with percentages, but I think it migth be "close enough" in practice). So for some reason the new code doesn't do a lot more per-page operations (the unlock_page() etc costs are fairly similar), but it has a *much* m ore expensive footprint in the xfs_bmap/iomap functions. The old code had almost no XFS footprint at all, and didn't need to look up block mappings etc, and worked almost entirely with the vfs caches (so used the block numbers in the buffers etc). And I know that DaveC often complains about vfs overhead, but the fact is, the VFS layer is optimized to hell and back and does really really well. Having to call down to filesystem routines (for block mappings etc) is when performance goes down. I think this is an example of that. And hey, maybe I'm just misreading things, or reading too much into those profiles. But it does look like that commit 68a9f5e7007c1afa2cf6830b690a90d0187c0684 ends up causing more xfs bmap activity. Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Christoph Hellwig writes: > On Thu, Aug 11, 2016 at 12:51:31PM -0700, Linus Torvalds wrote: >> Ok. It does seem to also reset the active file page counts back, so >> that part did seem to be related, but yeah, from a performance >> standpoint that was clearly not a major issue. >> >> Let's hope Dave can figure out something based on his numbers, because >> I'm out of ideas. Or maybe it's the pagefault-atomic thing that >> Christoph was looking at. > > I can't really think of any reason why the pagefaul_disable() would > sіgnificantly change performance. Anyway, the patch for the is below > (on top of the previous mark_page_accessed() one), so feel free to > re-run the test with it. It would also be nice to see the profiles > with the two patches applied. > > commit 43106eea246074acc4bb7d12fdb91f58002f52ed > Author: Christoph Hellwig > Date: Thu Aug 11 10:41:40 2016 -0700 > > fs: remove superflous pagefault_disable from iomap_write_actor > > No idea where this really came from, generic_perform_write only briefly > did a pagefaul_disable when trying a different prefault scheme. > > Signed-off-by: Christoph Hellwig > > diff --git a/fs/iomap.c b/fs/iomap.c > index f39c318..74712e2 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -194,9 +194,7 @@ again: > if (mapping_writably_mapped(inode->i_mapping)) > flush_dcache_page(page); > > - pagefault_disable(); > copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); > - pagefault_enable(); > > flush_dcache_page(page); > Test result is as follow, commit e129b86bfacc8bb517b843fca41d0d179de7a4ca Author: Christoph Hellwig Date: Thu Aug 11 10:41:40 2016 -0700 fs: remove superflous pagefault_disable from iomap_write_actor No idea where this really came from, generic_perform_write only briefly did a pagefaul_disable when trying a different prefault scheme. Signed-off-by: Christoph Hellwig diff --git a/fs/iomap.c b/fs/iomap.c index f39c318..74712e2 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -194,9 +194,7 @@ again: if (mapping_writably_mapped(inode->i_mapping)) flush_dcache_page(page); - pagefault_disable(); copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); - pagefault_enable(); flush_dcache_page(page); = compiler/cpufreq_governor/debug-setup/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase: gcc-6/performance/profile/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7 commit: f0c6bcba74ac51cb77aadb33ad35cb2dc1ad1506 68a9f5e7007c1afa2cf6830b690a90d0187c0684 e129b86bfacc8bb517b843fca41d0d179de7a4ca f0c6bcba74ac51cb 68a9f5e7007c1afa2cf6830b69 e129b86bfacc8bb517b843fca4 -- -- %stddev %change %stddev %change %stddev \ |\ |\ 484435 ± 0% -13.3% 420004 ± 0% -11.6% 428323 ± 0% aim7.jobs-per-min 37.37 ± 0% +15.3% 43.09 ± 0% +13.0% 42.24 ± 0% aim7.time.elapsed_time 37.37 ± 0% +15.3% 43.09 ± 0% +13.0% 42.24 ± 0% aim7.time.elapsed_time.max 6491 ± 3% +30.8% 8491 ± 0% +19.9% 7781 ± 4% aim7.time.involuntary_context_switches 2378 ± 0% +1.1% 2404 ± 0% +9.2% 2598 ± 0% aim7.time.maximum_resident_set_size 376.89 ± 0% +28.4% 484.11 ± 0% +22.8% 462.92 ± 0% aim7.time.system_time 430512 ± 0% -20.1% 343838 ± 0% -17.3% 356032 ± 0% aim7.time.voluntary_context_switches 26816 ± 8% +10.2% 29542 ± 1% +13.5% 30428 ± 0% interrupts.CAL:Function_call_interrupts 1016 ± 8%+527.7% 6381 ± 59%+483.5% 5932 ± 82% latency_stats.sum.down.xfs_buf_lock._xfs_buf_find.xfs_buf_get_map.xfs_trans_get_buf_map.xfs_da_get_buf.xfs_dir3_data_init.xfs_dir2_sf_to_block.xfs_dir2_sf_addname.xfs_dir_createname.xfs_create.xfs_generic_create 125122 ± 10% -10.7% 111758 ± 12% +8.2% 135440 ± 3% softirqs.SCHED 410707 ± 12% +5.2% 432155 ± 11% +24.9% 512838 ± 4% softirqs.TIMER 24772 ± 0% -28.6% 17675 ± 0% -24.1% 18813 ± 1% vmstat.system.cs 53477 ± 2% +5.6% 56453 ± 0% +6.1% 56716 ± 0% vmstat.system.in 43469 ± 0% +5.3% 45792 ± 1% +25.0% 54343 ± 16% proc-vmstat.nr_active_anon 3906 ± 0% +28.8% 5032 ± 2% -48.8% 2000 ± 96% proc-vmstat.nr_active_file 919.33 ± 5% +14.8% 1055 ± 8% +17.8% 1083 ± 10% proc-vmstat.nr_dirty 2270 ± 0%
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 1:00 PM, Christoph Hellwig wrote: > > I can't really think of any reason why the pagefaul_disable() would > sіgnificantly change performance. No, you're right, we prefault the page anyway. And quite frankly, looking at it, I think the pagefault_disable/enable is actually *correct*. The thing is, iov_iter_copy_from_user_atomic() doesn't itself enforce non-blocking user accesses, it depends on the caller blocking page faults. And the reason we want to block page faults is to make sure we don't deadlock on the page we just locked for writing. So looking at it, I think the pagefault_disable/enable is actually needed, and it may in fact be a bug that mm/filemap.c doesn't do that. However, see commit 00a3d660cbac ("Revert "fs: do not prefault sys_write() user buffer pages"") about why mm/filemap.c doesn't do the pagefault_disable(). But depending on the prefaulting actually guaranteeing that we don't deadlock sounds like a nasty race in theory (ie somebody does mmap tricks in another thread in between the pre-faulting and the final copying). Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 12:51:31PM -0700, Linus Torvalds wrote: > Ok. It does seem to also reset the active file page counts back, so > that part did seem to be related, but yeah, from a performance > standpoint that was clearly not a major issue. > > Let's hope Dave can figure out something based on his numbers, because > I'm out of ideas. Or maybe it's the pagefault-atomic thing that > Christoph was looking at. I can't really think of any reason why the pagefaul_disable() would sіgnificantly change performance. Anyway, the patch for the is below (on top of the previous mark_page_accessed() one), so feel free to re-run the test with it. It would also be nice to see the profiles with the two patches applied. commit 43106eea246074acc4bb7d12fdb91f58002f52ed Author: Christoph Hellwig Date: Thu Aug 11 10:41:40 2016 -0700 fs: remove superflous pagefault_disable from iomap_write_actor No idea where this really came from, generic_perform_write only briefly did a pagefaul_disable when trying a different prefault scheme. Signed-off-by: Christoph Hellwig diff --git a/fs/iomap.c b/fs/iomap.c index f39c318..74712e2 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -194,9 +194,7 @@ again: if (mapping_writably_mapped(inode->i_mapping)) flush_dcache_page(page); - pagefault_disable(); copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes); - pagefault_enable(); flush_dcache_page(page);
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 10:51 AM, Huang, Ying wrote: >> > > Here is the test result for the debug patch. It appears that the aim7 > score is a little better, but the regression is not recovered. Ok. It does seem to also reset the active file page counts back, so that part did seem to be related, but yeah, from a performance standpoint that was clearly not a major issue. Let's hope Dave can figure out something based on his numbers, because I'm out of ideas. Or maybe it's the pagefault-atomic thing that Christoph was looking at. Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
Linus Torvalds writes: > On Thu, Aug 11, 2016 at 8:57 AM, Christoph Hellwig wrote: >> >> The one liner below (not tested yet) to simply remove it should fix that >> up. I also noticed we have a spurious pagefault_disable/enable, I >> need to dig into the history of that first, though. > > Hopefully the pagefault_disable/enable doesn't matter for this case. > > Can we get this one-liner tested with the kernel robot for comparison? > I really think a messed-up LRU list could cause bad IO patterns, and > end up keeping dirty pages around that should be streaming out to disk > and re-used, so causing memory pressure etc for no good reason. > > I think the mapping->tree_lock issue that Dave sees is interesting > too, but the kswapd activity (and the extra locking it causes) could > also be a symptom of the same thing - memory pressure due to just > putting pages in the active file that simply shouldn't be there. > > So the trivial oneliner _might_ just explain things. It would be > really nice if the regression turns out to be due to something so > easily fixed. > Here is the test result for the debug patch. It appears that the aim7 score is a little better, but the regression is not recovered. commit 5c70fdfdf82723e47ac180e36a7638ca06ea19d8 Author: Christoph Hellwig Date: Thu Aug 11 17:57:21 2016 +0200 dbg fix 68a9f5e700: aim7.jobs-per-min -13.6% regression On Wed, Aug 10, 2016 at 06:00:24PM -0700, Linus Torvalds wrote: > The biggest difference is that we have "mark_page_accessed()" show up > after, and not before. There was also a lot of LRU noise in the > non-profile data. I wonder if that is the reason here: the old model > of using generic_perform_write/block_page_mkwrite didn't mark the > pages accessed, and now with iomap_file_buffered_write() they get > marked as active and that screws up the LRU list, and makes us not > flush out the dirty pages well (because they are seen as active and > not good for writeback), and then you get bad memory use. And that's actually a "bug" in the new code - mostly because I failed to pick up changes to the core code happening after we 'forked' it, in this case commit 2457ae ("mm: non-atomically mark page accessed during page cache allocation where possible"). The one liner below (not tested yet) to simply remove it should fix that up. I also noticed we have a spurious pagefault_disable/enable, I need to dig into the history of that first, though. diff --git a/fs/iomap.c b/fs/iomap.c index 48141b8..f39c318 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -199,7 +199,6 @@ again: pagefault_enable(); flush_dcache_page(page); - mark_page_accessed(page); status = iomap_write_end(inode, pos, bytes, copied, page); if (unlikely(status < 0)) = compiler/cpufreq_governor/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase: gcc-6/performance/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7 commit: f0c6bcba74ac51cb77aadb33ad35cb2dc1ad1506 68a9f5e7007c1afa2cf6830b690a90d0187c0684 5c70fdfdf82723e47ac180e36a7638ca06ea19d8 f0c6bcba74ac51cb 68a9f5e7007c1afa2cf6830b69 5c70fdfdf82723e47ac180e36a -- -- %stddev %change %stddev %change %stddev \ |\ |\ 486586 ± 0% -13.6% 420523 ± 0% -11.2% 432165 ± 0% aim7.jobs-per-min 37.23 ± 0% +15.5% 43.02 ± 0% +12.4% 41.87 ± 0% aim7.time.elapsed_time 37.23 ± 0% +15.5% 43.02 ± 0% +12.4% 41.87 ± 0% aim7.time.elapsed_time.max 6424 ± 1% +31.3% 8437 ± 1% +24.4% 7992 ± 3% aim7.time.involuntary_context_switches 2489 ± 3% -2.5% 2425 ± 1% -1.0% 2465 ± 3% aim7.time.maximum_resident_set_size 151288 ± 0% +3.0% 155764 ± 0% +2.0% 154252 ± 1% aim7.time.minor_page_faults 376.31 ± 0% +28.4% 483.23 ± 0% +22.5% 460.96 ± 1% aim7.time.system_time 429058 ± 0% -20.1% 343013 ± 0% -16.7% 357231 ± 0% aim7.time.voluntary_context_switches 16014 ± 0% +28.3% 20545 ± 1% -4.0% 15369 ± 0% meminfo.Active(file) 127154 ± 9% -15.5% 107424 ± 10% -4.4% 121505 ± 11% softirqs.SCHED 24561 ± 0% -27.1% 17895 ± 1% -22.5% 19033 ± 1% vmstat.system.cs 47289 ± 0% +1.4% 47938 ± 0% +1.1% 47807 ± 0% vmstat.system.in 4003 ± 0% +27.8% 5117 ± 1% -4.0% 3842 ± 0% proc-vmstat.nr_active_file 979.25 ± 0% +59.1% 1558 ± 5% -11.8% 864.00 ± 1% proc-vmstat.pgactivate
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 8:57 AM, Christoph Hellwig wrote: > > The one liner below (not tested yet) to simply remove it should fix that > up. I also noticed we have a spurious pagefault_disable/enable, I > need to dig into the history of that first, though. Hopefully the pagefault_disable/enable doesn't matter for this case. Can we get this one-liner tested with the kernel robot for comparison? I really think a messed-up LRU list could cause bad IO patterns, and end up keeping dirty pages around that should be streaming out to disk and re-used, so causing memory pressure etc for no good reason. I think the mapping->tree_lock issue that Dave sees is interesting too, but the kswapd activity (and the extra locking it causes) could also be a symptom of the same thing - memory pressure due to just putting pages in the active file that simply shouldn't be there. So the trivial oneliner _might_ just explain things. It would be really nice if the regression turns out to be due to something so easily fixed. Linus
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Wed, Aug 10, 2016 at 06:00:24PM -0700, Linus Torvalds wrote: > The biggest difference is that we have "mark_page_accessed()" show up > after, and not before. There was also a lot of LRU noise in the > non-profile data. I wonder if that is the reason here: the old model > of using generic_perform_write/block_page_mkwrite didn't mark the > pages accessed, and now with iomap_file_buffered_write() they get > marked as active and that screws up the LRU list, and makes us not > flush out the dirty pages well (because they are seen as active and > not good for writeback), and then you get bad memory use. And that's actually a "bug" in the new code - mostly because I failed to pick up changes to the core code happening after we 'forked' it, in this case commit 2457ae ("mm: non-atomically mark page accessed during page cache allocation where possible"). The one liner below (not tested yet) to simply remove it should fix that up. I also noticed we have a spurious pagefault_disable/enable, I need to dig into the history of that first, though. diff --git a/fs/iomap.c b/fs/iomap.c index 48141b8..f39c318 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -199,7 +199,6 @@ again: pagefault_enable(); flush_dcache_page(page); - mark_page_accessed(page); status = iomap_write_end(inode, pos, bytes, copied, page); if (unlikely(status < 0))
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Wed, Aug 10, 2016 at 06:00:24PM -0700, Linus Torvalds wrote: > On Wed, Aug 10, 2016 at 5:33 PM, Huang, Ying wrote: > > > > Here it is, > > Thanks. > > Appended is a munged "after" list, with the "before" values in > parenthesis. It actually looks fairly similar. > > The biggest difference is that we have "mark_page_accessed()" show up > after, and not before. There was also a lot of LRU noise in the > non-profile data. I wonder if that is the reason here: the old model > of using generic_perform_write/block_page_mkwrite didn't mark the > pages accessed, and now with iomap_file_buffered_write() they get > marked as active and that screws up the LRU list, and makes us not > flush out the dirty pages well (because they are seen as active and > not good for writeback), and then you get bad memory use. > > I'm not seeing anything that looks like locking-related. Not in that profile. I've been doing some local testing inside a 4-node fake-numa 16p/16GB RAM VM to see what I can find. I'm yet to work out how I can trigger a profile like the one that was reported (I really need to see the event traces), but in the mean time I found this Doing a large sequential single threaded buffered write using a 4k buffer (so single page per syscall to make the XFS IO path allocator behave the same way as in 4.7), I'm seeing a CPU profile that indicates we have a potential mapping->tree_lock issue: # xfs_io -f -c "truncate 0" -c "pwrite 0 47g" /mnt/scratch/fooey wrote 50465865728/50465865728 bytes at offset 0 47.000 GiB, 12320768 ops; 0:01:36.00 (499.418 MiB/sec and 127850.9132 ops/sec) 24.15% [kernel] [k] _raw_spin_unlock_irqrestore 9.67% [kernel] [k] copy_user_generic_string 5.64% [kernel] [k] _raw_spin_unlock_irq 3.34% [kernel] [k] get_page_from_freelist 2.57% [kernel] [k] mark_page_accessed 2.45% [kernel] [k] do_raw_spin_lock 1.83% [kernel] [k] shrink_page_list 1.70% [kernel] [k] free_hot_cold_page 1.26% [kernel] [k] xfs_do_writepage 1.21% [kernel] [k] __radix_tree_lookup 1.20% [kernel] [k] __wake_up_bit 0.99% [kernel] [k] __block_write_begin_int 0.95% [kernel] [k] find_get_pages_tag 0.92% [kernel] [k] cancel_dirty_page 0.89% [kernel] [k] unlock_page 0.87% [kernel] [k] clear_page_dirty_for_io 0.85% [kernel] [k] xfs_bmap_worst_indlen 0.84% [kernel] [k] xfs_file_buffered_aio_write 0.81% [kernel] [k] delay_tsc 0.78% [kernel] [k] node_dirty_ok 0.77% [kernel] [k] up_write 0.74% [kernel] [k] ___might_sleep 0.73% [kernel] [k] xfs_bmap_add_extent_hole_delay 0.72% [kernel] [k] __fget_light 0.67% [kernel] [k] add_to_page_cache_lru 0.67% [kernel] [k] __slab_free 0.63% [kernel] [k] drop_buffers 0.59% [kernel] [k] down_write 0.59% [kernel] [k] kmem_cache_alloc 0.58% [kernel] [k] iomap_write_actor 0.53% [kernel] [k] page_mapping 0.52% [kernel] [k] entry_SYSCALL_64_fastpath 0.52% [kernel] [k] __mark_inode_dirty 0.51% [kernel] [k] __block_commit_write.isra.30 0.51% [kernel] [k] xfs_file_write_iter 0.49% [kernel] [k] mark_buffer_async_write 0.47% [kernel] [k] balance_dirty_pages_ratelimited 0.47% [kernel] [k] xfs_count_page_state 0.47% [kernel] [k] page_evictable 0.46% [kernel] [k] xfs_vm_releasepage 0.46% [kernel] [k] xfs_iomap_write_delay 0.46% [kernel] [k] do_raw_spin_unlock 0.44% [kernel] [k] xfs_file_iomap_begin 0.44% [kernel] [k] xfs_map_at_offset 0.42% [kernel] [k] xfs_iext_bno_to_ext There's very little XFS showing up near the top of the profile; it's all page cache, writeback and spin lock traffic. This is a dead give-away as to the lock being contended: - 33.30% 0.01% [kernel][k] kswapd - 4.67% kswapd - 4.69% shrink_node - 4.77% shrink_node_memcg.isra.75 - 7.38% shrink_inactive_list - 6.70% shrink_page_list - 20.02% __remove_mapping 19.90% _raw_spin_unlock_irqrestore I don't think that this is the same as what aim7 is triggering as there's no XFS write() path allocation functions near the top of the profile to speak of. Still, I don't recall seeing this before... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 10:36:59AM +0800, Ye Xiaolong wrote: > On 08/11, Dave Chinner wrote: > >On Thu, Aug 11, 2016 at 11:16:12AM +1000, Dave Chinner wrote: > >> I need to see these events: > >> > >>xfs_file* > >>xfs_iomap* > >>xfs_get_block* > >> > >> For both kernels. An example trace from 4.8-rc1 running the command > >> `xfs_io -f -c 'pwrite 0 512k -b 128k' /mnt/scratch/fooey doing an > >> overwrite and extend of the existing file ends up looking like: > >> > >> $ sudo trace-cmd start -e xfs_iomap\* -e xfs_file\* -e xfs_get_blocks\* > >> $ sudo cat /sys/kernel/tracing/trace_pipe > >><...>-2946 [001] 253971.750304: xfs_file_ioctl: dev > >> 253:32 ino 0x84 > >> xfs_io-2946 [001] 253971.750938: xfs_file_buffered_write: > >> dev 253:32 ino 0x84 size 0x4 offset 0x0 count 0x2 > >> xfs_io-2946 [001] 253971.750961: xfs_iomap_found: dev > >> 253:32 ino 0x84 size 0x4 offset 0x0 count 131072 type invalid startoff > >> 0x0 startblock 24 blockcount 0x60 > >> xfs_io-2946 [001] 253971.751114: xfs_file_buffered_write: > >> dev 253:32 ino 0x84 size 0x4 offset 0x2 count 0x2 > >> xfs_io-2946 [001] 253971.751128: xfs_iomap_found: dev > >> 253:32 ino 0x84 size 0x4 offset 0x2 count 131072 type invalid > >> startoff 0x0 startblock 24 blockcount 0x60 > >> xfs_io-2946 [001] 253971.751234: xfs_file_buffered_write: > >> dev 253:32 ino 0x84 size 0x4 offset 0x4 count 0x2 > >> xfs_io-2946 [001] 253971.751236: xfs_iomap_found: dev > >> 253:32 ino 0x84 size 0x4 offset 0x4 count 131072 type invalid > >> startoff 0x0 startblock 24 blockcount 0x60 > >> xfs_io-2946 [001] 253971.751381: xfs_file_buffered_write: > >> dev 253:32 ino 0x84 size 0x4 offset 0x6 count 0x2 > >> xfs_io-2946 [001] 253971.751415: xfs_iomap_prealloc_size: > >> dev 253:32 ino 0x84 prealloc blocks 128 shift 0 m_writeio_blocks 16 > >> xfs_io-2946 [001] 253971.751425: xfs_iomap_alloc: dev > >> 253:32 ino 0x84 size 0x4 offset 0x6 count 131072 type invalid > >> startoff 0x60 startblock -1 blockcount 0x90 > >> > >> That's the output I need for the complete test - you'll need to use > >> a better recording mechanism that this (e.g. trace-cmd record, > >> trace-cmd report) because it will generate a lot of events. Compress > >> the two report files (they'll be large) and send them to me offlist. > > > >Can you also send me the output of xfs_info on the filesystem you > >are testing? > > Hi, Dave > > Here is the xfs_info output: > > # xfs_info /fs/ram0/ > meta-data=/dev/ram0 isize=256agcount=4, agsize=3145728 blks > = sectsz=4096 attr=2, projid32bit=1 > = crc=0finobt=0 > data = bsize=4096 blocks=12582912, imaxpct=25 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 ftype=0 > log =internal bsize=4096 blocks=6144, version=2 > = sectsz=4096 sunit=1 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 OK, nothing unusual there. One thing that I did just think of - how close to ENOSPC does this test get? i.e. are we hitting the "we're almost out of free space" slow paths on this test? Cheers, dave. > > Thanks, > Xiaolong > > > >Cheers, > > > >Dave. > >-- > >Dave Chinner > >da...@fromorbit.com > >___ > >LKP mailing list > >l...@lists.01.org > >https://lists.01.org/mailman/listinfo/lkp > -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On 08/11, Dave Chinner wrote: >On Thu, Aug 11, 2016 at 11:16:12AM +1000, Dave Chinner wrote: >> I need to see these events: >> >> xfs_file* >> xfs_iomap* >> xfs_get_block* >> >> For both kernels. An example trace from 4.8-rc1 running the command >> `xfs_io -f -c 'pwrite 0 512k -b 128k' /mnt/scratch/fooey doing an >> overwrite and extend of the existing file ends up looking like: >> >> $ sudo trace-cmd start -e xfs_iomap\* -e xfs_file\* -e xfs_get_blocks\* >> $ sudo cat /sys/kernel/tracing/trace_pipe >><...>-2946 [001] 253971.750304: xfs_file_ioctl: dev 253:32 >> ino 0x84 >> xfs_io-2946 [001] 253971.750938: xfs_file_buffered_write: >> dev 253:32 ino 0x84 size 0x4 offset 0x0 count 0x2 >> xfs_io-2946 [001] 253971.750961: xfs_iomap_found: dev 253:32 >> ino 0x84 size 0x4 offset 0x0 count 131072 type invalid startoff 0x0 >> startblock 24 blockcount 0x60 >> xfs_io-2946 [001] 253971.751114: xfs_file_buffered_write: >> dev 253:32 ino 0x84 size 0x4 offset 0x2 count 0x2 >> xfs_io-2946 [001] 253971.751128: xfs_iomap_found: dev 253:32 >> ino 0x84 size 0x4 offset 0x2 count 131072 type invalid startoff 0x0 >> startblock 24 blockcount 0x60 >> xfs_io-2946 [001] 253971.751234: xfs_file_buffered_write: >> dev 253:32 ino 0x84 size 0x4 offset 0x4 count 0x2 >> xfs_io-2946 [001] 253971.751236: xfs_iomap_found: dev 253:32 >> ino 0x84 size 0x4 offset 0x4 count 131072 type invalid startoff 0x0 >> startblock 24 blockcount 0x60 >> xfs_io-2946 [001] 253971.751381: xfs_file_buffered_write: >> dev 253:32 ino 0x84 size 0x4 offset 0x6 count 0x2 >> xfs_io-2946 [001] 253971.751415: xfs_iomap_prealloc_size: >> dev 253:32 ino 0x84 prealloc blocks 128 shift 0 m_writeio_blocks 16 >> xfs_io-2946 [001] 253971.751425: xfs_iomap_alloc: dev 253:32 >> ino 0x84 size 0x4 offset 0x6 count 131072 type invalid startoff 0x60 >> startblock -1 blockcount 0x90 >> >> That's the output I need for the complete test - you'll need to use >> a better recording mechanism that this (e.g. trace-cmd record, >> trace-cmd report) because it will generate a lot of events. Compress >> the two report files (they'll be large) and send them to me offlist. > >Can you also send me the output of xfs_info on the filesystem you >are testing? Hi, Dave Here is the xfs_info output: # xfs_info /fs/ram0/ meta-data=/dev/ram0 isize=256agcount=4, agsize=3145728 blks = sectsz=4096 attr=2, projid32bit=1 = crc=0finobt=0 data = bsize=4096 blocks=12582912, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=0 log =internal bsize=4096 blocks=6144, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Thanks, Xiaolong > >Cheers, > >Dave. >-- >Dave Chinner >da...@fromorbit.com >___ >LKP mailing list >l...@lists.01.org >https://lists.01.org/mailman/listinfo/lkp
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Thu, Aug 11, 2016 at 11:16:12AM +1000, Dave Chinner wrote: > I need to see these events: > > xfs_file* > xfs_iomap* > xfs_get_block* > > For both kernels. An example trace from 4.8-rc1 running the command > `xfs_io -f -c 'pwrite 0 512k -b 128k' /mnt/scratch/fooey doing an > overwrite and extend of the existing file ends up looking like: > > $ sudo trace-cmd start -e xfs_iomap\* -e xfs_file\* -e xfs_get_blocks\* > $ sudo cat /sys/kernel/tracing/trace_pipe ><...>-2946 [001] 253971.750304: xfs_file_ioctl: dev 253:32 > ino 0x84 > xfs_io-2946 [001] 253971.750938: xfs_file_buffered_write: dev > 253:32 ino 0x84 size 0x4 offset 0x0 count 0x2 > xfs_io-2946 [001] 253971.750961: xfs_iomap_found: dev 253:32 > ino 0x84 size 0x4 offset 0x0 count 131072 type invalid startoff 0x0 > startblock 24 blockcount 0x60 > xfs_io-2946 [001] 253971.751114: xfs_file_buffered_write: dev > 253:32 ino 0x84 size 0x4 offset 0x2 count 0x2 > xfs_io-2946 [001] 253971.751128: xfs_iomap_found: dev 253:32 > ino 0x84 size 0x4 offset 0x2 count 131072 type invalid startoff 0x0 > startblock 24 blockcount 0x60 > xfs_io-2946 [001] 253971.751234: xfs_file_buffered_write: dev > 253:32 ino 0x84 size 0x4 offset 0x4 count 0x2 > xfs_io-2946 [001] 253971.751236: xfs_iomap_found: dev 253:32 > ino 0x84 size 0x4 offset 0x4 count 131072 type invalid startoff 0x0 > startblock 24 blockcount 0x60 > xfs_io-2946 [001] 253971.751381: xfs_file_buffered_write: dev > 253:32 ino 0x84 size 0x4 offset 0x6 count 0x2 > xfs_io-2946 [001] 253971.751415: xfs_iomap_prealloc_size: dev > 253:32 ino 0x84 prealloc blocks 128 shift 0 m_writeio_blocks 16 > xfs_io-2946 [001] 253971.751425: xfs_iomap_alloc: dev 253:32 > ino 0x84 size 0x4 offset 0x6 count 131072 type invalid startoff 0x60 > startblock -1 blockcount 0x90 > > That's the output I need for the complete test - you'll need to use > a better recording mechanism that this (e.g. trace-cmd record, > trace-cmd report) because it will generate a lot of events. Compress > the two report files (they'll be large) and send them to me offlist. Can you also send me the output of xfs_info on the filesystem you are testing? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Wed, Aug 10, 2016 at 05:33:20PM -0700, Huang, Ying wrote: > Linus Torvalds writes: > > > On Wed, Aug 10, 2016 at 5:11 PM, Huang, Ying wrote: > >> > >> Here is the comparison result with perf-profile data. > > > > Heh. The diff is actually harder to read than just showing A/B > > state.The fact that the call chain shows up as part of the symbol > > makes it even more so. > > > > For example: > > > >> 0.00 ± -1% +Inf% 1.68 ± 1% > >> perf-profile.cycles-pp.__add_to_page_cache_locked.add_to_page_cache_lru.pagecache_get_page.grab_cache_page_write_begin.iomap_write_begin > >> 1.80 ± 1%-100.0% 0.00 ± -1% > >> perf-profile.cycles-pp.__add_to_page_cache_locked.add_to_page_cache_lru.pagecache_get_page.grab_cache_page_write_begin.xfs_vm_write_begin > > > > Ok, so it went from 1.8% to 1.68%, and isn't actually that big of a > > change, but it shows up as a big change because the caller changed > > from xfs_vm_write_begin to iomap_write_begin. > > > > There's a few other cases of that too. > > > > So I think it would actually be easier to just see "what 20 functions > > were the hottest" (or maybe 50) before and after separately (just > > sorted by cycles), without the diff part. Because the diff is really > > hard to read. > > Here it is, > > Before: > > "perf-profile.func.cycles-pp.intel_idle": 16.88, > "perf-profile.func.cycles-pp.copy_user_enhanced_fast_string": 3.94, > "perf-profile.func.cycles-pp.memset_erms": 3.26, > "perf-profile.func.cycles-pp.__block_commit_write.isra.24": 2.47, > "perf-profile.func.cycles-pp.___might_sleep": 2.33, > "perf-profile.func.cycles-pp.__mark_inode_dirty": 1.88, > "perf-profile.func.cycles-pp.unlock_page": 1.69, > "perf-profile.func.cycles-pp.up_write": 1.61, > "perf-profile.func.cycles-pp.__block_write_begin_int": 1.56, > "perf-profile.func.cycles-pp.down_write": 1.55, > "perf-profile.func.cycles-pp.mark_buffer_dirty": 1.53, > "perf-profile.func.cycles-pp.entry_SYSCALL_64_fastpath": 1.47, > "perf-profile.func.cycles-pp.generic_write_end": 1.36, > "perf-profile.func.cycles-pp.generic_perform_write": 1.33, > "perf-profile.func.cycles-pp.__radix_tree_lookup": 1.32, > "perf-profile.func.cycles-pp.__might_sleep": 1.26, > "perf-profile.func.cycles-pp._raw_spin_lock": 1.17, > "perf-profile.func.cycles-pp.vfs_write": 1.14, > "perf-profile.func.cycles-pp.__xfs_get_blocks": 1.07, Ok, so that is the old block mapping call in the buffered IO path. I don't see any of the functions it calls in the profile; specifically xfs_bmapi_read(), and xfs_iomap_write_delay(), so it appears the extent mapping and allocation overhead on the old code totals somewhere under 2-3% of the entire CPU usage. > "perf-profile.func.cycles-pp.xfs_file_write_iter": 1.03, > "perf-profile.func.cycles-pp.pagecache_get_page": 1.03, > "perf-profile.func.cycles-pp.native_queued_spin_lock_slowpath": 0.98, > "perf-profile.func.cycles-pp.get_page_from_freelist": 0.94, > "perf-profile.func.cycles-pp.rwsem_spin_on_owner": 0.94, > "perf-profile.func.cycles-pp.__vfs_write": 0.87, > "perf-profile.func.cycles-pp.iov_iter_copy_from_user_atomic": 0.87, > "perf-profile.func.cycles-pp.xfs_file_buffered_aio_write": 0.84, > "perf-profile.func.cycles-pp.find_get_entry": 0.79, > "perf-profile.func.cycles-pp._raw_spin_lock_irqsave": 0.78, > > > After: > > "perf-profile.func.cycles-pp.intel_idle": 16.82, > "perf-profile.func.cycles-pp.copy_user_enhanced_fast_string": 3.27, > "perf-profile.func.cycles-pp.memset_erms": 2.6, > "perf-profile.func.cycles-pp.xfs_bmapi_read": 2.24, Straight away - thats' at least 3x more overhead block mapping lookups with the iomap code. > "perf-profile.func.cycles-pp.___might_sleep": 2.04, > "perf-profile.func.cycles-pp.mark_page_accessed": 1.93, > "perf-profile.func.cycles-pp.__block_write_begin_int": 1.78, > "perf-profile.func.cycles-pp.up_write": 1.72, > "perf-profile.func.cycles-pp.xfs_iext_bno_to_ext": 1.7, Plus this child. > "perf-profile.func.cycles-pp.__block_commit_write.isra.24": 1.65, > "perf-profile.func.cycles-pp.down_write": 1.51, > "perf-profile.func.cycles-pp.__mark_inode_dirty": 1.51, > "perf-profile.func.cycles-pp.unlock_page": 1.43, > "perf-profile.func.cycles-pp.xfs_bmap_search_multi_extents": 1.25, > "perf-profile.func.cycles-pp.xfs_bmap_search_extents": 1.23, And these two. > "perf-profile.func.cycles-pp.mark_buffer_dirty": 1.21, > "perf-profile.func.cycles-pp.xfs_iomap_write_delay": 1.19, > "perf-profile.func.cycles-pp.xfs_iomap_eof_want_preallocate.constprop.8": > 1.15, And these two. So, essentially the old code had maybe 2-3% cpu usage overhead in the block mapping path on this workload, but the new code is, for some reason, showing at least 8-9% CPU usage overhead. That, right now, makes no sense at all to me as we should be doing - at worst - exactly the same number of block mapping calls as the old code. We need to know what is happe
Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression
On Wed, Aug 10, 2016 at 5:33 PM, Huang, Ying wrote: > > Here it is, Thanks. Appended is a munged "after" list, with the "before" values in parenthesis. It actually looks fairly similar. The biggest difference is that we have "mark_page_accessed()" show up after, and not before. There was also a lot of LRU noise in the non-profile data. I wonder if that is the reason here: the old model of using generic_perform_write/block_page_mkwrite didn't mark the pages accessed, and now with iomap_file_buffered_write() they get marked as active and that screws up the LRU list, and makes us not flush out the dirty pages well (because they are seen as active and not good for writeback), and then you get bad memory use. I'm not seeing anything that looks like locking-related. And I may well have screwed up that list munging. I should have automated it more than I did. Dave, Christoph? Linus --- intel_idle 16.82 (16.88) copy_user_enhanced_fast_string3.27(3.94) memset_erms 2.6 (3.26) xfs_bmapi_read2.24 ___might_sleep2.04(2.33) mark_page_accessed1.93 __block_write_begin_int 1.78(1.56) up_write 1.72(1.61) xfs_iext_bno_to_ext 1.7 __block_commit_write.isra.24 1.65(2.47) down_write1.51(1.55) __mark_inode_dirty1.51(1.88) unlock_page 1.43(1.69) xfs_bmap_search_multi_extents 1.25 xfs_bmap_search_extents 1.23 mark_buffer_dirty 1.21(1.53) xfs_iomap_write_delay 1.19 xfs_iomap_eof_want_preallocate.constprop.81.15 iomap_write_actor 1.14 __might_sleep 1.12(1.26) __radix_tree_lookup 1.08(1.32) entry_SYSCALL_64_fastpath 1.07(1.47) pagecache_get_page0.95(1.03) _raw_spin_lock0.95(1.17) xfs_bmapi_delay 0.93 vfs_write 0.92(1.14) xfs_file_write_iter 0.86