Hi Linus,

Some updates on this, we found the regression is related with the
percpu stuff change and BTRFS, though the exact relation is unknown yet. 

Some details below.

+ Michal who helped providing useful links for checking it.
+ Josef Bacik, for this is BTRFS related 

On Fri, Mar 19, 2021 at 11:21:44AM +0800, Feng Tang wrote:
> Hi Linus,
> 
> On Mon, Mar 15, 2021 at 01:42:50PM -0700, Linus Torvalds wrote:
> > On Sun, Mar 14, 2021 at 11:30 PM kernel test robot
> > <oliver.s...@intel.com> wrote:

> > > in testcase: fxmark
> > > on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz 
> > > with 80G memory
> > 
> > I think this must be some really random memory layout issue that
> > causes some false sharing or similar.
> > 
> > And it's not even that some fundamental data structure gets a
> > different layout, it literally is just either:
> > 
> >  (a) the (now smaller) array is allocated from a differently chunk,
> > and that then causes random cache effects with something else
> > 
> >  (b) the (old, and bigger) array was more spread out, and as a result
> > had different fields in different cachelines and less false sharing
> > 
> > Normally I'd say that (b) is the obvious case, except for the fact
> > that this is a __percpu array.
> > 
> > So in the common case there shouldn't be any cache contention _within_
> > the array itself. Any cache contention should be with something else
> > very hot that the change now randomly makes be in the same cache way
> > or whatever.
> > 
> > Afaik, only the flushing of the vmstats batches does access the percpu
> > arrays from other CPUs, so (b) is not _entirely_ impossible if
> > memcg_flush_percpu_vmstats() were to be very very very hot.
> > 
> > But the profile data doesn't show anything remotely like that.
> > 
> > In fact, the actual report seems to show things improving, ie things
> > like elapsed time going down:
> > 
> > >       1361            -9.5%       1232        fxmark.time.elapsed_time
> > >       1361            -9.5%       1232        fxmark.time.elapsed_time.max
> > >      25.67            +9.1%      28.00        
> > > fxmark.time.percent_of_cpu_this_job_got
> > >     343.68            -2.0%     336.76        fxmark.time.system_time
> > >      23.66            +2.5       26.12        mpstat.cpu.all.sys%
> > 
> > but I don't know what the benchmark actually does, so maybe it just
> > repeats things until it hits a certain confidence interval, and thus
> > "elapsed time" is immaterial.
> 
> I just checked the benchmark, seems it benchmarks the filesystem's
> scalability by doing file/inode opertions with different task numbers
> (from 1 to nr_cpus).
> 
> The benchmark has preparation and cleanup steps before/after every
> run, and for the 100 less seconds of 'fxmark.time.elapsed_time' you
> found, it is due to the newer kernel spends 100 seconds more in the
> cleanup step after run ( a point worth checking) 

Yes, the longer running time is because the cleanup ('umount' specifically)
of older kernel takes longer time, as it complets more FS operations. 

And perf-profile info in the original report was not accurate, as the
test tried 72/63/54/45/36/27/18/9/1 tasks, and perf data recorded may
not reflect the "63 tasks" subcase.

> > Honestly, normally if I were to get a report about "52% regression"
> > for a commit that is supposed to optimize something, I'd just revert
> > the commit as a case of "ok, that optimization clearly didn't work".
> > 
> > But there is absolutely no sign that this commit is actually the
> > culprit, or explanation for why that should be, and what could be
> > going on.
> > 
> > So I'm going to treat this as a "bisection failure, possibly due to
> > random code or data layout changes". Particularly since this seems to
> > be a 4-socket Xeon Phi machine, which I think is likely a very very
> > fragile system if there is some odd cache layout issue.
> 
> Oliver retested it and made it run 12 times in total, and the data
> is consistent. We tried some other test:
> * run other sub cases of this 'fxmark' which sees no regression
> * change 'btrfs' to 'ext4' of this case, and no regression
> * test  on Cascadelake platform, no regression
> 
> So the bitsection seems to be stable, though can't be explained yet. 
> 
> We checked the System map of the 2 kernels, and didn't find obvious
> code/data alignment change, which is expected, as the commit changes
> data structure which is usually dynamically allocated. 

We found with the commit some percpu related ops do have some change,
as shown in perf

old kernel
----------
  1.06%     0.69%  [kernel.kallsyms]         [k] __percpu_counter_sum           
                     -      -
  1.06% 
__percpu_counter_sum;need_preemptive_reclaim.part.0;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;write

  89.85%    88.17%  [kernel.kallsyms]         [k] 
native_queued_spin_lock_slowpath                    -      -
  45.27% 
native_queued_spin_lock_slowpath;_raw_spin_lock;btrfs_block_rsv_release;btrfs_inode_rsv_release;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;write
  44.51% 
native_queued_spin_lock_slowpath;_raw_spin_lock;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe;write


new kernel
----------
  1.33%     1.14%  [kernel.kallsyms]         [k] __percpu_counter_sum           
                         -      -
  1.33% 
__percpu_counter_sum;need_preemptive_reclaim.part.0;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe

  95.95%    95.31%  [kernel.kallsyms]         [k] 
native_queued_spin_lock_slowpath                        -      -
  48.56% 
native_queued_spin_lock_slowpath;_raw_spin_lock;btrfs_block_rsv_release;btrfs_inode_rsv_release;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe
  47.33% 
native_queued_spin_lock_slowpath;_raw_spin_lock;__reserve_bytes;btrfs_reserve_metadata_bytes;btrfs_delalloc_reserve_metadata;btrfs_buffered_write;btrfs_file_write_iter;new_sync_write;vfs_write;ksys_write;do_syscall_64;entry_SYSCALL_64_after_hwframe

__percpu_counter_sum is usually costy for platform with many CPUs and
it does rise much. It is called in fs/btrfs/space-info.c
        need_preemptive_reclaim
                ordered = percpu_counter_sum_positive(&fs_info->ordered_bytes);
                delalloc = 
percpu_counter_sum_positive(&fs_info->delalloc_bytes);

And we did 2 experiments:
1. Change the percpu_counter_sum_positive() to percpu_counter_read_positive()
   which skips looping online CPUs to get the sum, inside 
need_preemptive_reclaim(),
   the regression is gone, and even better

2. We add some padding to restore the percpu stuff size, the regression
   is also gone.

 struct batched_lruvec_stat {
        s32 count[NR_VM_NODE_STAT_ITEMS];
+       s32 pad[NR_VM_NODE_STAT_ITEMS];
 };

So we think the regression is related with Muchun's patch and BTRFS.

And another thing I tried is changing the test tasks number from '63' to
'65', and the regression is also gone, and this is something mysterious!

This is only reproduced on a Xeon Phi, which lacks some perf events, so
tool like perf c2c can't be run to pin-point some data or text. 

Thanks,
Feng

Reply via email to