Re: [PATCH] tmpfs: avoid a little creat and stat slowdown

2015-11-13 Thread Hugh Dickins
On Fri, 13 Nov 2015, Huang, Ying wrote:
> 
> c435a390574d is the direct parent of afa2db2fb6f1 in its original git.
> 43819159da2b is your patch applied on top of v4.3-rc7.  The comparison
> of 43819159da2b with v4.3-rc7 is as follow:
...
> So you patch improved 11.9% from its base v4.3-rc7.  I think other
> difference are caused by other changes.  Sorry for confusing.

Thanks for getting back on this: that's rather what I was hoping to hear.

Of course, no user will care which commit is responsible for a slowdown,
and we may need to look further; but I couldn't make sense of it before,
so this was a relief.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tmpfs: avoid a little creat and stat slowdown

2015-11-13 Thread Huang, Ying
Hugh Dickins  writes:

> On Wed, 4 Nov 2015, Huang, Ying wrote:
>> Hugh Dickins  writes:
>> 
>> > LKP reports that v4.2 commit afa2db2fb6f1 ("tmpfs: truncate prealloc
>> > blocks past i_size") causes a 14.5% slowdown in the AIM9 creat-clo
>> > benchmark.
>> >
>> > creat-clo does just what you'd expect from the name, and creat's O_TRUNC
>> > on 0-length file does indeed get into more overhead now shmem_setattr()
>> > tests "0 <= 0" instead of "0 < 0".
>> >
>> > I'm not sure how much we care, but I think it would not be too VW-like
>> > to add in a check for whether any pages (or swap) are allocated: if none
>> > are allocated, there's none to remove from the radix_tree.  At first I
>> > thought that check would be good enough for the unmaps too, but no: we
>> > should not skip the unlikely case of unmapping pages beyond the new EOF,
>> > which were COWed from holes which have now been reclaimed, leaving none.
>> >
>> > This gives me an 8.5% speedup: on Haswell instead of LKP's Westmere,
>> > and running a debug config before and after: I hope those account for
>> > the lesser speedup.
>> >
>> > And probably someone has a benchmark where a thousand threads keep on
>> > stat'ing the same file repeatedly: forestall that report by adjusting
>> > v4.3 commit 44a30220bc0a ("shmem: recalculate file inode when fstat")
>> > not to take the spinlock in shmem_getattr() when there's no work to do.
>> >
>> > Reported-by: Ying Huang 
>> > Signed-off-by: Hugh Dickins 
>> 
>> Hi, Hugh,
>> 
>> Thanks a lot for your support!  The test on LKP shows that this patch
>> restores a big part of the regression!  In following list,
>> 
>> c435a390574d012f8d30074135d8fcc6f480b484: is parent commit
>> afa2db2fb6f15f860069de94a1257db57589fe95: is the first bad commit has
>> performance regression.
>> 43819159da2b77fedcf7562134d6003dccd6a068: is the fixing patch
>
> Hi Ying,
>
> Thank you, for reporting, and for trying out the patch (which is now
> in Linus's tree as commit d0424c429f8e0555a337d71e0a13f2289c636ec9).
>
> But I'm disappointed by the result: do I understand correctly,
> that afa2db2fb6f1 made a -12.5% change, but the fix still -5.6%
> from your parent comparison point?

Yes.

> If we value that microbenchmark
> at all (debatable), I'd say that's not good enough.

I think that is a good improvement.

> It does match with my own rough measurement, but I'd been hoping
> for better when done in a more controlled environment; and I cannot
> explain why "truncate prealloc blocks past i_size" creat-clo performance
> would not be fully corrected by "avoid a little creat and stat slowdown"
> (unless either patch adds subtle icache or dcache displacements).
>
> I'm not certain of how you performed the comparison.  Was the
> c435a390574d tree measured, then patch afa2db2fb6f1 applied on top
> of that and measured, then patch 43819159da2b applied on top of that
> and measured?  Or were there other intervening changes, which could
> easily add their own interference?

c435a390574d is the direct parent of afa2db2fb6f1 in its original git.
43819159da2b is your patch applied on top of v4.3-rc7.  The comparison
of 43819159da2b with v4.3-rc7 is as follow:

=
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
  
gcc-4.9/performance/x86_64-rhel/debian-x86_64-2015-02-07.cgz/lkp-wsx02/creat-clo/aim9/300s

commit: 
  32b88194f71d6ae7768a29f87fbba454728273ee
  43819159da2b77fedcf7562134d6003dccd6a068

32b88194f71d6ae7 43819159da2b77fedcf7562134 
 -- 
 %stddev %change %stddev
 \  |\  
475224 ±  1% +11.9% 531968 ±  1%  aim9.creat-clo.ops_per_sec
  10469094 ±201% -52.3%4998529 ±130%  
latency_stats.avg.nfs_wait_on_request.nfs_updatepage.nfs_write_end.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
  18852332 ±223% -73.5%4998529 ±130%  
latency_stats.max.nfs_wait_on_request.nfs_updatepage.nfs_write_end.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
  21758590 ±199% -77.0%4998529 ±130%  
latency_stats.sum.nfs_wait_on_request.nfs_updatepage.nfs_write_end.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
   4817724 ±  0%  +9.6%5280303 ±  1%  proc-vmstat.numa_hit
   4812582 ±  0%  +9.7%5280287 ±  1%  proc-vmstat.numa_local
   8499767 ±  4% +14.2%9707953 ±  4%  proc-vmstat.pgalloc_normal
   8984075 ±  0% +10.4%9919044 ±  1%  proc-vmstat.pgfree
  9.22 ±  8% +27.4%  11.75 ±  9%  
sched_debug.cfs_rq[0]:/.nr_spread_over
  2667 ± 63% +90.0%   5068 ± 37%  

Re: [PATCH] tmpfs: avoid a little creat and stat slowdown

2015-11-13 Thread Huang, Ying
Hugh Dickins  writes:

> On Wed, 4 Nov 2015, Huang, Ying wrote:
>> Hugh Dickins  writes:
>> 
>> > LKP reports that v4.2 commit afa2db2fb6f1 ("tmpfs: truncate prealloc
>> > blocks past i_size") causes a 14.5% slowdown in the AIM9 creat-clo
>> > benchmark.
>> >
>> > creat-clo does just what you'd expect from the name, and creat's O_TRUNC
>> > on 0-length file does indeed get into more overhead now shmem_setattr()
>> > tests "0 <= 0" instead of "0 < 0".
>> >
>> > I'm not sure how much we care, but I think it would not be too VW-like
>> > to add in a check for whether any pages (or swap) are allocated: if none
>> > are allocated, there's none to remove from the radix_tree.  At first I
>> > thought that check would be good enough for the unmaps too, but no: we
>> > should not skip the unlikely case of unmapping pages beyond the new EOF,
>> > which were COWed from holes which have now been reclaimed, leaving none.
>> >
>> > This gives me an 8.5% speedup: on Haswell instead of LKP's Westmere,
>> > and running a debug config before and after: I hope those account for
>> > the lesser speedup.
>> >
>> > And probably someone has a benchmark where a thousand threads keep on
>> > stat'ing the same file repeatedly: forestall that report by adjusting
>> > v4.3 commit 44a30220bc0a ("shmem: recalculate file inode when fstat")
>> > not to take the spinlock in shmem_getattr() when there's no work to do.
>> >
>> > Reported-by: Ying Huang 
>> > Signed-off-by: Hugh Dickins 
>> 
>> Hi, Hugh,
>> 
>> Thanks a lot for your support!  The test on LKP shows that this patch
>> restores a big part of the regression!  In following list,
>> 
>> c435a390574d012f8d30074135d8fcc6f480b484: is parent commit
>> afa2db2fb6f15f860069de94a1257db57589fe95: is the first bad commit has
>> performance regression.
>> 43819159da2b77fedcf7562134d6003dccd6a068: is the fixing patch
>
> Hi Ying,
>
> Thank you, for reporting, and for trying out the patch (which is now
> in Linus's tree as commit d0424c429f8e0555a337d71e0a13f2289c636ec9).
>
> But I'm disappointed by the result: do I understand correctly,
> that afa2db2fb6f1 made a -12.5% change, but the fix still -5.6%
> from your parent comparison point?

Yes.

> If we value that microbenchmark
> at all (debatable), I'd say that's not good enough.

I think that is a good improvement.

> It does match with my own rough measurement, but I'd been hoping
> for better when done in a more controlled environment; and I cannot
> explain why "truncate prealloc blocks past i_size" creat-clo performance
> would not be fully corrected by "avoid a little creat and stat slowdown"
> (unless either patch adds subtle icache or dcache displacements).
>
> I'm not certain of how you performed the comparison.  Was the
> c435a390574d tree measured, then patch afa2db2fb6f1 applied on top
> of that and measured, then patch 43819159da2b applied on top of that
> and measured?  Or were there other intervening changes, which could
> easily add their own interference?

c435a390574d is the direct parent of afa2db2fb6f1 in its original git.
43819159da2b is your patch applied on top of v4.3-rc7.  The comparison
of 43819159da2b with v4.3-rc7 is as follow:

=
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
  
gcc-4.9/performance/x86_64-rhel/debian-x86_64-2015-02-07.cgz/lkp-wsx02/creat-clo/aim9/300s

commit: 
  32b88194f71d6ae7768a29f87fbba454728273ee
  43819159da2b77fedcf7562134d6003dccd6a068

32b88194f71d6ae7 43819159da2b77fedcf7562134 
 -- 
 %stddev %change %stddev
 \  |\  
475224 ±  1% +11.9% 531968 ±  1%  aim9.creat-clo.ops_per_sec
  10469094 ±201% -52.3%4998529 ±130%  
latency_stats.avg.nfs_wait_on_request.nfs_updatepage.nfs_write_end.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
  18852332 ±223% -73.5%4998529 ±130%  
latency_stats.max.nfs_wait_on_request.nfs_updatepage.nfs_write_end.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
  21758590 ±199% -77.0%4998529 ±130%  
latency_stats.sum.nfs_wait_on_request.nfs_updatepage.nfs_write_end.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
   4817724 ±  0%  +9.6%5280303 ±  1%  proc-vmstat.numa_hit
   4812582 ±  0%  +9.7%5280287 ±  1%  proc-vmstat.numa_local
   8499767 ±  4% +14.2%9707953 ±  4%  proc-vmstat.pgalloc_normal
   8984075 ±  0% +10.4%9919044 ±  1%  proc-vmstat.pgfree
  9.22 ±  8% +27.4%  11.75 ±  9%  

Re: [PATCH] tmpfs: avoid a little creat and stat slowdown

2015-11-13 Thread Hugh Dickins
On Fri, 13 Nov 2015, Huang, Ying wrote:
> 
> c435a390574d is the direct parent of afa2db2fb6f1 in its original git.
> 43819159da2b is your patch applied on top of v4.3-rc7.  The comparison
> of 43819159da2b with v4.3-rc7 is as follow:
...
> So you patch improved 11.9% from its base v4.3-rc7.  I think other
> difference are caused by other changes.  Sorry for confusing.

Thanks for getting back on this: that's rather what I was hoping to hear.

Of course, no user will care which commit is responsible for a slowdown,
and we may need to look further; but I couldn't make sense of it before,
so this was a relief.

Hugh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] tmpfs: avoid a little creat and stat slowdown

2015-11-08 Thread Hugh Dickins
On Wed, 4 Nov 2015, Huang, Ying wrote:
> Hugh Dickins  writes:
> 
> > LKP reports that v4.2 commit afa2db2fb6f1 ("tmpfs: truncate prealloc
> > blocks past i_size") causes a 14.5% slowdown in the AIM9 creat-clo
> > benchmark.
> >
> > creat-clo does just what you'd expect from the name, and creat's O_TRUNC
> > on 0-length file does indeed get into more overhead now shmem_setattr()
> > tests "0 <= 0" instead of "0 < 0".
> >
> > I'm not sure how much we care, but I think it would not be too VW-like
> > to add in a check for whether any pages (or swap) are allocated: if none
> > are allocated, there's none to remove from the radix_tree.  At first I
> > thought that check would be good enough for the unmaps too, but no: we
> > should not skip the unlikely case of unmapping pages beyond the new EOF,
> > which were COWed from holes which have now been reclaimed, leaving none.
> >
> > This gives me an 8.5% speedup: on Haswell instead of LKP's Westmere,
> > and running a debug config before and after: I hope those account for
> > the lesser speedup.
> >
> > And probably someone has a benchmark where a thousand threads keep on
> > stat'ing the same file repeatedly: forestall that report by adjusting
> > v4.3 commit 44a30220bc0a ("shmem: recalculate file inode when fstat")
> > not to take the spinlock in shmem_getattr() when there's no work to do.
> >
> > Reported-by: Ying Huang 
> > Signed-off-by: Hugh Dickins 
> 
> Hi, Hugh,
> 
> Thanks a lot for your support!  The test on LKP shows that this patch
> restores a big part of the regression!  In following list,
> 
> c435a390574d012f8d30074135d8fcc6f480b484: is parent commit
> afa2db2fb6f15f860069de94a1257db57589fe95: is the first bad commit has
> performance regression.
> 43819159da2b77fedcf7562134d6003dccd6a068: is the fixing patch

Hi Ying,

Thank you, for reporting, and for trying out the patch (which is now
in Linus's tree as commit d0424c429f8e0555a337d71e0a13f2289c636ec9).

But I'm disappointed by the result: do I understand correctly,
that afa2db2fb6f1 made a -12.5% change, but the fix still -5.6%
from your parent comparison point?  If we value that microbenchmark
at all (debatable), I'd say that's not good enough.

It does match with my own rough measurement, but I'd been hoping
for better when done in a more controlled environment; and I cannot
explain why "truncate prealloc blocks past i_size" creat-clo performance
would not be fully corrected by "avoid a little creat and stat slowdown"
(unless either patch adds subtle icache or dcache displacements).

I'm not certain of how you performed the comparison.  Was the
c435a390574d tree measured, then patch afa2db2fb6f1 applied on top
of that and measured, then patch 43819159da2b applied on top of that
and measured?  Or were there other intervening changes, which could
easily add their own interference?

Hugh

> 
> =
> compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
>   
> gcc-4.9/performance/x86_64-rhel/debian-x86_64-2015-02-07.cgz/lkp-wsx02/creat-clo/aim9/300s
> 
> commit: 
>   c435a390574d012f8d30074135d8fcc6f480b484
>   afa2db2fb6f15f860069de94a1257db57589fe95
>   43819159da2b77fedcf7562134d6003dccd6a068
> 
> c435a390574d012f afa2db2fb6f15f860069de94a1 43819159da2b77fedcf7562134 
>  -- -- 
>  %stddev %change %stddev %change %stddev
>  \  |\  |\  
> 563556 ±  1% -12.5% 493033 ±  5%  -5.6% 531968 ±  1%  
> aim9.creat-clo.ops_per_sec
>  11836 ±  7% +11.4%  13184 ±  7% +15.0%  13608 ±  5%  
> numa-meminfo.node1.SReclaimable
>   10121526 ±  3% -12.1%8897097 ±  5%  -4.1%9707953 ±  4%  
> proc-vmstat.pgalloc_normal
>   9.34 ±  4% -11.4%   8.28 ±  3%  -4.8%   8.88 ±  2%  
> time.user_time
>   3480 ±  3%  -2.5%   3395 ±  1% -28.5%   2488 ±  3%  
> vmstat.system.cs
> 203275 ± 17%  -6.8% 189453 ±  5% -34.4% 133352 ± 11%  
> cpuidle.C1-NHM.usage
>8081280 ±129% -93.3% 538377 ± 97% +31.5%   10625496 ±106%  
> cpuidle.C1E-NHM.time
>   3144 ± 58%+619.0%  22606 ± 56%+903.9%  31563 ±  0%  
> numa-vmstat.node0.numa_other
>   2958 ±  7% +11.4%   3295 ±  7% +15.0%   3401 ±  5%  
> numa-vmstat.node1.nr_slab_reclaimable
>  45074 ±  5% -43.4%  25494 ± 57% -68.7%  14105 ±  2%  
> numa-vmstat.node2.numa_other
>  56140 ±  0%  +0.0%  56158 ±  0% -94.4%   3120 ±  0%  
> slabinfo.Acpi-ParseExt.active_objs
>   1002 ±  0%  +0.0%   1002 ±  0% -92.0%  80.00 ±  0%  
> slabinfo.Acpi-ParseExt.active_slabs
>  56140 ±  0%  +0.0%  56158 ±  0% -94.4%   3120 ±  0%  
> slabinfo.Acpi-ParseExt.num_objs
>   1002 ±  0%  +0.0%   

Re: [PATCH] tmpfs: avoid a little creat and stat slowdown

2015-11-08 Thread Hugh Dickins
On Wed, 4 Nov 2015, Huang, Ying wrote:
> Hugh Dickins  writes:
> 
> > LKP reports that v4.2 commit afa2db2fb6f1 ("tmpfs: truncate prealloc
> > blocks past i_size") causes a 14.5% slowdown in the AIM9 creat-clo
> > benchmark.
> >
> > creat-clo does just what you'd expect from the name, and creat's O_TRUNC
> > on 0-length file does indeed get into more overhead now shmem_setattr()
> > tests "0 <= 0" instead of "0 < 0".
> >
> > I'm not sure how much we care, but I think it would not be too VW-like
> > to add in a check for whether any pages (or swap) are allocated: if none
> > are allocated, there's none to remove from the radix_tree.  At first I
> > thought that check would be good enough for the unmaps too, but no: we
> > should not skip the unlikely case of unmapping pages beyond the new EOF,
> > which were COWed from holes which have now been reclaimed, leaving none.
> >
> > This gives me an 8.5% speedup: on Haswell instead of LKP's Westmere,
> > and running a debug config before and after: I hope those account for
> > the lesser speedup.
> >
> > And probably someone has a benchmark where a thousand threads keep on
> > stat'ing the same file repeatedly: forestall that report by adjusting
> > v4.3 commit 44a30220bc0a ("shmem: recalculate file inode when fstat")
> > not to take the spinlock in shmem_getattr() when there's no work to do.
> >
> > Reported-by: Ying Huang 
> > Signed-off-by: Hugh Dickins 
> 
> Hi, Hugh,
> 
> Thanks a lot for your support!  The test on LKP shows that this patch
> restores a big part of the regression!  In following list,
> 
> c435a390574d012f8d30074135d8fcc6f480b484: is parent commit
> afa2db2fb6f15f860069de94a1257db57589fe95: is the first bad commit has
> performance regression.
> 43819159da2b77fedcf7562134d6003dccd6a068: is the fixing patch

Hi Ying,

Thank you, for reporting, and for trying out the patch (which is now
in Linus's tree as commit d0424c429f8e0555a337d71e0a13f2289c636ec9).

But I'm disappointed by the result: do I understand correctly,
that afa2db2fb6f1 made a -12.5% change, but the fix still -5.6%
from your parent comparison point?  If we value that microbenchmark
at all (debatable), I'd say that's not good enough.

It does match with my own rough measurement, but I'd been hoping
for better when done in a more controlled environment; and I cannot
explain why "truncate prealloc blocks past i_size" creat-clo performance
would not be fully corrected by "avoid a little creat and stat slowdown"
(unless either patch adds subtle icache or dcache displacements).

I'm not certain of how you performed the comparison.  Was the
c435a390574d tree measured, then patch afa2db2fb6f1 applied on top
of that and measured, then patch 43819159da2b applied on top of that
and measured?  Or were there other intervening changes, which could
easily add their own interference?

Hugh

> 
> =
> compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
>   
> gcc-4.9/performance/x86_64-rhel/debian-x86_64-2015-02-07.cgz/lkp-wsx02/creat-clo/aim9/300s
> 
> commit: 
>   c435a390574d012f8d30074135d8fcc6f480b484
>   afa2db2fb6f15f860069de94a1257db57589fe95
>   43819159da2b77fedcf7562134d6003dccd6a068
> 
> c435a390574d012f afa2db2fb6f15f860069de94a1 43819159da2b77fedcf7562134 
>  -- -- 
>  %stddev %change %stddev %change %stddev
>  \  |\  |\  
> 563556 ±  1% -12.5% 493033 ±  5%  -5.6% 531968 ±  1%  
> aim9.creat-clo.ops_per_sec
>  11836 ±  7% +11.4%  13184 ±  7% +15.0%  13608 ±  5%  
> numa-meminfo.node1.SReclaimable
>   10121526 ±  3% -12.1%8897097 ±  5%  -4.1%9707953 ±  4%  
> proc-vmstat.pgalloc_normal
>   9.34 ±  4% -11.4%   8.28 ±  3%  -4.8%   8.88 ±  2%  
> time.user_time
>   3480 ±  3%  -2.5%   3395 ±  1% -28.5%   2488 ±  3%  
> vmstat.system.cs
> 203275 ± 17%  -6.8% 189453 ±  5% -34.4% 133352 ± 11%  
> cpuidle.C1-NHM.usage
>8081280 ±129% -93.3% 538377 ± 97% +31.5%   10625496 ±106%  
> cpuidle.C1E-NHM.time
>   3144 ± 58%+619.0%  22606 ± 56%+903.9%  31563 ±  0%  
> numa-vmstat.node0.numa_other
>   2958 ±  7% +11.4%   3295 ±  7% +15.0%   3401 ±  5%  
> numa-vmstat.node1.nr_slab_reclaimable
>  45074 ±  5% -43.4%  25494 ± 57% -68.7%  14105 ±  2%  
> numa-vmstat.node2.numa_other
>  56140 ±  0%  +0.0%  56158 ±  0% -94.4%   3120 ±  0%  
> slabinfo.Acpi-ParseExt.active_objs
>   1002 ±  0%  +0.0%   1002 ±  0% -92.0%  80.00 ±  0%  
> slabinfo.Acpi-ParseExt.active_slabs
>  56140 ±  0%  +0.0%  56158 ±  0% -94.4%   3120 ±  0%  
> 

Re: [PATCH] tmpfs: avoid a little creat and stat slowdown

2015-11-03 Thread Huang, Ying
Hugh Dickins  writes:

> LKP reports that v4.2 commit afa2db2fb6f1 ("tmpfs: truncate prealloc
> blocks past i_size") causes a 14.5% slowdown in the AIM9 creat-clo
> benchmark.
>
> creat-clo does just what you'd expect from the name, and creat's O_TRUNC
> on 0-length file does indeed get into more overhead now shmem_setattr()
> tests "0 <= 0" instead of "0 < 0".
>
> I'm not sure how much we care, but I think it would not be too VW-like
> to add in a check for whether any pages (or swap) are allocated: if none
> are allocated, there's none to remove from the radix_tree.  At first I
> thought that check would be good enough for the unmaps too, but no: we
> should not skip the unlikely case of unmapping pages beyond the new EOF,
> which were COWed from holes which have now been reclaimed, leaving none.
>
> This gives me an 8.5% speedup: on Haswell instead of LKP's Westmere,
> and running a debug config before and after: I hope those account for
> the lesser speedup.
>
> And probably someone has a benchmark where a thousand threads keep on
> stat'ing the same file repeatedly: forestall that report by adjusting
> v4.3 commit 44a30220bc0a ("shmem: recalculate file inode when fstat")
> not to take the spinlock in shmem_getattr() when there's no work to do.
>
> Reported-by: Ying Huang 
> Signed-off-by: Hugh Dickins 

Hi, Hugh,

Thanks a lot for your support!  The test on LKP shows that this patch
restores a big part of the regression!  In following list,

c435a390574d012f8d30074135d8fcc6f480b484: is parent commit
afa2db2fb6f15f860069de94a1257db57589fe95: is the first bad commit has
performance regression.
43819159da2b77fedcf7562134d6003dccd6a068: is the fixing patch

=
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
  
gcc-4.9/performance/x86_64-rhel/debian-x86_64-2015-02-07.cgz/lkp-wsx02/creat-clo/aim9/300s

commit: 
  c435a390574d012f8d30074135d8fcc6f480b484
  afa2db2fb6f15f860069de94a1257db57589fe95
  43819159da2b77fedcf7562134d6003dccd6a068

c435a390574d012f afa2db2fb6f15f860069de94a1 43819159da2b77fedcf7562134 
 -- -- 
 %stddev %change %stddev %change %stddev
 \  |\  |\  
563556 ±  1% -12.5% 493033 ±  5%  -5.6% 531968 ±  1%  
aim9.creat-clo.ops_per_sec
 11836 ±  7% +11.4%  13184 ±  7% +15.0%  13608 ±  5%  
numa-meminfo.node1.SReclaimable
  10121526 ±  3% -12.1%8897097 ±  5%  -4.1%9707953 ±  4%  
proc-vmstat.pgalloc_normal
  9.34 ±  4% -11.4%   8.28 ±  3%  -4.8%   8.88 ±  2%  
time.user_time
  3480 ±  3%  -2.5%   3395 ±  1% -28.5%   2488 ±  3%  
vmstat.system.cs
203275 ± 17%  -6.8% 189453 ±  5% -34.4% 133352 ± 11%  
cpuidle.C1-NHM.usage
   8081280 ±129% -93.3% 538377 ± 97% +31.5%   10625496 ±106%  
cpuidle.C1E-NHM.time
  3144 ± 58%+619.0%  22606 ± 56%+903.9%  31563 ±  0%  
numa-vmstat.node0.numa_other
  2958 ±  7% +11.4%   3295 ±  7% +15.0%   3401 ±  5%  
numa-vmstat.node1.nr_slab_reclaimable
 45074 ±  5% -43.4%  25494 ± 57% -68.7%  14105 ±  2%  
numa-vmstat.node2.numa_other
 56140 ±  0%  +0.0%  56158 ±  0% -94.4%   3120 ±  0%  
slabinfo.Acpi-ParseExt.active_objs
  1002 ±  0%  +0.0%   1002 ±  0% -92.0%  80.00 ±  0%  
slabinfo.Acpi-ParseExt.active_slabs
 56140 ±  0%  +0.0%  56158 ±  0% -94.4%   3120 ±  0%  
slabinfo.Acpi-ParseExt.num_objs
  1002 ±  0%  +0.0%   1002 ±  0% -92.0%  80.00 ±  0%  
slabinfo.Acpi-ParseExt.num_slabs
  1079 ±  5% -10.8% 962.00 ± 10%-100.0%   0.00 ± -1%  
slabinfo.blkdev_ioc.active_objs
  1079 ±  5% -10.8% 962.00 ± 10%-100.0%   0.00 ± -1%  
slabinfo.blkdev_ioc.num_objs
110.67 ± 39% +74.4% 193.00 ± 46%+317.5% 462.00 ±  8%  
slabinfo.blkdev_queue.active_objs
189.33 ± 23% +43.7% 272.00 ± 33%+151.4% 476.00 ± 10%  
slabinfo.blkdev_queue.num_objs
  1129 ± 10%  -1.9%   1107 ±  7% +20.8%   1364 ±  6%  
slabinfo.blkdev_requests.active_objs
  1129 ± 10%  -1.9%   1107 ±  7% +20.8%   1364 ±  6%  
slabinfo.blkdev_requests.num_objs
  1058 ±  3% -10.3% 949.00 ±  9%-100.0%   0.00 ± -1%  
slabinfo.file_lock_ctx.active_objs
  1058 ±  3% -10.3% 949.00 ±  9%-100.0%   0.00 ± -1%  
slabinfo.file_lock_ctx.num_objs
  4060 ±  1%  -2.1%   3973 ±  1% -10.5%   3632 ±  1%  
slabinfo.files_cache.active_objs
  4060 ±  1%  -2.1%   3973 ±  1% -10.5%   3632 ±  1%  
slabinfo.files_cache.num_objs
 10001 ±  0%  -0.3%   9973 ±  0% -61.1%   3888 ±  0%  

Re: [PATCH] tmpfs: avoid a little creat and stat slowdown

2015-11-03 Thread Huang, Ying
Hugh Dickins  writes:

> LKP reports that v4.2 commit afa2db2fb6f1 ("tmpfs: truncate prealloc
> blocks past i_size") causes a 14.5% slowdown in the AIM9 creat-clo
> benchmark.
>
> creat-clo does just what you'd expect from the name, and creat's O_TRUNC
> on 0-length file does indeed get into more overhead now shmem_setattr()
> tests "0 <= 0" instead of "0 < 0".
>
> I'm not sure how much we care, but I think it would not be too VW-like
> to add in a check for whether any pages (or swap) are allocated: if none
> are allocated, there's none to remove from the radix_tree.  At first I
> thought that check would be good enough for the unmaps too, but no: we
> should not skip the unlikely case of unmapping pages beyond the new EOF,
> which were COWed from holes which have now been reclaimed, leaving none.
>
> This gives me an 8.5% speedup: on Haswell instead of LKP's Westmere,
> and running a debug config before and after: I hope those account for
> the lesser speedup.
>
> And probably someone has a benchmark where a thousand threads keep on
> stat'ing the same file repeatedly: forestall that report by adjusting
> v4.3 commit 44a30220bc0a ("shmem: recalculate file inode when fstat")
> not to take the spinlock in shmem_getattr() when there's no work to do.
>
> Reported-by: Ying Huang 
> Signed-off-by: Hugh Dickins 

Hi, Hugh,

Thanks a lot for your support!  The test on LKP shows that this patch
restores a big part of the regression!  In following list,

c435a390574d012f8d30074135d8fcc6f480b484: is parent commit
afa2db2fb6f15f860069de94a1257db57589fe95: is the first bad commit has
performance regression.
43819159da2b77fedcf7562134d6003dccd6a068: is the fixing patch

=
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
  
gcc-4.9/performance/x86_64-rhel/debian-x86_64-2015-02-07.cgz/lkp-wsx02/creat-clo/aim9/300s

commit: 
  c435a390574d012f8d30074135d8fcc6f480b484
  afa2db2fb6f15f860069de94a1257db57589fe95
  43819159da2b77fedcf7562134d6003dccd6a068

c435a390574d012f afa2db2fb6f15f860069de94a1 43819159da2b77fedcf7562134 
 -- -- 
 %stddev %change %stddev %change %stddev
 \  |\  |\  
563556 ±  1% -12.5% 493033 ±  5%  -5.6% 531968 ±  1%  
aim9.creat-clo.ops_per_sec
 11836 ±  7% +11.4%  13184 ±  7% +15.0%  13608 ±  5%  
numa-meminfo.node1.SReclaimable
  10121526 ±  3% -12.1%8897097 ±  5%  -4.1%9707953 ±  4%  
proc-vmstat.pgalloc_normal
  9.34 ±  4% -11.4%   8.28 ±  3%  -4.8%   8.88 ±  2%  
time.user_time
  3480 ±  3%  -2.5%   3395 ±  1% -28.5%   2488 ±  3%  
vmstat.system.cs
203275 ± 17%  -6.8% 189453 ±  5% -34.4% 133352 ± 11%  
cpuidle.C1-NHM.usage
   8081280 ±129% -93.3% 538377 ± 97% +31.5%   10625496 ±106%  
cpuidle.C1E-NHM.time
  3144 ± 58%+619.0%  22606 ± 56%+903.9%  31563 ±  0%  
numa-vmstat.node0.numa_other
  2958 ±  7% +11.4%   3295 ±  7% +15.0%   3401 ±  5%  
numa-vmstat.node1.nr_slab_reclaimable
 45074 ±  5% -43.4%  25494 ± 57% -68.7%  14105 ±  2%  
numa-vmstat.node2.numa_other
 56140 ±  0%  +0.0%  56158 ±  0% -94.4%   3120 ±  0%  
slabinfo.Acpi-ParseExt.active_objs
  1002 ±  0%  +0.0%   1002 ±  0% -92.0%  80.00 ±  0%  
slabinfo.Acpi-ParseExt.active_slabs
 56140 ±  0%  +0.0%  56158 ±  0% -94.4%   3120 ±  0%  
slabinfo.Acpi-ParseExt.num_objs
  1002 ±  0%  +0.0%   1002 ±  0% -92.0%  80.00 ±  0%  
slabinfo.Acpi-ParseExt.num_slabs
  1079 ±  5% -10.8% 962.00 ± 10%-100.0%   0.00 ± -1%  
slabinfo.blkdev_ioc.active_objs
  1079 ±  5% -10.8% 962.00 ± 10%-100.0%   0.00 ± -1%  
slabinfo.blkdev_ioc.num_objs
110.67 ± 39% +74.4% 193.00 ± 46%+317.5% 462.00 ±  8%  
slabinfo.blkdev_queue.active_objs
189.33 ± 23% +43.7% 272.00 ± 33%+151.4% 476.00 ± 10%  
slabinfo.blkdev_queue.num_objs
  1129 ± 10%  -1.9%   1107 ±  7% +20.8%   1364 ±  6%  
slabinfo.blkdev_requests.active_objs
  1129 ± 10%  -1.9%   1107 ±  7% +20.8%   1364 ±  6%  
slabinfo.blkdev_requests.num_objs
  1058 ±  3% -10.3% 949.00 ±  9%-100.0%   0.00 ± -1%  
slabinfo.file_lock_ctx.active_objs
  1058 ±  3% -10.3% 949.00 ±  9%-100.0%   0.00 ± -1%  
slabinfo.file_lock_ctx.num_objs
  4060 ±  1%  -2.1%   3973 ±  1% -10.5%   3632 ±  1%  
slabinfo.files_cache.active_objs
  4060 ±  1%  -2.1%   3973 ±  1% -10.5%   3632 ±  1%  
slabinfo.files_cache.num_objs
 10001 ±  0%  -0.3%   9973 

[PATCH] tmpfs: avoid a little creat and stat slowdown

2015-10-29 Thread Hugh Dickins
LKP reports that v4.2 commit afa2db2fb6f1 ("tmpfs: truncate prealloc
blocks past i_size") causes a 14.5% slowdown in the AIM9 creat-clo
benchmark.

creat-clo does just what you'd expect from the name, and creat's O_TRUNC
on 0-length file does indeed get into more overhead now shmem_setattr()
tests "0 <= 0" instead of "0 < 0".

I'm not sure how much we care, but I think it would not be too VW-like
to add in a check for whether any pages (or swap) are allocated: if none
are allocated, there's none to remove from the radix_tree.  At first I
thought that check would be good enough for the unmaps too, but no: we
should not skip the unlikely case of unmapping pages beyond the new EOF,
which were COWed from holes which have now been reclaimed, leaving none.

This gives me an 8.5% speedup: on Haswell instead of LKP's Westmere,
and running a debug config before and after: I hope those account for
the lesser speedup.

And probably someone has a benchmark where a thousand threads keep on
stat'ing the same file repeatedly: forestall that report by adjusting
v4.3 commit 44a30220bc0a ("shmem: recalculate file inode when fstat")
not to take the spinlock in shmem_getattr() when there's no work to do.

Reported-by: Ying Huang 
Signed-off-by: Hugh Dickins 
---
 mm/shmem.c |   22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

--- 4.3-rc7/mm/shmem.c  2015-09-12 18:30:20.857039763 -0700
+++ linux/mm/shmem.c2015-10-25 11:49:19.931973850 -0700
@@ -548,12 +548,12 @@ static int shmem_getattr(struct vfsmount
struct inode *inode = dentry->d_inode;
struct shmem_inode_info *info = SHMEM_I(inode);
 
-   spin_lock(>lock);
-   shmem_recalc_inode(inode);
-   spin_unlock(>lock);
-
+   if (info->alloced - info->swapped != inode->i_mapping->nrpages) {
+   spin_lock(>lock);
+   shmem_recalc_inode(inode);
+   spin_unlock(>lock);
+   }
generic_fillattr(inode, stat);
-
return 0;
 }
 
@@ -586,10 +586,16 @@ static int shmem_setattr(struct dentry *
}
if (newsize <= oldsize) {
loff_t holebegin = round_up(newsize, PAGE_SIZE);
-   unmap_mapping_range(inode->i_mapping, holebegin, 0, 1);
-   shmem_truncate_range(inode, newsize, (loff_t)-1);
+   if (oldsize > holebegin)
+   unmap_mapping_range(inode->i_mapping,
+   holebegin, 0, 1);
+   if (info->alloced)
+   shmem_truncate_range(inode,
+   newsize, (loff_t)-1);
/* unmap again to remove racily COWed private pages */
-   unmap_mapping_range(inode->i_mapping, holebegin, 0, 1);
+   if (oldsize > holebegin)
+   unmap_mapping_range(inode->i_mapping,
+   holebegin, 0, 1);
}
}
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] tmpfs: avoid a little creat and stat slowdown

2015-10-29 Thread Hugh Dickins
LKP reports that v4.2 commit afa2db2fb6f1 ("tmpfs: truncate prealloc
blocks past i_size") causes a 14.5% slowdown in the AIM9 creat-clo
benchmark.

creat-clo does just what you'd expect from the name, and creat's O_TRUNC
on 0-length file does indeed get into more overhead now shmem_setattr()
tests "0 <= 0" instead of "0 < 0".

I'm not sure how much we care, but I think it would not be too VW-like
to add in a check for whether any pages (or swap) are allocated: if none
are allocated, there's none to remove from the radix_tree.  At first I
thought that check would be good enough for the unmaps too, but no: we
should not skip the unlikely case of unmapping pages beyond the new EOF,
which were COWed from holes which have now been reclaimed, leaving none.

This gives me an 8.5% speedup: on Haswell instead of LKP's Westmere,
and running a debug config before and after: I hope those account for
the lesser speedup.

And probably someone has a benchmark where a thousand threads keep on
stat'ing the same file repeatedly: forestall that report by adjusting
v4.3 commit 44a30220bc0a ("shmem: recalculate file inode when fstat")
not to take the spinlock in shmem_getattr() when there's no work to do.

Reported-by: Ying Huang 
Signed-off-by: Hugh Dickins 
---
 mm/shmem.c |   22 ++
 1 file changed, 14 insertions(+), 8 deletions(-)

--- 4.3-rc7/mm/shmem.c  2015-09-12 18:30:20.857039763 -0700
+++ linux/mm/shmem.c2015-10-25 11:49:19.931973850 -0700
@@ -548,12 +548,12 @@ static int shmem_getattr(struct vfsmount
struct inode *inode = dentry->d_inode;
struct shmem_inode_info *info = SHMEM_I(inode);
 
-   spin_lock(>lock);
-   shmem_recalc_inode(inode);
-   spin_unlock(>lock);
-
+   if (info->alloced - info->swapped != inode->i_mapping->nrpages) {
+   spin_lock(>lock);
+   shmem_recalc_inode(inode);
+   spin_unlock(>lock);
+   }
generic_fillattr(inode, stat);
-
return 0;
 }
 
@@ -586,10 +586,16 @@ static int shmem_setattr(struct dentry *
}
if (newsize <= oldsize) {
loff_t holebegin = round_up(newsize, PAGE_SIZE);
-   unmap_mapping_range(inode->i_mapping, holebegin, 0, 1);
-   shmem_truncate_range(inode, newsize, (loff_t)-1);
+   if (oldsize > holebegin)
+   unmap_mapping_range(inode->i_mapping,
+   holebegin, 0, 1);
+   if (info->alloced)
+   shmem_truncate_range(inode,
+   newsize, (loff_t)-1);
/* unmap again to remove racily COWed private pages */
-   unmap_mapping_range(inode->i_mapping, holebegin, 0, 1);
+   if (oldsize > holebegin)
+   unmap_mapping_range(inode->i_mapping,
+   holebegin, 0, 1);
}
}
 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/