Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

2023-11-19 Thread Mateusz Guzik
On Mon, Nov 20, 2023 at 03:11:31PM +0800, kernel test robot wrote:
> 
> 
> Hello,
> 
> kernel test robot noticed a -2.9% regression of will-it-scale.per_thread_ops 
> on:
> 
> 
> commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to 
> SLAB_TYPESAFE_BY_RCU")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> 93faf426e3cc000c 0ede61d8589cc2d93aa78230d74 
>  --- 
>  %stddev %change %stddev
>  \  |\  
[snip]
>  30.90 ±  4% -20.6   10.35 ±  2%  
> perf-profile.self.cycles-pp.__fget_light
>   0.00   +26.5   26.48
> perf-profile.self.cycles-pp.__get_file_rcu
[snip]

So __fget_light now got a func call.

I don't know if this is worth patching (and benchmarking after), but I
if sorting this out is of interest, triviality below is probably the
easiest way out:

diff --git a/fs/file.c b/fs/file.c
index 5fb0b146e79e..d8d3e18800c4 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -856,14 +856,14 @@ void do_close_on_exec(struct files_struct *files)
spin_unlock(>file_lock);
 }
 
-static struct file *__get_file_rcu(struct file __rcu **f)
+static __always_inline struct file *__get_file_rcu(struct file __rcu **f)
 {
struct file __rcu *file;
struct file __rcu *file_reloaded;
struct file __rcu *file_reloaded_cmp;
 
file = rcu_dereference_raw(*f);
-   if (!file)
+   if (unlikely(!file))
return NULL;
 
if (unlikely(!atomic_long_inc_not_zero(>f_count)))
@@ -891,7 +891,7 @@ static struct file *__get_file_rcu(struct file __rcu **f)
 * If the pointers don't match the file has been reallocated by
 * SLAB_TYPESAFE_BY_RCU.
 */
-   if (file == file_reloaded_cmp)
+   if (likely(file == file_reloaded_cmp))
return file_reloaded;
 
fput(file);


Re: [linus:master] [locking] c8afaa1b0f: stress-ng.zero.ops_per_sec 6.3% improvement

2023-08-15 Thread Mateusz Guzik
On 8/15/23, Linus Torvalds  wrote:
> On Tue, 15 Aug 2023 at 07:12, kernel test robot 
> wrote:
>>
>> kernel test robot noticed a 6.3% improvement of stress-ng.zero.ops_per_sec
>> on:
>
> WTF? That's ridiculous. Why would that even test new_inode() at all?
> And why would it make any difference anyway to prefetch a new inode?
> The 'zero' test claims to just read /dev/zero in a loop...
>
> [ Goes looking ]
>

Ye man, I was puzzled myself but just figured it out and was about to respond ;)

# bpftrace -e 'kprobe:new_inode { @[kstack()] = count(); }'
Attaching 1 probe...

@[
new_inode+1
shmem_get_inode+137
__shmem_file_setup+195
shmem_zero_setup+46
mmap_region+1937
do_mmap+956
vm_mmap_pgoff+224
do_syscall_64+46
entry_SYSCALL_64_after_hwframe+115
]: 2689570

the bench is doing this *A LOT* and this looks so fishy, for the bench
itself and the kernel doing it, but I'm not going to dig into any of
that.

>>  39.35-0.3   39.09
>> perf-profile.calltrace.cycles-pp.inode_sb_list_add.new_inode.shmem_get_inode.__shmem_file_setup.shmem_zero_setup
>
> Ahh. It also does the mmap side, and the shared case ends up always
> creating a new inode.
>
> And while the test only tests *reading* and the mmap is read-only, the
> /dev/zero file descriptor was opened for writing too, for a different
> part of a test.
>
> So even though the mapping is never written to, MAYWRITE is set, and
> so the /dev/zero mapping is done as a shared memory mapping and we
> can't do it as just a private one.
>
> That's kind of stupid and looks unintentional, but whatever.
>
> End result: that benchmark ends up being at least partly (and a fairly
> noticeable part) a shmem setup benchmark, for no actual good reason.
>
> Oh well. I certainly don't mind the removal apparently then also
> helping some odd benchmark case, but I don't think this translates to
> anything real. Very random.
>
> Linus
>


-- 
Mateusz Guzik 


Re: [PATCH v2 3/3] fork: lock VMAs of the parent process when forking

2023-08-09 Thread Mateusz Guzik
On 8/5/23, Suren Baghdasaryan  wrote:
> On Fri, Aug 4, 2023 at 6:06 PM Mateusz Guzik  wrote:
>>
>> On 8/5/23, Linus Torvalds  wrote:
>> > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik  wrote:
>> >>
>> >> I know of these guys, I think they are excluded as is -- they go
>> >> through access_remote_vm, starting with:
>> >> if (mmap_read_lock_killable(mm))
>> >> return 0;
>> >>
>> >> while dup_mmap already write locks the parent's mm.
>> >
>> > Oh, you're only worried about vma_start_write()?
>> >
>> > That's a non-issue. It doesn't take the lock normally, since it starts
>> > off
>> > with
>> >
>> > if (__is_vma_write_locked(vma, _lock_seq))
>> > return;
>> >
>> > which catches on the lock sequence number already being set.
>> >
>> > So no extra locking there.
>> >
>> > Well, technically there's extra locking because the code stupidly
>> > doesn't initialize new vma allocations to the right sequence number,
>> > but that was talked about here:
>> >
>> >
>> > https://lore.kernel.org/all/CAHk-=wicrwaoeesbuogoqqufvesicbgp3cx0lykgevsfazn...@mail.gmail.com/
>> >
>> > and it's a separate issue.
>> >
>>
>> I'm going to bet one beer this is the issue.
>>
>> The patch I'm responding to only consists of adding the call to
>> vma_start_write and claims the 5% slowdown from it, while fixing
>> crashes if the forking process is multithreaded.
>>
>> For the fix to work it has to lock something against the parent.
>>
>> VMA_ITERATOR(old_vmi, oldmm, 0);
>> [..]
>> for_each_vma(old_vmi, mpnt) {
>> [..]
>> vma_start_write(mpnt);
>>
>> the added line locks an obj in the parent's vm space.
>>
>> The problem you linked looks like pessimization for freshly allocated
>> vmas, but that's what is being operated on here.
>
> Sorry, now I'm having trouble understanding the problem you are
> describing. We are locking the parent's vma before copying it and the
> newly created vma is locked before it's added into the vma tree. What
> is the problem then?
>

Sorry for the late reply!

Looks there has been a bunch of weird talking past one another in this
thread and I don't think trying to straighten it all out is worth any
time.

I think at least the two of us agree that if a single-threaded process
enters dup_mmap an
down_writes the mmap semaphore, then no new thread can pop up in said
process, thus no surprise page faults from that angle. 3rd parties are
supposed to interfaces like access_remote_vm, which down_read said
semaphore and are consequently also not a problem. The only worry here
is that someone is messing with another process memory without the
semaphore, but is very unlikely and patchable in the worst case -- but
someone(tm) has to audit. With all these conditions satisfied one can
elide vma_start_write for a perf win.

Finally, I think we agreed you are going to do the audit ;)

Cheers,
-- 
Mateusz Guzik 


Re: [PATCH v2 3/3] fork: lock VMAs of the parent process when forking

2023-08-04 Thread Mateusz Guzik
On 8/5/23, Suren Baghdasaryan  wrote:
> On Fri, Aug 4, 2023 at 5:49 PM Mateusz Guzik  wrote:
>> However, the other users (that I know of ) go through the mmap
>> semaphore to mess with anything which means they will wait for
>> dup_mmap to finish (or do their work first). I would be surprised if
>> there were any cases which don't take the semaphore, given that it was
>> a requirement prior to the vma patchset (unless you patched some to no
>> longer need it?). I would guess worst case the semaphore can be added
>> if missing.
>
> No, the only mmap_lock read-lock that is affected is during the page
> fault, which is expected.
>

I have difficulty parsing your statement.

I am saying that any 3rd parties which can trigger page faults already
read lock mmap_lock or can be made to do it (and I don't know any case
which does not already, but I'm not willing to spend time poking
around to make sure). One can consider 3rd parties as not a problem,
modulo the audit.

Past that there does is no known source of trouble? In my original
e-mail I was worried about processes up the chain in ancestry, perhaps
some of the state is shared(?) and the locking at hand neuters any
problems. I'm guessing this is not necessary.

Bottom line though it looks like this will work fine?

That said, I'm not going to submit a patch I can't confidently defend.
As I did not dig into any of the VMA code and can't be arsed to audit
all places which mess with "foreign" mm, I'm definitely not submitting
this myself. You are most welcome to write your own variant at your
leisure. :)

-- 
Mateusz Guzik 


Re: [PATCH v2 3/3] fork: lock VMAs of the parent process when forking

2023-08-04 Thread Mateusz Guzik
On 8/5/23, Linus Torvalds  wrote:
> On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik  wrote:
>>
>> I know of these guys, I think they are excluded as is -- they go
>> through access_remote_vm, starting with:
>> if (mmap_read_lock_killable(mm))
>> return 0;
>>
>> while dup_mmap already write locks the parent's mm.
>
> Oh, you're only worried about vma_start_write()?
>
> That's a non-issue. It doesn't take the lock normally, since it starts off
> with
>
> if (__is_vma_write_locked(vma, _lock_seq))
> return;
>
> which catches on the lock sequence number already being set.
>
> So no extra locking there.
>
> Well, technically there's extra locking because the code stupidly
> doesn't initialize new vma allocations to the right sequence number,
> but that was talked about here:
>
>
> https://lore.kernel.org/all/CAHk-=wicrwaoeesbuogoqqufvesicbgp3cx0lykgevsfazn...@mail.gmail.com/
>
> and it's a separate issue.
>

I'm going to bet one beer this is the issue.

The patch I'm responding to only consists of adding the call to
vma_start_write and claims the 5% slowdown from it, while fixing
crashes if the forking process is multithreaded.

For the fix to work it has to lock something against the parent.

VMA_ITERATOR(old_vmi, oldmm, 0);
[..]
for_each_vma(old_vmi, mpnt) {
[..]
vma_start_write(mpnt);

the added line locks an obj in the parent's vm space.

The problem you linked looks like pessimization for freshly allocated
vmas, but that's what is being operated on here.

-- 
Mateusz Guzik 


Re: [PATCH v2 3/3] fork: lock VMAs of the parent process when forking

2023-08-04 Thread Mateusz Guzik
On 8/5/23, Suren Baghdasaryan  wrote:
> On Fri, Aug 4, 2023 at 5:26 PM Suren Baghdasaryan 
> wrote:
>>
>> On Fri, Aug 4, 2023 at 5:15 PM Linus Torvalds
>>  wrote:
>> >
>> > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik  wrote:
>> > >
>> > > I know of these guys, I think they are excluded as is -- they go
>> > > through access_remote_vm, starting with:
>> > > if (mmap_read_lock_killable(mm))
>> > > return 0;
>> > >
>> > > while dup_mmap already write locks the parent's mm.
>> >
>> > Oh, you're only worried about vma_start_write()?
>> >
>> > That's a non-issue. It doesn't take the lock normally, since it starts
>> > off with
>> >
>> > if (__is_vma_write_locked(vma, _lock_seq))
>> > return;
>> >
>> > which catches on the lock sequence number already being set.
>>
>> That check will prevent re-locking but if vma is not already locked
>> then the call will proceed with obtaining the lock and setting
>> vma->vm_lock_seq to mm->mm_lock_seq.
>
> The optimization Mateusz describes looks valid to me. If there is
> nobody else to fault a page and mm_users is stable (which I think it
> is because we are holding mmap_lock for write) then we can skip vma
> locking, I think.
>

mm_users is definitely *not* stable -- it can be bumped by
get_task_mm, which is only synchronized with task lock.

However, the other users (that I know of ) go through the mmap
semaphore to mess with anything which means they will wait for
dup_mmap to finish (or do their work first). I would be surprised if
there were any cases which don't take the semaphore, given that it was
a requirement prior to the vma patchset (unless you patched some to no
longer need it?). I would guess worst case the semaphore can be added
if missing.

What is guaranteed is that if the forking process is single-threaded,
there will be no threads added out of nowhere -- the only thread which
could do it is busy creating one in dup_mmap. If multithreaded
operation of the forking process was the only problem, that's it.

>>
>> >
>> > So no extra locking there.
>> >
>> > Well, technically there's extra locking because the code stupidly
>> > doesn't initialize new vma allocations to the right sequence number,
>> > but that was talked about here:
>> >
>> >
>> > https://lore.kernel.org/all/CAHk-=wicrwaoeesbuogoqqufvesicbgp3cx0lykgevsfazn...@mail.gmail.com/
>> >
>> > and it's a separate issue.
>> >
>> >   Linus
>


-- 
Mateusz Guzik 


Re: [PATCH v2 3/3] fork: lock VMAs of the parent process when forking

2023-08-04 Thread Mateusz Guzik
On 8/5/23, Linus Torvalds  wrote:
> On Fri, 4 Aug 2023 at 14:46, Mateusz Guzik  wrote:
>>
>> I don't see it mentioned in the discussion, so at a risk of ruffling
>> feathers or looking really bad I'm going to ask: is the locking of any
>> use if the forking process is single-threaded? T
>
> Sadly, we've always been able to access the mm from other processes,
> so the locking is - I think - unavoidable.
>
> And some of those "access from other processes" aren't even uncommon
> or special. It's things like "ps" etc, that do it just to see the
> process name and arguments.
>

I know of these guys, I think they are excluded as is -- they go
through access_remote_vm, starting with:
if (mmap_read_lock_killable(mm))
return 0;

while dup_mmap already write locks the parent's mm.

I don't see any surprise relocks of the semaphore.

Granted, should someone *bypass* this mechanism the above would be moot.

-- 
Mateusz Guzik 


Re: [PATCH v2 3/3] fork: lock VMAs of the parent process when forking

2023-08-04 Thread Mateusz Guzik
On Sat, Jul 08, 2023 at 12:12:12PM -0700, Suren Baghdasaryan wrote:
[..]
> Lock VMAs of the parent process when forking a child, which prevents
> concurrent page faults during fork operation and avoids this issue.
> This fix can potentially regress some fork-heavy workloads. Kernel build
> time did not show noticeable regression on a 56-core machine while a
> stress test mapping 1 VMAs and forking 5000 times in a tight loop
> shows ~5% regression. If such fork time regression is unacceptable,
> disabling CONFIG_PER_VMA_LOCK should restore its performance. Further
> optimizations are possible if this regression proves to be problematic.
> 
> ---
>  kernel/fork.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b85814e614a5..d2e12b6d2b18 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>   for_each_vma(old_vmi, mpnt) {
>   struct file *file;
>  
> + vma_start_write(mpnt);
>   if (mpnt->vm_flags & VM_DONTCOPY) {
>   vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
>   continue;
> 

I don't see it mentioned in the discussion, so at a risk of ruffling
feathers or looking really bad I'm going to ask: is the locking of any
use if the forking process is single-threaded? The singular thread in
this case is occupied executing this very code, so it can't do any op
in parallel. Is there anyone else who could trigger a page fault? Are
these shared with other processes? Cursory reading suggests a private
copy is made here, so my guess is no. But then again, I landed here
freshly from the interwebz.

Or in short: if nobody can mess up the state if the forking process is
single-threaded, why not check for mm_users or whatever other indicator
to elide the slowdown for the (arguably) most common case?

If the state can be messed up anyway, that's a shame, but short
explanation how would be welcome.

to illustrate (totally untested):
diff --git a/kernel/fork.c b/kernel/fork.c
index d2e12b6d2b18..aac6b08a0b21 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -652,6 +652,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
LIST_HEAD(uf);
VMA_ITERATOR(old_vmi, oldmm, 0);
VMA_ITERATOR(vmi, mm, 0);
+   bool singlethread = READ_ONCE(oldmm->mm_users) == 1;

uprobe_start_dup_mmap();
if (mmap_write_lock_killable(oldmm)) {
@@ -686,7 +687,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
for_each_vma(old_vmi, mpnt) {
struct file *file;

-   vma_start_write(mpnt);
+   if (!singelthreaded)
+   vma_start_write(mpnt);
if (mpnt->vm_flags & VM_DONTCOPY) {
vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
continue;


[PATCH] lockref: stop doing cpu_relax in the cmpxchg loop

2023-01-13 Thread Mateusz Guzik
On the x86-64 architecture even a failing cmpxchg grants exclusive
access to the cacheline, making it preferable to retry the failed op
immediately instead of stalling with the pause instruction.

To illustrate the impact, below are benchmark results obtained by
running various will-it-scale tests on top of the 6.2-rc3 kernel
and Cascade Lake (2 sockets * 24 cores * 2 threads) CPU.

All results in ops/s. Note there is some variance in re-runs, but
the code is consistently faster when contention is present.

open3 ("Same file open/close"):
procstock   no-pause
1   805603  814942  (+%1)
2   1054980 1054781 (-0%)
8   1544802 1822858 (+18%)
24  1191064 2199665 (+84%)
48  851582  1469860 (+72%)
96  609481  1427170 (+134%)

fstat2 ("Same file fstat"):
procstock   no-pause
1   3013872 3047636 (+1%)
2   4284687 4400421 (+2%)
8   3257721 5530156 (+69%)
24  2239819 5466127 (+144%)
48  1701072 5256609 (+209%)
96  1269157 6649326 (+423%)

Additionally, a kernel with a private patch to help access() scalability:
access2 ("Same file access"):
procstock   patched patched+nopause
24  2378041 2005501 5370335 (-15% / +125%)

That is, fixing the problems in access itself *reduces* scalability
after the cacheline ping-pong only happens in lockref with the pause
instruction.

Note that fstat and access benchmarks are not currently integrated into
will-it-scale, but interested parties can find them in pull requests to
said project.

Code at hand has a rather tortured history. First modification showed up
in d472d9d98b463dd7 ("lockref: Relax in cmpxchg loop"), written with
Itanium in mind. Later it got patched up to use an arch-dependent macro
to stop doing it on s390 where it caused a significant regression. Said
macro had undergone revisions and was ultimately eliminated later, going
back to cpu_relax.

While I intended to only remove cpu_relax for x86-64, I got the
following comment from Linus:
> I would actually prefer just removing it entirely and see if somebody
> else hollers. You have the numbers to prove it hurts on real hardware,
> and I don't think we have any numbers to the contrary.

> So I think it's better to trust the numbers and remove it as a
> failure, than say "let's just remove it on x86-64 and leave everybody
>else with the potentially broken code"

Additionally, Will Deacon (maintainer of the arm64 port, one of the
architectures previously benchmarked):
> So, from the arm64 side of the fence, I'm perfectly happy just removing
> the cpu_relax() calls from lockref.

As such, come back full circle in history and whack it altogether.

Signed-off-by: Mateusz Guzik 
---
 lib/lockref.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/lockref.c b/lib/lockref.c
index 45e93ece8ba0..2afe4c5d8919 100644
--- a/lib/lockref.c
+++ b/lib/lockref.c
@@ -23,7 +23,6 @@
}   
\
if (!--retry)   
\
break;  
\
-   cpu_relax();
\
}   
\
 } while (0)
 
-- 
2.39.0



Re: lockref scalability on x86-64 vs cpu_relax

2023-01-13 Thread Mateusz Guzik
On 1/13/23, Linus Torvalds  wrote:
> Side note on your access() changes - if it turns out that you can
> remove all the cred games, we should possibly then revert my old
> commit d7852fbd0f04 ("access: avoid the RCU grace period for the
> temporary subjective credentials") which avoided the biggest issue
> with the unnecessary cred switching.
>
> I *think* access() is the only user of that special 'non_rcu' thing,
> but it is possible that the whole 'non_rcu' thing ends up mattering
> for cases where the cred actually does change because euid != uid (ie
> suid programs), so this would need a bit more effort to do performance
> testing on.
>

I don't think the games are avoidable. For one I found non-root
processes with non-empty cap_effective even on my laptop, albeit I did
not check how often something like this is doing access().

Discussion for another time.

> On Thu, Jan 12, 2023 at 5:36 PM Mateusz Guzik  wrote:
>> All that said, I think the thing to do here is to replace cpu_relax
>> with a dedicated arch-dependent macro, akin to the following:
>
> I would actually prefer just removing it entirely and see if somebody
> else hollers. You have the numbers to prove it hurts on real hardware,
> and I don't think we have any numbers to the contrary.
>
> So I think it's better to trust the numbers and remove it as a
> failure, than say "let's just remove it on x86-64 and leave everybody
> else with the potentially broken code"
>
[snip]
> Then other architectures can try to run their numbers, and only *if*
> it then turns out that they have a reason to do something else should
> we make this conditional and different on different architectures.
>
> Let's try to keep the code as common as possibly until we have hard
> evidence for special cases, in other words.
>

I did not want to make such a change without redoing the ThunderX2
benchmark, or at least something else arm64-y. I may be able to bench it
tomorrow on whatever arm-y stuff can be found on Amazon's EC2, assuming
no arm64 people show up with their results.

Even then IMHO the safest route is to patch it out on x86-64 and give
other people time to bench their archs as they get around to it, and
ultimately whack the thing if it turns out nobody benefits from it.
I would say beats backpedaling on the removal, but I'm not going to
fight for it.

That said, does waiting for arm64 numbers and/or producing them for the
removal commit message sound like a plan? If so, I'll post soon(tm).

-- 
Mateusz Guzik