Re: [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec

2020-07-30 Thread Eric W. Biederman
Linus Torvalds  writes:

> On Tue, Jul 28, 2020 at 6:23 AM Eric W. Biederman  
> wrote:
>>
>> For exec all I care about are user space threads.  So it appears the
>> freezer infrastructure adds very little.
>
> Yeah. 99% of the freezer stuff is for just adding the magic notations
> for kernel threads, and for any user space threads it seems the wrong
> interface.
>
>> Now to see if I can find another way to divert a task into a slow path
>> as it wakes up, so I don't need to manually wrap all of the sleeping
>> calls.  Something that plays nice with the scheduler.
>
> The thing is, how many places really care?
>
> Because I think there are like five of them. And they are all marked
> by taking cred_guard_mutex, or the file table lock.
>
> So it seems really excessive to then create some whole new "let's
> serialize every thread", when you actually don't care about any of it,
> except for a couple of very very special cases.
>
> If you care about "thread count stable", you care about exit() and
> clone().  You don't care about threads that are happily running - or
> sleeping - doing their own thing.
>
> So trying to catch those threads and freezing them really feels like
> entirely the wrong interface.

For me stopping the other threads has been a conceptually simple
direction that needs exploration even if it doesn't work out.

On that note I have something that is just a little longer than the
patch I posted that doesn't use the freezer, and leans on the fact that
TASK_INTERRUPTBLE and TASK_KILLABLE can occassionaly expect a spurious
wake up.  Which means (with the right locking) those two states
can be transformed into TASK_WAKEKILL to keep sleeping processes
sleeping.

After that I only need one small test in get_signal to catch the
unlikely case of processes running in userspace.

I have not figured out TASK_STOPPED or TASK_TRACED yet as they do
not handle spurious wake ups.

So I think I can stop and keep stopped the other threads in the group
without too much code or complexity for other parts of the kernel
(assuming spurious wakes ups are allowed).




Even with the other threads stopped the code does not simplify as
much as I had hoped.  The code still has to deal with the existence
of the other threads.  So while races don't happen and thus the code
is easier to understand and make correct the actual work of walking
the threads making a count etc still remains.



The real sticky widget right now is the unshare_files call.  When Al
moved the call in fd8328be874f ("[PATCH] sanitize handling of shared
descriptor tables in failing execve()") it introduced a regression that
causes posix file locks to be improperly removed during exec.
Unfortunately no one noticed for about a decade.

What caused the regression is that unshare_files is a noop if the usage
count is 1.  Which means that after de_thread unshare_files only does
something if our file table is shared by another process.  However when
unshare_files is called before de_thread in a multi-threaded process
unshare_files will always perform work.

For the locks that set fl_owner to current->files unsharing
current->files when unnecessary already causes problems, as we now
have an unnecessary change of owner during exec.

After the unnecessary change in owner the existing locks are
eventually removed at the end of bprm_execve:
bprm_execve()
   if (displaced)
   put_files_struct(displaced)
  filp_close()
 locks_remove_posix()
/* Which removes the posix locks */



After 12 years moving unshare_files back where it used to be is
also problematic, as with the addition of execveat we grew
a clear dependency other threads not messing with our open files:

/*
 * Record that a name derived from an O_CLOEXEC fd will be
 * inaccessible after exec. Relies on having exclusive access to
 * current->files (due to unshare_files above).
 */
if (bprm->fdpath &&
close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))
bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;


I have made a cursory look and I don't expect any other issues as the
only code in exec that messes with file descriptors is in fs/exec.c
Everything else simply uses "struct file *".


Testing to see if the file descriptor is close_on_exec is just there to
prevent trying to run a script that through a close_on_exec file
descriptor, that is part of the path to the script.  So it is a quality
of implementation issue so the code does not fail later if userspace
tries something silly.

So I think we can safely just update the comment and say if userspace
messes with the file descriptor they pass to exec during exec userspace
can keep both pieces, as it is a self inflicted problem.

All of the other issues I see with reverting the location where the file
table is unshared also look like userspace shooting themselves in the
foot and not a problem with correctnes

Re: [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec

2020-07-28 Thread Aleksa Sarai
On 2020-07-27, Eric W. Biederman  wrote:
> To the best of my knowledge processes with more than one thread
> calling exec are not common, and as all of the threads will be killed
> by exec there does not appear to be any useful work a thread can
> reliably do during exec.

Every Go program which calls exec (this includes runc, Docker, LXD,
Kubernetes, et al) fills the niche of "multi-threaded program that calls
exec" -- all Go programs are multi-threaded and there's no way of
disabling this. This will most likely cause pretty bad performance
regression for basically all container workloads.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec

2020-07-28 Thread Eric W. Biederman
Aleksa Sarai  writes:

> On 2020-07-27, Eric W. Biederman  wrote:
>> To the best of my knowledge processes with more than one thread
>> calling exec are not common, and as all of the threads will be killed
>> by exec there does not appear to be any useful work a thread can
>> reliably do during exec.
>
> Every Go program which calls exec (this includes runc, Docker, LXD,
> Kubernetes, et al) fills the niche of "multi-threaded program that calls
> exec" -- all Go programs are multi-threaded and there's no way of
> disabling this. This will most likely cause pretty bad performance
> regression for basically all container workloads.

So it is a good point that container runtimes use Go, and that
fundamentally anything that uses Go will be multi-threaded.  Having just
looked closely at this I don't think in practice this is an issue even
at this early state of my code.

If those other threads are sleeping the code I have implemented should
be a no-op.

If those threads aren't sleeping you have undefined behavior, as at
some point the kernel will come through and kill those threads.

Further unless I am completely mistaken the container runtimes use
forkAndExecInChild from go/src/syscall/exec_linux.go which performs a
vfork before performing the exec.

Eric



Re: [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec

2020-07-28 Thread Eric W. Biederman
Linus Torvalds  writes:

> On Mon, Jul 27, 2020 at 2:06 PM Eric W. Biederman  
> wrote:
>>
>> Therefore make it simpler to get exec correct by freezing the other
>> threads at the beginning of exec.  This removes an entire class of
>> races, and makes it tractable to fix some of the long standing
>> issues with exec.
>
> I hate the global state part of the freezer.
>
> It's also pointless. We don't want to trigger all the tests that
> various random driver kernel threads do.
>
> I also really don't like how now execve() by any random person will
> suddenly impact everything that might be doing freezing.

Yes.  system_freezing_cnt as an enable/disable that affects all
tasks in the system does seem like a misfeature for use in the context
of exec where only a few tasks need to be dealt with.

> It also makes for a possible _huge_ latency regression for execve(),
> since freezing really has never been a very low-latency operation.
>
> Other threads doing IO can now basically block execve() for a long
> long long time.

Hmm.  Potentially.  The synchronization with the other threads must
happen in a multi-threaded exec in de_thread.

So I need to look at the differences between where de_thread thread
can kill a thread and the freezer can not freeze a thread.  I am hoping
that the freezer has already instrumented most of those sleeps but I
admit I have not looked yet.

> Finally, I think your patch is fundamentally broken for another
> reason: it depends on CONFIG_FREEZER, and that isn't even required to
> be set!

Very true.  I absolutely missed that detail.

> So no, this is not at all acceptable in that form.
>
> Now, maybe we could _make_ it acceptable, by
>
>  (a) add a per-process freezer count to avoid the global state for this case

Perhaps even a single bit.


>  (b)  make a small subset of the freezing code available for the
> !CONFIG_FREEZER thing

The code that is controlled by CONFIG_FREEZER is just kernel/freezer.c,
and include/linux/freezer.h.  Which is 177 + 303 lines respectively
so not much.

Or are you thinking about all of the locations that already include
freezable sleeps?

>  (c) fix this "simple freezer" to not actually force wakeups etc, but
> catch things in the

To catch things in the scheduler I presume?

The thing is the freezer code does not wake up anything if it is in a
sleep that it has wrapped.  Which I thought was just about all
significant ones but I need to verify that.

> but honestly, at that point nothing of the "CONFIG_FREEZER" code even
> really exists any more. It would be more of a "execve_synchronize()"
> thing, where we'd catch things in the scheduler and/or system call
> entry/exit or whatever.

Yes.  Where we catch things seems to be key.  I believe if all sleeps
that are killable plus system call exit should be enough, to be a noop.
As those are the places where the code can be killed now.

The tricky part is to mark processes that are sleeping in such a way
that then they wake up they go into a slow path and they get trapped
by the freezing code.

> Also, that makes these kinds of nasty hacks that just make the
> existign freezer code even harder to figure out:


>> A new function exec_freeze_threads based upon
>> kernel/power/process.c:try_to_freeze_tasks is added.  To play well
>> with other uses of the kernel freezer it uses a killable sleep wrapped
>> with freezer_do_not_count/freezer_count.
>
> Ugh. Just _ugly_.
>
> And honestly, completely and utterly broken. See above.
>
> I understand the wish to re-use existing infrastructure. But the fact
> is, the FREEZER code is just about the _last_ thing you should want to
> use. That, and stop_machine(), is just too much of a big hammer.

Part of my challenge is that it the more layers that get put around a
sleep the trickier it is to subsequently wrap.

I can see the point of building something very simple and more
fundamental that doesn't need as much support as the current freezer.
Perhaps something the freezer can the be rebuilt upon.


If the basic idea of stopping other threads early before we kill them in
exec sounds plausible then I will see what I can do.

Eric


Re: [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec

2020-07-28 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> Linus Torvalds  writes:
>
>> It also makes for a possible _huge_ latency regression for execve(),
>> since freezing really has never been a very low-latency operation.
>>
>> Other threads doing IO can now basically block execve() for a long
>> long long time.
>
> Hmm.  Potentially.  The synchronization with the other threads must
> happen in a multi-threaded exec in de_thread.
>
> So I need to look at the differences between where de_thread thread
> can kill a thread and the freezer can not freeze a thread.  I am hoping
> that the freezer has already instrumented most of those sleeps but I
> admit I have not looked yet.

Alright I have looked at the freezer a bit more and I now see that the
point of marking things freezable is for kernel threads rather that user
space threads.  I think there are 5 maybe 6 places the code sleeps
reachable by userspace threads that are marked as freezable and most
of those are callable from get_signal.

For exec all I care about are user space threads.  So it appears the
freezer infrastructure adds very little.

Now to see if I can find another way to divert a task into a slow path
as it wakes up, so I don't need to manually wrap all of the sleeping
calls.  Something that plays nice with the scheduler.

Eric



Re: [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec

2020-07-28 Thread Linus Torvalds
On Tue, Jul 28, 2020 at 6:23 AM Eric W. Biederman  wrote:
>
> For exec all I care about are user space threads.  So it appears the
> freezer infrastructure adds very little.

Yeah. 99% of the freezer stuff is for just adding the magic notations
for kernel threads, and for any user space threads it seems the wrong
interface.

> Now to see if I can find another way to divert a task into a slow path
> as it wakes up, so I don't need to manually wrap all of the sleeping
> calls.  Something that plays nice with the scheduler.

The thing is, how many places really care?

Because I think there are like five of them. And they are all marked
by taking cred_guard_mutex, or the file table lock.

So it seems really excessive to then create some whole new "let's
serialize every thread", when you actually don't care about any of it,
except for a couple of very very special cases.

If you care about "thread count stable", you care about exit() and
clone().  You don't care about threads that are happily running - or
sleeping - doing their own thing.

So trying to catch those threads and freezing them really feels like
entirely the wrong interface.

 Linus


Re: [RFC][PATCH] exec: Freeze the other threads during a multi-threaded exec

2020-07-27 Thread Linus Torvalds
On Mon, Jul 27, 2020 at 2:06 PM Eric W. Biederman  wrote:
>
> Therefore make it simpler to get exec correct by freezing the other
> threads at the beginning of exec.  This removes an entire class of
> races, and makes it tractable to fix some of the long standing
> issues with exec.

I hate the global state part of the freezer.

It's also pointless. We don't want to trigger all the tests that
various random driver kernel threads do.

I also really don't like how now execve() by any random person will
suddenly impact everything that might be doing freezing.

It also makes for a possible _huge_ latency regression for execve(),
since freezing really has never been a very low-latency operation.

Other threads doing IO can now basically block execve() for a long
long long time.

Finally, I think your patch is fundamentally broken for another
reason: it depends on CONFIG_FREEZER, and that isn't even required to
be set!

So no, this is not at all acceptable in that form.

Now, maybe we could _make_ it acceptable, by

 (a) add a per-process freezer count to avoid the global state for this case

 (b)  make a small subset of the freezing code available for the
!CONFIG_FREEZER thing

 (c) fix this "simple freezer" to not actually force wakeups etc, but
catch things in the

but honestly, at that point nothing of the "CONFIG_FREEZER" code even
really exists any more. It would be more of a "execve_synchronize()"
thing, where we'd catch things in the scheduler and/or system call
entry/exit or whatever.

Also, that makes these kinds of nasty hacks that just make the
existign freezer code even harder to figure out:

> A new function exec_freeze_threads based upon
> kernel/power/process.c:try_to_freeze_tasks is added.  To play well
> with other uses of the kernel freezer it uses a killable sleep wrapped
> with freezer_do_not_count/freezer_count.

Ugh. Just _ugly_.

And honestly, completely and utterly broken. See above.

I understand the wish to re-use existing infrastructure. But the fact
is, the FREEZER code is just about the _last_ thing you should want to
use. That, and stop_machine(), is just too much of a big hammer.

Linus