On Wed, Dec 3, 2014 at 5:13 PM, Frederic Weisbecker <fweis...@gmail.com> wrote:
> On Wed, Dec 03, 2014 at 04:38:46PM -0800, Andy Lutomirski wrote:
>> On Wed, Dec 3, 2014 at 4:30 PM, Dave Jones <da...@redhat.com> wrote:
>> > On Wed, Dec 03, 2014 at 04:04:31PM -0800, Andy Lutomirski wrote:
>> >  > On Wed, Dec 3, 2014 at 3:58 PM, Frederic Weisbecker 
>> > <fweis...@gmail.com> wrote:
>> >  > > On Wed, Dec 03, 2014 at 03:18:41PM -0800, Andy Lutomirski wrote:
>> >  > >> It appears that some SCHEDULE_USER (asm for schedule_user) callers
>> >  > >> in arch/x86/kernel/entry_64.S are called from RCU kernel context,
>> >  > >> and schedule_user will return in RCU user context.  This causes RCU
>> >  > >> warnings and possible failures.
>> >  > >>
>> >  > >> This is intended to be a minimal fix suitable for 3.18.
>> >  > >>
>> >  > >> Reported-by: Dave Jones <da...@redhat.com>
>> >  > >> Cc: Oleg Nesterov <o...@redhat.com>
>> >  > >> Cc: Frédéric Weisbecker <fweis...@gmail.com>
>> >  > >> Cc: Paul McKenney <paul...@linux.vnet.ibm.com>
>> >  > >> Signed-off-by: Andy Lutomirski <l...@amacapital.net>
>> >  > >
>> >  > > Ah, we sent it about at the same time :-)
>> >  > >
>> >  > > Might be too late for 3.18 though because it's not a regression.
>> >
>> > Wait, so how come that trace didn't start showing up until recently ?
>>
>> Looking at the code, it's because int_careful has the same bug, but
>> syscall_trace_leave does:
>>
>>     /*
>>      * We may come here right after calling schedule_user()
>>      * or do_notify_resume(), in which case we can be in RCU
>>      * user mode.
>>      */
>>     user_exit();
>>
>> which means that this issue was anticipated when that comment was written.
>
> Indeed, in fact it was expected to work as long as the code that follows the
> syscall is limited to schedule_user(), syscall_trace_leave() and 
> do_notify_resume().
> But if anything else is called and uses RCU, this doesn't work anymore.
>
> So user_enter() and user_exit() have been designed to be re-entrant on 
> purpose.
>
>>
>> Prior to the 3.18 seccomp changes and the _TIF_WORK typo fix, it would
>> have been difficult to hit sysret_audit when context tracking was on
>> (you could do it once on the way out from a syscall that enabled
>> context tracking).  So this is 3.18 regression.
>
> I see now.
>
> So the real problem is not on schedule_user(). It's rather that 
> __audit_syscall_exit()
> should we wrapped inside user_exit()/user_enter() or exception_foo(). The 
> latter
> is safer in a sensitive patch. That would be the real and simple regression 
> fix.
> Tweaking schedule_user() is more risky.
>
> Then, if you like, we can rethink the whole later, define 
> syscall_trace_leave()
> as the only place that calls user_enter() and all the other syscall exit 
> functions
> (schedule_user(), do_notify_resume(), __audit_syscall_exit()) can just call
> exception_enter() - exception_exit() if they can be called after 
> syscall_trace_leave().
> Then finally we can make user_enter and user_exit non-reentrant after careful 
> audit
> of how other archs use it (sounds scary though).
>
> Or better yet: if you rework the syscall exit slow path, lets call 
> user_enter() at the
> very end of the syscall.

So, to summarize the choices for 3.18:

1. Revert everything that caused this.  Probably a bad idea -- that
will be quite invasive, and it will no longer revert cleanly, and the
changes themselves were all correct as far as we know.

2. Add a giant hack to phase 1 tracing: if _TIF_WORK, then force phase
2.  That would (I think) restore the 3.17 status quo.  Yuck.

3. My asm deletion patch
(https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry&id=1072a16a8d4ad1b11b8062f76e3236b9771b0fb6).
By itself, that will cause a huge performance hit to syscall auditing
users.

4. This patch.

5. Do something specific to fix __audit_syscall_exit.  That seems a
bit silly to be, since we'll want to revert it as soon as we fix the
larger problem.

I think I prefer choice 4, but 2 and 5 seem okay, too.  Or, if we're
okay with the temporary performance hit (Eric, rgb?) and someone wants
to review it (tglx? hpa? Oleg?) *and* confirm that its parent looks
reasonable in principle for 3.19, we could go with #3.

FWIW, choice 2 might be as simple as commenting out:

        work &= ~_TIF_NOHZ;

in syscall_trace_enter_phase1 and adding a comment that it's a
workaround pending some 3.19 fixes.  However, *it does not fix the
bug* -- it just makes it as hard to trigger as it was in 3.17 as
opposed to being really easy to trigger as in 3.18-rc7.

In the long run, I think that we need to fix SCHEDULE_USER *and* do
something like #3, since sysret_audit appears to have more issues than
just this.

--Andy

>
>>
>> The sysret_audit code is still totally screwed up AFAICT.  At the very
>> least, the whole mess rather strongly suggests that, if both context
>> tracking and audit are on, then __audit_syscall_exit is called *twice*
>> on each syscall.  __audit_syscall_exit seems to be idempotent, so
>> maybe no one has noticed that little glitch.
>>
>> I'll ask the x86 people to include my sysret_audit removal for 3.19,
>> since I think that this schedule_user change is a better last-minute
>> fix than removing a whole chunk of asm.
>>
>> --Andy
>>
>> >
>> >         Dave
>> >
>>
>>
>>
>> --
>> Andy Lutomirski
>> AMA Capital Management, LLC



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to