Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-16 Thread Shan Hai

On 07/15/2011 08:20 PM, Benjamin Herrenschmidt wrote:

On Fri, 2011-07-15 at 11:32 -0400, Shan Hai wrote:

I agree with you, the problem could be triggered by accessing
any user space page which has kernel read only permission
in the page fault disabled context, the problem also affects
architectures which depend on SW dirty/young tracking as
stated by Benjamin in this thread.

In the e500 case, the commit 6cfd8990e27d3a491c1c605d6cbc18a46ae51fef
removed the write permission fixup from TLB miss handlers and left it to
generic code, so it might be right time to fixup the write permission here
in the generic code.

But we can't. The must not modify the PTE from an interrupt context and
the "atomic" variants of user accesses can be called in such contexts.

I think the problem is that we try to actually do things other than just
"peek" at user memory (for backtraces etc...) but actually useful things
in page fault disabled contexts. That's bad and various archs mm were
designed with the assumption that this never happens.



Yes I understood, the *here* above means 'generic code' like futex code,
I am sorry for my ambiguous description.


If the futex case is seldom here, we could probably find a way to work
around in that specific case.



That's what my patch wants to do.


However, I -still- don't understand why gup didn't fixup the write
permission. gup doesn't set dirty ?



Yep, gup doesn't set dirty, because when the page fault
occurs on the kernel accessing a user page which is
read only to the kernel the following conditions hold,
- the page is present, because its a shared page
- the page is writable, because demand paging
sets up the pte for the current  process to so

The follow_page() called in the __get_user_page()
returns non NULL to its caller on the above mentioned
present and writable page, so the gup(.write=1) has no
chance to set pte dirty by calling handle_mm_fault

Thanks
Shan Hai
s

Cheers,
Ben.




___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread Benjamin Herrenschmidt
On Fri, 2011-07-15 at 11:32 -0400, Shan Hai wrote:
> 
> I agree with you, the problem could be triggered by accessing
> any user space page which has kernel read only permission
> in the page fault disabled context, the problem also affects
> architectures which depend on SW dirty/young tracking as
> stated by Benjamin in this thread.
> 
> In the e500 case, the commit 6cfd8990e27d3a491c1c605d6cbc18a46ae51fef
> removed the write permission fixup from TLB miss handlers and left it to
> generic code, so it might be right time to fixup the write permission here
> in the generic code.

But we can't. The must not modify the PTE from an interrupt context and
the "atomic" variants of user accesses can be called in such contexts.

I think the problem is that we try to actually do things other than just
"peek" at user memory (for backtraces etc...) but actually useful things
in page fault disabled contexts. That's bad and various archs mm were
designed with the assumption that this never happens.

If the futex case is seldom here, we could probably find a way to work
around in that specific case.

However, I -still- don't understand why gup didn't fixup the write
permission. gup doesn't set dirty ?

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread Benjamin Herrenschmidt
On Fri, 2011-07-15 at 11:32 +0100, David Laight wrote:
> > The fault causing futex_atomic_cmpxchg_inatomic() is
> > protected by pagefault_disable(), so the page fault handler has
> > no chance to toggle the SW dirty/young tracking.
> 
> Perhaps that is the bug!
> Whatever pagefault_disable() does, it shouldn't disable the
> SW dirty/young tracking - which should only needs bits moving
> in the page table itself (and TLB update??) rather than any
> operations on the rest of the data areas.
> 
> It looks to me as though this could happen any time a page
> is marked inaccessible by the dirty/young tracking.
> Not just as a result of COW.

Except that for many architectures, there's a hard wired assumption that
the state of the PTEs won't change at interrupt time.

If we allow the "atomic" user accesses, we'll break that rule (think
about perf backtraces for example), and so would have to at -least-
disable interrupts around all the PTE accessors, or use atomic ops,
which will slow things down all over the place.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread Peter Zijlstra
On Fri, 2011-07-15 at 11:32 +0100, David Laight wrote:
> > The fault causing futex_atomic_cmpxchg_inatomic() is
> > protected by pagefault_disable(), so the page fault handler has
> > no chance to toggle the SW dirty/young tracking.
> 
> Perhaps that is the bug!
> Whatever pagefault_disable() does, it shouldn't disable the
> SW dirty/young tracking - which should only needs bits moving
> in the page table itself (and TLB update??) rather than any
> operations on the rest of the data areas.
> 
> It looks to me as though this could happen any time a page
> is marked inaccessible by the dirty/young tracking.
> Not just as a result of COW.

I've thought of that as well, but couldn't find the actual code in the
ppc fault bits. It could be it relies on vma information to know what it
should allow due to lack of bits in the pte.

If it requires the vma, it requires mmap_sem and it thus has to avoid
the whole thing when pagefault_disabled().
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread Shan Hai

On 07/15/2011 06:32 AM, David Laight wrote:



The fault causing futex_atomic_cmpxchg_inatomic() is
protected by pagefault_disable(), so the page fault handler has
no chance to toggle the SW dirty/young tracking.

Perhaps that is the bug!
Whatever pagefault_disable() does, it shouldn't disable the
SW dirty/young tracking - which should only needs bits moving
in the page table itself (and TLB update??) rather than any
operations on the rest of the data areas.

It looks to me as though this could happen any time a page
is marked inaccessible by the dirty/young tracking.
Not just as a result of COW.



I agree with you, the problem could be triggered by accessing
any user space page which has kernel read only permission
in the page fault disabled context, the problem also affects
architectures which depend on SW dirty/young tracking as
stated by Benjamin in this thread.

In the e500 case, the commit 6cfd8990e27d3a491c1c605d6cbc18a46ae51fef
removed the write permission fixup from TLB miss handlers and left it to
generic code, so it might be right time to fixup the write permission here
in the generic code.


Thanks
Shan Hai


David




___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


RE: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread David Laight
 
> The fault causing futex_atomic_cmpxchg_inatomic() is
> protected by pagefault_disable(), so the page fault handler has
> no chance to toggle the SW dirty/young tracking.

Perhaps that is the bug!
Whatever pagefault_disable() does, it shouldn't disable the
SW dirty/young tracking - which should only needs bits moving
in the page table itself (and TLB update??) rather than any
operations on the rest of the data areas.

It looks to me as though this could happen any time a page
is marked inaccessible by the dirty/young tracking.
Not just as a result of COW.

David


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread Shan Hai

On 07/15/2011 05:50 PM, Peter Zijlstra wrote:

On Fri, 2011-07-15 at 17:08 +0800, Shan Hai wrote:

The whole scenario should be,
- the child process triggers a page fault at the first time access to
  the lock, and it got its own writable page, but its *clean* for
  the reason just for checking the status of the lock.
  I am sorry for above "unbreakable COW".
- the futex_lock_pi() is invoked because of the lock contention,
  and the futex_atomic_cmpxchg_inatomic() tries to get the lock,
  it found out the lock is free so tries to write to the lock for
  reservation, a page fault occurs, because the page is read only
  for kernel(e500 specific), and returns -EFAULT to the caller
- the fault_in_user_writeable() tries to fix the fault,
  but from the get_user_pages() view everything is ok, because
  the COW was already broken, retry futex_lock_pi_atomic()

but that's a bug right there, gup(.write=1) _should_ be a complete write
fault, and as such toggle your sw dirty/young tracking.



The fault causing futex_atomic_cmpxchg_inatomic() is
protected by pagefault_disable(), so the page fault handler has
no chance to toggle the SW dirty/young tracking.

Thanks
Shan Hai


- futex_lock_pi_atomic() -->  futex_atomic_cmpxchg_inatomic(),
  another write protection page fault
- infinite loop


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread Peter Zijlstra
On Fri, 2011-07-15 at 17:08 +0800, Shan Hai wrote:
> The whole scenario should be,
> - the child process triggers a page fault at the first time access to
>  the lock, and it got its own writable page, but its *clean* for
>  the reason just for checking the status of the lock.
>  I am sorry for above "unbreakable COW".
> - the futex_lock_pi() is invoked because of the lock contention,
>  and the futex_atomic_cmpxchg_inatomic() tries to get the lock,
>  it found out the lock is free so tries to write to the lock for
>  reservation, a page fault occurs, because the page is read only
>  for kernel(e500 specific), and returns -EFAULT to the caller
> - the fault_in_user_writeable() tries to fix the fault,
>  but from the get_user_pages() view everything is ok, because
>  the COW was already broken, retry futex_lock_pi_atomic()

but that's a bug right there, gup(.write=1) _should_ be a complete write
fault, and as such toggle your sw dirty/young tracking.

> - futex_lock_pi_atomic() --> futex_atomic_cmpxchg_inatomic(),
>  another write protection page fault
> - infinite loop

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread Benjamin Herrenschmidt
On Fri, 2011-07-15 at 17:08 +0800, Shan Hai wrote:
> The whole scenario should be,
> - the child process triggers a page fault at the first time access to
>  the lock, and it got its own writable page, but its *clean* for
>  the reason just for checking the status of the lock.
>  I am sorry for above "unbreakable COW".
> - the futex_lock_pi() is invoked because of the lock contention,
>  and the futex_atomic_cmpxchg_inatomic() tries to get the lock,
>  it found out the lock is free so tries to write to the lock for
>  reservation, a page fault occurs, because the page is read only
>  for kernel(e500 specific), and returns -EFAULT to the caller

There is nothing e500 specific about user read only pages being read
only for kernel. All architectures behave the same way here afaik.

_However_ there is something not totally x86-like in the fact that we
require handle_mm_fault() to deal with dirty and young tracking, which
means that we -will- fault for a non-dirty writeable page or for any
non-young page. It's quite possible that the page fault disabling occurs
before that and thus breaks those architectures (it's not only e500 and
afaik not only powerpc) while x86 works fine due to HW update of dirty
and young.

It might be something to look into.

Cheers,
Ben.

> - the fault_in_user_writeable() tries to fix the fault,
>  but from the get_user_pages() view everything is ok, because
>  the COW was already broken, retry futex_lock_pi_atomic()
> - futex_lock_pi_atomic() --> futex_atomic_cmpxchg_inatomic(),
>  another write protection page fault
> - infinite loop
> 
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread Benjamin Herrenschmidt
On Fri, 2011-07-15 at 16:38 +0800, MailingLists wrote:
> A page could be set to read only by the kernel (supervisor in the
> powerpc
> literature) on the e500, and that's what the kernel do. Set
> SW(supervisor
> write) bit in the TLB entry to grant write permission to the kernel on
> a
> page.
> 
> And further the SW bit is set according to the DIRTY flag of the PTE,
> PTE.DIRTY is set in the do_page_fault(), the futex_lock_pi() disabled
> page fault, the PTE.DIRTY never can be set, so do the SW bit,
> unbreakable
> COW occurred, infinite loop followed. 

That would be it ... the SW dirty and young tracking relies on faults to
fixup things in handle_pte_fault(). If the "disable page fault" thingy
happens before we get there, then we have a pretty nasty bug. Note that
this will hit more than just e500 (and in fact any architecture that
relies on SW to do dirty and young tracking).

Cheers,
Ben.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread Shan Hai

On 07/15/2011 04:44 PM, Peter Zijlstra wrote:

On Fri, 2011-07-15 at 16:38 +0800, MailingLists wrote:

On 07/15/2011 04:20 PM, Peter Zijlstra wrote:

On Fri, 2011-07-15 at 16:07 +0800, Shan Hai wrote:

The following test case could reveal a bug in the futex_lock_pi()

BUG: On FUTEX_LOCK_PI, there is a infinite loop in the futex_lock_pi()
  on Powerpc e500 core.
Cause: The linux kernel on the e500 core has no write permission on
  the COW page, refer the head comment of the following test code.

ftrace on test case:
[000]   353.990181: futex_lock_pi_atomic<-futex_lock_pi
[000]   353.990185: cmpxchg_futex_value_locked<-futex_lock_pi_atomic
[snip]
[000]   353.990191: do_page_fault<-handle_page_fault
[000]   353.990192: bad_page_fault<-handle_page_fault
[000]   353.990193: search_exception_tables<-bad_page_fault
[snip]
[000]   353.990199: get_user_pages<-fault_in_user_writeable
[snip]
[000]   353.990208: mark_page_accessed<-follow_page
[000]   353.990222: futex_lock_pi_atomic<-futex_lock_pi
[snip]
[000]   353.990230: cmpxchg_futex_value_locked<-futex_lock_pi_atomic
[ a loop occures here ]


But but but but, that get_user_pages(.write=1, .force=0) should result
in a COW break, getting our own writable page.

What is this e500 thing smoking that this doesn't work?

A page could be set to read only by the kernel (supervisor in the powerpc
literature) on the e500, and that's what the kernel do. Set SW(supervisor
write) bit in the TLB entry to grant write permission to the kernel on a
page.

And further the SW bit is set according to the DIRTY flag of the PTE,
PTE.DIRTY is set in the do_page_fault(), the futex_lock_pi() disabled
page fault, the PTE.DIRTY never can be set, so do the SW bit, unbreakable
COW occurred, infinite loop followed.

I'm fairly sure fault_in_user_writeable() has PF enabled as it takes
mmap_sem, an pagefaul_disable() is akin to preemp_disable() on mainline.

Also get_user_pages() fully expects to be able to schedule, and in fact
can call the full pf handler path all by its lonesome self.


The whole scenario should be,
- the child process triggers a page fault at the first time access to
the lock, and it got its own writable page, but its *clean* for
the reason just for checking the status of the lock.
I am sorry for above "unbreakable COW".
- the futex_lock_pi() is invoked because of the lock contention,
and the futex_atomic_cmpxchg_inatomic() tries to get the lock,
it found out the lock is free so tries to write to the lock for
reservation, a page fault occurs, because the page is read only
for kernel(e500 specific), and returns -EFAULT to the caller
- the fault_in_user_writeable() tries to fix the fault,
but from the get_user_pages() view everything is ok, because
the COW was already broken, retry futex_lock_pi_atomic()
- futex_lock_pi_atomic() --> futex_atomic_cmpxchg_inatomic(),
another write protection page fault
- infinite loop

Thanks
Shan Hai


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread Benjamin Herrenschmidt
On Fri, 2011-07-15 at 10:20 +0200, Peter Zijlstra wrote:
> But but but but, that get_user_pages(.write=1, .force=0) should result
> in a COW break, getting our own writable page.
> 
> What is this e500 thing smoking that this doesn't work? 

Right. That should have triggered the cow & flushed the TLB entry 

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread Peter Zijlstra
On Fri, 2011-07-15 at 16:38 +0800, MailingLists wrote:
> On 07/15/2011 04:20 PM, Peter Zijlstra wrote:
> > On Fri, 2011-07-15 at 16:07 +0800, Shan Hai wrote:
> >> The following test case could reveal a bug in the futex_lock_pi()
> >>
> >> BUG: On FUTEX_LOCK_PI, there is a infinite loop in the futex_lock_pi()
> >>  on Powerpc e500 core.
> >> Cause: The linux kernel on the e500 core has no write permission on
> >>  the COW page, refer the head comment of the following test code.
> >>
> >> ftrace on test case:
> >> [000]   353.990181: futex_lock_pi_atomic<-futex_lock_pi
> >> [000]   353.990185: cmpxchg_futex_value_locked<-futex_lock_pi_atomic
> >> [snip]
> >> [000]   353.990191: do_page_fault<-handle_page_fault
> >> [000]   353.990192: bad_page_fault<-handle_page_fault
> >> [000]   353.990193: search_exception_tables<-bad_page_fault
> >> [snip]
> >> [000]   353.990199: get_user_pages<-fault_in_user_writeable
> >> [snip]
> >> [000]   353.990208: mark_page_accessed<-follow_page
> >> [000]   353.990222: futex_lock_pi_atomic<-futex_lock_pi
> >> [snip]
> >> [000]   353.990230: cmpxchg_futex_value_locked<-futex_lock_pi_atomic
> >> [ a loop occures here ]
> >>
> >
> > But but but but, that get_user_pages(.write=1, .force=0) should result
> > in a COW break, getting our own writable page.
> >
> > What is this e500 thing smoking that this doesn't work?
> 
> A page could be set to read only by the kernel (supervisor in the powerpc
> literature) on the e500, and that's what the kernel do. Set SW(supervisor
> write) bit in the TLB entry to grant write permission to the kernel on a
> page.
> 
> And further the SW bit is set according to the DIRTY flag of the PTE,
> PTE.DIRTY is set in the do_page_fault(), the futex_lock_pi() disabled
> page fault, the PTE.DIRTY never can be set, so do the SW bit, unbreakable
> COW occurred, infinite loop followed.

I'm fairly sure fault_in_user_writeable() has PF enabled as it takes
mmap_sem, an pagefaul_disable() is akin to preemp_disable() on mainline.

Also get_user_pages() fully expects to be able to schedule, and in fact
can call the full pf handler path all by its lonesome self.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread MailingLists

On 07/15/2011 04:20 PM, Peter Zijlstra wrote:

On Fri, 2011-07-15 at 16:07 +0800, Shan Hai wrote:

The following test case could reveal a bug in the futex_lock_pi()

BUG: On FUTEX_LOCK_PI, there is a infinite loop in the futex_lock_pi()
 on Powerpc e500 core.
Cause: The linux kernel on the e500 core has no write permission on
 the COW page, refer the head comment of the following test code.

ftrace on test case:
[000]   353.990181: futex_lock_pi_atomic<-futex_lock_pi
[000]   353.990185: cmpxchg_futex_value_locked<-futex_lock_pi_atomic
[snip]
[000]   353.990191: do_page_fault<-handle_page_fault
[000]   353.990192: bad_page_fault<-handle_page_fault
[000]   353.990193: search_exception_tables<-bad_page_fault
[snip]
[000]   353.990199: get_user_pages<-fault_in_user_writeable
[snip]
[000]   353.990208: mark_page_accessed<-follow_page
[000]   353.990222: futex_lock_pi_atomic<-futex_lock_pi
[snip]
[000]   353.990230: cmpxchg_futex_value_locked<-futex_lock_pi_atomic
[ a loop occures here ]



But but but but, that get_user_pages(.write=1, .force=0) should result
in a COW break, getting our own writable page.

What is this e500 thing smoking that this doesn't work?


A page could be set to read only by the kernel (supervisor in the powerpc
literature) on the e500, and that's what the kernel do. Set SW(supervisor
write) bit in the TLB entry to grant write permission to the kernel on a
page.

And further the SW bit is set according to the DIRTY flag of the PTE,
PTE.DIRTY is set in the do_page_fault(), the futex_lock_pi() disabled
page fault, the PTE.DIRTY never can be set, so do the SW bit, unbreakable
COW occurred, infinite loop followed.

Thanks
Shan Hai
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/1] Fixup write permission of TLB on powerpc e500 core

2011-07-15 Thread Peter Zijlstra
On Fri, 2011-07-15 at 16:07 +0800, Shan Hai wrote:
> 
> The following test case could reveal a bug in the futex_lock_pi()
> 
> BUG: On FUTEX_LOCK_PI, there is a infinite loop in the futex_lock_pi() 
> on Powerpc e500 core.
> Cause: The linux kernel on the e500 core has no write permission on
> the COW page, refer the head comment of the following test code.
>  
> ftrace on test case:
> [000]   353.990181: futex_lock_pi_atomic <-futex_lock_pi
> [000]   353.990185: cmpxchg_futex_value_locked <-futex_lock_pi_atomic
> [snip]
> [000]   353.990191: do_page_fault <-handle_page_fault
> [000]   353.990192: bad_page_fault <-handle_page_fault
> [000]   353.990193: search_exception_tables <-bad_page_fault
> [snip]
> [000]   353.990199: get_user_pages <-fault_in_user_writeable
> [snip]
> [000]   353.990208: mark_page_accessed <-follow_page
> [000]   353.990222: futex_lock_pi_atomic <-futex_lock_pi
> [snip]
> [000]   353.990230: cmpxchg_futex_value_locked <-futex_lock_pi_atomic
> [ a loop occures here ]
> 


But but but but, that get_user_pages(.write=1, .force=0) should result
in a COW break, getting our own writable page.

What is this e500 thing smoking that this doesn't work?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev