Re: [PATCH] powerpc/32: Fix critical and debug interrupts on BOOKE

2021-08-04 Thread Radu Rendec
Hi Christophe,

On Wed, 2021-08-04 at 07:52 +0200, Christophe Leroy wrote:
> Le 07/07/2021 à 07:55, Christophe Leroy a écrit :
> > 32 bits BOOKE have special interrupts for debug and other
> > critical events.
> 
> Were you able to test this patch ?

I tested it three weeks ago and it works like a charm!

I'm so sorry I forgot to let you know. I got distracted testing the
old ptrace() problem. In fact, I wouldn't have been able to test that
if your interrupts patch hadn't been working.

Thanks,
Radu

> > When handling those interrupts, dedicated registers are saved
> > in the stack frame in addition to the standard registers, leading
> > to a shift of the pt_regs struct.
> > 
> > Since commit db297c3b07af ("powerpc/32: Don't save thread.regs on
> > interrupt entry"), the pt_regs struct is expected to be at the
> > same place all the time.
> > 
> > Instead of handling a special struct in addition to pt_regs, just
> > add those special registers to struct pt_regs.
> > 
> > Reported-by: Radu Rendec 
> > Signed-off-by: Christophe Leroy 
> > Fixes: db297c3b07af ("powerpc/32: Don't save thread.regs on
> > interrupt entry")
> > Cc: sta...@vger.kernel.org
> > ---
> >   arch/powerpc/include/asm/ptrace.h | 16 
> >   arch/powerpc/kernel/asm-offsets.c | 31
> > ++-
> >   arch/powerpc/kernel/head_booke.h  | 27
> > +++
> >   3 files changed, 33 insertions(+), 41 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/ptrace.h
> > b/arch/powerpc/include/asm/ptrace.h
> > index 3e5d470a6155..14422e851494 100644
> > --- a/arch/powerpc/include/asm/ptrace.h
> > +++ b/arch/powerpc/include/asm/ptrace.h
> > @@ -70,6 +70,22 @@ struct pt_regs
> > unsigned long __pad[4]; /* Maintain 16 byte
> > interrupt stack alignment */
> > };
> >   #endif
> > +#if defined(CONFIG_PPC32) && defined(CONFIG_BOOKE)
> > +   struct { /* Must be a multiple of 16 bytes */
> > +   unsigned long mas0;
> > +   unsigned long mas1;
> > +   unsigned long mas2;
> > +   unsigned long mas3;
> > +   unsigned long mas6;
> > +   unsigned long mas7;
> > +   unsigned long srr0;
> > +   unsigned long srr1;
> > +   unsigned long csrr0;
> > +   unsigned long csrr1;
> > +   unsigned long dsrr0;
> > +   unsigned long dsrr1;
> > +   };
> > +#endif
> >   };
> >   #endif
> >   
> > diff --git a/arch/powerpc/kernel/asm-offsets.c
> > b/arch/powerpc/kernel/asm-offsets.c
> > index a47eefa09bcb..5bee245d832b 100644
> > --- a/arch/powerpc/kernel/asm-offsets.c
> > +++ b/arch/powerpc/kernel/asm-offsets.c
> > @@ -309,24 +309,21 @@ int main(void)
> > STACK_PT_REGS_OFFSET(STACK_REGS_IAMR, iamr);
> >   #endif
> >   
> > -#if defined(CONFIG_PPC32)
> > -#if defined(CONFIG_BOOKE) || defined(CONFIG_40x)
> > -   DEFINE(EXC_LVL_SIZE, STACK_EXC_LVL_FRAME_SIZE);
> > -   DEFINE(MAS0, STACK_INT_FRAME_SIZE+offsetof(struct
> > exception_regs, mas0));
> > +#if defined(CONFIG_PPC32) && defined(CONFIG_BOOKE)
> > +   STACK_PT_REGS_OFFSET(MAS0, mas0);
> > /* we overload MMUCR for 44x on MAS0 since they are
> > mutually exclusive */
> > -   DEFINE(MMUCR, STACK_INT_FRAME_SIZE+offsetof(struct
> > exception_regs, mas0));
> > -   DEFINE(MAS1, STACK_INT_FRAME_SIZE+offsetof(struct
> > exception_regs, mas1));
> > -   DEFINE(MAS2, STACK_INT_FRAME_SIZE+offsetof(struct
> > exception_regs, mas2));
> > -   DEFINE(MAS3, STACK_INT_FRAME_SIZE+offsetof(struct
> > exception_regs, mas3));
> > -   DEFINE(MAS6, STACK_INT_FRAME_SIZE+offsetof(struct
> > exception_regs, mas6));
> > -   DEFINE(MAS7, STACK_INT_FRAME_SIZE+offsetof(struct
> > exception_regs, mas7));
> > -   DEFINE(_SRR0, STACK_INT_FRAME_SIZE+offsetof(struct
> > exception_regs, srr0));
> > -   DEFINE(_SRR1, STACK_INT_FRAME_SIZE+offsetof(struct
> > exception_regs, srr1));
> > -   DEFINE(_CSRR0, STACK_INT_FRAME_SIZE+offsetof(struct
> > exception_regs, csrr0));
> > -   DEFINE(_CSRR1, STACK_INT_FRAME_SIZE+offsetof(struct
> > exception_regs, csrr1));
> > -   DEFINE(_DSRR0, STACK_INT_FRAME_SIZE+offsetof(struct
> > exception_regs, dsrr0));
> > -   DEFINE(_DSRR1, STACK_INT_FRAME_SIZE+offsetof(struct
> > exception_regs, dsrr1));
> >

Re: [PATCH 0/1] PPC32: fix ptrace() access to FPU registers

2021-07-18 Thread Radu Rendec
On Fri, 2021-06-11 at 10:37 -0400, Radu Rendec wrote:
>On Fri, 2021-06-11 at 08:02 +0200, Christophe Leroy wrote:
>>Le 19/06/2019 à 14:57, Radu Rendec a écrit :
>>> On Wed, 2019-06-19 at 10:36 +1000, Daniel Axtens wrote:
>>>> Andreas Schwab <
>>>> sch...@linux-m68k.org
>>>>> writes:
>>>>
>>>>> On Jun 18 2019, Radu Rendec <
>>>>> radu.ren...@gmail.com
>>>>>> wrote:
>>>>>
>>>>>> Since you already have a working setup, it would be nice if you could
>>>>>> add a printk to arch_ptrace() to print the address and confirm what I
>>>>>> believe happens (by reading the gdb source code).
>>>>>
>>>>> A ppc32 ptrace syscall goes through compat_arch_ptrace.
>>>
>>> Right. I completely overlooked that part.
>>>
>>>> Ah right, and that (in ptrace32.c) contains code that will work:
>>>>
>>>>
>>>>/*
>>>> * the user space code considers the floating point
>>>> * to be an array of unsigned int (32 bits) - the
>>>> * index passed in is based on this assumption.
>>>> */
>>>>tmp = ((unsigned int *)child->thread.fp_state.fpr)
>>>>[FPRINDEX(index)];
>>>>
>>>> FPRINDEX is defined above to deal with the various manipulations you
>>>> need to do.
>>>
>>> Correct. Basically it does the same that I did in my patch: it divides
>>> the index again by 2 (it's already divided by 4 in compat_arch_ptrace()
>>> so it ends up divided by 8), then takes the least significant bit and
>>> adds it to the index. I take bit 2 of the original address, which is the
>>> same thing (because in FPRHALF() the address is already divided by 4).
>>>
>>> So we have this in ptrace32.c:
>>>
>>> #define FPRNUMBER(i) (((i) - PT_FPR0) >> 1)
>>> #define FPRHALF(i) (((i) - PT_FPR0) & 1)
>>> #define FPRINDEX(i) TS_FPRWIDTH * FPRNUMBER(i) * 2 + FPRHALF(i)
>>>
>>> index = (unsigned long) addr >> 2;
>>> (unsigned int *)child->thread.fp_state.fpr)[FPRINDEX(index)]
>>>
>>>
>>> And we have this in my patch:
>>>
>>> fpidx = (addr - PT_FPR0 * sizeof(long)) / 8;
>>> (void *)>thread.TS_FPR(fpidx) + (addr & 4)
>>>
>>>> Radu: I think we want to copy that working code back into ptrace.c.
>>>
>>> I'm not sure that would work. There's a subtle difference: the code in
>>> ptrace32.c is always compiled on a 64-bit kernel and the user space
>>> calling it is always 32-bit; on the other hand, the code in ptrace.c can
>>> be compiled on either a 64-bit kernel or a 32-bit kernel and the user
>>> space calling it always has the same "bitness" as the kernel.
>>>
>>> One difference is the size of the CPU registers. On 64-bit they are 8
>>> byte long and user space knows that and generates 8-byte aligned
>>> addresses. So you have to divide the address by 8 to calculate the CPU
>>> register index correctly, which compat_arch_ptrace() currently doesn't.
>>>
>>> Another difference is that on 64-bit `long` is 8 bytes, so user space
>>> can read a whole FPU register in a single ptrace call.
>>>
>>> Now that we are all aware of compat_arch_ptrace() (which handles the
>>> special case of a 32-bit process running on a 64-bit kernel) I would say
>>> the patch is correct and does the right thing for both 32-bit and 64-bit
>>> kernels and processes.
>>>
>>>> The challenge will be unpicking the awful mess of ifdefs in ptrace.c
>>>> and making it somewhat more comprehensible.
>>>
>>> I'm not sure what ifdefs you're thinking about. The only that are used
>>> inside arch_ptrace() are PT_FPR0, PT_FPSCR and TS_FPR, which seem to be
>>> correct.
>>>
>>> But perhaps it would be useful to change my patch and add a comment just
>>> before arch_ptrace() that explains how the math is done and that the
>>> code must work on both 32-bit and 64-bit, the user space address
>>> assumptions, etc.
>>>
>>> By the way, I'm not sure the code in compat_arch_ptrace() handles
>>> PT_FPSCR correctly. It might (just because fpscr is right next to fpr[]
>>> in memory - and that's a hack), but I can't figure out if it accesse

Re: Hitting BUG_ON in do_notify_resume() with gdb and SIGTRAP

2021-07-06 Thread Radu Rendec
On Tue, 2021-07-06 at 15:53 +0200, Christophe Leroy wrote:
>Le 06/07/2021 à 15:50, Radu Rendec a écrit :
>> On Tue, 2021-07-06 at 15:16 +0200, Christophe Leroy wrote:
>>> Le 06/07/2021 à 13:56, Radu Rendec a écrit :
>>>> On Tue, 2021-07-06 at 12:43 +0200, Christophe Leroy wrote:
>>>>> Le 04/07/2021 à 23:38, Radu Rendec a écrit :
>>>>>> I'm trying to set up my (virtual) environment to test an old bug in the
>>>>>> PPC32 ptrace() code. I came across a completely different problem,
>>>>>> which seems to make gdb pretty much unusable on PPC32. I'm not sure if
>>>>>> this is a real kernel bug or maybe something wrong with my
>>>>>> configuration.
>>>>>>
>>>>>> I'm running kernel 5.13 in a qemu VM with one e500mc CPU. I am running
>>>>>> native gdb (inside the VM) and setting a breakpoint in main() in a test
>>>>>> "hello world" program. Upon running the test program, I am hitting the
>>>>>> BUG_ON in do_notify_resume() on line 292. The kernel bug log snippet is
>>>>>> included below at the end of the email.
>>>>>>
>>>>>> FWIW, gdb says:
>>>>>> Program terminated with signal SIGTRAP, Trace/breakpoint trap.
>>>>>> The program no longer exists.
>>>>>>
>>>>>> I also added a pr_info() to do_notify_resume() just to see how much
>>>>>> different 'regs' and 'current->thread.regs' are. Surprisingly, they are
>>>>>> just 0x30 apart: regs=c7955f10 cur=c7955f40. Also, 'current' seems to
>>>>>> be OK (pid and comm are consistent with the test program).
>>>>>
>>>>> The TRAP = 0x7d8 is obviously wrong.
>>>>>
>>>>> Need to know which 'TRAP' it is exactly.
>>>>> Could you try to dump what we have at the correct regs ?
>>>>> Something like 'show_regs(current->thread.regs)' should do it.
>>>>
>>>> Sure, please see the output below. It looks to me like the "correct"
>>>> regs are just garbage. Either they are overwritten or current->thread.regs
>>>> is wrong. But in any case, r1 = 0 doesn't look good.
>>>
>>> Yes indeed. I think I identified the problem. For Critical interrupts like 
>>> DEBUG interrupt, struct
>>> exception_regs is added, therefore the frame has 12x4 (0x30) more bytes. 
>>> That's what you see.
>>>
>>> Commit
>>> https://github.com/linuxppc/linux/commit/db297c3b07af7856fb7c666fbc9792d8e37556be#diff-dd6b952a3980da19df4facccdb4f3dddeb8cef56ee384c7f03d02b23b0c6cb26
>>>
>>> Need to find the best solution now to fix that.
>> 
>> Awesome, happy to see you figured it out so quickly.
>> 
>> I'm not sure if it makes any sense, but one thing that comes to mind is
>> to put struct exception_regs before struct pt_regs when the frame is
>> saved. Unless of course other parts of the code expect the opposite.
>
>Yes I think it is a good idea. I think I won't have time to look at that 
>before summer vacation though.

I can take a stab at it. I'm not familiar with that part of the code,
but the best way to learn is to get your hands dirty :) In the worst
case, I won't fix it.




Re: Hitting BUG_ON in do_notify_resume() with gdb and SIGTRAP

2021-07-06 Thread Radu Rendec
On Tue, 2021-07-06 at 15:16 +0200, Christophe Leroy wrote:
>Le 06/07/2021 à 13:56, Radu Rendec a écrit :
>> On Tue, 2021-07-06 at 12:43 +0200, Christophe Leroy wrote:
>>> Le 04/07/2021 à 23:38, Radu Rendec a écrit :
>>>> I'm trying to set up my (virtual) environment to test an old bug in the
>>>> PPC32 ptrace() code. I came across a completely different problem,
>>>> which seems to make gdb pretty much unusable on PPC32. I'm not sure if
>>>> this is a real kernel bug or maybe something wrong with my
>>>> configuration.
>>>>
>>>> I'm running kernel 5.13 in a qemu VM with one e500mc CPU. I am running
>>>> native gdb (inside the VM) and setting a breakpoint in main() in a test
>>>> "hello world" program. Upon running the test program, I am hitting the
>>>> BUG_ON in do_notify_resume() on line 292. The kernel bug log snippet is
>>>> included below at the end of the email.
>>>>
>>>> FWIW, gdb says:
>>>> Program terminated with signal SIGTRAP, Trace/breakpoint trap.
>>>> The program no longer exists.
>>>>
>>>> I also added a pr_info() to do_notify_resume() just to see how much
>>>> different 'regs' and 'current->thread.regs' are. Surprisingly, they are
>>>> just 0x30 apart: regs=c7955f10 cur=c7955f40. Also, 'current' seems to
>>>> be OK (pid and comm are consistent with the test program).
>>>
>>> The TRAP = 0x7d8 is obviously wrong.
>>>
>>> Need to know which 'TRAP' it is exactly.
>>> Could you try to dump what we have at the correct regs ?
>>> Something like 'show_regs(current->thread.regs)' should do it.
>> 
>> Sure, please see the output below. It looks to me like the "correct"
>> regs are just garbage. Either they are overwritten or current->thread.regs
>> is wrong. But in any case, r1 = 0 doesn't look good.
>
>Yes indeed. I think I identified the problem. For Critical interrupts like 
>DEBUG interrupt, struct 
>exception_regs is added, therefore the frame has 12x4 (0x30) more bytes. 
>That's what you see.
>
>Commit 
>https://github.com/linuxppc/linux/commit/db297c3b07af7856fb7c666fbc9792d8e37556be#diff-dd6b952a3980da19df4facccdb4f3dddeb8cef56ee384c7f03d02b23b0c6cb26
>
>Need to find the best solution now to fix that.

Awesome, happy to see you figured it out so quickly.

I'm not sure if it makes any sense, but one thing that comes to mind is
to put struct exception_regs before struct pt_regs when the frame is
saved. Unless of course other parts of the code expect the opposite.

>> regs=c7a0ff10 cur=c7a0ff40 pid=139 comm=test
>> CPU: 0 PID: 139 Comm: test Not tainted 5.13.0-dirty #4
>> NIP:  1338 LR: 0003 CTR: 0003
>> REGS: c7a0ff40 TRAP: 67   Not tainted  (5.13.0-dirty)
>> MSR:  1002d202   CR: 1004  XER: 80670100
>> 
>> GPR00: b7fc36d8   b7fe17b4  b7ffd588 b7ffe8b8 
>> b7ffee10
>> GPR08: b7fff214 b7ffdf40 b7fff208 b858 b970 b7fff130 0001 
>> b960
>> GPR16: b7fff2b0 b7ffd5f0 b7ffe8a8 b850 b7fc3714 1002d002 b7fff208 
>> 0003
>> GPR24: b7fc3714  22000284 b960 07d8 1338 0800 
>> b850
>> NIP [1338] 0x1338
>> LR [0003] 0x3
>> Call Trace:
>> [c7a0fe40] [c0008eac] show_regs+0x4c/0x1b0 (unreliable)
>> [c7a0fe80] [c000969c] do_notify_resume+0x31c/0x320
>> [c7a0fee0] [c0010b94] interrupt_exit_user_prepare+0x94/0xc0
>> [c7a0ff00] [c00151e8] interrupt_return+0x14/0x13c
>> --- interrupt: 7d8 at 0xb7fc3714
>> NIP:  b7fc3714 LR: b7fc3714 CTR: 0003
>> REGS: c7a0ff10 TRAP: 07d8   Not tainted  (5.13.0-dirty)
>> MSR:  1002d002   CR: 22000284  XER: 
>> 
>> GPR00: b7fc3584 b850     00a0 
>> 6474e552
>> GPR08: b7fbe0d4 0001 b7fff230 b850 b7fc36d8   
>> b7fe17b4
>> GPR16:  b7ffd588 b7ffe8b8 b7ffee10 b7fff214 b7ffdf40 b7fff208 
>> b858
>> GPR24: b970 b7fff130 0001 b960 b7fff2b0 b7ffd5f0 b7ffe8a8 
>> b850
>> NIP [b7fc3714] 0xb7fc3714
>> LR [b7fc3714] 0xb7fc3714
>> --- interrupt: 7d8
>> [ cut here ]
>> kernel BUG at arch/powerpc/kernel/signal.c:298!
>> Oops: Exception in kernel mode, sig: 5 [#1]
>> BE PAGE_SIZE=4K SMP NR_CPUS=24 QEMU e500
>> Modules linked in:
>> CPU: 0 PID: 139 Comm: test Not tainted 5.13.0-dirty #4
>> NIP:  c000969c LR: c000969c CTR: c065f540
>> REGS: c7a0fdc0 TRAP: 0700   Not tainted  (5.13.0-dirty)
>> MSR:  000280

Re: Hitting BUG_ON in do_notify_resume() with gdb and SIGTRAP

2021-07-06 Thread Radu Rendec
On Tue, 2021-07-06 at 12:43 +0200, Christophe Leroy wrote:
> Le 04/07/2021 à 23:38, Radu Rendec a écrit :
> > I'm trying to set up my (virtual) environment to test an old bug in the
> > PPC32 ptrace() code. I came across a completely different problem,
> > which seems to make gdb pretty much unusable on PPC32. I'm not sure if
> > this is a real kernel bug or maybe something wrong with my
> > configuration.
> >
> > I'm running kernel 5.13 in a qemu VM with one e500mc CPU. I am running
> > native gdb (inside the VM) and setting a breakpoint in main() in a test
> > "hello world" program. Upon running the test program, I am hitting the
> > BUG_ON in do_notify_resume() on line 292. The kernel bug log snippet is
> > included below at the end of the email.
> >
> > FWIW, gdb says:
> > Program terminated with signal SIGTRAP, Trace/breakpoint trap.
> > The program no longer exists.
> >
> > I also added a pr_info() to do_notify_resume() just to see how much
> > different 'regs' and 'current->thread.regs' are. Surprisingly, they are
> > just 0x30 apart: regs=c7955f10 cur=c7955f40. Also, 'current' seems to
> > be OK (pid and comm are consistent with the test program).
>
> The TRAP = 0x7d8 is obviously wrong.
>
> Need to know which 'TRAP' it is exactly.
> Could you try to dump what we have at the correct regs ?
> Something like 'show_regs(current->thread.regs)' should do it.

Sure, please see the output below. It looks to me like the "correct"
regs are just garbage. Either they are overwritten or current->thread.regs
is wrong. But in any case, r1 = 0 doesn't look good.

regs=c7a0ff10 cur=c7a0ff40 pid=139 comm=test
CPU: 0 PID: 139 Comm: test Not tainted 5.13.0-dirty #4
NIP:  1338 LR: 0003 CTR: 0003
REGS: c7a0ff40 TRAP: 67   Not tainted  (5.13.0-dirty)
MSR:  1002d202   CR: 1004  XER: 80670100

GPR00: b7fc36d8   b7fe17b4  b7ffd588 b7ffe8b8 b7ffee10 
GPR08: b7fff214 b7ffdf40 b7fff208 b858 b970 b7fff130 0001 b960 
GPR16: b7fff2b0 b7ffd5f0 b7ffe8a8 b850 b7fc3714 1002d002 b7fff208 0003 
GPR24: b7fc3714  22000284 b960 07d8 1338 0800 b850 
NIP [1338] 0x1338
LR [0003] 0x3
Call Trace:
[c7a0fe40] [c0008eac] show_regs+0x4c/0x1b0 (unreliable)
[c7a0fe80] [c000969c] do_notify_resume+0x31c/0x320
[c7a0fee0] [c0010b94] interrupt_exit_user_prepare+0x94/0xc0
[c7a0ff00] [c00151e8] interrupt_return+0x14/0x13c
--- interrupt: 7d8 at 0xb7fc3714
NIP:  b7fc3714 LR: b7fc3714 CTR: 0003
REGS: c7a0ff10 TRAP: 07d8   Not tainted  (5.13.0-dirty)
MSR:  1002d002   CR: 22000284  XER: 

GPR00: b7fc3584 b850     00a0 6474e552 
GPR08: b7fbe0d4 0001 b7fff230 b850 b7fc36d8   b7fe17b4 
GPR16:  b7ffd588 b7ffe8b8 b7ffee10 b7fff214 b7ffdf40 b7fff208 b858 
GPR24: b970 b7fff130 0001 b960 b7fff2b0 b7ffd5f0 b7ffe8a8 b850 
NIP [b7fc3714] 0xb7fc3714
LR [b7fc3714] 0xb7fc3714
--- interrupt: 7d8
[ cut here ]
kernel BUG at arch/powerpc/kernel/signal.c:298!
Oops: Exception in kernel mode, sig: 5 [#1]
BE PAGE_SIZE=4K SMP NR_CPUS=24 QEMU e500
Modules linked in:
CPU: 0 PID: 139 Comm: test Not tainted 5.13.0-dirty #4
NIP:  c000969c LR: c000969c CTR: c065f540
REGS: c7a0fdc0 TRAP: 0700   Not tainted  (5.13.0-dirty)
MSR:  00028002   CR: 28000282  XER: 2000

GPR00: c000969c c7a0fe80 c70ef500 c70efbd8 c70ef500 0010 c7a0fc38 0002 
GPR08: 0001    28000282   b7fe17b4 
GPR16:  b7ffd588 b7ffe8b8 b7ffee10 b7fff214 b7ffdf40 b7fff208 b858 
GPR24: b970 b7fff130 0001 b960 c7a0ff10 0800 c70ef500 0102 
NIP [c000969c] do_notify_resume+0x31c/0x320
LR [c000969c] do_notify_resume+0x31c/0x320
Call Trace:
[c7a0fe80] [c000969c] do_notify_resume+0x31c/0x320 (unreliable)
[c7a0fee0] [c0010b94] interrupt_exit_user_prepare+0x94/0xc0
[c7a0ff00] [c00151e8] interrupt_return+0x14/0x13c
--- interrupt: 7d8 at 0xb7fc3714
NIP:  b7fc3714 LR: b7fc3714 CTR: 0003
REGS: c7a0ff10 TRAP: 07d8   Not tainted  (5.13.0-dirty)
MSR:  1002d002   CR: 22000284  XER: 

GPR00: b7fc3584 b850     00a0 6474e552 
GPR08: b7fbe0d4 0001 b7fff230 b850 b7fc36d8   b7fe17b4 
GPR16:  b7ffd588 b7ffe8b8 b7ffee10 b7fff214 b7ffdf40 b7fff208 b858 
GPR24: b970 b7fff130 0001 b960 b7fff2b0 b7ffd5f0 b7ffe8a8 b850 
NIP [b7fc3714] 0xb7fc3714
LR [b7fc3714] 0xb7fc3714
--- interrupt: 7d8
Instruction dump:
93a10054 90010064 93c10058 48b95369 80c20398 3c60c0dc 7f84e378 38e204b0 
3863ce30 4809d819 80620704 4bfff7c9 <0fe0> 3d20c0ff 8129c014 2c09 
---[ end trace 1cbf27cd19ae39a0 ]---

> > If I'm not missing something, the 'regs' pointer that is eventual

Hitting BUG_ON in do_notify_resume() with gdb and SIGTRAP

2021-07-04 Thread Radu Rendec
Hi Everyone,

I'm trying to set up my (virtual) environment to test an old bug in the
PPC32 ptrace() code. I came across a completely different problem,
which seems to make gdb pretty much unusable on PPC32. I'm not sure if
this is a real kernel bug or maybe something wrong with my
configuration.

I'm running kernel 5.13 in a qemu VM with one e500mc CPU. I am running
native gdb (inside the VM) and setting a breakpoint in main() in a test
"hello world" program. Upon running the test program, I am hitting the
BUG_ON in do_notify_resume() on line 292. The kernel bug log snippet is
included below at the end of the email.

FWIW, gdb says:
Program terminated with signal SIGTRAP, Trace/breakpoint trap.
The program no longer exists.

I also added a pr_info() to do_notify_resume() just to see how much
different 'regs' and 'current->thread.regs' are. Surprisingly, they are
just 0x30 apart: regs=c7955f10 cur=c7955f40. Also, 'current' seems to
be OK (pid and comm are consistent with the test program).

If I'm not missing something, the 'regs' pointer that is eventually
passed to do_notify_resume() is calculated in interrupt_return() by
adding STACK_FRAME_OVERHEAD (defined to 112) to the value of r1. That
means all registers are saved on the stack before entering the
interrupt handler and, upon returning, a pointer to the register
structure is calculated from the stack pointer. Either the offset
itself is wrong, or the stack pointer is off by 0x30.

This is as far as I have gone. Hopefully this rings a bell to someone
who is familiar with that part of the code and has a better
understanding of PPC32 interrupt handling in general.

Last but not least, my configuration is fairly standard. I'm using the
powerpc-e500mc--glibc--bleeding-edge-2020.08-1 toolchain from Bootlin
to compile everything (kernel and user-space). The qemu version is
5.2.0 (Fedora 34) and the user-space is a small Busybox based rootfs
that I built using Buildroot 2021.05. The gdb version is 9.2.

regs=c7955f10 cur=c7955f40 pid=138 comm=test
[ cut here ]
kernel BUG at arch/powerpc/kernel/signal.c:296!
Oops: Exception in kernel mode, sig: 5 [#1]
BE PAGE_SIZE=4K SMP NR_CPUS=24 QEMU e500
Modules linked in:
CPU: 0 PID: 138 Comm: test Not tainted 5.13.0-dirty #3
NIP:  c0009694 LR: c0009694 CTR: c065f540
REGS: c7955dc0 TRAP: 0700   Not tainted  (5.13.0-dirty)
MSR:  00028002   CR: 28000282  XER: 2000

GPR00: c0009694 c7955e80 c7145100 002c dfbdc3d4 dfbe5d24 0027 dfbdc3d8 
GPR08: c0ffe988    22000282   b7fe17b4 
GPR16:  b7ffd588 b7ffe8b8 b7ffee10 b7fff214 b7ffdf40 b7fff208 b858 
GPR24: b970 b7fff130 0001 b960 c7955f10 0800 c7145100 0102 
NIP [c0009694] do_notify_resume+0x314/0x320
LR [c0009694] do_notify_resume+0x314/0x320
Call Trace:
[c7955e80] [c0009694] do_notify_resume+0x314/0x320 (unreliable)
[c7955ee0] [c0010b94] interrupt_exit_user_prepare+0x94/0xc0
[c7955f00] [c00151e8] interrupt_return+0x14/0x13c
--- interrupt: 7d8 at 0xb7fc3714
NIP:  b7fc3714 LR: b7fc3714 CTR: 0003
REGS: c7955f10 TRAP: 07d8   Not tainted  (5.13.0-dirty)
MSR:  1002d002   CR: 22000284  XER: 

GPR00: b7fc3584 b850     00a0 6474e552 
GPR08: b7fbe0d4 0001 b7fff230 b850 b7fc36d8   b7fe17b4 
GPR16:  b7ffd588 b7ffe8b8 b7ffee10 b7fff214 b7ffdf40 b7fff208 b858 
GPR24: b970 b7fff130 0001 b960 b7fff2b0 b7ffd5f0 b7ffe8a8 b850 
NIP [b7fc3714] 0xb7fc3714
LR [b7fc3714] 0xb7fc3714
--- interrupt: 7d8
Instruction dump:
4b04 7c0802a6 93a10054 90010064 93c10058 48b95369 80c20398 3c60c0dc 
7f84e378 38e204b0 3863ce30 4809d819 <0fe0> 6000 6000 3d20c0ff 
---[ end trace 065671519ba3d526 ]---

Note: the BUG() line is slightly different because I had made this
small change to print the pointers:

diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index 9ded046edb0e..57ea6e500a6f 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -289,7 +289,12 @@ void do_notify_resume(struct pt_regs *regs, unsigned long 
thread_info_flags)
klp_update_patch_state(current);
 
if (thread_info_flags & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL)) {
-   BUG_ON(regs != current->thread.regs);
+   if (regs != current->thread.regs) {
+   pr_info("regs=%px cur=%px pid=%d comm=%s\n",
+   regs, current->thread.regs,
+   current->pid, current->comm);
+   BUG();
+   }
do_signal(current);
}
 




Re: [PATCH 0/1] PPC32: fix ptrace() access to FPU registers

2021-06-11 Thread Radu Rendec
On Fri, 2021-06-11 at 08:02 +0200, Christophe Leroy wrote:
>Le 19/06/2019 à 14:57, Radu Rendec a écrit :
>> On Wed, 2019-06-19 at 10:36 +1000, Daniel Axtens wrote:
>>> Andreas Schwab <
>>> sch...@linux-m68k.org
>>>> writes:
>>>
>>>> On Jun 18 2019, Radu Rendec <
>>>> radu.ren...@gmail.com
>>>>> wrote:
>>>>
>>>>> Since you already have a working setup, it would be nice if you could
>>>>> add a printk to arch_ptrace() to print the address and confirm what I
>>>>> believe happens (by reading the gdb source code).
>>>>
>>>> A ppc32 ptrace syscall goes through compat_arch_ptrace.
>>
>> Right. I completely overlooked that part.
>>
>>> Ah right, and that (in ptrace32.c) contains code that will work:
>>>
>>>
>>> /*
>>>  * the user space code considers the floating point
>>>  * to be an array of unsigned int (32 bits) - the
>>>  * index passed in is based on this assumption.
>>>  */
>>> tmp = ((unsigned int *)child->thread.fp_state.fpr)
>>> [FPRINDEX(index)];
>>>
>>> FPRINDEX is defined above to deal with the various manipulations you
>>> need to do.
>>
>> Correct. Basically it does the same that I did in my patch: it divides
>> the index again by 2 (it's already divided by 4 in compat_arch_ptrace()
>> so it ends up divided by 8), then takes the least significant bit and
>> adds it to the index. I take bit 2 of the original address, which is the
>> same thing (because in FPRHALF() the address is already divided by 4).
>>
>> So we have this in ptrace32.c:
>>
>> #define FPRNUMBER(i) (((i) - PT_FPR0) >> 1)
>> #define FPRHALF(i) (((i) - PT_FPR0) & 1)
>> #define FPRINDEX(i) TS_FPRWIDTH * FPRNUMBER(i) * 2 + FPRHALF(i)
>>
>> index = (unsigned long) addr >> 2;
>> (unsigned int *)child->thread.fp_state.fpr)[FPRINDEX(index)]
>>
>>
>> And we have this in my patch:
>>
>> fpidx = (addr - PT_FPR0 * sizeof(long)) / 8;
>> (void *)>thread.TS_FPR(fpidx) + (addr & 4)
>>
>>> Radu: I think we want to copy that working code back into ptrace.c.
>>
>> I'm not sure that would work. There's a subtle difference: the code in
>> ptrace32.c is always compiled on a 64-bit kernel and the user space
>> calling it is always 32-bit; on the other hand, the code in ptrace.c can
>> be compiled on either a 64-bit kernel or a 32-bit kernel and the user
>> space calling it always has the same "bitness" as the kernel.
>>
>> One difference is the size of the CPU registers. On 64-bit they are 8
>> byte long and user space knows that and generates 8-byte aligned
>> addresses. So you have to divide the address by 8 to calculate the CPU
>> register index correctly, which compat_arch_ptrace() currently doesn't.
>>
>> Another difference is that on 64-bit `long` is 8 bytes, so user space
>> can read a whole FPU register in a single ptrace call.
>>
>> Now that we are all aware of compat_arch_ptrace() (which handles the
>> special case of a 32-bit process running on a 64-bit kernel) I would say
>> the patch is correct and does the right thing for both 32-bit and 64-bit
>> kernels and processes.
>>
>>> The challenge will be unpicking the awful mess of ifdefs in ptrace.c
>>> and making it somewhat more comprehensible.
>>
>> I'm not sure what ifdefs you're thinking about. The only that are used
>> inside arch_ptrace() are PT_FPR0, PT_FPSCR and TS_FPR, which seem to be
>> correct.
>>
>> But perhaps it would be useful to change my patch and add a comment just
>> before arch_ptrace() that explains how the math is done and that the
>> code must work on both 32-bit and 64-bit, the user space address
>> assumptions, etc.
>>
>> By the way, I'm not sure the code in compat_arch_ptrace() handles
>> PT_FPSCR correctly. It might (just because fpscr is right next to fpr[]
>> in memory - and that's a hack), but I can't figure out if it accesses
>> the right half.
>>
>
>Does the issue still exists ? If yes, the patch has to be rebased.

Hard to say. I'm still using 4.9 (stable) on the systems that I created
the patch for. I tried to rebase, and the patch no longer applies. It
looks like there have been some changes around that area, notably your
commit e009fa433542, so it could actually be fixed now.

It's been exactly two years since I sent the patch and I don't remember
all the details. I will have to go back and look. Also, running a recent
kernel on my PPC32 systems is not an option because there are a bunch of
custom patches that would have to be ported. I will try in a VM and get
back to you, hopefully early next week.

Best regards,
Radu



Re: MCE handler gets NIP wrong on MPC8378

2020-02-25 Thread Radu Rendec
On 02/20/2020 at 12:48 PM Christophe Leroy  wrote:
> Le 20/02/2020 à 18:34, Radu Rendec a écrit :
> > On 02/20/2020 at 11:25 AM Christophe Leroy  wrote:
> >> Le 20/02/2020 à 17:02, Radu Rendec a écrit :
> >>> On 02/20/2020 at 3:38 AM Christophe Leroy  wrote:
> >>>> On 02/19/2020 10:39 PM, Radu Rendec wrote:
> >>>>> On 02/19/2020 at 4:21 PM Christophe Leroy  
> >>>>> wrote:
> >>>>>>> Interesting.
> >>>>>>>
> >>>>>>> 0x900 is the adress of the timer interrupt.
> >>>>>>>
> >>>>>>> Would the MCE occur just after the timer interrupt ?
> >>>>>
> >>>>> I doubt that. I'm using a small test module to artificially trigger the
> >>>>> MCE. Basically it's just this (the full code is in my original post):
> >>>>>
> >>>>>bad_addr_base = ioremap(0xf000, 0x100);
> >>>>>x = ioread32(bad_addr_base);
> >>>>>
> >>>>> I find it hard to believe that every time I load the module the lwbrx
> >>>>> instruction that triggers the MCE is executed exactly after the timer
> >>>>> interrupt (or that the timer interrupt always occurs close to the lwbrx
> >>>>> instruction).
> >>>>
> >>>> Can you try to see how much time there is between your read and the MCE ?
> >>>> The below should allow it, you'll see first value in r13 and the other
> >>>> in r14 (mce.c is your test code)
> >>>>
> >>>> Also provide the timebase frequency as reported in /proc/cpuinfo
> >>>
> >>> I just ran a test: r13 is 0xda8e0f91 and r14 is 0xdaae0f9c.
> >>>
> >>> # cat /proc/cpuinfo
> >>> processor   : 0
> >>> cpu : e300c4
> >>> clock   : 800.04MHz
> >>> revision: 1.1 (pvr 8086 1011)
> >>> bogomips: 200.00
> >>> timebase: 1
> >>>
> >>> The difference between r14 and r13 is 0x2b. Assuming TB is
> >>> incremented with 'timebase' frequency, that means 20.97 milliseconds
> >>> (although the e300 manual says TB is "incremented once every four core
> >>> input clock cycles").
> >>
> >> I wouldn't be surprised that the internal CPU clock be twice the input
> >> clock.
> >>
> >> So that's long enough to surely get a timer interrupt during every bad
> >> access.
> >>
> >> Now we have to understand why SRR1 contains the address of the timer
> >> exception entry and not the address of the bad access.
> >>
> >> The value of SRR1 confirms that it comes from 0x900 as MSR[IR] and [DR]
> >> are cleared when interrupts are enabled.
> >>
> >> Maybe you should file a support case at NXP. They are usually quite
> >> professionnal at responding.
> >
> > I already did (quite some time ago), but it started off as "why does the
> > MCE occur in the first place". That part has already been figured out,
> > but unfortunately I don't have a viable solution to it. Like you said,
> > now the focus has shifted to understanding why the SRR0 value is not
> > what we expect.
>
> Yes now the point is to understand why it starts processing the timer
> interrupt at 0x900 (with IR and DR cleared as observed in SRR1) just
> before taking the Machine Check.
>
> Allthough the execution of the decrementer interrupt is queue for after
> the completion of the failing memory access, I'd expect the Machine
> Check to take priority.
>
> Note that I have never observed such a behaviour on MPC8321 which has an
> e300c2 core.

I apologize for the silence during the past few days, I've been diverted
with something else. This is the feedback that I got from NXP:

| The e300 core uses SRR0/1 for both non-critical interrupts and machine
| check interrupts and if they happen simultaneously a problem can occur
| where the return address from the first exception is lost when handling
| the second exception concurrently. This only occurs in the rare case
| when the software ISR hasn't had the time to save SRR0/1 to the sw stack.
|
| If the ability to nest interrupts is desired, software then saves off
| enough state (i.e. the contents of SRR0, SRR1, etc) that will allow it
| to recover (i.e. resume handling the current interrupt) if another
| interrupt occurs.

So basically what they describe is a race condition between the MCE and
a regular interrup

Re: MCE handler gets NIP wrong on MPC8378

2020-02-20 Thread Radu Rendec
On 02/20/2020 at 11:25 AM Christophe Leroy  wrote:
> Le 20/02/2020 à 17:02, Radu Rendec a écrit :
> > On 02/20/2020 at 3:38 AM Christophe Leroy  wrote:
> >> On 02/19/2020 10:39 PM, Radu Rendec wrote:
> >>> On 02/19/2020 at 4:21 PM Christophe Leroy  wrote:
> >>>>> Interesting.
> >>>>>
> >>>>> 0x900 is the adress of the timer interrupt.
> >>>>>
> >>>>> Would the MCE occur just after the timer interrupt ?
> >>>
> >>> I doubt that. I'm using a small test module to artificially trigger the
> >>> MCE. Basically it's just this (the full code is in my original post):
> >>>
> >>>   bad_addr_base = ioremap(0xf000, 0x100);
> >>>   x = ioread32(bad_addr_base);
> >>>
> >>> I find it hard to believe that every time I load the module the lwbrx
> >>> instruction that triggers the MCE is executed exactly after the timer
> >>> interrupt (or that the timer interrupt always occurs close to the lwbrx
> >>> instruction).
> >>
> >> Can you try to see how much time there is between your read and the MCE ?
> >> The below should allow it, you'll see first value in r13 and the other
> >> in r14 (mce.c is your test code)
> >>
> >> Also provide the timebase frequency as reported in /proc/cpuinfo
> >
> > I just ran a test: r13 is 0xda8e0f91 and r14 is 0xdaae0f9c.
> >
> > # cat /proc/cpuinfo
> > processor   : 0
> > cpu : e300c4
> > clock   : 800.04MHz
> > revision: 1.1 (pvr 8086 1011)
> > bogomips: 200.00
> > timebase: 1
> >
> > The difference between r14 and r13 is 0x2b. Assuming TB is
> > incremented with 'timebase' frequency, that means 20.97 milliseconds
> > (although the e300 manual says TB is "incremented once every four core
> > input clock cycles").
>
> I wouldn't be surprised that the internal CPU clock be twice the input
> clock.
>
> So that's long enough to surely get a timer interrupt during every bad
> access.
>
> Now we have to understand why SRR1 contains the address of the timer
> exception entry and not the address of the bad access.
>
> The value of SRR1 confirms that it comes from 0x900 as MSR[IR] and [DR]
> are cleared when interrupts are enabled.
>
> Maybe you should file a support case at NXP. They are usually quite
> professionnal at responding.

I already did (quite some time ago), but it started off as "why does the
MCE occur in the first place". That part has already been figured out,
but unfortunately I don't have a viable solution to it. Like you said,
now the focus has shifted to understanding why the SRR0 value is not
what we expect.

I asked them the question about SRR0 as soon as you helped me get back
on track and figured out there's nothing wrong with the Linux MCE
handler and the NIP value comes from SRR0. What they came up with is
basically this paragraph in the e300 core manual (section 5.5.2):

| Note that the e300 core makes no attempt to force recoverability on a
| machine check; however, it does guarantee that the machine check
| interrupt is always taken immediately upon request, with a nonpredicted
| address saved in SRR0, regardless of the current machine state.

... and with an emphasis on "nonpredicted". To be honest, I am a bit
disappointed with their response and I believe in this context what
"unpredicted" means is that the address that is saved to SRR0 is a
"real" address rather than the result of branch prediction. The support
folks were probably thinking "unpredictable". But that's another word
and the difference is quite subtle :)

I updated the case and added information about the interrupts and the
timing. Let's see what they come up with this time.

Best regards,
Radu


Re: MCE handler gets NIP wrong on MPC8378

2020-02-20 Thread Radu Rendec
On 02/20/2020 at 3:38 AM Christophe Leroy  wrote:
> On 02/19/2020 10:39 PM, Radu Rendec wrote:
> > On 02/19/2020 at 4:21 PM Christophe Leroy  wrote:
> >>> Interesting.
> >>>
> >>> 0x900 is the adress of the timer interrupt.
> >>>
> >>> Would the MCE occur just after the timer interrupt ?
> >
> > I doubt that. I'm using a small test module to artificially trigger the
> > MCE. Basically it's just this (the full code is in my original post):
> >
> >  bad_addr_base = ioremap(0xf000, 0x100);
> >  x = ioread32(bad_addr_base);
> >
> > I find it hard to believe that every time I load the module the lwbrx
> > instruction that triggers the MCE is executed exactly after the timer
> > interrupt (or that the timer interrupt always occurs close to the lwbrx
> > instruction).
>
> Can you try to see how much time there is between your read and the MCE ?
> The below should allow it, you'll see first value in r13 and the other
> in r14 (mce.c is your test code)
>
> Also provide the timebase frequency as reported in /proc/cpuinfo

I just ran a test: r13 is 0xda8e0f91 and r14 is 0xdaae0f9c.

# cat /proc/cpuinfo
processor   : 0
cpu : e300c4
clock   : 800.04MHz
revision: 1.1 (pvr 8086 1011)
bogomips: 200.00
timebase: 1

The difference between r14 and r13 is 0x2b. Assuming TB is
incremented with 'timebase' frequency, that means 20.97 milliseconds
(although the e300 manual says TB is "incremented once every four core
input clock cycles").

I repeated the test twice and the absolute values were of course very
different, but r14-r13 was 0x2c and 0x200011, so it seems to be
quite consistent (within just a few clock cycles).

Just for the fun of it, I repeated the test once more, but with
interrupts disabled. The difference was 0x200014. FWIW, I disabled
interrupts before sampling TB in r13.

> And what's the reason given in the Oops message for the machine check ?
> Is that "Caused by (from SRR1=49030): Transfer error ack signal" or
> something else ?

When interrupts are enabled:
Caused by (from SRR1=41000): Transfer error ack signal

When interrupts are disabled:
Caused by (from SRR1=41030): Transfer error ack signal

> >
> >> Do you use the local bus monitoring driver ?
> >
> > I don't. In fact, I'm not even aware of it. What driver is that?
>
> CONFIG_FSL_LBC

OK, it seems I'm actually using it. I haven't enabled it explicitly, but
it's automatically pulled by CONFIG_MTD_NAND_FSL_ELBC as a prerequisite.

I looked at the code in arch/powerpc/sysdev/fsl_lbc.c and it's quite
small. Most of the code is in fsl_lbc_ctrl_irq, which I guess is
supposed to print a message if/when the LBC catches an error. I've never
seen any of those messages being printed.

Best regards,
Radu


Re: MCE handler gets NIP wrong on MPC8378

2020-02-19 Thread Radu Rendec
On 02/19/2020 at 4:21 PM Christophe Leroy  wrote:
> > Radu Rendec  a écrit :
> >> On 02/19/2020 at 10:11 AM Radu Rendec  wrote:
> >>> On 02/18/2020 at 1:08 PM Christophe Leroy  wrote:
> >>>> Le 18/02/2020 à 18:07, Radu Rendec a écrit :
> >>>> > The saved NIP seems to be broken inside machine_check_exception() on
> >>>> > MPC8378, running Linux 4.9.191. The value is 0x900 most of the times,
> >>>> > but I have seen other weird values.
> >>>> >
> >>>> > I've been able to track down the entry code to head_32.S (vector 
> >>>> > 0x200),
> >>>> > but I'm not sure where/how the NIP value (where the exception occurred)
> >>>> > is captured.
> >>>>
> >>>> NIP value is supposed to come from SRR0, loaded in r12 in PROLOG_2 and
> >>>> saved into _NIP(r11) in transfer_to_handler in entry_32.S
> >>>>
> >>>> Can something clobber r12 at some point ?
> >>>>
> >>>
> >>> I did something even simpler: I added the following
> >>>
> >>>  lis r12,0x1234
> >>>
> >>> ... right after
> >>>
> >>>  mfspr r12,SPRN_SRR0
> >>>
> >>> ... and now the NIP value I see in the crash dump is 0x1234. This
> >>> means r12 is not clobbered and most likely the NIP value I normally see
> >>> is the actual SRR0 value.
> >>
> >> I apologize for the noise. I just found out accidentally that the saved
> >> NIP value is correct if interrupts are disabled at the time when the
> >> faulty access that triggers the MCE occurs. This seems to happen
> >> consistently.
> >>
> >> By "interrupts are disabled" I mean local_irq_save/local_irq_restore, so
> >> it's basically enough to wrap ioread32 to get the NIP value right.
> >>
> >> Does this make any sense? Maybe it's not a silicon bug after all, or
> >> maybe it is and I just found a workaround. Could this happen on other
> >> PowerPC CPUs as well?
> >
> > Interesting.
> >
> > 0x900 is the adress of the timer interrupt.
> >
> > Would the MCE occur just after the timer interrupt ?

I doubt that. I'm using a small test module to artificially trigger the
MCE. Basically it's just this (the full code is in my original post):

bad_addr_base = ioremap(0xf000, 0x100);
x = ioread32(bad_addr_base);

I find it hard to believe that every time I load the module the lwbrx
instruction that triggers the MCE is executed exactly after the timer
interrupt (or that the timer interrupt always occurs close to the lwbrx
instruction).

> >
> > Can you tell how are configured your IO busses, etc ... ?

Nothing special. The device tree is mostly similar to mpc8379_rdb.dts,
but I can provide the actual dts if you think it's relevant.

> And what's the value of SERSR after the machine check ?

I'm assuming you're talking about the IPIC SERSR register. I modified
machine_check_exception and added a call to ipic_get_mcp_status, which
seems to read IPIC_SERSR. The value is 0, both with interrupts enabled
and disabled (which makes sense, since disabling/enabling interrupts is
local to the CPU core).

> Do you use the local bus monitoring driver ?

I don't. In fact, I'm not even aware of it. What driver is that?

Best regards,
Radu


Re: MCE handler gets NIP wrong on MPC8378

2020-02-19 Thread Radu Rendec
On 02/19/2020 at 10:11 AM Radu Rendec  wrote:
> On 02/18/2020 at 1:08 PM Christophe Leroy  wrote:
> > Le 18/02/2020 à 18:07, Radu Rendec a écrit :
> > > The saved NIP seems to be broken inside machine_check_exception() on
> > > MPC8378, running Linux 4.9.191. The value is 0x900 most of the times,
> > > but I have seen other weird values.
> > >
> > > I've been able to track down the entry code to head_32.S (vector 0x200),
> > > but I'm not sure where/how the NIP value (where the exception occurred)
> > > is captured.
> >
> > NIP value is supposed to come from SRR0, loaded in r12 in PROLOG_2 and
> > saved into _NIP(r11) in transfer_to_handler in entry_32.S
> >
> > Can something clobber r12 at some point ?
> >
>
> I did something even simpler: I added the following
>
>   lis r12,0x1234
>
> ... right after
>
>   mfspr r12,SPRN_SRR0
>
> ... and now the NIP value I see in the crash dump is 0x1234. This
> means r12 is not clobbered and most likely the NIP value I normally see
> is the actual SRR0 value.

I apologize for the noise. I just found out accidentally that the saved
NIP value is correct if interrupts are disabled at the time when the
faulty access that triggers the MCE occurs. This seems to happen
consistently.

By "interrupts are disabled" I mean local_irq_save/local_irq_restore, so
it's basically enough to wrap ioread32 to get the NIP value right.

Does this make any sense? Maybe it's not a silicon bug after all, or
maybe it is and I just found a workaround. Could this happen on other
PowerPC CPUs as well?

Best regards,
Radu


Re: MCE handler gets NIP wrong on MPC8378

2020-02-19 Thread Radu Rendec
On 02/18/2020 at 1:08 PM Christophe Leroy  wrote:
> Le 18/02/2020 à 18:07, Radu Rendec a écrit :
> > The saved NIP seems to be broken inside machine_check_exception() on
> > MPC8378, running Linux 4.9.191. The value is 0x900 most of the times,
> > but I have seen other weird values.
> >
> > I've been able to track down the entry code to head_32.S (vector 0x200),
> > but I'm not sure where/how the NIP value (where the exception occurred)
> > is captured.
>
> NIP value is supposed to come from SRR0, loaded in r12 in PROLOG_2 and
> saved into _NIP(r11) in transfer_to_handler in entry_32.S

Thank you so much for the information, it is extremely helpful!

> Can something clobber r12 at some point ?
>
> Maybe add the following at some place to trap when it happens ?
>
> tweqi r12, 0x900
>
> If you put it just after reading SRR0, and just before writing into
> NIP(r11), you'll see if its wrong from the begining or if it is
> overwriten later.

I did something even simpler: I added the following

lis r12,0x1234

... right after

mfspr r12,SPRN_SRR0

... and now the NIP value I see in the crash dump is 0x1234. This
means r12 is not clobbered and most likely the NIP value I normally see
is the actual SRR0 value.

Just to be sure that SRR0 is not clobbered before it's even saved to r12
(very unlikely though) I changed the code to save SRR0 to r8 at the very
beginning of the handler (first instruction, at address 0x200) and then
load r12 from r8 later. This of course clobbers r8, but it's good for
testing. Now in the crash dump I see 0x900 in both NIP and r8.

So I think I ruled out any problem in the Linux MCE handler. MPC8378 has
an e300 core and I double checked with the e300 core reference manual
(e300coreRM.pdf from NXP). I couldn't find anything weird there either.
Quoting from the RM:

| 5.5.2.1 Machine Check Interrupt Enabled (MSR[ME] = 1)
|
| When a machine check interrupt is taken, registers are updated as
| shown in Table 5-14.
|
| Table 5-14. Machine Check Interrupt—Register Settings
|
| SRR0 Set to the address of the next instruction that would have been
|  completed in the interrupted instruction stream. Neither this
|  instruction nor any others beyond it will have been completed.
|  All preceding instructions will have been completed.

At this point I'm assuming a silicon bug, although I couldn't find
anything interesting in the Errata provided by NXP.

Best regards,
Radu


MCE handler gets NIP wrong on MPC8378

2020-02-18 Thread Radu Rendec
Hi Everyone,

The saved NIP seems to be broken inside machine_check_exception() on
MPC8378, running Linux 4.9.191. The value is 0x900 most of the times,
but I have seen other weird values.

I've been able to track down the entry code to head_32.S (vector 0x200),
but I'm not sure where/how the NIP value (where the exception occurred)
is captured.

I also noticed most of the code has moved to head_32.h in newer kernel
versions, but EXCEPTION_PROLOG_1 and EXCEPTION_PROLOG_2 look pretty much
the same. I guess the same thing happens on a recent kernel, even though
I don't have an easy way to test it.

The original MCE that I see is triggered by a failed PCIe transaction,
but I was able to reproduce it by just reading from a (physically)
unmapped memory area. Sample code and kernel crash dump are included
below.

Can anyone please provide any suggestion as to what to look at next?

Thanks,
Radu

8<

#include 
#include 
#include 

static void __iomem *bad_addr_base;

static int __init test_mce_init(void)
{
unsigned int x;

bad_addr_base = ioremap(0xf000, 0x100);

if (bad_addr_base) {
__asm__ __volatile__ ("isync");
x = ioread32(bad_addr_base);
pr_info("Test: %#0x\n", x);
} else
pr_err("Cannot map\n");

return 0;
}

static void __exit test_mce_exit(void)
{
iounmap(bad_addr_base);
}

module_init(test_mce_init);
module_exit(test_mce_exit);

MODULE_LICENSE("GPL");

8<

[   14.977053] mce: loading out-of-tree module taints kernel.
[   15.004285] Disabling lock debugging due to kernel taint
[   15.026151] Machine check in kernel mode.
[   15.030153] Caused by (from SRR1=41000): [   15.033982] Transfer
error ack signal
[   15.037652] Oops: Machine check 1, sig: 7 [#1]
[   15.042088] PREEMPT [   15.044091] MPC8378_CUSTOM
[   15.046967] Modules linked in: mce(O+) iptable_filter ip_tables
x_tables ipv6 mpc8xxx_wdt yaffs spidev spi_fsl_spi spi_fsl_lib
spi_fsl_cpm fsl_mph_dr_of ehci_fsl ehci_hcd
[   15.067486] CPU: 0 PID: 1213 Comm: insmod Tainted: G   M C O
4.9.191-default-mpc8378-p3c692a64ae1d #31
[   15.078175] task: 9e83e550 task.stack: 9ea2e000
[   15.082699] NIP: 0900 LR: b147e030 CTR: 80015d50
[   15.087659] REGS: 9ea2fca0 TRAP: 0200   Tainted: G   M C O
(4.9.191-default-mpc8378-p3c692a64ae1d)
[   15.098084] MSR: 00041000 [   15.100973]   CR: 42002228  XER: 
[   15.104976] DAR: 80017414 DSISR: 
GPR00: b147e030 9ea2fd50 9e83e550  b148 9c652200 9ea2fd18 
GPR08: 9c652200  b148 1032 80015d50 100b93d0 b147e308 805eb3d8
GPR16: 003a 0550 b1473b5c b147c2a4 8048e444 80082b08  b147c0e8
GPR24: 0124 0578   b147c0a0 b147e000 9eb7c280 b147c0a0
NIP [0900] 0x900
[   15.139310] LR [b147e030] test_mce_init+0x30/0xa8 [mce]
[   15.144528] Call Trace:
[   15.146973] [9ea2fd50] [b147e000] test_mce_init+0x0/0xa8 [mce] (unreliable)
[   15.153940] [9ea2fd60] [b147e030] test_mce_init+0x30/0xa8 [mce]
[   15.159864] [9ea2fd70] [80003ac4] do_one_initcall+0xbc/0x184
[   15.165527] [9ea2fde0] [804857e8] do_init_module+0x64/0x1e4
[   15.171107] [9ea2fe00] [80086014] load_module+0x1c78/0x2268
[   15.176680] [9ea2fec0] [80086780] SyS_init_module+0x17c/0x190
[   15.182433] [9ea2ff40] [80010acc] ret_from_syscall+0x0/0x38
[   15.188005] --- interrupt: c01 at 0xfdfdb64
[   15.188005] LR = 0x10013c64
[   15.195309] Instruction dump:
[   15.198274]      
 
[   15.206047]     7d5043a6 
7d400026 
[   15.213824] ---[ end trace d38922938e009d45 ]---
[   15.218434]
[   16.219951] Kernel panic - not syncing: Fatal exception
[   16.225174] Rebooting in 1 seconds..


Re: [PATCH 0/1] PPC32: fix ptrace() access to FPU registers

2019-06-19 Thread Radu Rendec
On Wed, 2019-06-19 at 10:36 +1000, Daniel Axtens wrote:
> Andreas Schwab <
> sch...@linux-m68k.org
> > writes:
> 
> > On Jun 18 2019, Radu Rendec <
> > radu.ren...@gmail.com
> > > wrote:
> > 
> > > Since you already have a working setup, it would be nice if you could
> > > add a printk to arch_ptrace() to print the address and confirm what I
> > > believe happens (by reading the gdb source code).
> > 
> > A ppc32 ptrace syscall goes through compat_arch_ptrace.

Right. I completely overlooked that part.

> Ah right, and that (in ptrace32.c) contains code that will work:
> 
> 
>   /*
>* the user space code considers the floating point
>* to be an array of unsigned int (32 bits) - the
>* index passed in is based on this assumption.
>*/
>   tmp = ((unsigned int *)child->thread.fp_state.fpr)
>   [FPRINDEX(index)];
> 
> FPRINDEX is defined above to deal with the various manipulations you
> need to do.

Correct. Basically it does the same that I did in my patch: it divides
the index again by 2 (it's already divided by 4 in compat_arch_ptrace()
so it ends up divided by 8), then takes the least significant bit and
adds it to the index. I take bit 2 of the original address, which is the
same thing (because in FPRHALF() the address is already divided by 4).

So we have this in ptrace32.c:

#define FPRNUMBER(i) (((i) - PT_FPR0) >> 1)
#define FPRHALF(i) (((i) - PT_FPR0) & 1)
#define FPRINDEX(i) TS_FPRWIDTH * FPRNUMBER(i) * 2 + FPRHALF(i)

index = (unsigned long) addr >> 2;
(unsigned int *)child->thread.fp_state.fpr)[FPRINDEX(index)]


And we have this in my patch:

fpidx = (addr - PT_FPR0 * sizeof(long)) / 8;
(void *)>thread.TS_FPR(fpidx) + (addr & 4)

> Radu: I think we want to copy that working code back into ptrace.c. 

I'm not sure that would work. There's a subtle difference: the code in
ptrace32.c is always compiled on a 64-bit kernel and the user space
calling it is always 32-bit; on the other hand, the code in ptrace.c can
be compiled on either a 64-bit kernel or a 32-bit kernel and the user
space calling it always has the same "bitness" as the kernel.

One difference is the size of the CPU registers. On 64-bit they are 8
byte long and user space knows that and generates 8-byte aligned
addresses. So you have to divide the address by 8 to calculate the CPU
register index correctly, which compat_arch_ptrace() currently doesn't.

Another difference is that on 64-bit `long` is 8 bytes, so user space
can read a whole FPU register in a single ptrace call. 

Now that we are all aware of compat_arch_ptrace() (which handles the
special case of a 32-bit process running on a 64-bit kernel) I would say
the patch is correct and does the right thing for both 32-bit and 64-bit 
kernels and processes.

> The challenge will be unpicking the awful mess of ifdefs in ptrace.c
> and making it somewhat more comprehensible.

I'm not sure what ifdefs you're thinking about. The only that are used
inside arch_ptrace() are PT_FPR0, PT_FPSCR and TS_FPR, which seem to be
correct.

But perhaps it would be useful to change my patch and add a comment just
before arch_ptrace() that explains how the math is done and that the
code must work on both 32-bit and 64-bit, the user space address
assumptions, etc.

By the way, I'm not sure the code in compat_arch_ptrace() handles
PT_FPSCR correctly. It might (just because fpscr is right next to fpr[]
in memory - and that's a hack), but I can't figure out if it accesses
the right half.

Radu




Re: [PATCH 0/1] PPC32: fix ptrace() access to FPU registers

2019-06-18 Thread Radu Rendec
On Tue, 2019-06-18 at 16:42 +1000, Daniel Axtens wrote:
> Radu Rendec <
> radu.ren...@gmail.com
> > writes:
> 
> > On Mon, 2019-06-17 at 11:19 +1000, Daniel Axtens wrote:
> > > Radu Rendec <
> > > radu.ren...@gmail.com
> > > 
> > > > writes:
> > > > Hi Everyone,
> > > > 
> > > > I'm following up on the ptrace() problem that I reported a few days ago.
> > > > I believe my version of the code handles all cases correctly. While the
> > > > problem essentially boils down to dividing the fpidx by 2 on PPC32, it
> > > > becomes tricky when the same code must work correctly on both PPC32 and
> > > > PPC64.
> > > > 
> > > > One other thing that I believe was handled incorrectly in the previous
> > > > version is the unused half of fpscr on PPC32. Note that while PT_FPSCR
> > > > is defined as (PT_FPR0 + 2*32 + 1), making only the upper half visible,
> > > > PT_FPR0 + 2*32 still corresponds to a possible address that userspace
> > > > can pass. In that case, comparing fpidx to (PT_FPSCR - PT_FPR0) would
> > > > cause an invalid access to the FPU registers array.
> > > > 
> > > > I tested the patch on 4.9.179, but that part of the code is identical in
> > > > recent kernels so it should work just the same.
> > > > 
> > > > I wrote a simple test program than can be used to quickly test (on an
> > > > x86_64 host) that all cases are handled correctly for both PPC32/PPC64.
> > > > The code is included below.
> > > > 
> > > > I also tested with gdbserver (test patch included below) and verified
> > > > that it generates two ptrace() calls for each FPU register, with
> > > > addresses between 0xc0 and 0x1bc.
> > > 
> > > Thanks for looking in to this. I can confirm your issue. What I'm
> > > currently wondering is: what is the behaviour with a 32-bit userspace on
> > > a 64-bit kernel? Should they also be going down the 32-bit path as far
> > > as calculating offsets goes?
> > 
> > Thanks for reviewing this. I haven't thought about the 32-bit userspace
> > on a 64-bit kernel, that is a very good question. Userspace passes a
> > pointer, so in theory it could go down either path as long as the
> > pointer points to the right data type.
> > 
> > I will go back to the gdb source code and try to figure out if that case
> > is handled in a special way. If not, it's probably safe to assume that a
> > 32-bit userspace should always go down the 32-bit path regardless of the
> > kernel bitness (in which case I think I have to change my patch).
> 
> It doesn't seem to reproduce on a 64-bit kernel with 32-bit
> userspace. Couldn't tell you why - if you can figure it out from the gdb
> source code I'd love to know! I have, however, tried it - and all the fp
> registers look correct and KASAN doesn't pick up any memory corruption.

I looked at the gdb source code and all it seems to care about is the
architecture it was compiled for. In other words, 32-bit gdb always
assumes 32-bit register layout, regardless of whether it's running on a
32-bit or 64-bit kernel.

So it's no surprise that the problem didn't reproduce and KASAN didn't
pick up anything in your experiment. Since the kernel is 64-bit, it
divides addresses by 8, so all indexes are within bounds. The part that
I don't get is how the FP registers looked correct.

Since you already have a working setup, it would be nice if you could
add a printk to arch_ptrace() to print the address and confirm what I
believe happens (by reading the gdb source code).

So for 32-bit gdb on 64-bit kernel, I think gdb will generate 48 x 4-
byte aligned addresses for the CPU registers, then 32 x 2 x 4-byte
aligned addresses for the FPU registers. The indexes will not overflow,
but access to all registers (CPU and FPU) will be broken because:
 * The kernel always divides by 8. The CPU register address range that
   gdb generates will be half of what the kernel expects and "odd" (i.e.
   non 8-byte aligned) addresses will fail with EIO because the kernel
   code checks that the last 3 bits are all zero.
 * The FPU register indexes will be off by 24. For example, when gdb
   thinks it asks for FPR0, it calculates the address as 4 x 48, but the
   kernel divides by 8, so it will get index 24, which it thinks is a
   CPU register.

When it stops on a breakpoint, gdb seems to read all registers (both CPU
and FPU) in ascending index order. So if you print the address in the
kernel it's actually very easy to verify my theory. I expect addresses
from 0 to 48 x 4 in increments of 4 (for the CPU registers), then
addresses from 48 x 4 to 48 x 4 + 32 x 2 x 4 in increments of 4 (for the
FPU registers).

Best regards,
Radu




Re: [PATCH 0/1] PPC32: fix ptrace() access to FPU registers

2019-06-16 Thread Radu Rendec
On Mon, 2019-06-17 at 11:19 +1000, Daniel Axtens wrote:
> Radu Rendec <
> radu.ren...@gmail.com
> > writes:
> 
> > Hi Everyone,
> > 
> > I'm following up on the ptrace() problem that I reported a few days ago.
> > I believe my version of the code handles all cases correctly. While the
> > problem essentially boils down to dividing the fpidx by 2 on PPC32, it
> > becomes tricky when the same code must work correctly on both PPC32 and
> > PPC64.
> > 
> > One other thing that I believe was handled incorrectly in the previous
> > version is the unused half of fpscr on PPC32. Note that while PT_FPSCR
> > is defined as (PT_FPR0 + 2*32 + 1), making only the upper half visible,
> > PT_FPR0 + 2*32 still corresponds to a possible address that userspace
> > can pass. In that case, comparing fpidx to (PT_FPSCR - PT_FPR0) would
> > cause an invalid access to the FPU registers array.
> > 
> > I tested the patch on 4.9.179, but that part of the code is identical in
> > recent kernels so it should work just the same.
> > 
> > I wrote a simple test program than can be used to quickly test (on an
> > x86_64 host) that all cases are handled correctly for both PPC32/PPC64.
> > The code is included below.
> > 
> > I also tested with gdbserver (test patch included below) and verified
> > that it generates two ptrace() calls for each FPU register, with
> > addresses between 0xc0 and 0x1bc.
> 
> Thanks for looking in to this. I can confirm your issue. What I'm
> currently wondering is: what is the behaviour with a 32-bit userspace on
> a 64-bit kernel? Should they also be going down the 32-bit path as far
> as calculating offsets goes?

Thanks for reviewing this. I haven't thought about the 32-bit userspace
on a 64-bit kernel, that is a very good question. Userspace passes a
pointer, so in theory it could go down either path as long as the
pointer points to the right data type.

I will go back to the gdb source code and try to figure out if that case
is handled in a special way. If not, it's probably safe to assume that a
32-bit userspace should always go down the 32-bit path regardless of the
kernel bitness (in which case I think I have to change my patch).

Best regards,
Radu

> > 8<--- Makefile -
> > .PHONY: all clean
> > 
> > all: ptrace-fpregs-32 ptrace-fpregs-64
> > 
> > ptrace-fpregs-32: ptrace-fpregs.c
> > $(CC) -o ptrace-fpregs-32 -Wall -O2 -m32 ptrace-fpregs.c
> > 
> > ptrace-fpregs-64: ptrace-fpregs.c
> > $(CC) -o ptrace-fpregs-64 -Wall -O2 ptrace-fpregs.c
> > 
> > clean:
> > rm -f ptrace-fpregs-32 ptrace-fpregs-64
> > 8<--- ptrace-fpregs.c --
> > #include 
> > #include 
> > 
> > #define PT_FPR0 48
> > 
> > #ifndef __x86_64
> > 
> > #define PT_FPR31 (PT_FPR0 + 2*31)
> > #define PT_FPSCR (PT_FPR0 + 2*32 + 1)
> > 
> > #else
> > 
> > #define PT_FPSCR (PT_FPR0 + 32)
> > 
> > #endif
> > 
> > int test_access(unsigned long addr)
> > {
> > int ret;
> > 
> > do {
> > unsigned long index, fpidx;
> > 
> > ret = -EIO;
> > 
> > /* convert to index and check */
> > index = addr / sizeof(long);
> > if ((addr & (sizeof(long) - 1)) || (index > PT_FPSCR))
> > break;
> > 
> > if (index < PT_FPR0) {
> > ret = printf("ptrace_put_reg(%lu)", index);
> > break;
> > }
> > 
> > ret = 0;
> > #ifndef __x86_64
> > if (index == PT_FPSCR - 1) {
> > /* corner case for PPC32; do nothing */
> > printf("corner_case");
> > break;
> > }
> > #endif
> > if (index == PT_FPSCR) {
> > printf("fpscr");
> > break;
> > }
> > 
> > /*
> >  * FPR is always 64-bit; on PPC32, userspace does two 32-bit
> >  * accesses. Add bit2 to allow accessing the upper half on
> >  * 32-bit; on 64-bit, bit2 is always 0 (we validate it above).
> >  */
> > fpidx = (addr - PT_FPR0 * sizeof(long)) / 8;
> > printf("TS_FPR[%lu] + %lu", fpidx, addr & 4);
> > break;
> > } while (0);
> > 
> > ret

[PATCH 1/1] PPC32: fix ptrace() access to FPU registers

2019-06-10 Thread Radu Rendec
This patch addresses several issues with ptrace() access to FPU
registers through PTRACE_PEEKUSR/PTRACE_POKEUSR.

Standard CPU registers are of course the size of the machine word on
both PPC32/PPC64, but FPU registers are always 64-bit. Because the
ptrace() can only transfer one `long` at a time with PTRACE_PEEKUSR and
PTRACE_POKEUSR, on PPC32 userspace must do two separate ptrace() calls
to access a whole FPU register.

This patch fixes the code that translates between ptrace() addresses and
indexes into (struct thread_fp_state).fpr, taking into account all cases
for both PPC32/PPC64. In the previous version, on PPC32, the index was
double the correct value, allowing memory to be accessed beyond the
register array. This had the following side effects:
* Access to all FPU registers (except for FPR0) was broken.
* PTRACE_POKEUSR could corrupt memory following the FPU register array.
  That means the remainder of thread_struct, which is by design the last
  field of task_struct. For large enough ptrace() addresses, memory
  access could go even outside task_struct, corrupting the adjacent
  task_struct.

Note that gdb (which is probably the most frequent user of ptrace() with
PTRACE_PEEKUSR/PTRACE_POKEUSR) seems to always read/write all FPU
registers whenever a traced task stops.

Signed-off-by: Radu Rendec 
---
 arch/powerpc/kernel/ptrace.c | 85 ++--
 1 file changed, 52 insertions(+), 33 deletions(-)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index 684b0b315c32..060e5ed0fad9 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -2991,69 +2991,88 @@ long arch_ptrace(struct task_struct *child, long 
request,
switch (request) {
/* read the word at location addr in the USER area. */
case PTRACE_PEEKUSR: {
-   unsigned long index, tmp;
+   unsigned long index, fpidx, tmp = 0;
 
ret = -EIO;
/* convert to index and check */
+   index = addr / sizeof(long);
+   if ((addr & (sizeof(long) - 1)) || (index > PT_FPSCR))
+   break;
 #ifdef CONFIG_PPC32
-   index = addr >> 2;
-   if ((addr & 3) || (index > PT_FPSCR)
-   || (child->thread.regs == NULL))
-#else
-   index = addr >> 3;
-   if ((addr & 7) || (index > PT_FPSCR))
-#endif
+   if (child->thread.regs == NULL)
break;
+#endif
 
CHECK_FULL_REGS(child->thread.regs);
if (index < PT_FPR0) {
ret = ptrace_get_reg(child, (int) index, );
if (ret)
break;
-   } else {
-   unsigned int fpidx = index - PT_FPR0;
+   goto out_peekusr;
+   }
 
-   flush_fp_to_thread(child);
-   if (fpidx < (PT_FPSCR - PT_FPR0))
-   memcpy(, >thread.TS_FPR(fpidx),
-  sizeof(long));
-   else
-   tmp = child->thread.fp_state.fpscr;
+   flush_fp_to_thread(child);
+#ifdef CONFIG_PPC32
+   if (index == PT_FPSCR - 1)
+   /* corner case for PPC32; do nothing */
+   goto out_peekusr;
+#endif
+   if (index == PT_FPSCR) {
+   tmp = child->thread.fp_state.fpscr;
+   goto out_peekusr;
}
+   /*
+* FPR is always 64-bit; on PPC32, userspace does two 32-bit
+* accesses. Add bit2 to allow accessing the upper half on
+* 32-bit; on 64-bit, bit2 is always 0 (we validate it above).
+*/
+   fpidx = (addr - PT_FPR0 * sizeof(long)) / 8;
+   memcpy(, (void *)>thread.TS_FPR(fpidx) + (addr & 4),
+   sizeof(long));
+out_peekusr:
ret = put_user(tmp, datalp);
break;
}
 
/* write the word at location addr in the USER area */
case PTRACE_POKEUSR: {
-   unsigned long index;
+   unsigned long index, fpidx;
 
ret = -EIO;
/* convert to index and check */
+   index = addr / sizeof(long);
+   if ((addr & (sizeof(long) - 1)) || (index > PT_FPSCR))
+   break;
 #ifdef CONFIG_PPC32
-   index = addr >> 2;
-   if ((addr & 3) || (index > PT_FPSCR)
-   || (child->thread.regs == NULL))
-#else
-   index = addr >> 3;
-   if ((addr & 7) || (index > PT_FPSCR))
-#endif
+   if (child->thread.regs == NULL)
br

[PATCH 0/1] PPC32: fix ptrace() access to FPU registers

2019-06-10 Thread Radu Rendec
Hi Everyone,

I'm following up on the ptrace() problem that I reported a few days ago.
I believe my version of the code handles all cases correctly. While the
problem essentially boils down to dividing the fpidx by 2 on PPC32, it
becomes tricky when the same code must work correctly on both PPC32 and
PPC64.

One other thing that I believe was handled incorrectly in the previous
version is the unused half of fpscr on PPC32. Note that while PT_FPSCR
is defined as (PT_FPR0 + 2*32 + 1), making only the upper half visible,
PT_FPR0 + 2*32 still corresponds to a possible address that userspace
can pass. In that case, comparing fpidx to (PT_FPSCR - PT_FPR0) would
cause an invalid access to the FPU registers array.

I tested the patch on 4.9.179, but that part of the code is identical in
recent kernels so it should work just the same.

I wrote a simple test program than can be used to quickly test (on an
x86_64 host) that all cases are handled correctly for both PPC32/PPC64.
The code is included below.

I also tested with gdbserver (test patch included below) and verified
that it generates two ptrace() calls for each FPU register, with
addresses between 0xc0 and 0x1bc.

8<--- Makefile -
.PHONY: all clean

all: ptrace-fpregs-32 ptrace-fpregs-64

ptrace-fpregs-32: ptrace-fpregs.c
$(CC) -o ptrace-fpregs-32 -Wall -O2 -m32 ptrace-fpregs.c

ptrace-fpregs-64: ptrace-fpregs.c
$(CC) -o ptrace-fpregs-64 -Wall -O2 ptrace-fpregs.c

clean:
rm -f ptrace-fpregs-32 ptrace-fpregs-64
8<--- ptrace-fpregs.c --
#include 
#include 

#define PT_FPR0 48

#ifndef __x86_64

#define PT_FPR31 (PT_FPR0 + 2*31)
#define PT_FPSCR (PT_FPR0 + 2*32 + 1)

#else

#define PT_FPSCR (PT_FPR0 + 32)

#endif

int test_access(unsigned long addr)
{
int ret;

do {
unsigned long index, fpidx;

ret = -EIO;

/* convert to index and check */
index = addr / sizeof(long);
if ((addr & (sizeof(long) - 1)) || (index > PT_FPSCR))
break;

if (index < PT_FPR0) {
ret = printf("ptrace_put_reg(%lu)", index);
break;
}

ret = 0;
#ifndef __x86_64
if (index == PT_FPSCR - 1) {
/* corner case for PPC32; do nothing */
printf("corner_case");
break;
}
#endif
if (index == PT_FPSCR) {
printf("fpscr");
break;
}

/*
 * FPR is always 64-bit; on PPC32, userspace does two 32-bit
 * accesses. Add bit2 to allow accessing the upper half on
 * 32-bit; on 64-bit, bit2 is always 0 (we validate it above).
 */
fpidx = (addr - PT_FPR0 * sizeof(long)) / 8;
printf("TS_FPR[%lu] + %lu", fpidx, addr & 4);
break;
} while (0);

return ret;
}

int main(void)
{
unsigned long addr;
int rc;

for (addr = 0; addr < PT_FPSCR * sizeof(long) + 16; addr++) {
printf("0x%04lx: ", addr);
rc = test_access(addr);
if (rc < 0)
printf("!err!");
printf("\t<%d>\n", rc);
}

return 0;
}
8<--- gdb.patch 
--- gdb/gdbserver/linux-low.c.orig  2019-06-10 11:45:53.810882669 -0400
+++ gdb/gdbserver/linux-low.c   2019-06-10 11:49:32.272929766 -0400
@@ -4262,6 +4262,8 @@ store_register (struct regcache *regcach
   pid = lwpid_of (get_thread_lwp (current_inferior));
   for (i = 0; i < size; i += sizeof (PTRACE_XFER_TYPE))
 {
+  printf("writing register #%d offset %d at address %#x\n",
+ regno, i, (unsigned int)regaddr);
   errno = 0;
   ptrace (PTRACE_POKEUSER, pid,
/* Coerce to a uintptr_t first to avoid potential gcc warning
8<--

Radu Rendec (1):
  PPC32: fix ptrace() access to FPU registers

 arch/powerpc/kernel/ptrace.c | 85 ++--
 1 file changed, 52 insertions(+), 33 deletions(-)

-- 
2.20.1



Re: PowerPC arch_ptrace() writes beyond thread_struct/task_struct

2019-06-06 Thread Radu Rendec
On Thu, 2019-06-06 at 07:15 +0200, Christophe Leroy wrote:
> 
> Le 05/06/2019 à 23:45, Radu Rendec a écrit :
> > Hi Everyone,
> > 
> > I'm seeing some weird memory corruption that I have been able to isolate
> > to arch_ptrace() [arch/powerpc/kernel/ptrace.c] and PTRACE_POKEUSR. I am
> > on PowerPC 32 (MPC8378), kernel 4.9.179.
> > 
> > It's not very easy for me to test on the latest kernel, but I guess
> > little has changed since 4.9 in either the architecture specific ptrace
> > code or PowerPC register data structures.
> > 
> > What happens is that gdb calls ptrace(PTRACE_POKEUSER) with addr=0x158.
> > This goes down to arch_ptrace() [arch/powerpc/kernel/ptrace.c], inside
> > `case PTRACE_POKEUSR`, on the branch that does this:
> > 
> >  memcpy(>thread.TS_FPR(fpidx), ,
> >  sizeof(long));
> > 
> > where:
> >  index = addr >> 2 = 0x56 = 86
> >  fpidx = index - PT_FPR0 = 86 - 48 = 38
> 
> In struct thread_fp_state, fpr field is u64, so I guess we should have 
> the following on PPC32:
> 
> fpidx = (index - PT_FPR0) >> 1;

I guess this would only apply to PPC32, since everything up to fpidx is
calculated in units of sizeof(long) - which is 4 on PPC32 and 8 on
PPC64. But fpr[0:31] is always u64.

It also looks odd that only sizeof(long) bytes are ever copied for any
given fpr[fpidx], which means one half of the u64 is never accessible on
PPC32.

Ont other thing I don't get is the "+1" in the definition of PT_FPSCR
for PPC32:

#define PT_FPSCR (PT_FPR0 + 2*32 + 1)

Looking at struct thread_fp_state, fpscr follows immediately after
fpr[31]. Is the FPSCR register only 32-bit on PPC32? Is it stored in the
2nd half of (struct thread_fp_state).fpscr? This line:

child->thread.fp_state.fpscr = data;

suggests so. And in that case, the "+1" in PT_FPSCR makes sense, but
only for big endian: assigning `data` (which is "long", 32-bit) to the
`fpscr` field (which is "u64") would go to the higher address, which is
indeed "+1" in units of 32-bit words.

Then there is also a problem is the condition that determines whether
memcpy() is used to access one of the fpr[0:31] or fpscr is assigned
directly:

if (fpidx < (PT_FPSCR - PT_FPR0))

The case when the supplied addr points to the lower half of fpscr (which
is unused on PPC32?) erroneously indexes into fpr[0:31].

Is there any documentation of what "addr" is supposed to mean?


> >  >thread.TS_FPR(fpidx) = (void *)child + 1296
> > 
> >  offsetof(struct task_struct, thread) = 960
> >  sizeof(struct thread_struct) = 336
> >  sizeof(struct task_struct) = 1296
> > 
> > In other words, the memcpy() call writes just beyond thread_struct
> > (which is also beyond task_struct, for that matter).
> > 
> > This should never get past the bounds checks for `index`, so perhaps
> > there is a mismatch between ptrace macros and the actual register data
> > structures layout.
> > 
> > I will continue to investigate, but I'm not familiar with the PowerPC
> > registers so it will take a while before I make sense of all the data
> > structures and macros. Hopefully this rings a bell to someone who is
> > already familiar with those and could figure out quickly what the
> > problem is.




PowerPC arch_ptrace() writes beyond thread_struct/task_struct

2019-06-05 Thread Radu Rendec
Hi Everyone,

I'm seeing some weird memory corruption that I have been able to isolate
to arch_ptrace() [arch/powerpc/kernel/ptrace.c] and PTRACE_POKEUSR. I am
on PowerPC 32 (MPC8378), kernel 4.9.179.

It's not very easy for me to test on the latest kernel, but I guess
little has changed since 4.9 in either the architecture specific ptrace
code or PowerPC register data structures.

What happens is that gdb calls ptrace(PTRACE_POKEUSER) with addr=0x158.
This goes down to arch_ptrace() [arch/powerpc/kernel/ptrace.c], inside
`case PTRACE_POKEUSR`, on the branch that does this:

memcpy(>thread.TS_FPR(fpidx), ,
sizeof(long));

where:
index = addr >> 2 = 0x56 = 86
fpidx = index - PT_FPR0 = 86 - 48 = 38
>thread.TS_FPR(fpidx) = (void *)child + 1296

offsetof(struct task_struct, thread) = 960
sizeof(struct thread_struct) = 336
sizeof(struct task_struct) = 1296

In other words, the memcpy() call writes just beyond thread_struct
(which is also beyond task_struct, for that matter).

This should never get past the bounds checks for `index`, so perhaps
there is a mismatch between ptrace macros and the actual register data
structures layout.

I will continue to investigate, but I'm not familiar with the PowerPC
registers so it will take a while before I make sense of all the data
structures and macros. Hopefully this rings a bell to someone who is
already familiar with those and could figure out quickly what the
problem is.

Best regards,
Radu Rendec


Re: [PATCH 0/1] Fix NULL pointer access in PowerPC MSI teardown code

2018-11-28 Thread Radu Rendec
Hi Michael,

On Wed, Nov 28, 2018 at 6:00 AM Michael Ellerman  wrote:
>
> Radu Rendec  writes:
> >
> > The assumption in arch_teardown_msi_irqs() is wrong and results in a
> > function call on a NULL pointer. An example of how this can happen is
> > included in the actual patch header. In my case, it happens when the PCI
> > hardware is configured during kernel start-up, because my controller
> > doesn't support MSI and the ops are NULL.
>
> What hardware are you on?

I'm on Freescale MPC8378 - old stuff, but still going strong :)
The MSI capable device is a Broadcom PEX 8613 (a 3-port PCIe switch).

> > I'm proposing the attached patch to fix the problem. It basically just
> > checks the pointer before the function call.
>
> Yeah that patch looks good to me.
>
> I suspect this bug was introduced in:
>
>   6b2fd7efeb88 ("PCI/MSI/PPC: Remove arch_msi_check_device()")
>
> Previously we had that check routine which would run before any of the
> MSI setup had been done, and so if there were no MSI ops then we bailed
> out early and didn't call teardown.
>
> I guess since then (2014) we haven't tested an MSI capable device on a
> system that isn't MSI capable?

Thanks for looking into this. You're probably right, it really looks like
that patch could have introduced this bug.

Cheers,
Radu


[PATCH 1/1] Fix NULL pointer access in PowerPC MSI teardown code

2018-11-27 Thread Radu Rendec
The arch_teardown_msi_irqs() function assumes that controller ops
pointers were already checked in arch_setup_msi_irqs(), but this
assumption is wrong: arch_teardown_msi_irqs() can be called even when
arch_setup_msi_irqs() returns an error (-ENOSYS).

This can happen in the following scenario:

  * msi_capability_init() calls pci_msi_setup_msi_irqs()
  * pci_msi_setup_msi_irqs() returns -ENOSYS
  * msi_capability_init() notices the error and calls free_msi_irqs()
  * free_msi_irqs() calls pci_msi_teardown_msi_irqs()

This is easier to see when CONFIG_PCI_MSI_IRQ_DOMAIN is not set and
pci_msi_setup_msi_irqs() and pci_msi_teardown_msi_irqs() are just
aliases to arch_setup_msi_irqs() and arch_teardown_msi_irqs().

The call to free_msi_irqs() upon pci_msi_setup_msi_irqs() failure seems
legit, as it does additional cleanup; e.g. list_del(>list) and
kfree(entry) inside free_msi_irqs() do happen (MSI descriptors are
allocated before pci_msi_setup_msi_irqs() is called and need to be
cleaned up if that fails).

Signed-off-by: Radu Rendec 
---
 arch/powerpc/kernel/msi.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
index dab616a33b8d..83c2043cc685 100644
--- a/arch/powerpc/kernel/msi.c
+++ b/arch/powerpc/kernel/msi.c
@@ -34,5 +34,12 @@ void arch_teardown_msi_irqs(struct pci_dev *dev)
 {
struct pci_controller *phb = pci_bus_to_host(dev->bus);
 
-   phb->controller_ops.teardown_msi_irqs(dev);
+   /*
+* We can be called even when arch_setup_msi_irqs() returns -ENOSYS,
+* so check the pointer again. Example: msi_capability_init() calls
+* pci_msi_setup_msi_irqs(), then free_msi_irqs(), which in turn calls
+* pci_msi_teardown_msi_irqs().
+*/
+   if (phb->controller_ops.teardown_msi_irqs)
+   phb->controller_ops.teardown_msi_irqs(dev);
 }
-- 
2.17.2



[PATCH 0/1] Fix NULL pointer access in PowerPC MSI teardown code

2018-11-27 Thread Radu Rendec
Hi everyone,

It seems there's an unchecked access to a NULL pointer (to a function)
in the PowerPC MSI teardown code. I found this on kernel 4.9, but the
code looks identical in the latest 4.20-rc. I don't see any reason why
this wouldn't happen on recent kernels too.

The PowerPC architecture specific MSI setup and teardown functions are
in arch/powerpc/kernel/msi.c:

  * arch_setup_msi_irqs() checks pointers for both the setup_msi_irqs
and teardown_msi_irqs ops and returns -ENOSYS if either one is NULL.

  * arch_teardown_msi_irqs() calls on the teardown_msi_irqs op pointer
without checking it and assumes the function is never called unless
arch_setup_msi_irqs() returns successfully.

The assumption in arch_teardown_msi_irqs() is wrong and results in a
function call on a NULL pointer. An example of how this can happen is
included in the actual patch header. In my case, it happens when the PCI
hardware is configured during kernel start-up, because my controller
doesn't support MSI and the ops are NULL.

I'm proposing the attached patch to fix the problem. It basically just
checks the pointer before the function call.

The patch is against v4.20-rc4, but I only actually tested it on
v4.9.115. On the other hand, the patch is trivial and I did check that
the NULL pointer dereference scenario is still valid on v4.20-rc4.

Best regards,
Radu Rendec


Radu Rendec (1):
  Fix NULL pointer access in PowerPC MSI teardown code

 arch/powerpc/kernel/msi.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

-- 
2.17.2



Regression: compile error on mpc83xx

2018-10-18 Thread Radu Rendec
Hi everyone,

I'm getting the following compile error for mpc83xx in v4.9.115:

arch/powerpc/sysdev/built-in.o: In function `fsl_of_msi_probe':
fsl_msi.c:(.text+0x1548): undefined reference to `fsl_mpic_primary_get_version'

This seems to have been fixed in v3.12-rc1 by commit df1024ad8728. Then,
somewhere between v4.0 and v4.1 the following two commits changed the
arch/powerpc/include/asm/mpic.h file and that function again:
5e86bfde9cd9
807d38b73b63

After those two commits, it seems we're back to square one. I haven't
tested, but it looks like there haven't been any other changes to that
file up to the latest master in Linus' tree, so I believe the problem is
still there in the latest kernel.

Should commit df1024ad8728 be reapplied?

Thanks,
Radu Rendec


Re: MPC83xx reset status register (RSR, offset 0x910)

2018-09-18 Thread Radu Rendec
Hi Christophe,

On Thu, 2018-09-13 at 10:21 +0200, Christophe LEROY wrote:
>
> Le 11/09/2018 à 00:17, Radu Rendec a écrit :
> >
> > The MPC83xx also has a watchdog and the kernel driver (mpc8xxx_wdt.c)
> > could also be improved to support the WDIOC_GETBOOTSTATUS ioctl and
> > properly report if the system rebooted due to a watchdog.
>
> Very good idea.
>
> I just submitted a patch for that. Please look at it.
> I'm sure any driver which needs reset status information can do the same.

Thanks for submitting the patch and sorry for the late reply! I followed
the conversation between you and Guenter and it seems your patches are
almost accepted. That's a good thing.

> If we want to do something more central, maybe we should look at what
> was done on ARM:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=04fef228fb00

Thanks for pointing out that commit. It's very similar to what I wanted
to do (for MPC83xx) in the first place: read the RSR value on start-up
into a variable and export it as a symbol to make it available to other
drivers.

I would take on the work to implement something similar for PowerPC, but
I need some guidance as to what goes where. For instance, what would be
the PowerPC equivalent of arch/arm/mach-pxa/reset.c, which defines the
reset_status variable?

Another question is if the device tree should be used. We already have
separate directories for each platform in arch/powerpc/platforms and I
guess for each platform the RSR is always there and at a fixed, well
known address.

Thanks,
Radu



Re: MPC83xx reset status register (RSR, offset 0x910)

2018-09-10 Thread Radu Rendec
Hi,

On Mon, 2018-09-10 at 07:37 +0200, Christophe Leroy wrote:
> Le 10/09/2018 à 01:13, Radu Rendec a écrit :
> >
> > I'm using U-boot as well, but it's just not configured to read or clear
> > the RSR. I'm curious: if U-boot reads/clears the RSR in your case, how
> > do you make the initial value available to user space programs running
> > under Linux?
>
> I'm surprised. When looking at U-boot code, I don't see any way to
> configure that. It seems just do by default in function cpu_init_f():
>
> https://elixir.bootlin.com/u-boot/v2018.07/source/arch/powerpc/cpu/mpc83xx/cpu_init.c#L217
>
> /* RSR - Reset Status Register - clear all status (4.6.1.3) */
> gd->arch.reset_status = __raw_readl(>reset.rsr);
> __raw_writel(~(RSR_RES), >reset.rsr);

I'm working as a contractor in a large embedded project, so I don't know
all the bits and pieces. I just checked the U-boot code. Whoever was
maintaining it, "configured" it by commenting out the __raw_writel()
that clears the register :)

Probably the reason was specifically to be able to read it from Linux,
but unfortunately the guy is not here any more to ask him.

It may make more sense to read it from U-boot, but (1) the value must
still be passed to Linux somehow and (2) in my case, I would prefer not
to touch U-boot.

> Do you know any user space program in Linux that needs this value ?

I don't know of any "standard" program that needs it. In the project I'm
working on, there are multiple peripherals on the board and initialization
is slightly different when the reset line is physically asserted vs. a
soft CPU reset. Besides, we need to show the reset reason to the user.

I guess in the embedded world this is a fairly common use case, so
perhaps others can benefit from that if I fix it in a way that can be
pushed upstream.

> > Thank you very much for the patches. Is there any chance they can be
> > submitted upstream?
>
> I see no problem submitting them upstream, but are they really worth it
> ? Adding Michael in copy to get his opinion.

I guess it's worth if they are changed to make the value available to
the kernel and user space rather than just decoding/printing it, for the
reasons I mentioned above.

The MPC83xx also has a watchdog and the kernel driver (mpc8xxx_wdt.c)
could also be improved to support the WDIOC_GETBOOTSTATUS ioctl and
properly report if the system rebooted due to a watchdog.

> > I tried to look for something similar on other platforms or architectures,
> > but couldn't find anything.
>
> I believe furst thing is to identify some app needing such an
> information, then we'll be able to investigate how to handle it.

Well, I guess I explained my reasons and use case. If there is any
interest in that, I will gladly implement it in a way that makes sense
to upstream. Let's see what Michael thinks.

Thanks,
Radu Rendec


Re: MPC83xx reset status register (RSR, offset 0x910)

2018-09-09 Thread Radu Rendec
Hi,

On Fri, 2018-08-24 at 16:20 +, Christophe Leroy wrote:
> > On 08/03/2018 04:36 PM, Radu Rendec wrote:
> >
> > Is there any kernel code that handles the "reset status register" (RSR)
> > on MPC83xx? I looked at arch/powerpc/platforms/83xx/misc.c, but it seems
> > to only map the reset register area and it's static. The watchdog driver
> > (drivers/watchdog/mpc8xxx_wdt.c) doesn't seem to look at it either (for
> > the bootstatus flags).
>
> How do you boot your Linux kernel ?
>
> My 832x board boots using U-boot, and U-boot reads the RSR then clears
> it. So when Linux kernel reads it, it is just 0.

I'm using U-boot as well, but it's just not configured to read or clear
the RSR. I'm curious: if U-boot reads/clears the RSR in your case, how 
do you make the initial value available to user space programs running 
under Linux?

> > Basically I need to check the CPU reset reason and I thought I would ask
> > first, before starting to write any code of my own.
>
> Anyway, find below a set of two patches I used for testing. Feel free to
> use them if you bootloader doesn't clear the register

Thank you very much for the patches. Is there any chance they can be 
submitted upstream?

Of course, just printing the decoded bits is only helpful for testing. I
was thinking of a way to make the value available to both the kernel 
(as an exported symbol) and user space (e.g. via sysfs). Is there a
standard or preferred way to do this?

I tried to look for something similar on other platforms or architectures,
but couldn't find anything.

Thanks,
Radu Rendec




MPC83xx reset status register (RSR, offset 0x910)

2018-08-03 Thread Radu Rendec
Hi Everyone,

Is there any kernel code that handles the "reset status register" (RSR)
on MPC83xx? I looked at arch/powerpc/platforms/83xx/misc.c, but it seems
to only map the reset register area and it's static. The watchdog driver
(drivers/watchdog/mpc8xxx_wdt.c) doesn't seem to look at it either (for
the bootstatus flags).

Basically I need to check the CPU reset reason and I thought I would ask
first, before starting to write any code of my own.

Thanks,
Radu Rendec