Re: [PATCH] cxl: Check if vphb exists before iterating over AFU devices

2017-12-21 Thread Greg KH
On Thu, Dec 21, 2017 at 09:36:19AM +0530, Vaibhav Jain wrote:
> commit 12841f87b7a8ceb3d54f171660f72a86941bfcb3 upstream, for 4.3.

Thanks, now applied to 4.4.

greg k-h


Re: [PATCH] On ppc64le we HAVE_RELIABLE_STACKTRACE

2017-12-21 Thread Michael Ellerman
Josh Poimboeuf  writes:

> On Tue, Dec 19, 2017 at 12:28:33PM +0100, Torsten Duwe wrote:
>> On Mon, Dec 18, 2017 at 12:56:22PM -0600, Josh Poimboeuf wrote:
>> > On Mon, Dec 18, 2017 at 03:33:34PM +1000, Nicholas Piggin wrote:
>> > > On Sun, 17 Dec 2017 20:58:54 -0600
>> > > Josh Poimboeuf  wrote:
>> > > 
>> > > > On Fri, Dec 15, 2017 at 07:40:09PM +1000, Nicholas Piggin wrote:
>> > > > > On Tue, 12 Dec 2017 08:05:01 -0600
>> > > > > Josh Poimboeuf  wrote:
>> > > > >   
>> > > > > > What about leaf functions?  If a leaf function doesn't establish a 
>> > > > > > stack
>> > > > > > frame, and it has inline asm which contains a blr to another 
>> > > > > > function,
>> > > > > > this ABI is broken.  
>> > > > 
>> > > > Oops, I meant to say "bl" instead of "blr".
>> 
>> You need to save LR, one way or the other. If gcc thinks it's a leaf 
>> function and
>> does not do it, nor does your asm code, you'll return in an endless loop => 
>> bug.
>
> Ah, so the function's return path would be corrupted, and an unreliable
> stack trace would be the least of our problems.

That's mostly true.

It is possible to save LR somewhere other than the correct stack slot,
in which case you can return correctly but still confuse the unwinder. A
function can hide its caller that way.

It's stupid and we should never do it, but it's not impossible.

...

> So with your proposal, I think I'm convinced that we don't need objtool
> for ppc64le.  Does anyone disagree?

I don't disagree, but I'd be happier if we did have objtool support.

Just because it would give us a lot more certainty that we're doing the
right thing everywhere, including in hand-coded asm and inline asm.

It's easy to write powerpc asm such that stack traces are reliable, but
it is *possible* to break them.

> There are still a few more things that need to be looked at:
>
> 1) With function graph tracing enabled, is the unwinder smart enough to
>get the original function return address, e.g. by calling
>ftrace_graph_ret_addr()?

No I don't think so.

> 2) Similar question for kretprobes.
>
> 3) Any other issues with generated code (e.g., bpf, ftrace trampolines),
>runtime patching (e.g., CPU feature alternatives), kprobes, paravirt,
>etc, that might confuse the unwinder?

We'll have to look, I can't be sure off the top of my head.

> 4) As a sanity check, it *might* be a good idea for
>save_stack_trace_tsk_reliable() to ensure that it always reaches the
>end of the stack.  There are several ways to do that:
>
>- If the syscall entry stack frame is always the same size, then the
>  "end" would simply mean that the stack pointer is at a certain
>  offset from the end of the task stack page.  However this might not
>  work for kthreads and idle tasks, unless their stacks also start at
>  the same offset.  (On x86 we actually standardized the end of stack
>  location for all tasks, both user and kernel.)

Yeah it differs between user and kernel.

>- If the unwinder can get to the syscall frame, it can presumably
>  examine regs->msr to check the PR bit to ensure it got all the way
>  to syscall entry.  But again this might only work for user tasks,
>  depending on how kernel task stacks are set up.

That sounds like a good idea. We could possibly mark the last frame of
kernel tasks somehow.

>- Or a different approach would be to do error checking along the
>  way, and reporting an error for any unexpected conditions.
>
>However, given that backlink/LR corruption doesn't seem possible with
>this architecture, maybe #4 would be overkill.  Personally I would
>feel more comfortable with an "end" check and a WARN() if it doesn't
>reach the end.

Yeah I agree.

cheers


Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-21 Thread Michael Ellerman
Matthew Wilcox  writes:

> On Mon, Dec 18, 2017 at 01:35:47PM -0700, Ross Zwisler wrote:
>> What I'm hoping to do with this series is to just provide a sysfs
>> representation of the HMAT so that applications can know which NUMA nodes to
>> select with existing utilities like numactl.  This series does not currently
>> alter any kernel behavior, it only provides a sysfs interface.
>> 
>> Say for example you had a system with some high bandwidth memory (HBM), and
>> you wanted to use it for a specific application.  You could use the sysfs
>> representation of the HMAT to figure out which memory target held your HBM.
>> You could do this by looking at the local bandwidth values for the various
>> memory targets, so:
>> 
>>  # grep . /sys/devices/system/hmat/mem_tgt*/local_init/write_bw_MBps
>>  /sys/devices/system/hmat/mem_tgt2/local_init/write_bw_MBps:81920
>>  /sys/devices/system/hmat/mem_tgt3/local_init/write_bw_MBps:40960
>>  /sys/devices/system/hmat/mem_tgt4/local_init/write_bw_MBps:40960
>>  /sys/devices/system/hmat/mem_tgt5/local_init/write_bw_MBps:40960
>> 
>> and look for the one that corresponds to your HBM speed. (These numbers are
>> made up, but you get the idea.)
>
> Presumably ACPI-based platforms will not be the only ones who have the
> ability to expose different bandwidth memories in the future.  I think
> we need a platform-agnostic way ... right, PowerPC people?

Yes!

I don't have any detail at hand but will try and rustle something up.

cheers


[PATCH 1/1] powerpc/pseries: Use the system workqueue as fallback to hotplug workqueue

2017-12-21 Thread Jose Ricardo Ziviani
The hotplug engine uses its own workqueue to handle IRQ requests, the
problem is that such workqueue is initialized not so early in the boot
process.

Thus, when the kernel is ready to handle IRQ requests, after the system
workqueue is initialized, we have a timeframe where any hotplug issued
by the client will result in a kernel panic. That timeframe goes until
the hotplug workqueue is initialized.

It would be good to have the hotplug workqueue initialized as soon as
the system workqueue but I don't think it is possible. So, this patch
uses the system workqueue as a fallback the handle such IRQs.

Signed-off-by: Jose Ricardo Ziviani 
---
 arch/powerpc/platforms/pseries/dlpar.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
b/arch/powerpc/platforms/pseries/dlpar.c
index 6e35780c5962..0474aa14b5f6 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -399,7 +399,15 @@ void queue_hotplug_event(struct pseries_hp_errorlog 
*hp_errlog,
work->errlog = hp_errlog_copy;
work->hp_completion = hotplug_done;
work->rc = rc;
-   queue_work(pseries_hp_wq, (struct work_struct *)work);
+
+   /* The hotplug workqueue may happen to be NULL at the moment
+* this code is executed, during the boot phase. So, in this
+* scenario, we can fallback to the system workqueue.
+*/
+   if (unlikely(pseries_hp_wq == NULL))
+   schedule_work((struct work_struct *)work);
+   else
+   queue_work(pseries_hp_wq, (struct work_struct *)work);
} else {
*rc = -ENOMEM;
kfree(hp_errlog_copy);
-- 
2.14.1



[PATCH 0/1] Uses the system workqueue as fallback

2017-12-21 Thread Jose Ricardo Ziviani
In order to avoid kernel panic after memory hotplug in early stages of the boot
process (which the kernel is already able to handle IRQs), this patch uses the
system workqueue as a fallback to the hotplug workqueue.

After this patch I'm not able to reproduce the problem and the memory is
successfuly plugged at any stage in the boot process.

Thank you

Error scenario:

Booting Linux via __start() @ 0x0200 ...
[0.00] Detected Power 8 processor 
[0.00] Warning: Processor - this hardware has not undergone testing by 
Red Hat and might not be certified. Please consult https://hardware.redhat.com 
for certified hardware.
 -> smp_release_cpus()
spinning_secondaries = 3
 <- smp_release_cpus()
Linux ppc64le
#1 SMP Wed Nov 2[0.021319] Unable to handle kernel paging request for data 
at address 0x0100
[0.021379] Faulting instruction address: 0xc015c420
[0.021423] Oops: Kernel access of bad area, sig: 11 [#1]
[0.021457] LE SMP NR_CPUS=2048 NUMA pSeries
[0.021493] Modules linked in:
[0.021522] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.14.0-9.el7a.ppc64le 
#1
[0.021572] task: c0047bb8 task.stack: c0047bbc
[0.021615] NIP:  c015c420 LR: c015cae4 CTR: 
[0.021666] REGS: c0047ffeb920 TRAP: 0380   Not tainted  
(4.14.0-9.el7a.ppc64le)
[0.021715] MSR:  82009033   CR: 2842  
XER: 2000
[0.021769] CFAR: c015cae0 SOFTE: 0 
[0.021769] GPR00: c015cae4 c0047ffebba0 c14ca600 
0800 
[0.021769] GPR04:  c0047e1e5000 01a0 
c00e19e0 
[0.021769] GPR08: 000fffe1  000fffe0 
02001001 
[0.021769] GPR12: c00e0ea0 c7ac c000d0b8 
 
[0.021769] GPR16:  c0047e1e5000  
 
[0.021769] GPR20:  0001 0002 
0015 
[0.021769] GPR24: c001fdc075b8 0001 c001fdc07400 
 
[0.021769] GPR28: 0800   
0800 
[0.022196] NIP [c015c420] __queue_work+0x80/0x690
[0.022231] LR [c015cae4] queue_work_on+0xb4/0xf0
[0.022264] Call Trace:
[0.022283] [c0047ffebba0] [c017d948] ttwu_do_wakeup+0x228/0x290 
(unreliable)
[0.022334] [c0047ffebc90] [c015cae4] queue_work_on+0xb4/0xf0
[0.022377] [c0047ffebcd0] [c00e36d0] 
queue_hotplug_event+0xe0/0x160
[0.022428] [c0047ffebd20] [c00e0fe0] 
ras_hotplug_interrupt+0x140/0x160
[0.022480] [c0047ffebdb0] [c01d0a20] 
__handle_irq_event_percpu+0xa0/0x330
...
[1.024963] Kernel panic - not syncing: Fatal exception in interrupt
[1.027080] Rebooting in 10 seconds..

Test case 1: Hotplug during the boot process, after the hotplug wq
initialization

0.554391] rtas_flash: no firmware flash support
[0.554464]  [devlog pseries_hp_wq] ALLOCed
[0.555021] Initialise system trusted keyrings
...
...
Welcome to Red Hat Enterprise Linux Server 7.4 (Maipo) dracut-033-502.el7 
(Initramfs)!
...
[  OK  ] Started dracut cmdline hook.
 Starting dracut pre-udev hook...
(qemu) object_add memory-backend-ram,id=mem1,size=10G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
[0.754432]  [devlog pseries_hp_wq] 0xfec52400L
[0.765465] pseries-hotplug-mem: Attempting to hot-add 40 LMB(s) at index 
8010
[0.765710] radix-mmu: Mapped 0xc001-0xc0011000 with 
2.00 MiB pages
...
(qemu) info memory-devices
Memory device [dimm]: "dimm1"
  addr: 0x1
  slot: 0
  node: 0
  size: 10737418240
  memdev: /objects/mem1
  hotplugged: true
  hotpluggable: true

Test case 2: Hotplug during the boot process, before the hotplug wq
initialization

[   [0.028103] NET: Registered protocol family 1
[0.028745] Unpacking initramfs...
(qemu) object_add memory-backend-ram,id=mem1,size=10G
(qemu) device_add pc-dimm,id=dimm1,memdev=mem1
[0.407070]  [devlog pseries_hp_wq] 0x0 (using system wq)
[0.407420] pseries-hotplug-mem: Attempting to hot-add 40 LMB(s) at index 
8010
[0.407749] radix-mmu: Mapped 0xc001-0xc0011000 with 
2.00 MiB pages
...  0.627554] rtas_flash: no firmware flash support
[0.627674]  [devlog pseries_hp_wq] ALLOCed
[0.628451] Initialise system trusted keyrings
...
(qemu) info memory-devices
Memory device [dimm]: "dimm1"
  addr: 0x1
  slot: 0
  node: 0
  size: 10737418240
  memdev: /objects/mem1
  hotplugged: true
  hotpluggable: true

Jose Ricardo Ziviani (1):
  powerpc/pseries: Use the system workqueue as fallback to hotplug
workqueue

 arch/powerpc/platforms/pseries/dlpar.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

-- 
2.14.1



Re: [PATCH v4 00/11] ASoC: fsl_ssi: Clean up - coding style level

2017-12-21 Thread Caleb Crome
On Thu, Dec 21, 2017 at 8:08 AM, Caleb Crome  wrote:


On Wed, Dec 20, 2017 at 3:40 AM, Arnaud Mouiche
 wrote:
>
>
>
> On 19/12/2017 01:25, Caleb Crome wrote:
>
>> On Mon, Dec 18, 2017 at 3:02 PM, Nicolin Chen  wrote:
>>>
>>> On Mon, Dec 18, 2017 at 02:19:08PM -0800, Caleb Crome wrote:
>>>
>>>
> Acked-by: Timur Tabi 
>>>
>>> --- To Mark ---
>>>
>>>
>>> Mark, can you still take these changes first? Since this failed
>>>
>>> test that Caleb reported here is already existing on the top of
>>>
>>> the mainline tree, I would like to treat this mail as a separate
>>>
>>> bug report and fix it with a separate patch.
>>>
>>>
>>> Besides, this series of changes don't change any function flow.
>>>
>>>
>>> Thank you
>>>
>> Sorry!  I should have created a separate thread for this subject.  My
>>
>> comments have *nothing* to do with this patch set, except they are
>>
>> about the same source files.
>>
>>
>>> --- To Caleb ---
>>>
>>>
 I'm re-setting up my loopback test to try to verify these most recent 
 changes.
>>>
>>> I really appreciate your verification and help.
>>
>> Of course!  I have this wandboard permanently set up for this
>>
>> verification test, so that I can easily repeat whenever I touch our
>>
>> kernel.
>>
>>
>> It's a dead-simple hardware mod just to connect TX to RX.
>>
>>
 warn:   11a0 11a1 1160 11a3 11a4 11a5 11a6 11a7

 warn: Valid frame after 1 invalid frames

 warn:   11c0 11c1 11c2 11c3 11c4 11c5 11c6 11c7

 warn: first invalid frame while expecting frame 0x00a0

 warn:   13e7 1400 1401 1402 1403 1404 1405 1404

 warn:   1407 1420 1421 1422 1423 1424 1425 1426

 warn:   1427 1440 1441 1442 1443 1444 1445 1484

 warn:   1447 1460 1461 1462 1463 1464 1465 1466


 Those last 4 lines are the channel slips -- the least significant

 nibble should be the channel number:  i.e. should go 0, 1, 2, 3, 4, 5,

 6, 7.


 Ugh, so it's basically quite broken again -- before these patches.
>>>
>>> I remember Arnaud reviewed one of my changes back to September.
>>>
>>> So I suppose the test should be fine at that time -- so a change
>>> being merged recently might have impacted the test result.
>>
>>
>> It's certainly possible that I'm doing something wrong again -- it
>> wouldn't be the first time :-)
>
[resend -- previous wasn't in plain text mode]

Okay, operator error on my part.  There was an old clock setting in my
ssi3 dtsi file that (falsely) modified the ssi baud clock frequency.
Nicolin's patch

ASoC: fsl_ssi: Caculate bit clock rate using slot number and width

now properly computes the master clock, and the old dtsi settings that
were necessary to fake things into the right speed are now obsolete.

So... basically, everything is back to working properly.  it wasn't
broken at all -- just my oversight on a ssi clock setting in the dtb.

-Caleb


Re: [PATCH v4 00/11] ASoC: fsl_ssi: Clean up - coding style level

2017-12-21 Thread Nicolin Chen
On Thu, Dec 21, 2017 at 08:10:07AM -0800, Caleb Crome wrote:

> >>> the mainline tree, I would like to treat this mail as a separate
> >>>
> >>> bug report and fix it with a separate patch.

>  warn:   11a0 11a1 1160 11a3 11a4 11a5 11a6 11a7
> 
>  warn: Valid frame after 1 invalid frames
> 
>  warn:   11c0 11c1 11c2 11c3 11c4 11c5 11c6 11c7
> 
>  warn: first invalid frame while expecting frame 0x00a0
> 
>  warn:   13e7 1400 1401 1402 1403 1404 1405 1404
> 
>  warn:   1407 1420 1421 1422 1423 1424 1425 1426
> 
>  warn:   1427 1440 1441 1442 1443 1444 1445 1484
> 
>  warn:   1447 1460 1461 1462 1463 1464 1465 1466
> 
> 
>  Those last 4 lines are the channel slips -- the least significant
> 
>  nibble should be the channel number:  i.e. should go 0, 1, 2, 3, 4, 5,
> 
>  6, 7.
> 
> 
>  Ugh, so it's basically quite broken again -- before these patches.

> Okay, operator error on my part.  There was an old clock setting in my
> ssi3 dtsi file that (falsely) modified the ssi baud clock frequency.
> Nicolin's patch
> 
> ASoC: fsl_ssi: Caculate bit clock rate using slot number and width
> 
> now properly computes the master clock, and the old dtsi settings that
> were necessary to fake things into the right speed are now obsolete.
> 
> So... basically, everything is back to working properly.  it wasn't
> broken at all -- just my oversight on a ssi clock setting in the dtb.

Well, that's a good news :) Thanks for the efforts during these days
to track back every corner.

Happy holiday.
Nicolin


Re: [PATCH v4 00/11] ASoC: fsl_ssi: Clean up - coding style level

2017-12-21 Thread Caleb Crome
On Wed, Dec 20, 2017 at 3:40 AM, Arnaud Mouiche 
wrote:

>
>
> On 19/12/2017 01:25, Caleb Crome wrote:
>
>> On Mon, Dec 18, 2017 at 3:02 PM, Nicolin Chen 
>> wrote:
>>
>>> On Mon, Dec 18, 2017 at 02:19:08PM -0800, Caleb Crome wrote:
>>>
>>> Acked-by: Timur Tabi 
>
 --- To Mark ---
>>>
>>> Mark, can you still take these changes first? Since this failed
>>> test that Caleb reported here is already existing on the top of
>>> the mainline tree, I would like to treat this mail as a separate
>>> bug report and fix it with a separate patch.
>>>
>>> Besides, this series of changes don't change any function flow.
>>>
>>> Thank you
>>>
>>> Sorry!  I should have created a separate thread for this subject.  My
>> comments have *nothing* to do with this patch set, except they are
>> about the same source files.
>>
>> --- To Caleb ---
>>>
>>> I'm re-setting up my loopback test to try to verify these most recent
 changes.

>>> I really appreciate your verification and help.
>>>
>> Of course!  I have this wandboard permanently set up for this
>> verification test, so that I can easily repeat whenever I touch our
>> kernel.
>>
>> It's a dead-simple hardware mod just to connect TX to RX.
>>
>> warn:   11a0 11a1 1160 11a3 11a4 11a5 11a6 11a7
 warn: Valid frame after 1 invalid frames
 warn:   11c0 11c1 11c2 11c3 11c4 11c5 11c6 11c7
 warn: first invalid frame while expecting frame 0x00a0
 warn:   13e7 1400 1401 1402 1403 1404 1405 1404
 warn:   1407 1420 1421 1422 1423 1424 1425 1426
 warn:   1427 1440 1441 1442 1443 1444 1445 1484
 warn:   1447 1460 1461 1462 1463 1464 1465 1466

 Those last 4 lines are the channel slips -- the least significant
 nibble should be the channel number:  i.e. should go 0, 1, 2, 3, 4, 5,
 6, 7.

 Ugh, so it's basically quite broken again -- before these patches.

>>> I remember Arnaud reviewed one of my changes back to September.
>>> So I suppose the test should be fine at that time -- so a change
>>> being merged recently might have impacted the test result.
>>>
>>
>> It's certainly possible that I'm doing something wrong again -- it
>> wouldn't be the first time :-)
>>
>
> Hi All,
>
> Sorry but I will be busy until mid January, I could help testing and
> fixing broken multi channel after.
> Anyway, I don't see specific issues with Nicolin patches.
> We can take time to fix what was broken before this patch set... after.
>
> Arnaud
>
>
>
>> I guess I need to go backwards in time and see what rev re-broke it.
 I don't really have time to dig too deep on this again.

 I'd be happy to provide the hardware to anybody that can diagnose and
 debug this more quickly than I can.  I'm very inefficient at kernel
 drivers I think.   My day job is acoustical and electrical
 engineering.

 Here's what the hardware looks like for anybody that's interested.
 Just a single wire loopback on the wandboard header.

>>> I would definitely like to take the hardware to debug it as long
>>> as you are willing to provide me. Can you send me a private mail
>>> to discuss about it?
>>>
>> Absolutely.
>> -Caleb
>>
>>
>> Thanks
>>> Nicolin
>>>
>>
>
Okay, operator error on my part.  There was an old clock setting in my ssi3
dtsi file that (falsely) modified the ssi baud clock frequency.  Nicolin's
patch

ASoC: fsl_ssi: Caculate bit clock rate using slot number and width

now properly computes the master clock, and the old dtsi settings that were
necessary to fake things into the right speed are now obsolete.

So... basically, everything is back to working properly.  it wasn't broken
at all -- just my oversight on a ssi clock setting in the dtb.

-Caleb


ASoC: fsl_ssi: Bringing up the SSI port in multi-channel TDM mode

2017-12-21 Thread Caleb Crome
Hi,
   I just posted a little write up for helping people get started with
the Freescale SSI port in TDM mode.

https://medium.com/@caleb_22836/how-to-get-the-mx6-ssi-port-up-and-running-in-tdm-mode-dbce02a15e81

I'm just posting here in case anybody is searching the google for this
information and finds this post.

(Also, it assumes that v4.15 will be released by the time anybody
actually uses it...)

-Caleb


Re: [PATCH 1/1] powerpc/pseries: Use the system workqueue as fallback to hotplug workqueue

2017-12-21 Thread David Gibson
On Thu, Dec 21, 2017 at 01:44:48PM -0200, Jose Ricardo Ziviani wrote:
> The hotplug engine uses its own workqueue to handle IRQ requests, the
> problem is that such workqueue is initialized not so early in the boot
> process.
> 
> Thus, when the kernel is ready to handle IRQ requests, after the system
> workqueue is initialized, we have a timeframe where any hotplug issued
> by the client will result in a kernel panic. That timeframe goes until
> the hotplug workqueue is initialized.
> 
> It would be good to have the hotplug workqueue initialized as soon as
> the system workqueue but I don't think it is possible. So, this patch
> uses the system workqueue as a fallback the handle such IRQs.
> 
> Signed-off-by: Jose Ricardo Ziviani 

I don't think this is the right approach.

It seems to me the bug is that the hotplug interrupt is registered in
init_ras_IRQ(), before the work queue is initialized in
pseries_dlpar_init().  We need to correct that ordering.

> ---
>  arch/powerpc/platforms/pseries/dlpar.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
> b/arch/powerpc/platforms/pseries/dlpar.c
> index 6e35780c5962..0474aa14b5f6 100644
> --- a/arch/powerpc/platforms/pseries/dlpar.c
> +++ b/arch/powerpc/platforms/pseries/dlpar.c
> @@ -399,7 +399,15 @@ void queue_hotplug_event(struct pseries_hp_errorlog 
> *hp_errlog,
>   work->errlog = hp_errlog_copy;
>   work->hp_completion = hotplug_done;
>   work->rc = rc;
> - queue_work(pseries_hp_wq, (struct work_struct *)work);
> +
> + /* The hotplug workqueue may happen to be NULL at the moment
> +  * this code is executed, during the boot phase. So, in this
> +  * scenario, we can fallback to the system workqueue.
> +  */
> + if (unlikely(pseries_hp_wq == NULL))
> + schedule_work((struct work_struct *)work);
> + else
> + queue_work(pseries_hp_wq, (struct work_struct *)work);
>   } else {
>   *rc = -ENOMEM;
>   kfree(hp_errlog_copy);

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-21 Thread Brice Goglin
Le 20/12/2017 à 23:41, Ross Zwisler a écrit :
> On Wed, Dec 20, 2017 at 02:29:56PM -0800, Dan Williams wrote:
>> On Wed, Dec 20, 2017 at 1:24 PM, Ross Zwisler
>>  wrote:
>>> On Wed, Dec 20, 2017 at 01:16:49PM -0800, Matthew Wilcox wrote:
 On Wed, Dec 20, 2017 at 12:22:21PM -0800, Dave Hansen wrote:
> On 12/20/2017 10:19 AM, Matthew Wilcox wrote:
>> I don't know what the right interface is, but my laptop has a set of
>> /sys/devices/system/memory/memoryN/ directories.  Perhaps this is the
>> right place to expose write_bw (etc).
> Those directories are already too redundant and wasteful.  I think we'd
> really rather not add to them.  In addition, it's technically possible
> to have a memory section span NUMA nodes and have different performance
> properties, which make it impossible to represent there.
>
> In any case, ACPI PXM's (Proximity Domains) are guaranteed to have
> uniform performance properties in the HMAT, and we just so happen to
> always create one NUMA node per PXM.  So, NUMA nodes really are a good 
> fit.
 I think you're missing my larger point which is that I don't think this
 should be exposed to userspace as an ACPI feature.  Because if you do,
 then it'll also be exposed to userspace as an openfirmware feature.
 And sooner or later a devicetree feature.  And then writing a portable
 program becomes an exercise in suffering.

 So, what's the right place in sysfs that isn't tied to ACPI?  A new
 directory or set of directories under /sys/devices/system/memory/ ?
>>> Oh, the current location isn't at all tied to acpi except that it happens to
>>> be named 'hmat'.  When it was all named 'hmem' it was just:
>>>
>>> /sys/devices/system/hmem
>>>
>>> Which has no ACPI-isms at all.  I'm happy to move it under
>>> /sys/devices/system/memory/hmat if that's helpful, but I think we still have
>>> the issue that the data represented therein is still pulled right from the
>>> HMAT, and I don't know how to abstract it into something more platform
>>> agnostic until I know what data is provided by those other platforms.
>>>
>>> For example, the HMAT provides latency information and bandwidth information
>>> for both reads and writes.  Will the devicetree/openfirmware/etc version 
>>> have
>>> this same info, or will it be just different enough that it won't translate
>>> into whatever I choose to stick in sysfs?
>> For the initial implementation do we need to have a representation of
>> all the performance data? Given that
>> /sys/devices/system/node/nodeX/distance is the only generic
>> performance attribute published by the kernel today it is already the
>> case that applications that need to target specific memories need to
>> go parse information that is not provided by the kernel by default.
>> The question is can those specialized applications stay special and go
>> parse the platform specific data sources, like raw HMAT, directly, or
>> do we expect general purpose applications to make use of this data? I
>> think a firmware-id to numa-node translation facility
>> (/sys/devices/system/node/nodeX/fwid) is a simple start that we can
>> build on with more information as specific use cases arise.
> We don't represent all the performance data, we only represent the data for
> local initiator/target pairs.  I do think that this is useful to have in sysfs
> because it provides a way to easily answer the most commonly asked questions
> (or at least what I'm guessing will be the most commmonly asked queststions),
> i.e. "given a CPU, what are the speeds of the various types of memory attached
> to it", and "given a chunk of memory, how fast is it and to which CPU is it
> local"?  By providing this base level of information I'm hoping to prevent
> most applications from having to parse the HMAT directly.
>
> The question of whether or not to include this local performance information
> was one of the main questions of the initial RFC patch series, and I did get
> feedback (albiet off-list) that the local performance information was
> valuable to at least some users.  I did intentionally structure my (now very
> short) set so that the performance information was added as a separate patch,
> so we can get to the place you're talking about where we only provide firmware
> id <=> proximity domain mappings by just leaving off the last patch in the
> series.
>

Hello

I can confirm that HPC runtimes are going to use these patches (at least
all runtimes that use hwloc for topology discovery, but that's the vast
majority of HPC anyway).

We really didn't like KNL exposing a hacky SLIT table [1]. We had to
explicitly detect that specific crazy table to find out which NUMA nodes
were local to which cores, and to find out which NUMA nodes were
HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the
application because the reported latencies didn't match reality. Quite
annoying.

With Ross' patches, we can easily get what

Re: [net] Revert "net: core: maybe return -EEXIST in __dev_alloc_name"

2017-12-21 Thread Michael Ellerman
Rasmus Villemoes  writes:
> On Tue, Dec 19 2017, Michael Ellerman  
> wrote:
>>> From: Johannes Berg 
>>> 
>>> This reverts commit d6f295e9def0; some userspace (in the case
>>
>> This revert seems to have broken networking on one of my powerpc
>> machines, according to git bisect.
>>
>> The symptom is DHCP fails and I don't get a link, I didn't dig any
>> further than that. I can if it's helpful.
>>
>> I think the problem is that 87c320e51519 ("net: core: dev_get_valid_name
>> is now the same as dev_alloc_name_ns") only makes sense while
>> d6f295e9def0 remains in the tree.
>
> I'm sorry about all of this, I really didn't think there would be such
> consequences of changing an errno return. Indeed, d6f29 was preparation
> for unifying the two functions that do the exact same thing (and how we
> ever got into that situation is somewhat unclear), except for
> their behaviour in the case the requested name already exists. So one of
> the two interfaces had to change its return value, and as I wrote, I
> thought EEXIST was the saner choice when an explicit name (no %d) had
> been requested.

No worries.

>> ie. before the entire series, dev_get_valid_name() would return EEXIST,
>> and that was retained when 87c320e51519 was merged, but now that
>> d6f295e9def0 has been reverted dev_get_valid_name() is returning ENFILE.
>>
>> I can get the network up again if I also revert 87c320e51519 ("net:
>> core: dev_get_valid_name is now the same as dev_alloc_name_ns"), or with
>> the gross patch below.
>
> I don't think changing -ENFILE to -EEXIST would be right either, since
> dev_get_valid_name() used to be able to return both (-EEXIST in the case
> where there's no %d, -ENFILE in the case where we end up calling
> dev_alloc_name_ns()). If anything, we could do the check for the old
> -EEXIST condition first, and then call dev_alloc_name_ns(). But I'm also
> fine with reverting.

Yeah I think a revert would be best, given it's nearly rc5.

My userspace is not exotic AFAIK, just debian something, so presumably
this will affect other people too.

cheers


Re: [PATCH] KVM: PPC: Book3S: fix XIVE migration of pending interrupts

2017-12-21 Thread Michael Ellerman
Laurent Vivier  writes:

> On 12/12/2017 13:02, Cédric Le Goater wrote:
>> When restoring a pending interrupt, we are setting the Q bit to force
>> a retrigger in xive_finish_unmask(). But we also need to force an EOI
>> in this case to reach the same initial state : P=1, Q=0.
>> 
>> This can be done by not setting 'old_p' for pending interrupts which
>> will inform xive_finish_unmask() that an EOI needs to be sent.
>> 
>> Suggested-by: Benjamin Herrenschmidt 
>> Signed-off-by: Cédric Le Goater 
>> ---
>> 
>>  Tested with a guest running iozone.
>> 
>>  arch/powerpc/kvm/book3s_xive.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> We really need this patch to fix VM migration on POWER9.
> When will it be merged?

Paul is away, so I'll merge it via the powerpc tree.

I'll mark it:

  Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt 
controller")
  Cc: sta...@vger.kernel.org # v4.12+

cheers


Re: powerpc/perf: Fix nest-imc cpuhotplug callback failure

2017-12-21 Thread Michael Ellerman
On Tue, 2017-12-05 at 05:30:38 UTC, Anju T Sudhakar wrote:
> Call trace observed during boot:  
>   
>   
>   
> Faulting instruction address: 0xc0248340  
>   
> cpu 0x0: Vector: 380 (Data Access Out of Range) at [c00ff66fb850] 
>   
> pc: c0248340: event_function_call+0x50/0x1f0  
>   
> lr: c024878c: perf_remove_from_context+0x3c/0x100 
>   
> sp: c00ff66fbad0  
>   
>msr: 90009033  
>   
>dar: 7d20e2a6f92d03c0  
>   
>   current = 0xc00ff6679200
>   
>   paca= 0xcfd4   softe: 0  irq_happened: 0x01 
>   
> pid   = 14, comm = cpuhp/0
>   
> Linux version 4.14.0-rc2-42789-ge8eae4b (rgrimm@) (gcc version 5.4.0  
>   
> 20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4)) #1 SMP Thu Nov 16 14:35:14 CST  
>   
> 2017  
>   
> enter ? for help  
>   
> [c00ff66fbb80] c024878c perf_remove_from_context+0x3c/0x100   
>   
> [c00ff66fbbc0] c024e84c perf_pmu_migrate_context+0x10c/0x380  
>   
> [c00ff66fbc60] c00ca050 ppc_nest_imc_cpu_offline+0x1b0/0x210  
>   
> [c00ff66fbcb0] c00d5d54 cpuhp_invoke_callback+0x194/0x620 
>   
> [c00ff66fbd20] c00d702c cpuhp_thread_fun+0x7c/0x1b0   
>   
> [c00ff66fbd60] c010ad90 smpboot_thread_fn+0x290/0x2a0 
>   
> [c00ff66fbdc0] c0104818 kthread+0x168/0x1b0   
>   
> [c00ff66fbe30] c000b5a0 ret_from_kernel_thread+0x5c/0xbc  
>   
>   
>   
> While registering the cpuhotplug callbacks for nest-imc, if we fail in the
>   
> cpuhotplug online path for any random node in a multi node system (because
>   
> the opal call to stop nest-imc counters fails for that node), 
>   
> ppc_nest_imc_cpu_offline() will get invoked for other nodes who successfully  
>   
> returned from cpuhotplug online path. 
>   
>   
>   
> This call trace is generated since in the ppc_nest_imc_cpu_offline()  
>   
> path we are trying to migrate the event context, when nest-imc counters are   
>   
> not even initialized. 
>   
>   
>   
> Patch to add a check to ensure that nest-imc is registered before migrating   
>   
> the event context. 
> 
> Note: 
>   
> Madhavan Srinivasan has recently send a skiboot patch to have a check in the  
>   
> skiboot code to make sure that the  microcode is initialized in all the 
> chips,  
> before enabling the nest units.   
>   
> https://patchwork.ozlabs.org/patch/844047/ (v2)
>   
>   
> Signed-off-by: Anju T Sudhakar   
> Reviewed-by: Madhavan Srinivasan 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/ad2b6e01024ef23bddc3ce0bcb115e

cheers


Re: powerpc/perf: Fix kfree memory allocated for nest pmus

2017-12-21 Thread Michael Ellerman
On Thu, 2017-12-07 at 17:23:27 UTC, Anju T Sudhakar wrote:
> imc_common_cpuhp_mem_free() is the common function for all IMC (In-memory
> Collection counters) domains to unregister cpuhotplug callback and free 
> memory.
> Since kfree of memory allocated for nest-imc (per_nest_pmu_arr) is in the 
> common
> code, all domains (core/nest/thread) can do the kfree in the failure case.
> 
> This could potentially create a call trace as shown below, where 
> core(/thread/nest)
> imc pmu initialization fails and in the failure path 
> imc_common_cpuhp_mem_free()
> free the memory(per_nest_pmu_arr), which is allocated by successfully 
> registered
> nest units.
> 
> 
> The call trace is generated in a scenario where core-imc initialization is
> made to fail and a cpuhotplug is performed in a p9 system.
> During cpuhotplug ppc_nest_imc_cpu_offline() tries to access per_nest_pmu_arr,
> which is already freed by core-imc.
> 
> [  136.563618] NIP [c0cb6a94] mutex_lock+0x34/0x90
> [  136.563653] LR [c0cb6a88] mutex_lock+0x28/0x90
> [  136.563687] Call Trace:
> [  136.563707] [c016b7a93b90] [c0cb6a88] mutex_lock+0x28/0x90 
> (unreliable)
> [  136.563762] [c016b7a93bc0] [c02bc720] 
> perf_pmu_migrate_context+0x90/0x3a0
> [  136.563814] [c016b7a93c60] [c00f7a40] 
> ppc_nest_imc_cpu_offline+0x190/0x1f0
> [  136.563867] [c016b7a93cb0] [c0108140] 
> cpuhp_invoke_callback+0x160/0x820
> [  136.563918] [c016b7a93d30] [c010939c] 
> cpuhp_thread_fun+0x1bc/0x270
> [  136.563970] [c016b7a93d60] [c013d2b0] 
> smpboot_thread_fn+0x250/0x290
> [  136.564022] [c016b7a93dc0] [c0136f18] kthread+0x1a8/0x1b0
> [  136.564067] [c016b7a93e30] [c000b4e8] 
> ret_from_kernel_thread+0x5c/0x74
> 
> To address this scenario do the kfree(per_nest_pmu_arr) only in case of
> nest-imc initialization failure, and when there is no other nest units 
> registered.
> 
> 
> Fixes: 73ce9aec65b1 ("powerpc/perf: Fix IMC_MAX_PMU macro")
> Signed-off-by: Anju T Sudhakar 
> Reviewed-by: Madhavan Srinivasan 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/110df8bd3e418b3476cae80babe8ad

cheers


Re: powerpc/perf: Dereference bhrb entries safely

2017-12-21 Thread Michael Ellerman
On Tue, 2017-12-12 at 12:29:15 UTC, Ravi Bangoria wrote:
> It may very well happen that branch instructions recorded by
> bhrb entries already get unmapped before they get processed by
> the kernel. Hence, trying to dereference such memory location
> will endup in a crash. Ex,
> 
> Unable to handle kernel paging request for data at address 
> 0xc00819c41764
> Faulting instruction address: 0xc0084a14
> NIP [c0084a14] branch_target+0x4/0x70
> LR [c00eb828] record_and_restart+0x568/0x5c0
> Call Trace:
> [c00eb3b4] record_and_restart+0xf4/0x5c0 (unreliable)
> [c00ec378] perf_event_interrupt+0x298/0x460
> [c0027964] performance_monitor_exception+0x54/0x70
> [c0009ba4] performance_monitor_common+0x114/0x120
> 
> Fix this by deferefencing them safely.
> 
> Suggested-by: Naveen N. Rao 
> Signed-off-by: Ravi Bangoria 
> Reviewed-by: Naveen N. Rao 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/f41d84dddc66b164ac16acf3f584c2

cheers


Re: [PATCH] powerpc/pseries: Increase memory block size to 1GB on radix

2017-12-21 Thread Michael Ellerman
Michael Ellerman  writes:

> When we're using the Radix MMU we map the kernel linear mapping with
> 1G pages. That means we must do memory hot remove in blocks of at
> least that size. Otherwise the linear mapping can end up not mapping
> all of memory because we've removed part of a 1G region but unmapped
> the entire 1G region from the linear mapping.
>
> Currently on pseries we consult the device tree to find out the the
> "LMB" (Logical Memory Block) size. This is the unit of memory hotplug
> communicated to us by the hypervisor, but it does not take into
> account anything the kernel has done itself, such as use 1G pages for
> the linear mapping.

This patch failed to survive contact with reality. ie. it doesn't work.

NAK.

cheers


Re: [PATCH] powerpc/pseries: Increase memory block size to 1GB on radix

2017-12-21 Thread Balbir Singh
On Fri, Dec 22, 2017 at 4:06 PM, Michael Ellerman  wrote:
> Michael Ellerman  writes:
>
>> When we're using the Radix MMU we map the kernel linear mapping with
>> 1G pages. That means we must do memory hot remove in blocks of at
>> least that size. Otherwise the linear mapping can end up not mapping
>> all of memory because we've removed part of a 1G region but unmapped
>> the entire 1G region from the linear mapping.
>>
>> Currently on pseries we consult the device tree to find out the the
>> "LMB" (Logical Memory Block) size. This is the unit of memory hotplug
>> communicated to us by the hypervisor, but it does not take into
>> account anything the kernel has done itself, such as use 1G pages for
>> the linear mapping.
>
> This patch failed to survive contact with reality. ie. it doesn't work.
>
> NAK.


I have patches to split the size of a region, I guess the right thing to do is
to split the size of mapping during hotplug. I can look at doing that once I
am back. What broke for you during testing?

Balbir Singh.


Re: [PATCH] KVM: PPC: Book3S: fix XIVE migration of pending interrupts

2017-12-21 Thread Paul Mackerras
On Fri, Dec 22, 2017 at 03:34:20PM +1100, Michael Ellerman wrote:
> Laurent Vivier  writes:
> 
> > On 12/12/2017 13:02, Cédric Le Goater wrote:
> >> When restoring a pending interrupt, we are setting the Q bit to force
> >> a retrigger in xive_finish_unmask(). But we also need to force an EOI
> >> in this case to reach the same initial state : P=1, Q=0.
> >> 
> >> This can be done by not setting 'old_p' for pending interrupts which
> >> will inform xive_finish_unmask() that an EOI needs to be sent.
> >> 
> >> Suggested-by: Benjamin Herrenschmidt 
> >> Signed-off-by: Cédric Le Goater 
> >> ---
> >> 
> >>  Tested with a guest running iozone.
> >> 
> >>  arch/powerpc/kvm/book3s_xive.c | 4 ++--
> >>  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > We really need this patch to fix VM migration on POWER9.
> > When will it be merged?
> 
> Paul is away, so I'll merge it via the powerpc tree.
> 
> I'll mark it:
> 
>   Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE 
> interrupt controller")
>   Cc: sta...@vger.kernel.org # v4.12+

Thanks for doing that.

If you felt like merging Alexey's patch "KVM: PPC: Book3S PR: Fix WIMG
handling under pHyp" with my acked-by, that would be fine too.  The
commit message needs a little work - the reason for using HPTE_R_M is
not just because it seems to work, but because current POWER
processors require M set on mappings for normal pages, and pHyp
enforces that.

Cheers,
Paul.


Re: [PATCH] KVM: PPC: Book3S: fix XIVE migration of pending interrupts

2017-12-21 Thread Laurent Vivier
On 22/12/2017 08:54, Paul Mackerras wrote:
> On Fri, Dec 22, 2017 at 03:34:20PM +1100, Michael Ellerman wrote:
>> Laurent Vivier  writes:
>>
>>> On 12/12/2017 13:02, Cédric Le Goater wrote:
 When restoring a pending interrupt, we are setting the Q bit to force
 a retrigger in xive_finish_unmask(). But we also need to force an EOI
 in this case to reach the same initial state : P=1, Q=0.

 This can be done by not setting 'old_p' for pending interrupts which
 will inform xive_finish_unmask() that an EOI needs to be sent.

 Suggested-by: Benjamin Herrenschmidt 
 Signed-off-by: Cédric Le Goater 
 ---

  Tested with a guest running iozone.

  arch/powerpc/kvm/book3s_xive.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> We really need this patch to fix VM migration on POWER9.
>>> When will it be merged?
>>
>> Paul is away, so I'll merge it via the powerpc tree.
>>
>> I'll mark it:
>>
>>   Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE 
>> interrupt controller")
>>   Cc: sta...@vger.kernel.org # v4.12+
> 
> Thanks for doing that.
> 
> If you felt like merging Alexey's patch "KVM: PPC: Book3S PR: Fix WIMG
> handling under pHyp" with my acked-by, that would be fine too.  The
> commit message needs a little work - the reason for using HPTE_R_M is
> not just because it seems to work, but because current POWER
> processors require M set on mappings for normal pages, and pHyp
> enforces that.

We also need:

KVM: PPC: Book3S HV: Fix pending_pri value in kvmppc_xive_get_icp()

Thanks,
Laurent