Re: [PATCH] powerpc: Add aacraid and nvme to powernv_defconfig

2017-12-22 Thread Benjamin Herrenschmidt
On Sat, 2017-12-23 at 11:40 +1100, Alexey Kardashevskiy wrote:
> On 20/12/17 13:14, Benjamin Herrenschmidt wrote:
> > On Wed, 2017-12-20 at 12:59 +1100, Alexey Kardashevskiy wrote:
> > > On 20/12/17 12:51, Benjamin Herrenschmidt wrote:
> > > > These adapters can be found in a number of our systems, so let's
> > > > enable the corresponding drivers by default.
> > > > 
> > > > Signed-off-by: Benjamin Herrenschmidt 
> > > > ---
> > > > --- a/arch/powerpc/configs/powernv_defconfig2017-12-19 
> > > > 18:37:24.803470591 -0600
> > > > +++ b/arch/powerpc/configs/powernv_defconfig2017-12-19 
> > > > 19:47:57.931952417 -0600
> > > > @@ -97,6 +97,7 @@
> > > >  CONFIG_BLK_DEV_RAM=y
> > > >  CONFIG_BLK_DEV_RAM_SIZE=65536
> > > >  CONFIG_VIRTIO_BLK=m
> > > > +CONFIG_BLK_DEV_NVME=y
> > > >  CONFIG_IDE=y
> > > >  CONFIG_BLK_DEV_IDECD=y
> > > >  CONFIG_BLK_DEV_GENERIC=y
> > > > @@ -113,6 +114,7 @@
> > > >  CONFIG_SCSI_CXGB4_ISCSI=m
> > > >  CONFIG_SCSI_BNX2_ISCSI=m
> > > >  CONFIG_BE2ISCSI=m
> > > > +CONFIG_SCSI_AACRAID=y
> > > >  CONFIG_SCSI_MPT2SAS=m
> > > >  CONFIG_SCSI_SYM53C8XX_2=m
> > > >  CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=0
> > > > 
> > > 
> > > "y", not "m"?
> > 
> > Yes. I like my boot devices to be "y"
> 
> 
> Then rename it to powernv_benh_defconfig :) Everything else seems to cope
> well with "m" - just look above.

No. Most of the above aren't needed as boot devices on powernv systems.

If you look at IPR for example, it's Y not N.

Cheers,
Ben.

> 
> 


Re: [PATCH] On ppc64le we HAVE_RELIABLE_STACKTRACE

2017-12-22 Thread Josh Poimboeuf
On Thu, Dec 21, 2017 at 11:10:46PM +1100, Michael Ellerman wrote:
> Josh Poimboeuf  writes:
> 
> > On Tue, Dec 19, 2017 at 12:28:33PM +0100, Torsten Duwe wrote:
> >> On Mon, Dec 18, 2017 at 12:56:22PM -0600, Josh Poimboeuf wrote:
> >> > On Mon, Dec 18, 2017 at 03:33:34PM +1000, Nicholas Piggin wrote:
> >> > > On Sun, 17 Dec 2017 20:58:54 -0600
> >> > > Josh Poimboeuf  wrote:
> >> > > 
> >> > > > On Fri, Dec 15, 2017 at 07:40:09PM +1000, Nicholas Piggin wrote:
> >> > > > > On Tue, 12 Dec 2017 08:05:01 -0600
> >> > > > > Josh Poimboeuf  wrote:
> >> > > > >   
> >> > > > > > What about leaf functions?  If a leaf function doesn't establish 
> >> > > > > > a stack
> >> > > > > > frame, and it has inline asm which contains a blr to another 
> >> > > > > > function,
> >> > > > > > this ABI is broken.  
> >> > > > 
> >> > > > Oops, I meant to say "bl" instead of "blr".
> >> 
> >> You need to save LR, one way or the other. If gcc thinks it's a leaf 
> >> function and
> >> does not do it, nor does your asm code, you'll return in an endless loop 
> >> => bug.
> >
> > Ah, so the function's return path would be corrupted, and an unreliable
> > stack trace would be the least of our problems.
> 
> That's mostly true.
> 
> It is possible to save LR somewhere other than the correct stack slot,
> in which case you can return correctly but still confuse the unwinder. A
> function can hide its caller that way.
> 
> It's stupid and we should never do it, but it's not impossible.
> 
> ...
> 
> > So with your proposal, I think I'm convinced that we don't need objtool
> > for ppc64le.  Does anyone disagree?
> 
> I don't disagree, but I'd be happier if we did have objtool support.
> 
> Just because it would give us a lot more certainty that we're doing the
> right thing everywhere, including in hand-coded asm and inline asm.
> 
> It's easy to write powerpc asm such that stack traces are reliable, but
> it is *possible* to break them.

In the unlikely case where some asm code had its own custom stack
format, I guess there are two things which could go wrong:

1) bad LR:

   If LR isn't a kernel text address, the unwinder can stop the stack
   trace, WARN(), and report an error.  Although if we were _extremely_
   unlucky and a random leftover text address just happened to be in the
   LR slot, then the real function would get skipped in the stack trace.
   But even then, it's probably only going to be an asm function getting
   skipped, and we don't patch asm functions anyway, so it shouldn't
   affect livepatch.

2) bad back chain pointer:

   I'm not sure if this is even a reasonable concern.  I doubt it.  But
   if it were to happen, presumably the unwinder would abort the stack
   trace after reading the bad value.  In this case I think the "end"
   check (#4 below) would be sufficient to catch it.

So even if there were some stupid ppc asm code out there with its own
stack magic, it still sounds to me like objtool wouldn't be needed.

> > There are still a few more things that need to be looked at:
> >
> > 1) With function graph tracing enabled, is the unwinder smart enough to
> >get the original function return address, e.g. by calling
> >ftrace_graph_ret_addr()?
> 
> No I don't think so.
> 
> > 2) Similar question for kretprobes.
> >
> > 3) Any other issues with generated code (e.g., bpf, ftrace trampolines),
> >runtime patching (e.g., CPU feature alternatives), kprobes, paravirt,
> >etc, that might confuse the unwinder?
> 
> We'll have to look, I can't be sure off the top of my head.
> 
> > 4) As a sanity check, it *might* be a good idea for
> >save_stack_trace_tsk_reliable() to ensure that it always reaches the
> >end of the stack.  There are several ways to do that:
> >
> >- If the syscall entry stack frame is always the same size, then the
> >  "end" would simply mean that the stack pointer is at a certain
> >  offset from the end of the task stack page.  However this might not
> >  work for kthreads and idle tasks, unless their stacks also start at
> >  the same offset.  (On x86 we actually standardized the end of stack
> >  location for all tasks, both user and kernel.)
> 
> Yeah it differs between user and kernel.
> 
> >- If the unwinder can get to the syscall frame, it can presumably
> >  examine regs->msr to check the PR bit to ensure it got all the way
> >  to syscall entry.  But again this might only work for user tasks,
> >  depending on how kernel task stacks are set up.
> 
> That sounds like a good idea. We could possibly mark the last frame of
> kernel tasks somehow.
> 
> >- Or a different approach would be to do error checking along the
> >  way, and reporting an error for any unexpected conditions.
> >
> >However, given that backlink/LR corruption doesn't seem possible with
> >this architecture, maybe #4 would be overkill.  Personally I would
> >feel more 

Re: [PATCH 04/17] mm: pass the vmem_altmap to arch_add_memory and __add_pages

2017-12-22 Thread Dan Williams
On Fri, Dec 15, 2017 at 6:09 AM, Christoph Hellwig  wrote:
> We can just pass this on instead of having to do a radix tree lookup
> without proper locking 2 levels into the callchain.
>
> Signed-off-by: Christoph Hellwig 
[..]
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 8acdc35c2dfa..e26ade50ae18 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -772,12 +772,12 @@ static void update_end_of_memory_vars(u64 start, u64 
> size)
> }
>  }
>
> -int add_pages(int nid, unsigned long start_pfn,
> - unsigned long nr_pages, bool want_memblock)
> +int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
> +   struct vmem_altmap *altmap, bool want_memblock)
>  {
> int ret;
>
> -   ret = __add_pages(nid, start_pfn, nr_pages, want_memblock);
> +   ret = __add_pages(nid, start_pfn, nr_pages, NULL, want_memblock);
> WARN_ON_ONCE(ret);

Should be 'altmap' instead of NULL.


Re: [PATCH 04/17] mm: pass the vmem_altmap to arch_add_memory and __add_pages

2017-12-22 Thread Dan Williams
On Fri, Dec 15, 2017 at 6:09 AM, Christoph Hellwig  wrote:
> We can just pass this on instead of having to do a radix tree lookup
> without proper locking 2 levels into the callchain.
>
> Signed-off-by: Christoph Hellwig 
[..]
> diff --git a/kernel/memremap.c b/kernel/memremap.c
> index 403ab9cdb949..16456117a1b1 100644
> --- a/kernel/memremap.c
> +++ b/kernel/memremap.c
> @@ -427,7 +427,7 @@ void *devm_memremap_pages(struct device *dev, struct 
> resource *res,
> goto err_pfn_remap;
>
> mem_hotplug_begin();
> -   error = arch_add_memory(nid, align_start, align_size, false);
> +   error = arch_add_memory(nid, align_start, align_size, altmap, false);
> if (!error)
> 
> move_pfn_range_to_zone(_DATA(nid)->node_zones[ZONE_DEVICE],
> align_start >> PAGE_SHIFT,

Subtle bug here. This altmap is the one that was passed in that we
copy into its permanent location in the pgmap, so it looks like this
patch needs to fold the following fix:

diff --git a/kernel/memremap.c b/kernel/memremap.c
index f277bf5b8c57..157a3756e1d5 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -382,6 +382,7 @@ void *devm_memremap_pages(struct device *dev,
struct resource *res,
if (altmap) {
memcpy(_map->altmap, altmap, sizeof(*altmap));
pgmap->altmap = _map->altmap;
+   altmap = pgmap->altmap;
}
pgmap->ref = ref;
pgmap->res = _map->res;


Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-22 Thread Rafael J. Wysocki
On Sat, Dec 23, 2017 at 12:57 AM, Dan Williams  wrote:
> On Fri, Dec 22, 2017 at 3:22 PM, Ross Zwisler
>  wrote:
>> On Fri, Dec 22, 2017 at 02:53:42PM -0800, Dan Williams wrote:
>>> On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin  
>>> wrote:
>>> > Le 20/12/2017 à 23:41, Ross Zwisler a écrit :
>>> [..]
>>> > Hello
>>> >
>>> > I can confirm that HPC runtimes are going to use these patches (at least
>>> > all runtimes that use hwloc for topology discovery, but that's the vast
>>> > majority of HPC anyway).
>>> >
>>> > We really didn't like KNL exposing a hacky SLIT table [1]. We had to
>>> > explicitly detect that specific crazy table to find out which NUMA nodes
>>> > were local to which cores, and to find out which NUMA nodes were
>>> > HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the
>>> > application because the reported latencies didn't match reality. Quite
>>> > annoying.
>>> >
>>> > With Ross' patches, we can easily get what we need:
>>> > * which NUMA nodes are local to which CPUs? /sys/devices/system/node/
>>> > can only report a single local node per CPU (doesn't work for KNL and
>>> > upcoming architectures with HBM+DDR+...)
>>> > * which NUMA nodes are slow/fast (for both bandwidth and latency)
>>> > And we can still look at SLIT under /sys/devices/system/node if really
>>> > needed.
>>> >
>>> > And of course having this in sysfs is much better than parsing ACPI
>>> > tables that are only accessible to root :)
>>>
>>> On this point, it's not clear to me that we should allow these sysfs
>>> entries to be world readable. Given /proc/iomem now hides physical
>>> address information from non-root we at least need to be careful not
>>> to undo that with new sysfs HMAT attributes.
>>
>> This enabling does not expose any physical addresses to userspace.  It only
>> provides performance numbers from the HMAT and associates them with existing
>> NUMA nodes.  Are you worried that exposing performance numbers to non-root
>> users via sysfs poses a security risk?
>
> It's an information disclosure that's not clear we need to make to
> non-root processes.
>
> I'm more worried about userspace growing dependencies on the absolute
> numbers when those numbers can change from platform to platform.
> Differentiated memory on one platform may be the common memory pool on
> another.
>
> To me this has parallels with storage device hinting where
> specifications like T10 have a complex enumeration of all the
> performance hints that can be passed to the device, but the Linux
> enabling effort aims for a sanitzed set of relative hints that make
> sense. It's more flexible if userspace specifies a relative intent
> rather than an absolute performance target. Putting all the HMAT
> information into sysfs gives userspace more information than it could
> possibly do anything reasonable, at least outside of specialized apps
> that are hand tuned for a given hardware platform.

That's a valid point IMO.

It is sort of tempting to expose everything to user space verbatim,
especially early in the enabling process when the kernel has not yet
found suitable ways to utilize the given information, but the very act
of exposing it may affect what can be done with it in the future.

User space interfaces need to stay around and be supported forever, at
least potentially, so adding every one of them is a serious
commitment.

Thanks,
Rafael


Re: [PATCH] powerpc: Add aacraid and nvme to powernv_defconfig

2017-12-22 Thread Alexey Kardashevskiy
On 20/12/17 13:14, Benjamin Herrenschmidt wrote:
> On Wed, 2017-12-20 at 12:59 +1100, Alexey Kardashevskiy wrote:
>> On 20/12/17 12:51, Benjamin Herrenschmidt wrote:
>>> These adapters can be found in a number of our systems, so let's
>>> enable the corresponding drivers by default.
>>>
>>> Signed-off-by: Benjamin Herrenschmidt 
>>> ---
>>> --- a/arch/powerpc/configs/powernv_defconfig2017-12-19 
>>> 18:37:24.803470591 -0600
>>> +++ b/arch/powerpc/configs/powernv_defconfig2017-12-19 
>>> 19:47:57.931952417 -0600
>>> @@ -97,6 +97,7 @@
>>>  CONFIG_BLK_DEV_RAM=y
>>>  CONFIG_BLK_DEV_RAM_SIZE=65536
>>>  CONFIG_VIRTIO_BLK=m
>>> +CONFIG_BLK_DEV_NVME=y
>>>  CONFIG_IDE=y
>>>  CONFIG_BLK_DEV_IDECD=y
>>>  CONFIG_BLK_DEV_GENERIC=y
>>> @@ -113,6 +114,7 @@
>>>  CONFIG_SCSI_CXGB4_ISCSI=m
>>>  CONFIG_SCSI_BNX2_ISCSI=m
>>>  CONFIG_BE2ISCSI=m
>>> +CONFIG_SCSI_AACRAID=y
>>>  CONFIG_SCSI_MPT2SAS=m
>>>  CONFIG_SCSI_SYM53C8XX_2=m
>>>  CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=0
>>>
>>
>> "y", not "m"?
> 
> Yes. I like my boot devices to be "y"


Then rename it to powernv_benh_defconfig :) Everything else seems to cope
well with "m" - just look above.


-- 
Alexey


[RFC] macio airport how standard pccard?

2017-12-22 Thread René Rebe
Hi all,

I have a nice 1.2 GHz G4 Cube on my desk, and if I could somehow get USB 2 into 
it, it would be way more useful as a “thin client” ;-)

I was looking at the macio airport kernel glue, but could not immediately 
figure out how much standard pcmcia/cardbus that is.
Is there any chance I could hack up some kernel patch glue to get a 
pcmcia/pccard USB 2.0 NEC chip based card working in there - or is it totally 
hopeless?

Thanks for any tip and merry christmas!

René Rebe

-- 
 ExactCODE GmbH, Lietzenburger Str. 42, DE-10789 Berlin
 http://exactcode.com | http://exactscan.com | http://ocrkit.com | 
http://t2-project.org | http://rene.rebe.de



Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-22 Thread Dan Williams
On Fri, Dec 22, 2017 at 3:22 PM, Ross Zwisler
 wrote:
> On Fri, Dec 22, 2017 at 02:53:42PM -0800, Dan Williams wrote:
>> On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin  
>> wrote:
>> > Le 20/12/2017 à 23:41, Ross Zwisler a écrit :
>> [..]
>> > Hello
>> >
>> > I can confirm that HPC runtimes are going to use these patches (at least
>> > all runtimes that use hwloc for topology discovery, but that's the vast
>> > majority of HPC anyway).
>> >
>> > We really didn't like KNL exposing a hacky SLIT table [1]. We had to
>> > explicitly detect that specific crazy table to find out which NUMA nodes
>> > were local to which cores, and to find out which NUMA nodes were
>> > HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the
>> > application because the reported latencies didn't match reality. Quite
>> > annoying.
>> >
>> > With Ross' patches, we can easily get what we need:
>> > * which NUMA nodes are local to which CPUs? /sys/devices/system/node/
>> > can only report a single local node per CPU (doesn't work for KNL and
>> > upcoming architectures with HBM+DDR+...)
>> > * which NUMA nodes are slow/fast (for both bandwidth and latency)
>> > And we can still look at SLIT under /sys/devices/system/node if really
>> > needed.
>> >
>> > And of course having this in sysfs is much better than parsing ACPI
>> > tables that are only accessible to root :)
>>
>> On this point, it's not clear to me that we should allow these sysfs
>> entries to be world readable. Given /proc/iomem now hides physical
>> address information from non-root we at least need to be careful not
>> to undo that with new sysfs HMAT attributes.
>
> This enabling does not expose any physical addresses to userspace.  It only
> provides performance numbers from the HMAT and associates them with existing
> NUMA nodes.  Are you worried that exposing performance numbers to non-root
> users via sysfs poses a security risk?

It's an information disclosure that's not clear we need to make to
non-root processes.

I'm more worried about userspace growing dependencies on the absolute
numbers when those numbers can change from platform to platform.
Differentiated memory on one platform may be the common memory pool on
another.

To me this has parallels with storage device hinting where
specifications like T10 have a complex enumeration of all the
performance hints that can be passed to the device, but the Linux
enabling effort aims for a sanitzed set of relative hints that make
sense. It's more flexible if userspace specifies a relative intent
rather than an absolute performance target. Putting all the HMAT
information into sysfs gives userspace more information than it could
possibly do anything reasonable, at least outside of specialized apps
that are hand tuned for a given hardware platform.


Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-22 Thread Ross Zwisler
On Fri, Dec 22, 2017 at 02:53:42PM -0800, Dan Williams wrote:
> On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin  wrote:
> > Le 20/12/2017 à 23:41, Ross Zwisler a écrit :
> [..]
> > Hello
> >
> > I can confirm that HPC runtimes are going to use these patches (at least
> > all runtimes that use hwloc for topology discovery, but that's the vast
> > majority of HPC anyway).
> >
> > We really didn't like KNL exposing a hacky SLIT table [1]. We had to
> > explicitly detect that specific crazy table to find out which NUMA nodes
> > were local to which cores, and to find out which NUMA nodes were
> > HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the
> > application because the reported latencies didn't match reality. Quite
> > annoying.
> >
> > With Ross' patches, we can easily get what we need:
> > * which NUMA nodes are local to which CPUs? /sys/devices/system/node/
> > can only report a single local node per CPU (doesn't work for KNL and
> > upcoming architectures with HBM+DDR+...)
> > * which NUMA nodes are slow/fast (for both bandwidth and latency)
> > And we can still look at SLIT under /sys/devices/system/node if really
> > needed.
> >
> > And of course having this in sysfs is much better than parsing ACPI
> > tables that are only accessible to root :)
> 
> On this point, it's not clear to me that we should allow these sysfs
> entries to be world readable. Given /proc/iomem now hides physical
> address information from non-root we at least need to be careful not
> to undo that with new sysfs HMAT attributes.

This enabling does not expose any physical addresses to userspace.  It only
provides performance numbers from the HMAT and associates them with existing
NUMA nodes.  Are you worried that exposing performance numbers to non-root
users via sysfs poses a security risk?

> Once you need to be root for this info, is parsing binary HMAT vs sysfs a
> blocker for the HPC use case?
> 
> Perhaps we can enlist /proc/iomem or a similar enumeration interface
> to tell userspace the NUMA node and whether the kernel thinks it has
> better or worse performance characteristics relative to base
> system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start
> publishing absolute numbers in sysfs userspace will default to looking
> for specific magic numbers in sysfs vs asking the kernel for memory
> that has performance characteristics relative to base "System RAM". In
> other words the absolute performance information that the HMAT
> publishes is useful to the kernel, but it's not clear that userspace
> needs that vs a relative indicator for making NUMA node preference
> decisions.


Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-22 Thread Dan Williams
On Thu, Dec 21, 2017 at 12:31 PM, Brice Goglin  wrote:
> Le 20/12/2017 à 23:41, Ross Zwisler a écrit :
[..]
> Hello
>
> I can confirm that HPC runtimes are going to use these patches (at least
> all runtimes that use hwloc for topology discovery, but that's the vast
> majority of HPC anyway).
>
> We really didn't like KNL exposing a hacky SLIT table [1]. We had to
> explicitly detect that specific crazy table to find out which NUMA nodes
> were local to which cores, and to find out which NUMA nodes were
> HBM/MCDRAM or DDR. And then we had to hide the SLIT values to the
> application because the reported latencies didn't match reality. Quite
> annoying.
>
> With Ross' patches, we can easily get what we need:
> * which NUMA nodes are local to which CPUs? /sys/devices/system/node/
> can only report a single local node per CPU (doesn't work for KNL and
> upcoming architectures with HBM+DDR+...)
> * which NUMA nodes are slow/fast (for both bandwidth and latency)
> And we can still look at SLIT under /sys/devices/system/node if really
> needed.
>
> And of course having this in sysfs is much better than parsing ACPI
> tables that are only accessible to root :)

On this point, it's not clear to me that we should allow these sysfs
entries to be world readable. Given /proc/iomem now hides physical
address information from non-root we at least need to be careful not
to undo that with new sysfs HMAT attributes. Once you need to be root
for this info, is parsing binary HMAT vs sysfs a blocker for the HPC
use case?

Perhaps we can enlist /proc/iomem or a similar enumeration interface
to tell userspace the NUMA node and whether the kernel thinks it has
better or worse performance characteristics relative to base
system-RAM, i.e. new IORES_DESC_* values. I'm worried that if we start
publishing absolute numbers in sysfs userspace will default to looking
for specific magic numbers in sysfs vs asking the kernel for memory
that has performance characteristics relative to base "System RAM". In
other words the absolute performance information that the HMAT
publishes is useful to the kernel, but it's not clear that userspace
needs that vs a relative indicator for making NUMA node preference
decisions.


Re: [PATCH v3 0/3] create sysfs representation of ACPI HMAT

2017-12-22 Thread Ross Zwisler
On Thu, Dec 21, 2017 at 01:41:15AM +, Elliott, Robert (Persistent Memory) 
wrote:
> 
> 
> > -Original Message-
> > From: Linux-nvdimm [mailto:linux-nvdimm-boun...@lists.01.org] On Behalf Of
> > Ross Zwisler
> ...
> > 
> > On Wed, Dec 20, 2017 at 10:19:37AM -0800, Matthew Wilcox wrote:
> ...
> > > initiator is a CPU?  I'd have expected you to expose a memory controller
> > > abstraction rather than re-use storage terminology.
> > 
> > Yea, I agree that at first blush it seems weird.  It turns out that
> > looking at it in sort of a storage initiator/target way is beneficial,
> > though, because it allows us to cut down on the number of data values
> > we need to represent.
> > 
> > For example the SLIT, which doesn't differentiate between initiator and
> > target proximity domains (and thus nodes) always represents a system
> > with N proximity domains using a NxN distance table.  This makes sense
> > if every node contains both CPUs and memory.
> > 
> > With the introduction of the HMAT, though, we can have memory-only
> > initiator nodes and we can explicitly associate them with their local 
> > CPU.  This is necessary so that we can separate memory with different
> > performance characteristics (HBM vs normal memory vs persistent memory,
> > for example) that are all attached to the same CPU.
> > 
> > So, say we now have a system with 4 CPUs, and each of those CPUs has 3
> > different types of memory attached to it.  We now have 16 total proximity
> > domains, 4 CPU and 12 memory.
> 
> The CPU cores that make up a node can have performance restrictions of
> their own; for example, they might max out at 10 GB/s even though the
> memory controller supports 120 GB/s (meaning you need to use 12 cores
> on the node to fully exercise memory).  It'd be helpful to report this,
> so software can decide how many cores to use for bandwidth-intensive work.
> 
> > If we represent this with the SLIT we end up with a 16 X 16 distance table
> > (256 entries), most of which don't matter because they are memory-to-
> > memory distances which don't make sense.
> > 
> > In the HMAT, though, we separate out the initiators and the targets and
> > put them into separate lists.  (See 5.2.27.4 System Locality Latency and
> > Bandwidth Information Structure in ACPI 6.2 for details.)  So, this same
> > config in the HMAT only has 4*12=48 performance values of each type, all
> > of which convey meaningful information.
> > 
> > The HMAT indeed even uses the storage "initiator" and "target"
> > terminology. :)
> 
> Centralized DMA engines (e.g., as used by the "DMA based blk-mq pmem
> driver") have performance differences too.  A CPU might include
> CPU cores that reach 10 GB/s, DMA engines that reach 60 GB/s, and
> memory controllers that reach 120 GB/s.  I guess these would be
> represented as extra initiators on the node?

For both of your comments I think all of this comes down to how you want to
represent your platform in the HMAT.  The sysfs representation just shows you
what is in the HMAT.

Each initiator node is just a single NUMA node (think of it as a NUMA node
which has the characteristic that it can initiate memory requests), so I don't
think there is a way to have "extra initiators on the node".  I think what
you're talking about is separating the DMA engines and CPU cores into separate
NUMA nodes, both of which are initiators.  I think this is probably fine as it
conveys useful info.

I don't think the HMAT has a concept of increasing bandwidth for number of CPU
cores used - it just has a single bandwidth number (well, one for read and one
for write) per initiator/target pair.  I don't think we want to add this,
either - the HMAT is already very complex.


Re: [RFC] macio airport how standard pccard?

2017-12-22 Thread Benjamin Herrenschmidt
On Fri, 2017-12-22 at 16:18 +0100, René Rebe wrote:
> Hi all,
> 
> I have a nice 1.2 GHz G4 Cube on my desk, and if I could somehow get USB 2 
> into it, it would be way more useful as a “thin client” ;-)
> 
> I was looking at the macio airport kernel glue, but could not immediately 
> figure out how much standard pcmcia/cardbus that is.
> Is there any chance I could hack up some kernel patch glue to get a 
> pcmcia/pccard USB 2.0 NEC chip based card working in there - or is it totally 
> hopeless?
> 
> Thanks for any tip and merry christmas!

I'm not 100% sure. I think it's some kind of PCMCIA card but PCMCIA is
basically just some kind of ISA bus with control lines.

I honestly don't know much more about it.

Cheers,
Ben.
 


[RFC PATCH for 4.16 03/11] powerpc: membarrier: Skip memory barrier in switch_mm() (v7)

2017-12-22 Thread Mathieu Desnoyers
Allow PowerPC to skip the full memory barrier in switch_mm(), and
only issue the barrier when scheduling into a task belonging to a
process that has registered to use expedited private.

Threads targeting the same VM but which belong to different thread
groups is a tricky case. It has a few consequences:

It turns out that we cannot rely on get_nr_threads(p) to count the
number of threads using a VM. We can use
(atomic_read(>mm_users) == 1 && get_nr_threads(p) == 1)
instead to skip the synchronize_sched() for cases where the VM only has
a single user, and that user only has a single thread.

It also turns out that we cannot use for_each_thread() to set
thread flags in all threads using a VM, as it only iterates on the
thread group.

Therefore, test the membarrier state variable directly rather than
relying on thread flags. This means
membarrier_register_private_expedited() needs to set the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
private expedited membarrier commands to succeed.
membarrier_arch_switch_mm() now tests for the
MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.

Changes since v1:
- Use test_ti_thread_flag(next, ...) instead of test_thread_flag() in
  powerpc membarrier_arch_sched_in(), given that we want to specifically
  check the next thread state.
- Add missing ARCH_HAS_MEMBARRIER_HOOKS in Kconfig.
- Use task_thread_info() to pass thread_info from task to
  *_ti_thread_flag().

Changes since v2:
- Move membarrier_arch_sched_in() call to finish_task_switch().
- Check for NULL t->mm in membarrier_arch_fork().
- Use membarrier_sched_in() in generic code, which invokes the
  arch-specific membarrier_arch_sched_in(). This fixes allnoconfig
  build on PowerPC.
- Move asm/membarrier.h include under CONFIG_MEMBARRIER, fixing
  allnoconfig build on PowerPC.
- Build and runtime tested on PowerPC.

Changes since v3:
- Simply rely on copy_mm() to copy the membarrier_private_expedited mm
  field on fork.
- powerpc: test thread flag instead of reading
  membarrier_private_expedited in membarrier_arch_fork().
- powerpc: skip memory barrier in membarrier_arch_sched_in() if coming
  from kernel thread, since mmdrop() implies a full barrier.
- Set membarrier_private_expedited to 1 only after arch registration
  code, thus eliminating a race where concurrent commands could succeed
  when they should fail if issued concurrently with process
  registration.
- Use READ_ONCE() for membarrier_private_expedited field access in
  membarrier_private_expedited. Matches WRITE_ONCE() performed in
  process registration.

Changes since v4:
- Move powerpc hook from sched_in() to switch_mm(), based on feedback
  from Nicholas Piggin.

Changes since v5:
- Rebase on v4.14-rc6.
- Fold "Fix: membarrier: Handle CLONE_VM + !CLONE_THREAD correctly on
  powerpc (v2)"

Changes since v6:
- Rename MEMBARRIER_STATE_SWITCH_MM to MEMBARRIER_STATE_PRIVATE_EXPEDITED.

Signed-off-by: Mathieu Desnoyers 
CC: Peter Zijlstra 
CC: Paul E. McKenney 
CC: Boqun Feng 
CC: Andrew Hunter 
CC: Maged Michael 
CC: Avi Kivity 
CC: Benjamin Herrenschmidt 
CC: Paul Mackerras 
CC: Michael Ellerman 
CC: Dave Watson 
CC: Alan Stern 
CC: Will Deacon 
CC: Andy Lutomirski 
CC: Ingo Molnar 
CC: Alexander Viro 
CC: Nicholas Piggin 
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-a...@vger.kernel.org
---
 MAINTAINERS   |  1 +
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/membarrier.h | 26 ++
 arch/powerpc/mm/mmu_context.c |  7 +++
 include/linux/sched/mm.h  | 13 -
 init/Kconfig  |  3 +++
 kernel/sched/core.c   | 10 --
 kernel/sched/membarrier.c |  8 
 8 files changed, 58 insertions(+), 11 deletions(-)
 create mode 100644 arch/powerpc/include/asm/membarrier.h

diff --git a/MAINTAINERS b/MAINTAINERS
index a6e86e20761e..bd1666217c73 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8934,6 +8934,7 @@ L:linux-ker...@vger.kernel.org
 S: Supported
 F: kernel/sched/membarrier.c
 F: include/uapi/linux/membarrier.h
+F: arch/powerpc/include/asm/membarrier.h
 
 MEMORY MANAGEMENT
 L: linux...@kvack.org
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c51e6ce42e7a..a63adb082c0a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -140,6 +140,7 @@ config PPC
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_PMEM_APIif PPC64
+   

Re: [PATCH] KVM: PPC: Book3S: fix XIVE migration of pending interrupts

2017-12-22 Thread Greg Kurz
On Fri, 22 Dec 2017 12:58:47 +0100
Greg Kurz  wrote:

> On Fri, 22 Dec 2017 22:22:08 +1100
> Michael Ellerman  wrote:
> 
> > Paul Mackerras  writes:
[...]
> > >
> > > Thanks for doing that.
> > >
> > > If you felt like merging Alexey's patch "KVM: PPC: Book3S PR: Fix WIMG
> > > handling under pHyp" with my acked-by, that would be fine too.  The
> > > commit message needs a little work - the reason for using HPTE_R_M is
> > > not just because it seems to work, but because current POWER
> > > processors require M set on mappings for normal pages, and pHyp
> > > enforces that.
> > 
> > OK. I saw this too late, but I'll pick that one up next week. If someone
> > sends me an updated change log I will merge all of their patches for
> > ever.
> >   
> 
> Really ? Opportunity makes the thief, so here's my take :P
> 
> 8<-->8
> KVM: PPC: Book3S: fix XIVE migration of pending interrupts

Oops! Paste error... Title should be:

KVM: PPC: Book3S PR: Fix WIMG handling under pHyp

> 
> 96df226 "KVM: PPC: Book3S PR: Preserve storage control bits" added WIMG
> bits preserving but it missed 2 special cases:
> - a magic page in kvmppc_mmu_book3s_64_xlate() and
> - guest real mode in kvmppc_handle_pagefault().
> 
> For these ptes WIMG were 0 and pHyp failed on these causing a guest to
> stop in the very beginning at NIP=0x100 (due to bd9166ffe
> "KVM: PPC: Book3S PR: Exit KVM on failed mapping").
> 
> According to LoPAPR v1.1 14.5.4.1.2 H_ENTER:
> 
>  The hypervisor checks that the WIMG bits within the PTE are appropriate
>  for the physical page number else H_Parameter return. (For System Memory
>  pages WIMG=0010, or, 1110 if the SAO option is enabled, and for IO pages
>  WIMG=01**.)
> 
> This hence initializes WIMG to non-zero value HPTE_R_M (0x10), as expected
> by pHyp.
> 
> Fixes: 96df226 "KVM: PPC: Book3S PR: Preserve storage control bits"
> Signed-off-by: Alexey Kardashevskiy 
> 8<-->8
> 
> Cheers,
> 
> --
> Greg
> 
> > cheers
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html  
> 



Re: [RFC PATCH 4/8] powerpc/64s: put io_sync bit into r14

2017-12-22 Thread Thiago Jung Bauermann

Hello Nicholas,

Just a small comment about syntax. I'm afraid I can't comment much about
the substance of the patch.

Nicholas Piggin  writes:
> diff --git a/arch/powerpc/include/asm/spinlock.h 
> b/arch/powerpc/include/asm/spinlock.h
> index b9ebc3085fb7..182bb9304c79 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -40,16 +40,9 @@
>  #endif
>
>  #if defined(CONFIG_PPC64) && defined(CONFIG_SMP)
> -#define CLEAR_IO_SYNC(get_paca()->io_sync = 0)
> -#define SYNC_IO  do {
> \
> - if (unlikely(get_paca()->io_sync)) {\
> - mb();   \
> - get_paca()->io_sync = 0;\
> - }   \
> - } while (0)
> +#define CLEAR_IO_SYNCdo { r14_clear_bits(R14_BIT_IO_SYNC); } while(0)

Is there a reason for the do { } while(0) idiom here? If
r14_clear_bits() is an inline function, isn't it a single statement
already?

-- 
Thiago Jung Bauermann
IBM Linux Technology Center



Re: [PATCH 1/1] powerpc/pseries: Use the system workqueue as fallback to hotplug workqueue

2017-12-22 Thread joserz
On Fri, Dec 22, 2017 at 11:54:10AM +1100, David Gibson wrote:
> On Thu, Dec 21, 2017 at 01:44:48PM -0200, Jose Ricardo Ziviani wrote:
> > The hotplug engine uses its own workqueue to handle IRQ requests, the
> > problem is that such workqueue is initialized not so early in the boot
> > process.
> > 
> > Thus, when the kernel is ready to handle IRQ requests, after the system
> > workqueue is initialized, we have a timeframe where any hotplug issued
> > by the client will result in a kernel panic. That timeframe goes until
> > the hotplug workqueue is initialized.
> > 
> > It would be good to have the hotplug workqueue initialized as soon as
> > the system workqueue but I don't think it is possible. So, this patch
> > uses the system workqueue as a fallback the handle such IRQs.
> > 
> > Signed-off-by: Jose Ricardo Ziviani 
> 
> I don't think this is the right approach.
> 
> It seems to me the bug is that the hotplug interrupt is registered in
> init_ras_IRQ(), before the work queue is initialized in
> pseries_dlpar_init().  We need to correct that ordering.
> 

Oh, this makes much more sense. I'm going to make some tests and send a
v2 then.

Thanks for reviewing it!

> > ---
> >  arch/powerpc/platforms/pseries/dlpar.c | 10 +-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
> > b/arch/powerpc/platforms/pseries/dlpar.c
> > index 6e35780c5962..0474aa14b5f6 100644
> > --- a/arch/powerpc/platforms/pseries/dlpar.c
> > +++ b/arch/powerpc/platforms/pseries/dlpar.c
> > @@ -399,7 +399,15 @@ void queue_hotplug_event(struct pseries_hp_errorlog 
> > *hp_errlog,
> > work->errlog = hp_errlog_copy;
> > work->hp_completion = hotplug_done;
> > work->rc = rc;
> > -   queue_work(pseries_hp_wq, (struct work_struct *)work);
> > +
> > +   /* The hotplug workqueue may happen to be NULL at the moment
> > +* this code is executed, during the boot phase. So, in this
> > +* scenario, we can fallback to the system workqueue.
> > +*/
> > +   if (unlikely(pseries_hp_wq == NULL))
> > +   schedule_work((struct work_struct *)work);
> > +   else
> > +   queue_work(pseries_hp_wq, (struct work_struct *)work);
> > } else {
> > *rc = -ENOMEM;
> > kfree(hp_errlog_copy);
> 
> -- 
> David Gibson  | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au| minimalist, thank you.  NOT _the_ 
> _other_
>   | _way_ _around_!
> http://www.ozlabs.org/~dgibson




Re: [RFC PATCH 3/8] powerpc/64s: put the per-cpu data_offset in r14

2017-12-22 Thread Nicholas Piggin
On Wed, 20 Dec 2017 18:53:24 +0100
Gabriel Paubert  wrote:

> On Thu, Dec 21, 2017 at 12:52:01AM +1000, Nicholas Piggin wrote:
> > Shifted left by 16 bits, so the low 16 bits of r14 remain available.
> > This allows per-cpu pointers to be dereferenced with a single extra
> > shift whereas previously it was a load and add.
> > ---
> >  arch/powerpc/include/asm/paca.h   |  5 +
> >  arch/powerpc/include/asm/percpu.h |  2 +-
> >  arch/powerpc/kernel/entry_64.S|  5 -
> >  arch/powerpc/kernel/head_64.S |  5 +
> >  arch/powerpc/kernel/setup_64.c| 11 +--
> >  5 files changed, 16 insertions(+), 12 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/paca.h 
> > b/arch/powerpc/include/asm/paca.h
> > index cd6a9a010895..4dd4ac69e84f 100644
> > --- a/arch/powerpc/include/asm/paca.h
> > +++ b/arch/powerpc/include/asm/paca.h
> > @@ -35,6 +35,11 @@
> >  
> >  register struct paca_struct *local_paca asm("r13");
> >  #ifdef CONFIG_PPC_BOOK3S
> > +/*
> > + * The top 32-bits of r14 is used as the per-cpu offset, shifted by 
> > PAGE_SHIFT.  
> 
> Top 32, really? It's 48 in later comments.

Yep, I used 32 to start with but it wasn't enough. Will fix.

Thanks,
Nick


[PATCH v5 2/2] cxl: read PHB indications from the device tree

2017-12-22 Thread Philippe Bergheaud
Configure the P9 XSL_DSNCTL register with PHB indications found
in the device tree, or else use legacy hard-coded values.

Signed-off-by: Philippe Bergheaud 
---
Changelog:

v2: New patch. Use the new device tree property "ibm,phb-indications".

v3: No change.

v4: No functional change.
Drop cosmetic fix in comment.

v5: get_phb_indications():
  - make static variables local to function.
  - return static variable values by arguments.

This patch depends on the following skiboot prerequisite:

https://patchwork.ozlabs.org/patch/849162/
---
 drivers/misc/cxl/cxl.h|  2 +-
 drivers/misc/cxl/cxllib.c |  2 +-
 drivers/misc/cxl/pci.c| 42 +-
 3 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index e46a4062904a..5a6e9a921c2b 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -1062,7 +1062,7 @@ int cxl_psl_purge(struct cxl_afu *afu);
 int cxl_calc_capp_routing(struct pci_dev *dev, u64 *chipid,
  u32 *phb_index, u64 *capp_unit_id);
 int cxl_slot_is_switched(struct pci_dev *dev);
-int cxl_get_xsl9_dsnctl(u64 capp_unit_id, u64 *reg);
+int cxl_get_xsl9_dsnctl(struct pci_dev *dev, u64 capp_unit_id, u64 *reg);
 u64 cxl_calculate_sr(bool master, bool kernel, bool real_mode, bool p9);
 
 void cxl_native_irq_dump_regs_psl9(struct cxl_context *ctx);
diff --git a/drivers/misc/cxl/cxllib.c b/drivers/misc/cxl/cxllib.c
index dc9bc1807fdf..61f80d586279 100644
--- a/drivers/misc/cxl/cxllib.c
+++ b/drivers/misc/cxl/cxllib.c
@@ -99,7 +99,7 @@ int cxllib_get_xsl_config(struct pci_dev *dev, struct 
cxllib_xsl_config *cfg)
if (rc)
return rc;
 
-   rc = cxl_get_xsl9_dsnctl(capp_unit_id, >dsnctl);
+   rc = cxl_get_xsl9_dsnctl(dev, capp_unit_id, >dsnctl);
if (rc)
return rc;
if (cpu_has_feature(CPU_FTR_POWER9_DD1)) {
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 19969ee86d6f..1d38fff2139f 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -409,21 +409,53 @@ int cxl_calc_capp_routing(struct pci_dev *dev, u64 
*chipid,
return 0;
 }
 
-int cxl_get_xsl9_dsnctl(u64 capp_unit_id, u64 *reg)
+static int get_phb_indications(struct pci_dev *dev, u64* capiind, u64 *asnind,
+  u64 *nbwind)
+{
+   static u64 nbw, asn, capi = 0;
+   struct device_node *np;
+   const __be32 *prop;
+
+   if (!capi) {
+   if (!(np = pnv_pci_get_phb_node(dev)))
+   return -1;
+
+   prop = of_get_property(np, "ibm,phb-indications", NULL);
+   if (!prop) {
+   nbw = 0x0300UL; /* legacy values */
+   asn = 0x0400UL;
+   capi = 0x0200UL;
+   } else {
+   nbw = (u64)be32_to_cpu(prop[2]);
+   asn = (u64)be32_to_cpu(prop[1]);
+   capi = (u64)be32_to_cpu(prop[0]);
+   }
+   of_node_put(np);
+   }
+   *capiind = capi;
+   *asnind = asn;
+   *nbwind = nbw;
+   return 0;
+}
+
+int cxl_get_xsl9_dsnctl(struct pci_dev *dev, u64 capp_unit_id, u64 *reg)
 {
u64 xsl_dsnctl;
+   u64 capiind, asnind, nbwind;
 
/*
 * CAPI Identifier bits [0:7]
 * bit 61:60 MSI bits --> 0
 * bit 59 TVT selector --> 0
 */
+   if (get_phb_indications(dev, , , ))
+   return -1;
 
/*
 * Tell XSL where to route data to.
 * The field chipid should match the PHB CAPI_CMPM register
 */
-   xsl_dsnctl = ((u64)0x2 << (63-7)); /* Bit 57 */
+   xsl_dsnctl = (capiind << (63-15)); /* Bit 57 */
xsl_dsnctl |= (capp_unit_id << (63-15));
 
/* nMMU_ID Defaults to: b’01001’*/
@@ -437,14 +469,14 @@ int cxl_get_xsl9_dsnctl(u64 capp_unit_id, u64 *reg)
 * nbwind=0x03, bits [57:58], must include capi indicator.
 * Not supported on P9 DD1.
 */
-   xsl_dsnctl |= ((u64)0x03 << (63-47));
+   xsl_dsnctl |= (nbwind << (63-55));
 
/*
 * Upper 16b address bits of ASB_Notify messages sent to the
 * system. Need to match the PHB’s ASN Compare/Mask Register.
 * Not supported on P9 DD1.
 */
-   xsl_dsnctl |= ((u64)0x04 << (63-55));
+   xsl_dsnctl |= asnind;
}
 
*reg = xsl_dsnctl;
@@ -464,7 +496,7 @@ static int init_implementation_adapter_regs_psl9(struct cxl 
*adapter,
if (rc)
return rc;
 
-   rc = cxl_get_xsl9_dsnctl(capp_unit_id, _dsnctl);
+   rc = cxl_get_xsl9_dsnctl(dev, capp_unit_id, _dsnctl);
if (rc)
return rc;
 
-- 
2.15.1



[PATCH v5 1/2] powerpc/powernv: Enable tunneled operations

2017-12-22 Thread Philippe Bergheaud
P9 supports PCI tunneled operations (atomics and as_notify). This
patch adds support for tunneled operations on powernv, with a new
API, to be called by device drivers:

pnv_pci_get_tunnel_ind()
   Tell driver the 16-bit ASN indication used by kernel.

pnv_pci_set_tunnel_bar()
   Tell kernel the Tunnel BAR Response address used by driver.
   This function uses two new OPAL calls, as the PBCQ Tunnel BAR
   register is configured by skiboot.

pnv_pci_get_as_notify_info()
   Return the ASN info of the thread to be woken up.

Signed-off-by: Philippe Bergheaud 
---
Changelog:

v2: Do not set the ASN indication. Get it from the device tree.

v3: Make pnv_pci_get_phb_node() available when compiling without cxl.

v4: Add pnv_pci_get_as_notify_info().
Rebase opal call numbers on skiboot 5.9.6.

v5: pnv_pci_get_tunnel_ind():
  - fix node reference count
pnv_pci_get_as_notify_info():
  - fail if task == NULL
  - read pid from mm->context.id
  - explain that thread.tidr require CONFIG_PPC64

This patch depends on the following skiboot prerequisites:

https://patchwork.ozlabs.org/patch/849162/
https://patchwork.ozlabs.org/patch/849163/
---
 arch/powerpc/include/asm/opal-api.h|   4 +-
 arch/powerpc/include/asm/opal.h|   2 +
 arch/powerpc/include/asm/pnv-pci.h |   5 ++
 arch/powerpc/platforms/powernv/opal-wrappers.S |   2 +
 arch/powerpc/platforms/powernv/pci-cxl.c   |   8 --
 arch/powerpc/platforms/powernv/pci.c   | 106 +
 6 files changed, 118 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 233c7504b1f2..b901f4d9f009 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -201,7 +201,9 @@
 #define OPAL_SET_POWER_SHIFT_RATIO 155
 #define OPAL_SENSOR_GROUP_CLEAR156
 #define OPAL_PCI_SET_P2P   157
-#define OPAL_LAST  157
+#define OPAL_PCI_GET_PBCQ_TUNNEL_BAR   159
+#define OPAL_PCI_SET_PBCQ_TUNNEL_BAR   160
+#define OPAL_LAST  160
 
 /* Device tree flags */
 
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 0c545f7fc77b..8705e422b893 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -198,6 +198,8 @@ int64_t opal_unregister_dump_region(uint32_t id);
 int64_t opal_slw_set_reg(uint64_t cpu_pir, uint64_t sprn, uint64_t val);
 int64_t opal_config_cpu_idle_state(uint64_t state, uint64_t flag);
 int64_t opal_pci_set_phb_cxl_mode(uint64_t phb_id, uint64_t mode, uint64_t 
pe_number);
+int64_t opal_pci_get_pbcq_tunnel_bar(uint64_t phb_id, uint64_t *addr);
+int64_t opal_pci_set_pbcq_tunnel_bar(uint64_t phb_id, uint64_t addr);
 int64_t opal_ipmi_send(uint64_t interface, struct opal_ipmi_msg *msg,
uint64_t msg_len);
 int64_t opal_ipmi_recv(uint64_t interface, struct opal_ipmi_msg *msg,
diff --git a/arch/powerpc/include/asm/pnv-pci.h 
b/arch/powerpc/include/asm/pnv-pci.h
index 3e5cf251ad9a..c69de3276b5e 100644
--- a/arch/powerpc/include/asm/pnv-pci.h
+++ b/arch/powerpc/include/asm/pnv-pci.h
@@ -29,6 +29,11 @@ extern int pnv_pci_set_power_state(uint64_t id, uint8_t 
state,
 extern int pnv_pci_set_p2p(struct pci_dev *initiator, struct pci_dev *target,
   u64 desc);
 
+extern int pnv_pci_get_tunnel_ind(struct pci_dev *dev, uint64_t *ind);
+extern int pnv_pci_set_tunnel_bar(struct pci_dev *dev, uint64_t addr,
+ int enable);
+extern int pnv_pci_get_as_notify_info(struct task_struct *task, u32 *lpid,
+ u32 *pid, u32 *tid);
 int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode);
 int pnv_cxl_ioda_msi_setup(struct pci_dev *dev, unsigned int hwirq,
   unsigned int virq);
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S 
b/arch/powerpc/platforms/powernv/opal-wrappers.S
index 6f4b00a2ac46..5da790fb7fef 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -320,3 +320,5 @@ OPAL_CALL(opal_set_powercap,
OPAL_SET_POWERCAP);
 OPAL_CALL(opal_get_power_shift_ratio,  OPAL_GET_POWER_SHIFT_RATIO);
 OPAL_CALL(opal_set_power_shift_ratio,  OPAL_SET_POWER_SHIFT_RATIO);
 OPAL_CALL(opal_sensor_group_clear, OPAL_SENSOR_GROUP_CLEAR);
+OPAL_CALL(opal_pci_get_pbcq_tunnel_bar,
OPAL_PCI_GET_PBCQ_TUNNEL_BAR);
+OPAL_CALL(opal_pci_set_pbcq_tunnel_bar,
OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
diff --git a/arch/powerpc/platforms/powernv/pci-cxl.c 
b/arch/powerpc/platforms/powernv/pci-cxl.c
index 94498a04558b..cee003de63af 100644
--- a/arch/powerpc/platforms/powernv/pci-cxl.c
+++ b/arch/powerpc/platforms/powernv/pci-cxl.c
@@ -16,14 +16,6 @@
 
 #include "pci.h"
 

[PATCH] SB600 for the Nemo board has non-zero devices on non-root bus

2017-12-22 Thread Christian Zigotzky

Hi Michael,

Thanks a lot for your reply! :-)

I have found two interesting lines in the device tree:

compatible   "pasemi,nemo"
model    "pasemi,nemo"

What do you think?

Please find attached the output of `lsprop /proc/device-tree`.

Thanks,
Christian


On 22.12.2017 12:19, Michael Ellerman wrote:
>
> I suspect Darren was referring to all of sb600_set_flag().
>
> What we'd really like is to be able to do something like:
>
> void __init pas_pci_init(void)
> {
>     ...
>
>     if (of_find_compatible_node(NULL, NULL, "nemo-something"))
>         pci_set_flag(PCI_SCAN_ALL_PCIE_DEVS).
>
>
> But I don't know if there's anything in the NEMO device tree that we can
> use to uniquely identify those machines? ie. the "nemo-something" string.
>
> Can you attach the output of `lsprop /proc/device-tree` ?
>
> cheers
>


compatible   "pasemi,nemo"
 "pasemi,pa6t-1682m"
 "PA6T-1682M"
 "pasemi,pwrficient"
 "pasemi"
device_type  "bootrom"
model"pasemi,nemo"
#interrupt-cells 0002
#address-cells   0002
#size-cells  0002
linux,phandle7fdff018 (2145382424)
platform-open-pic  fc00  00041000
name ""

/proc/device-tree/sdc@fc00:
compatible   "1682m-sdc"
 "pasemi,pwrficient-sdc"
 "pasemi,sdc"
device_type  "sdc"
#address-cells   0001
#size-cells  0001
reg   fc00  0080
linux,phandle7fe2f458 (2145580120)
name "sdc"

/proc/device-tree/sdc@fc00/rng@fc105000:
compatible   "1682m-rng"
 "pasemi,pwrficient-rng"
 "pasemi,rng"
device_type  "rng"
reg  fc105000 1000
linux,phandle7fe2fdd0 (2145582544)
name "rng"

/proc/device-tree/sdc@fc00/mdio@0:
compatible   "gpio-mdio"
mdc-pin  0005
#address-cells   0001
#size-cells  
reg   
linux,phandle7fe3d5a0 (2145637792)
mdio-pin 0006
name "mdio"

/proc/device-tree/sdc@fc00/mdio@0/ethernet-phy@0:
interrupt-parent 7fe2f6e8 (2145580776)
interrupts   0007 0001
reg  
linux,phandle7fe3d860 (2145638496)
name "ethernet-phy"

/proc/device-tree/sdc@fc00/openpic@fc00:
compatible   "pasemi,pwrficient-openpic"
 "chrp,open-pic"
device_type  "open-pic"
msi-available-ranges 0200 0200
#interrupt-cells 0002
#address-cells   
reg  fc00 0010
linux,phandle7fe2f6e8 (2145580776)
name "openpic"
interrupt-controller

/proc/device-tree/sdc@fc00/gizmo@fc104000:
compatible   "1682m-gizmo"
 "pasemi,pwrficient-gizmo"
 "pasemi,gizmo"
device_type  "gizmo"
reg  fc104000 1000
linux,phandle7fe2fbf0 (2145582064)
name "gizmo"

/proc/device-tree/sdc@fc00/gpio@fc103000:
compatible   "1682m-gpio"
 "pasemi,pwrficient-gpio"
 "pasemi,gpio"
device_type  "gpio"
reg  fc103000 1000
linux,phandle7fe2fa18 (2145581592)
name "gpio"

/proc/device-tree/options:
MENU_2_LABEL "Debian Sid/experimental Kernel 4.9"
MENU_4_COMMAND   "set pmu -astate=A4 ; ramdisk -z -addr=0x2400 -fatfs 
cf0:slitaz25.gz ; boot -elf -noints -fatfs cf0:vmlinux-3.13.14"
ETH0_HWADDR  "00:50:C2:20:DA:9E"
CFE_MEMORYSIZE   "8192"
MENU_5_LABEL "Fedora 17 Kernel 3.13.9"
MENU_8_LABEL "ubuntu MATE 16.04.2 LTS Kernel 4.9"
MENU_1_COMMAND   "setenv amigaboot_quiet Y ;boot -fs=iso atapi0.1:amigaboot.of"
MENU_8_COMMAND   "set pmu -astate=A4 ; setenv bootargs "root=/dev/sdb1 quiet ro 
splash" ; boot -elf -noints -fatfs cf0:vmlinux-4.9"
bootargs "root=/dev/sda4"
STARTUP  "speed;menu"
MENU_DEFAULT "0"
MENU_0_LABEL "AmigaOS"
MENU_5_COMMAND   73657420 706d7520 2d617374 6174653d
 4134203b 20736574 656e7620 626f6f74
 61726773 20227264 2e6d643d 30207264
 2e6c766d 3d302072 642e646d 3d302053
 5953464f 4e543d54 72756520 4b455954
 41424c45 3d646520 72642e6c 756b733d
 3020726f 6f743d2f 6465762f 73646233
 204c414e 473d6465 5f44452e 5554462d
 [191 bytes total]
MENU_3_LABEL "ubuntu MATE 17.04 Kernel 4.9"
MENU_6_LABEL "Fedora 25 PPC64 Kernel 4.9"
MENU_2_COMMAND   "set pmu -astate=A4 ; setenv bootargs "root=/dev/sda4" ; boot 
-elf -noints -fatfs cf0:vmlinux-4.9"
MENU_9_LABEL "openSUSE Tumbleweed Kernel 4.14"
speed"set pmu -astate=A4"
MENU_9_COMMAND   "set pmu -astate=A4 ; setenv bootargs "root=/dev/sdb6 
splash=silent" ; boot -elf -noints -fatfs cf0:vmlinux-4.14"
BOOT_CONSOLE "pcconsole0"
CFE_VERSION  "PAS-2.0.30"
little-endian?   
MENU_6_COMMAND   "set pmu -astate=A4 ; setenv bootargs 

Re: [PATCH] powerpc/mm: Simplify _PAGE_RO handling in page table dump

2017-12-22 Thread Christophe LEROY

Ping ?

Le 09/05/2017 à 16:16, Christophe Leroy a écrit :

Commit fd893fe56a130 ("powerpc/mm: Fix missing page attributes in
page table dump") added support of _PAGE_RO attribute.

This patch makes it more simple

Signed-off-by: Christophe Leroy 
---
  arch/powerpc/mm/dump_linuxpagetables.c | 7 +--
  1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/arch/powerpc/mm/dump_linuxpagetables.c 
b/arch/powerpc/mm/dump_linuxpagetables.c
index d659345a98d6..eeef51107cff 100644
--- a/arch/powerpc/mm/dump_linuxpagetables.c
+++ b/arch/powerpc/mm/dump_linuxpagetables.c
@@ -121,13 +121,8 @@ static const struct flag_info flag_array[] = {
.set= "user",
.clear  = "",
}, {
-#if _PAGE_RO == 0
-   .mask   = _PAGE_RW,
+   .mask   = _PAGE_RW | _PAGE_RO,
.val= _PAGE_RW,
-#else
-   .mask   = _PAGE_RO,
-   .val= 0,
-#endif
.set= "rw",
.clear  = "ro",
}, {



Re: [PATCH v5] powerpc/mm: Only read faulting instruction when necessary in do_page_fault()

2017-12-22 Thread Christophe LEROY

Hi Michael,

Did you have a chance to have a look ?

Christophe

Le 08/08/2017 à 09:08, Christophe Leroy a écrit :

Commit a7a9dcd882a67 ("powerpc: Avoid taking a data miss on every
userspace instruction miss") has shown that limiting the read of
faulting instruction to likely cases improves performance.

This patch goes further into this direction by limiting the read
of the faulting instruction to the only cases where it is definitly
needed.

On an MPC885, with the same benchmark app as in the commit referred
above, we see a reduction of 4000 dTLB misses (approx 3%):

Before the patch:
  Performance counter stats for './fault 500' (10 runs):

  720495838  cpu-cycles ( +-  0.04% )
 141769  dTLB-load-misses   ( +-  0.02% )
  52722  iTLB-load-misses   ( +-  0.01% )
  19611  faults ( +-  0.02% )

5.750535176 seconds time elapsed( +-  0.16% )

With the patch:
  Performance counter stats for './fault 500' (10 runs):

  717669123  cpu-cycles ( +-  0.02% )
 137344  dTLB-load-misses   ( +-  0.03% )
  52731  iTLB-load-misses   ( +-  0.01% )
  19614  faults ( +-  0.03% )

5.728423115 seconds time elapsed( +-  0.14% )

The proper work of the huge stack expansion was tested with the
following app:

#include 
#include 

int main(int argc, char **argv)
{
char buf[1024 * 1025];

sprintf(buf, "Hello world !\n");
printf(buf);

exit(0);
}

Signed-off-by: Christophe Leroy 
---
  I'm wondering if it really worth it to do something so complex. Is there 
really a chance that the
  get_user() faults ? It would mean that an instruction that as just been 
executed has been in the
  meantime swapped out. Is that really a possibility ? I'd expect not, which 
would mean that we
  could limit it to __get_user_inatomic() and then not implement this complex 
unlocking and retry stuff.

  v5: Reworked to fit after Benh do_fault improvement and rebased on top of 
powerpc/merge (65152902e43fef)

  v4: Rebased on top of powerpc/next (f718d426d7e42e) and doing access_ok() 
verification before __get_user_xxx()

  v3: Do a first try with pagefault disabled before releasing the semaphore

  v2: Changes 'if (cond1) if (cond2)' by 'if (cond1 && cond2)'

  arch/powerpc/mm/fault.c | 90 +++--
  1 file changed, 65 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index f88fac3d281b..7a218f69f956 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -68,26 +68,58 @@ static inline bool notify_page_fault(struct pt_regs *regs)
  /*
   * Check whether the instruction at regs->nip is a store using
   * an update addressing form which will update r1.
+ * If no, returns STACK_EXPANSION_BAD
+ * If yes, returns STACK_EXPANSION_GOOD
+ * In addition, the result is ored with STACK_EXPANSION_UNLOCKED if the
+ * semaphore has been released
   */
-static bool store_updates_sp(struct pt_regs *regs)
+
+#define STACK_EXPANSION_BAD0
+#define STACK_EXPANSION_GOOD   1
+#define STACK_EXPANSION_LOCKED 0
+#define STACK_EXPANSION_UNLOCKED   2
+
+int store_updates_sp(struct pt_regs *regs)
  {
unsigned int inst;
+   unsigned int __user *nip = (unsigned int __user *)regs->nip;
+   int ret;
+   int sema = STACK_EXPANSION_LOCKED;
+
+   /*
+* We want to do this outside mmap_sem, because reading code around nip
+* can result in fault, which will cause a deadlock when called with
+* mmap_sem held. However, we do a first try with pagefault disabled as
+* a fault here is very unlikely.
+*/
+   if (!access_ok(VERIFY_READ, nip, sizeof(inst)))
+   return STACK_EXPANSION_BAD | STACK_EXPANSION_LOCKED;
+
+   pagefault_disable();
+   ret = __get_user_inatomic(inst, nip);
+   pagefault_enable();
+   if (ret) {
+   up_read(>mm->mmap_sem);
+   sema = STACK_EXPANSION_UNLOCKED;
+   if (__get_user(inst, nip))
+   return STACK_EXPANSION_BAD | STACK_EXPANSION_UNLOCKED;
+   }
  
-	if (get_user(inst, (unsigned int __user *)regs->nip))

-   return false;
/* check for 1 in the rA field */
if (((inst >> 16) & 0x1f) != 1)
-   return false;
+   return STACK_EXPANSION_BAD | sema;
+
/* check major opcode */
switch (inst >> 26) {
+   case 62:/* std or stdu */
+   if ((inst & 3) == 0)
+   break;
case 37:/* stwu */
case 39:/* stbu */
case 45:/* sthu */
case 53:/* stfsu */
case 55:/* stfdu */
-   return true;
-   case 62:   

Re: [PATCH] KVM: PPC: Book3S: fix XIVE migration of pending interrupts

2017-12-22 Thread Greg Kurz
On Fri, 22 Dec 2017 22:22:08 +1100
Michael Ellerman  wrote:

> Paul Mackerras  writes:
> 
> > On Fri, Dec 22, 2017 at 03:34:20PM +1100, Michael Ellerman wrote:  
> >> Laurent Vivier  writes:
> >>   
> >> > On 12/12/2017 13:02, Cédric Le Goater wrote:  
> >> >> When restoring a pending interrupt, we are setting the Q bit to force
> >> >> a retrigger in xive_finish_unmask(). But we also need to force an EOI
> >> >> in this case to reach the same initial state : P=1, Q=0.
> >> >> 
> >> >> This can be done by not setting 'old_p' for pending interrupts which
> >> >> will inform xive_finish_unmask() that an EOI needs to be sent.
> >> >> 
> >> >> Suggested-by: Benjamin Herrenschmidt 
> >> >> Signed-off-by: Cédric Le Goater 
> >> >> ---
> >> >> 
> >> >>  Tested with a guest running iozone.
> >> >> 
> >> >>  arch/powerpc/kvm/book3s_xive.c | 4 ++--
> >> >>  1 file changed, 2 insertions(+), 2 deletions(-)  
> >> >
> >> > We really need this patch to fix VM migration on POWER9.
> >> > When will it be merged?  
> >> 
> >> Paul is away, so I'll merge it via the powerpc tree.
> >> 
> >> I'll mark it:
> >> 
> >>   Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE 
> >> interrupt controller")
> >>   Cc: sta...@vger.kernel.org # v4.12+  
> >
> > Thanks for doing that.
> >
> > If you felt like merging Alexey's patch "KVM: PPC: Book3S PR: Fix WIMG
> > handling under pHyp" with my acked-by, that would be fine too.  The
> > commit message needs a little work - the reason for using HPTE_R_M is
> > not just because it seems to work, but because current POWER
> > processors require M set on mappings for normal pages, and pHyp
> > enforces that.  
> 
> OK. I saw this too late, but I'll pick that one up next week. If someone
> sends me an updated change log I will merge all of their patches for
> ever.
> 

Really ? Opportunity makes the thief, so here's my take :P

8<-->8
KVM: PPC: Book3S: fix XIVE migration of pending interrupts

96df226 "KVM: PPC: Book3S PR: Preserve storage control bits" added WIMG
bits preserving but it missed 2 special cases:
- a magic page in kvmppc_mmu_book3s_64_xlate() and
- guest real mode in kvmppc_handle_pagefault().

For these ptes WIMG were 0 and pHyp failed on these causing a guest to
stop in the very beginning at NIP=0x100 (due to bd9166ffe
"KVM: PPC: Book3S PR: Exit KVM on failed mapping").

According to LoPAPR v1.1 14.5.4.1.2 H_ENTER:

 The hypervisor checks that the WIMG bits within the PTE are appropriate
 for the physical page number else H_Parameter return. (For System Memory
 pages WIMG=0010, or, 1110 if the SAO option is enabled, and for IO pages
 WIMG=01**.)

This hence initializes WIMG to non-zero value HPTE_R_M (0x10), as expected
by pHyp.

Fixes: 96df226 "KVM: PPC: Book3S PR: Preserve storage control bits"
Signed-off-by: Alexey Kardashevskiy 
8<-->8

Cheers,

--
Greg

> cheers
> --
> To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



Re: WARNING: CPU: 0 PID: 2777 at arch/powerpc/mm/hugetlbpage.c:354 h,ugetlb_free_pgd_range+0xc8/0x1e4

2017-12-22 Thread Christophe LEROY



Le 22/12/2017 à 10:32, Christophe LEROY a écrit :



Le 20/12/2017 à 13:17, Christophe LEROY a écrit :
Trying to malloc() with libhugetlbfs, it runs indefinitly doing page 
faults in do_page_fault()/hugetlb_fault().
When interrupting the blocked app with CTRL+C, I get the following 
WARNING:


Any idea of what can be wrong ? I'm on a 8xx with 512k huge pages.



It looks like something goes wrong when the app tries to mmap a 
hugetlbpage at a given address.

When it requests the page with a NULL address, it works well.

Any idea ?


Now I have found the reason:

I have something allocated

1000-10001000 r-xp  00:0f 2597   /root/malloc
1001-10011000 rwxp  00:0f 2597   /root/malloc

And mmap() accepts the hint, which is in the same PMD which is not a 
huge PMD:


mmap(0x1008, 524288, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0) = 0x1008


Apparently, hugetlb_get_unmapped_area() doesn't care about that.

What should we do to handle it ? Have our own 
hugetlb_get_unmapped_area(), which does all the same, checking this in 
addition ?


Christophe



Christophe



[162980.035629] WARNING: CPU: 0 PID: 2777 at 
arch/powerpc/mm/hugetlbpage.c:354 h

ugetlb_free_pgd_range+0xc8/0x1e4
[162980.035699] CPU: 0 PID: 2777 Comm: malloc Tainted: G W   4.14.6-s
3k-dev-ga8e8e8b176-svn9134 #85
[162980.035744] task: c67e2c00 task.stack: c668e000
[162980.035783] NIP:  c000fe18 LR: c00e1eec CTR: c00f90c0
[162980.035830] REGS: c668fc20 TRAP: 0700   Tainted: G W    (4.14.6-s
3k-dev-ga8e8e8b176-svn9134)
[162980.035854] MSR:  00029032   CR: 24044224 XER: 
2000

[162980.036003]
[162980.036003] GPR00: c00e1eec c668fcd0 c67e2c00 0010 c6869410 
1008 000

0 77fb4000
[162980.036003] GPR08: 0001 0683c001  ff80 44028228 
10018a34 000

04008 418004fc
[162980.036003] GPR16: c668e000 00040100 c668e000 c06c c668fe78 
c668e000 c68

35ba0 c668fd48
[162980.036003] GPR24:  73ff 7400 0001 77fb4000 
100f 101

0 1010
[162980.036743] NIP [c000fe18] hugetlb_free_pgd_range+0xc8/0x1e4
[162980.036839] LR [c00e1eec] free_pgtables+0x12c/0x150
[162980.036861] Call Trace:
[162980.036939] [c668fcd0] [c00f0774] unlink_anon_vmas+0x1c4/0x214 
(unreliable)

[162980.037040] [c668fd10] [c00e1eec] free_pgtables+0x12c/0x150
[162980.037118] [c668fd40] [c00eabac] exit_mmap+0xe8/0x1b4
[162980.037210] [c668fda0] [c0019710] mmput.part.9+0x20/0xd8
[162980.037301] [c668fdb0] [c001ecb0] do_exit+0x1f0/0x93c
[162980.037386] [c668fe00] [c001f478] do_group_exit+0x40/0xcc
[162980.037479] [c668fe10] [c002a76c] get_signal+0x47c/0x614
[162980.037570] [c668fe70] [c0007840] do_signal+0x54/0x244
[162980.037654] [c668ff30] [c0007ae8] do_notify_resume+0x34/0x88
[162980.037744] [c668ff40] [c000dae8] do_user_signal+0x74/0xc4
[162980.037781] Instruction dump:
[162980.037821] 7fdff378 8137 54a3463a 80890020 7d24182e 7c841a14 
712a0004 4

082ff94
[162980.038014] 2f89 419e0010 712a0ff0 408200e0 <0fe0> 
54a9000a 7f984840

  419d0094
[162980.038216] ---[ end trace c0ceeca8e7a5800a ]---
[162980.038754] BUG: non-zero nr_ptes on freeing mm: 1
[162985.363322] BUG: non-zero nr_ptes on freeing mm: -1

Christophe


Re: KVM: PPC: Book3S HV: Fix pending_pri value in kvmppc_xive_get_icp()

2017-12-22 Thread Michael Ellerman
On Tue, 2017-12-12 at 17:23:56 UTC, Laurent Vivier wrote:
> When we migrate a VM from a POWER8 host (XICS) to a POWER9 host
> (XICS-on-XIVE), we have an error:
> 
> qemu-kvm: Unable to restore KVM interrupt controller state \
>   (0xff00) for CPU 0: Invalid argument
> 
> This is because kvmppc_xics_set_icp() checks the new state
> is internaly consistent, and especially:
> 
> ...
>1129 if (xisr == 0) {
>1130 if (pending_pri != 0xff)
>1131 return -EINVAL;
> ...
> 
> On the other side, kvmppc_xive_get_icp() doesn't set
> neither the pending_pri value, nor the xisr value (set to 0)
> (and kvmppc_xive_set_icp() ignores the pending_pri value)
> 
> As xisr is 0, pending_pri must be set to 0xff.
> 
> Signed-off-by: Laurent Vivier 
> Acked-by: Benjamin Herrenschmidt 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/7333b5aca412d6ad02667b5a513485

cheers


Re: KVM: PPC: Book3S: fix XIVE migration of pending interrupts

2017-12-22 Thread Michael Ellerman
On Tue, 2017-12-12 at 12:02:04 UTC, =?utf-8?q?C=C3=A9dric_Le_Goater?= wrote:
> When restoring a pending interrupt, we are setting the Q bit to force
> a retrigger in xive_finish_unmask(). But we also need to force an EOI
> in this case to reach the same initial state : P=1, Q=0.
> 
> This can be done by not setting 'old_p' for pending interrupts which
> will inform xive_finish_unmask() that an EOI needs to be sent.
> 
> Suggested-by: Benjamin Herrenschmidt 
> Signed-off-by: Cédric Le Goater 
> Reviewed-by: Laurent Vivier 
> Tested-by: Laurent Vivier 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/dc1c4165d189350cb51bdd3057deb6

cheers


Re: [PATCH] KVM: PPC: Book3S: fix XIVE migration of pending interrupts

2017-12-22 Thread Michael Ellerman
Laurent Vivier  writes:

> On 22/12/2017 08:54, Paul Mackerras wrote:
>> On Fri, Dec 22, 2017 at 03:34:20PM +1100, Michael Ellerman wrote:
>>> Laurent Vivier  writes:
>>>
 On 12/12/2017 13:02, Cédric Le Goater wrote:
> When restoring a pending interrupt, we are setting the Q bit to force
> a retrigger in xive_finish_unmask(). But we also need to force an EOI
> in this case to reach the same initial state : P=1, Q=0.
>
> This can be done by not setting 'old_p' for pending interrupts which
> will inform xive_finish_unmask() that an EOI needs to be sent.
>
> Suggested-by: Benjamin Herrenschmidt 
> Signed-off-by: Cédric Le Goater 
> ---
>
>  Tested with a guest running iozone.
>
>  arch/powerpc/kvm/book3s_xive.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

 We really need this patch to fix VM migration on POWER9.
 When will it be merged?
>>>
>>> Paul is away, so I'll merge it via the powerpc tree.
>>>
>>> I'll mark it:
>>>
>>>   Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE 
>>> interrupt controller")
>>>   Cc: sta...@vger.kernel.org # v4.12+
>> 
>> Thanks for doing that.
>> 
>> If you felt like merging Alexey's patch "KVM: PPC: Book3S PR: Fix WIMG
>> handling under pHyp" with my acked-by, that would be fine too.  The
>> commit message needs a little work - the reason for using HPTE_R_M is
>> not just because it seems to work, but because current POWER
>> processors require M set on mappings for normal pages, and pHyp
>> enforces that.
>
> We also need:
>
> KVM: PPC: Book3S HV: Fix pending_pri value in kvmppc_xive_get_icp()

I've merged that one.

cheers


Re: [PATCH] KVM: PPC: Book3S: fix XIVE migration of pending interrupts

2017-12-22 Thread Michael Ellerman
Paul Mackerras  writes:

> On Fri, Dec 22, 2017 at 03:34:20PM +1100, Michael Ellerman wrote:
>> Laurent Vivier  writes:
>> 
>> > On 12/12/2017 13:02, Cédric Le Goater wrote:
>> >> When restoring a pending interrupt, we are setting the Q bit to force
>> >> a retrigger in xive_finish_unmask(). But we also need to force an EOI
>> >> in this case to reach the same initial state : P=1, Q=0.
>> >> 
>> >> This can be done by not setting 'old_p' for pending interrupts which
>> >> will inform xive_finish_unmask() that an EOI needs to be sent.
>> >> 
>> >> Suggested-by: Benjamin Herrenschmidt 
>> >> Signed-off-by: Cédric Le Goater 
>> >> ---
>> >> 
>> >>  Tested with a guest running iozone.
>> >> 
>> >>  arch/powerpc/kvm/book3s_xive.c | 4 ++--
>> >>  1 file changed, 2 insertions(+), 2 deletions(-)
>> >
>> > We really need this patch to fix VM migration on POWER9.
>> > When will it be merged?
>> 
>> Paul is away, so I'll merge it via the powerpc tree.
>> 
>> I'll mark it:
>> 
>>   Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE 
>> interrupt controller")
>>   Cc: sta...@vger.kernel.org # v4.12+
>
> Thanks for doing that.
>
> If you felt like merging Alexey's patch "KVM: PPC: Book3S PR: Fix WIMG
> handling under pHyp" with my acked-by, that would be fine too.  The
> commit message needs a little work - the reason for using HPTE_R_M is
> not just because it seems to work, but because current POWER
> processors require M set on mappings for normal pages, and pHyp
> enforces that.

OK. I saw this too late, but I'll pick that one up next week. If someone
sends me an updated change log I will merge all of their patches for
ever.

cheers


Re: [PATCH] SB600 for the Nemo board has non-zero devices on non-root bus

2017-12-22 Thread Michael Ellerman
Christian Zigotzky  writes:

> Hi Bjorn,
>
> Sorry I'm bothering you again. Is this small out of tree init routine in 
> the Nemo patch? I haven't get an answer from Darren yet and I didn't 
> found the small out of tree init routine in the Nemo patch. Please find 
> attached the Nemo patch. Maybe you can find this small out of tree init 
> routine.
>
> What do you think of this following code?
>
> if (sb600_bus == -1)
> +   {
> +   busp = pci_find_bus(0, 0);
> +   pa_pxp_read_config(busp, PCI_DEVFN(17,0), 
> PCI_SECONDARY_BUS, 1, );
> +
> +   sb600_bus = val;
> +
> +   printk(KERN_CRIT "NEMO SB600 on bus %d.\n",sb600_bus);
> +   }

I suspect Darren was referring to all of sb600_set_flag().

What we'd really like is to be able to do something like:

void __init pas_pci_init(void)
{
...

if (of_find_compatible_node(NULL, NULL, "nemo-something"))
pci_set_flag(PCI_SCAN_ALL_PCIE_DEVS).


But I don't know if there's anything in the NEMO device tree that we can
use to uniquely identify those machines? ie. the "nemo-something" string.

Can you attach the output of `lsprop /proc/device-tree` ?

cheers


[PATCH 9/9] powerpc/64s: Use array of lppaca pointers and allocate lppacas individually

2017-12-22 Thread Nicholas Piggin
Similary to the previous patch, allocate LPPACAs individually.

We no longer allocate lppacas in an array, so this patch removes the 1kB
static alignment for the structure, and enforce the PAPR alignment
requirements at allocation time. We can not reduce the 1kB allocation size
however, due to existing KVM hypervisors.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/lppaca.h  | 24 -
 arch/powerpc/kernel/machine_kexec_64.c | 15 --
 arch/powerpc/kernel/paca.c | 89 --
 arch/powerpc/kvm/book3s_hv.c   |  3 +-
 arch/powerpc/mm/numa.c |  4 +-
 arch/powerpc/platforms/pseries/kexec.c |  7 ++-
 6 files changed, 63 insertions(+), 79 deletions(-)

diff --git a/arch/powerpc/include/asm/lppaca.h 
b/arch/powerpc/include/asm/lppaca.h
index 6e4589eee2da..65d589689f01 100644
--- a/arch/powerpc/include/asm/lppaca.h
+++ b/arch/powerpc/include/asm/lppaca.h
@@ -36,14 +36,16 @@
 #include 
 
 /*
- * We only have to have statically allocated lppaca structs on
- * legacy iSeries, which supports at most 64 cpus.
- */
-#define NR_LPPACAS 1
-
-/*
- * The Hypervisor barfs if the lppaca crosses a page boundary.  A 1k
- * alignment is sufficient to prevent this
+ * The lppaca is the "virtual processor area" registered with the hypervisor,
+ * H_REGISTER_VPA etc.
+ *
+ * According to PAPR, the structure is 640 bytes long, must be L1 cache line
+ * aligned, and must not cross a 4kB boundary. Its size field must be at
+ * least 640 bytes (but may be more).
+ *
+ * Pre-v4.14 KVM hypervisors reject the VPA if its size field is smaller than
+ * 1kB, so we dynamically allocate 1kB and advertise size as 1kB, but keep
+ * this structure as the canonical 640 byte size.
  */
 struct lppaca {
/* cacheline 1 contains read-only data */
@@ -97,11 +99,9 @@ struct lppaca {
 
__be32  page_ins;   /* CMO Hint - # page ins by OS */
u8  reserved11[148];
-   volatile __be64 dtl_idx;/* Dispatch Trace Log head 
index */
+   volatile __be64 dtl_idx;/* Dispatch Trace Log head index */
u8  reserved12[96];
-} __attribute__((__aligned__(0x400)));
-
-extern struct lppaca lppaca[];
+} cacheline_aligned;
 
 #define lppaca_of(cpu) (*paca_ptrs[cpu]->lppaca_ptr)
 
diff --git a/arch/powerpc/kernel/machine_kexec_64.c 
b/arch/powerpc/kernel/machine_kexec_64.c
index a250e3331f94..1044bf15d5ed 100644
--- a/arch/powerpc/kernel/machine_kexec_64.c
+++ b/arch/powerpc/kernel/machine_kexec_64.c
@@ -323,17 +323,24 @@ void default_machine_kexec(struct kimage *image)
kexec_stack.thread_info.cpu = current_thread_info()->cpu;
 
/* We need a static PACA, too; copy this CPU's PACA over and switch to
-* it.  Also poison per_cpu_offset to catch anyone using non-static
-* data.
+* it. Also poison per_cpu_offset and NULL lppaca to catch anyone using
+* non-static data.
 */
memcpy(_paca, get_paca(), sizeof(struct paca_struct));
kexec_paca.data_offset = 0xedeaddeadeeeUL;
+#ifdef CONFIG_PPC_PSERIES
+   kexec_paca.lppaca_ptr = NULL;
+#endif
paca_ptrs[kexec_paca.paca_index] = _paca;
+
setup_paca(_paca);
 
-   /* XXX: If anyone does 'dynamic lppacas' this will also need to be
-* switched to a static version!
+   /*
+* The lppaca should be unregistered at this point so the HV won't
+* touch it. In the case of a crash, none of the lppacas are
+* unregistered so there is not much we can do about it here.
 */
+
/*
 * On Book3S, the copy must happen with the MMU off if we are either
 * using Radix page tables or we are not in an LPAR since we can
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index eef4891c9af6..6cddb9bdc151 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -23,82 +23,50 @@
 #ifdef CONFIG_PPC_PSERIES
 
 /*
- * The structure which the hypervisor knows about - this structure
- * should not cross a page boundary.  The vpa_init/register_vpa call
- * is now known to fail if the lppaca structure crosses a page
- * boundary.  The lppaca is also used on POWER5 pSeries boxes.
- * The lppaca is 640 bytes long, and cannot readily
- * change since the hypervisor knows its layout, so a 1kB alignment
- * will suffice to ensure that it doesn't cross a page boundary.
+ * See asm/lppaca.h for more detail.
+ *
+ * lppaca structures must must be 1kB in size, L1 cache line aligned,
+ * and not cross 4kB boundary. A 1kB size and 1kB alignment will satisfy
+ * these requirements.
  */
-struct lppaca lppaca[] = {
-   [0 ... (NR_LPPACAS-1)] = {
+static inline void init_lppaca(struct lppaca *lppaca)
+{
+   BUILD_BUG_ON(sizeof(struct lppaca) != 640);
+
+   *lppaca = (struct lppaca) {
.desc = cpu_to_be32(0xd397d781),/* "LpPa" */
-   .size = 

[PATCH 8/9] powerpc/64: Use array of paca pointers and allocate pacas individually

2017-12-22 Thread Nicholas Piggin
Change the paca array into an array of pointers to pacas. Allocate
pacas individually.

This allows flexibility in where the PACAs are allocated. Future work
will allocate them node-local. Platforms that don't have address limits
on PACAs would be able to defer PACA allocations until later in boot
rather than allocate all possible ones up-front then freeing unused.

This is slightly more overhead (one additional indirection) for cross
CPU paca references, but those aren't too common.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/kvm_ppc.h   |  8 ++--
 arch/powerpc/include/asm/lppaca.h|  2 +-
 arch/powerpc/include/asm/paca.h  |  4 +-
 arch/powerpc/include/asm/smp.h   |  4 +-
 arch/powerpc/kernel/crash.c  |  2 +-
 arch/powerpc/kernel/head_64.S| 19 
 arch/powerpc/kernel/machine_kexec_64.c   | 22 -
 arch/powerpc/kernel/paca.c   | 70 +++-
 arch/powerpc/kernel/setup_64.c   | 18 +++
 arch/powerpc/kernel/smp.c| 10 ++--
 arch/powerpc/kernel/sysfs.c  |  2 +-
 arch/powerpc/kvm/book3s_hv.c | 31 ++--
 arch/powerpc/kvm/book3s_hv_builtin.c |  2 +-
 arch/powerpc/mm/tlb-radix.c  |  2 +-
 arch/powerpc/platforms/85xx/smp.c|  8 ++--
 arch/powerpc/platforms/cell/smp.c|  4 +-
 arch/powerpc/platforms/powernv/idle.c| 13 +++---
 arch/powerpc/platforms/powernv/setup.c   |  4 +-
 arch/powerpc/platforms/powernv/smp.c |  2 +-
 arch/powerpc/platforms/powernv/subcore.c |  2 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c |  2 +-
 arch/powerpc/platforms/pseries/lpar.c|  4 +-
 arch/powerpc/platforms/pseries/setup.c   |  2 +-
 arch/powerpc/platforms/pseries/smp.c |  4 +-
 arch/powerpc/sysdev/xics/icp-native.c|  2 +-
 arch/powerpc/xmon/xmon.c |  2 +-
 26 files changed, 140 insertions(+), 105 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 941c2a3f231b..bde76baf8ae3 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -432,15 +432,15 @@ struct openpic;
 extern void kvm_cma_reserve(void) __init;
 static inline void kvmppc_set_xics_phys(int cpu, unsigned long addr)
 {
-   paca[cpu].kvm_hstate.xics_phys = (void __iomem *)addr;
+   paca_ptrs[cpu]->kvm_hstate.xics_phys = (void __iomem *)addr;
 }
 
 static inline void kvmppc_set_xive_tima(int cpu,
unsigned long phys_addr,
void __iomem *virt_addr)
 {
-   paca[cpu].kvm_hstate.xive_tima_phys = (void __iomem *)phys_addr;
-   paca[cpu].kvm_hstate.xive_tima_virt = virt_addr;
+   paca_ptrs[cpu]->kvm_hstate.xive_tima_phys = (void __iomem *)phys_addr;
+   paca_ptrs[cpu]->kvm_hstate.xive_tima_virt = virt_addr;
 }
 
 static inline u32 kvmppc_get_xics_latch(void)
@@ -454,7 +454,7 @@ static inline u32 kvmppc_get_xics_latch(void)
 
 static inline void kvmppc_set_host_ipi(int cpu, u8 host_ipi)
 {
-   paca[cpu].kvm_hstate.host_ipi = host_ipi;
+   paca_ptrs[cpu]->kvm_hstate.host_ipi = host_ipi;
 }
 
 static inline void kvmppc_fast_vcpu_kick(struct kvm_vcpu *vcpu)
diff --git a/arch/powerpc/include/asm/lppaca.h 
b/arch/powerpc/include/asm/lppaca.h
index d0a2a2f99564..6e4589eee2da 100644
--- a/arch/powerpc/include/asm/lppaca.h
+++ b/arch/powerpc/include/asm/lppaca.h
@@ -103,7 +103,7 @@ struct lppaca {
 
 extern struct lppaca lppaca[];
 
-#define lppaca_of(cpu) (*paca[cpu].lppaca_ptr)
+#define lppaca_of(cpu) (*paca_ptrs[cpu]->lppaca_ptr)
 
 /*
  * We are using a non architected field to determine if a partition is
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index f83fc885fba8..09efab1a9854 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -236,10 +236,10 @@ struct paca_struct {
struct sibling_subcore_state *sibling_subcore_state;
 #endif
 #endif
-};
+} cacheline_aligned;
 
 extern void copy_mm_to_paca(struct mm_struct *mm);
-extern struct paca_struct *paca;
+extern struct paca_struct **paca_ptrs;
 extern void initialise_paca(struct paca_struct *new_paca, int cpu);
 extern void setup_paca(struct paca_struct *new_paca);
 extern void allocate_pacas(void);
diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index fac963e10d39..ec7b299350d9 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -170,12 +170,12 @@ static inline const struct cpumask *cpu_sibling_mask(int 
cpu)
 #ifdef CONFIG_PPC64
 static inline int get_hard_smp_processor_id(int cpu)
 {
-   return paca[cpu].hw_cpu_id;
+   return paca_ptrs[cpu]->hw_cpu_id;
 }
 
 static inline void set_hard_smp_processor_id(int cpu, int phys)
 {
-   

[PATCH 7/9] powerpc/64s: do not allocate lppaca if we are not virtualized

2017-12-22 Thread Nicholas Piggin
The "lppaca" is a structure registered with the hypervisor. This
is unnecessary when running on non-virtualised platforms. One field
from the lppaca (pmcregs_in_use) is also used by the host, so move
the host part out into the paca (lppaca field is still updated in
guest mode).

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/paca.h |  8 ++--
 arch/powerpc/include/asm/pmc.h  | 13 -
 arch/powerpc/kernel/asm-offsets.c   |  5 +
 arch/powerpc/kernel/paca.c  | 16 +---
 arch/powerpc/kvm/book3s_hv_interrupts.S |  3 +--
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |  3 +--
 6 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 3892db93b837..f83fc885fba8 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -58,7 +58,7 @@ struct task_struct;
  * processor.
  */
 struct paca_struct {
-#ifdef CONFIG_PPC_BOOK3S
+#ifdef CONFIG_PPC_PSERIES
/*
 * Because hw_cpu_id, unlike other paca fields, is accessed
 * routinely from other CPUs (from the IRQ code), we stick to
@@ -67,7 +67,8 @@ struct paca_struct {
 */
 
struct lppaca *lppaca_ptr;  /* Pointer to LpPaca for PLIC */
-#endif /* CONFIG_PPC_BOOK3S */
+#endif /* CONFIG_PPC_PSERIES */
+
/*
 * MAGIC: the spinlock functions in arch/powerpc/lib/locks.c 
 * load lock_token and paca_index with a single lwz
@@ -159,6 +160,9 @@ struct paca_struct {
u64 saved_r1;   /* r1 save for RTAS calls or PM */
u64 saved_msr;  /* MSR saved here by enter_rtas */
u16 trap_save;  /* Used when bad stack is encountered */
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+   u8 pmcregs_in_use;  /* pseries puts this in lppaca */
+#endif
u8 soft_enabled;/* irq soft-enable flag */
u8 irq_happened;/* irq happened while soft-disabled */
u8 io_sync; /* writel() needs spin_unlock sync */
diff --git a/arch/powerpc/include/asm/pmc.h b/arch/powerpc/include/asm/pmc.h
index 5a9ede4962cb..7ac3586c38ab 100644
--- a/arch/powerpc/include/asm/pmc.h
+++ b/arch/powerpc/include/asm/pmc.h
@@ -31,10 +31,21 @@ void ppc_enable_pmcs(void);
 
 #ifdef CONFIG_PPC_BOOK3S_64
 #include 
+#include 
 
 static inline void ppc_set_pmu_inuse(int inuse)
 {
-   get_lppaca()->pmcregs_in_use = inuse;
+#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_BOOK3S_HV_POSSIBLE)
+   if (firmware_has_feature(FW_FEATURE_LPAR)) {
+#ifdef CONFIG_PPC_PSERIES
+   get_lppaca()->pmcregs_in_use = inuse;
+#endif
+   } else {
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+   get_paca()->pmcregs_in_use = inuse;
+#endif
+   }
+#endif
 }
 
 extern void power4_enable_pmcs(void);
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 3f6316bcde4e..a7c355c8c467 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -221,12 +221,17 @@ int main(void)
OFFSET(PACA_EXMC, paca_struct, exmc);
OFFSET(PACA_EXSLB, paca_struct, exslb);
OFFSET(PACA_EXNMI, paca_struct, exnmi);
+#ifdef CONFIG_PPC_PSERIES
OFFSET(PACALPPACAPTR, paca_struct, lppaca_ptr);
+#endif
OFFSET(PACA_SLBSHADOWPTR, paca_struct, slb_shadow_ptr);
OFFSET(SLBSHADOW_STACKVSID, slb_shadow, save_area[SLB_NUM_BOLTED - 
1].vsid);
OFFSET(SLBSHADOW_STACKESID, slb_shadow, save_area[SLB_NUM_BOLTED - 
1].esid);
OFFSET(SLBSHADOW_SAVEAREA, slb_shadow, save_area);
OFFSET(LPPACA_PMCINUSE, lppaca, pmcregs_in_use);
+#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
+   OFFSET(PACA_PMCINUSE, paca_struct, pmcregs_in_use);
+#endif
OFFSET(LPPACA_DTLIDX, lppaca, dtl_idx);
OFFSET(LPPACA_YIELDCOUNT, lppaca, yield_count);
OFFSET(PACA_DTL_RIDX, paca_struct, dtl_ridx);
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 95ffedf14885..5900540e2ff8 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -20,7 +20,7 @@
 
 #include "setup.h"
 
-#ifdef CONFIG_PPC_BOOK3S
+#ifdef CONFIG_PPC_PSERIES
 
 /*
  * The structure which the hypervisor knows about - this structure
@@ -47,6 +47,9 @@ static long __initdata lppaca_size;
 
 static void __init allocate_lppacas(int nr_cpus, unsigned long limit)
 {
+   if (early_cpu_has_feature(CPU_FTR_HVMODE))
+   return;
+
if (nr_cpus <= NR_LPPACAS)
return;
 
@@ -60,6 +63,9 @@ static struct lppaca * __init new_lppaca(int cpu)
 {
struct lppaca *lp;
 
+   if (early_cpu_has_feature(CPU_FTR_HVMODE))
+   return NULL;
+
if (cpu < NR_LPPACAS)
return [cpu];
 
@@ -73,6 +79,9 @@ static void __init free_lppacas(void)
 {
long new_size = 0, nr;
 
+   if 

[PATCH 6/9] powerpc/64s: Relax PACA address limitations

2017-12-22 Thread Nicholas Piggin
Book3S PACA memory allocation is restricted by the RMA limit and also
must not take SLB faults when accessed in virtual mode. Currently a
fixed 256MB limit is used for this, which is imprecise and sub-optimal.

Update the paca allocation limits to use use the ppc64_rma_size for RMA
limit, and share the safe_stack_limit() that is currently used for stack
allocations that must not take virtual mode faults.

The safe_stack_limit() name is changed to ppc64_bolted_size() to match
ppc64_rma_size and some comments are updated, but code unchanged.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/paca.c | 13 +++--
 arch/powerpc/kernel/setup.h|  4 
 arch/powerpc/kernel/setup_64.c | 22 ++
 3 files changed, 25 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index d6597038931d..95ffedf14885 100644
--- a/arch/powerpc/kernel/paca.c
+++ b/arch/powerpc/kernel/paca.c
@@ -18,6 +18,8 @@
 #include 
 #include 
 
+#include "setup.h"
+
 #ifdef CONFIG_PPC_BOOK3S
 
 /*
@@ -208,15 +210,14 @@ void __init allocate_pacas(void)
u64 limit;
int cpu;
 
-   limit = ppc64_rma_size;
-
 #ifdef CONFIG_PPC_BOOK3S_64
/*
-* We can't take SLB misses on the paca, and we want to access them
-* in real mode, so allocate them within the RMA and also within
-* the first segment.
+* We access pacas in real mode, and cannot take SLB faults
+* on them when in virtual mode, so allocate them accordingly.
 */
-   limit = min(0x1000ULL, limit);
+   limit = min(ppc64_bolted_size(), ppc64_rma_size);
+#else
+   limit = ppc64_rma_size;
 #endif
 
paca_size = PAGE_ALIGN(sizeof(struct paca_struct) * nr_cpu_ids);
diff --git a/arch/powerpc/kernel/setup.h b/arch/powerpc/kernel/setup.h
index 21c18071d9d5..3fc11e30308f 100644
--- a/arch/powerpc/kernel/setup.h
+++ b/arch/powerpc/kernel/setup.h
@@ -51,6 +51,10 @@ void record_spr_defaults(void);
 static inline void record_spr_defaults(void) { };
 #endif
 
+#ifdef CONFIG_PPC64
+u64 ppc64_bolted_size(void);
+#endif
+
 /*
  * Having this in kvm_ppc.h makes include dependencies too
  * tricky to solve for setup-common.c so have it here.
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index d3124c302146..a2b731052084 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -565,24 +565,30 @@ void __init initialize_cache_info(void)
DBG(" <- initialize_cache_info()\n");
 }
 
-/* This returns the limit below which memory accesses to the linear
- * mapping are guarnateed not to cause a TLB or SLB miss. This is
- * used to allocate interrupt or emergency stacks for which our
- * exception entry path doesn't deal with being interrupted.
+/*
+ * This returns the limit below which memory accesses to the linear
+ * mapping are guarnateed not to cause an architectural exception (e.g.,
+ * TLB or SLB miss fault).
+ *
+ * This is used to allocate PACAs and various interrupt stacks that
+ * that are accessed early in interrupt handlers that must not cause
+ * re-entrant interrupts.
  */
-static __init u64 safe_stack_limit(void)
+__init u64 ppc64_bolted_size(void)
 {
 #ifdef CONFIG_PPC_BOOK3E
/* Freescale BookE bolts the entire linear mapping */
+   /* XXX: BookE ppc64_rma_limit setup seems to disagree? */
if (mmu_has_feature(MMU_FTR_TYPE_FSL_E))
return linear_map_top;
/* Other BookE, we assume the first GB is bolted */
return 1ul << 30;
 #else
+   /* BookS radix, does not take faults on linear mapping */
if (early_radix_enabled())
return ULONG_MAX;
 
-   /* BookS, the first segment is bolted */
+   /* BookS hash, the first segment is bolted */
if (mmu_has_feature(MMU_FTR_1T_SEGMENT))
return 1UL << SID_SHIFT_1T;
return 1UL << SID_SHIFT;
@@ -591,7 +597,7 @@ static __init u64 safe_stack_limit(void)
 
 void __init irqstack_early_init(void)
 {
-   u64 limit = safe_stack_limit();
+   u64 limit = ppc64_bolted_size();
unsigned int i;
 
/*
@@ -676,7 +682,7 @@ void __init emergency_stack_init(void)
 * initialized in kernel/irq.c. These are initialized here in order
 * to have emergency stacks available as early as possible.
 */
-   limit = min(safe_stack_limit(), ppc64_rma_size);
+   limit = min(ppc64_bolted_size(), ppc64_rma_size);
 
for_each_possible_cpu(i) {
struct thread_info *ti;
-- 
2.15.0



[PATCH 5/9] powerpc/pseries: lift RTAS limit for hash

2017-12-22 Thread Nicholas Piggin
With the previous patch to switch to 64-bit mode after returning from
RTAS and before doing any memory accesses, the RMA limit need not be
clamped to 1GB to avoid RTAS bugs.

Keep the 1GB limit for older firmware (although this is more of a kernel
concern than RTAS), and remove it starting with POWER9.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/mm/hash_utils_64.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 8922e069b073..3c1bc1e6fa91 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1830,11 +1830,13 @@ void hash__setup_initial_memory_limit(phys_addr_t 
first_memblock_base,
 * non-virtualized 64-bit hash MMU systems don't have a limitation
 * on real mode access.
 *
-* We also clamp it to 1G to avoid some funky things
-* such as RTAS bugs etc...
+* For guests on platforms before POWER9, we clamp the it limit to 1G
+* to avoid some funky things such as RTAS bugs etc...
 */
if (!early_cpu_has_feature(CPU_FTR_HVMODE)) {
-   ppc64_rma_size = min_t(u64, first_memblock_size, 0x4000);
+   ppc64_rma_size = first_memblock_size;
+   if (!early_cpu_has_feature(CPU_FTR_ARCH_300))
+   ppc64_rma_size = min_t(u64, ppc64_rma_size, 0x4000);
 
/* Finally limit subsequent allocations */
memblock_set_current_limit(ppc64_rma_size);
-- 
2.15.0



[PATCH 4/9] powerpc/pseries: lift RTAS limit for radix

2017-12-22 Thread Nicholas Piggin
With the previous patch to switch to 64-bit mode after returning from
RTAS and before doing any memory accesses, the RMA limit need not be
clamped to 1GB to avoid RTAS bugs.

Keep the 1GB limit for older firmware (although this is more of a kernel
concern than RTAS), and remove it starting with POWER9.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/mm/pgtable-radix.c | 21 -
 1 file changed, 4 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 6606216f1992..b8c49e6623ae 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -623,23 +623,10 @@ void radix__setup_initial_memory_limit(phys_addr_t 
first_memblock_base,
 */
BUG_ON(first_memblock_base != 0);
 
-   if (!early_cpu_has_feature(CPU_FTR_HVMODE)) {
-   /*
-* Radix mode guests are not limited by RMA / VRMA addressing.
-*
-* We do clamp addresses to 1GB to avoid some funky things
-* such as RTAS bugs.
-*/
-   ppc64_rma_size = 0x4000;
-   /*
-* Finally limit subsequent allocations. We really don't want
-* to limit the memblock allocations to rma_size. FIXME!! should
-* we even limit at all ?
-*/
-   memblock_set_current_limit(first_memblock_base + 
first_memblock_size);
-   } else {
-   ppc64_rma_size = ULONG_MAX;
-   }
+   /*
+* Radix mode is not limited by RMA / VRMA addressing.
+*/
+   ppc64_rma_size = ULONG_MAX;
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-- 
2.15.0



[PATCH 3/9] powerpc/64: rtas avoid accessing paca in 32-bit mode

2017-12-22 Thread Nicholas Piggin
Commit 177ba7c647f3 ("powerpc/mm/radix: Limit paca allocation in radix")
limited the paca allocation address to 1G on pSeries because RTAS return
accesses the paca in 32-bit mode:

On return from RTAS we access the paca variables and we have 64 bit
disabled. This requires us to limit paca in 32 bit range.

Fix this by setting ppc64_rma_size to first_memblock_size/1G range.

Avoid this limit by switching to 64-bit mode before accessing any memory.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/entry_64.S | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 36878b6ee8b8..371c05fe250a 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -1083,6 +1083,17 @@ __enter_rtas:
 rtas_return_loc:
FIXUP_ENDIAN
 
+   /*
+* Clear RI and set SF before anything.
+*/
+   mfmsr   r6
+   li  r0,MSR_RI
+   andcr6,r6,r0
+   sldir0,r0,(MSR_SF_LG - MSR_RI_LG)
+   or  r6,r6,r0
+   sync
+   mtmsrd  r6
+
/* relocation is off at this point */
GET_PACA(r4)
clrldi  r4,r4,2 /* convert to realmode address */
@@ -1091,12 +1102,6 @@ rtas_return_loc:
 0: mflrr3
ld  r3,(1f-0b)(r3)  /* get _restore_regs */
 
-   mfmsr   r6
-   li  r0,MSR_RI
-   andcr6,r6,r0
-   sync
-   mtmsrd  r6
-
 ld r1,PACAR1(r4)   /* Restore our SP */
 ld r4,PACASAVEDMSR(r4) /* Restore our MSR */
 
-- 
2.15.0



[PATCH 2/9] powerpc/pseries: radix is not subject to RMA limit, remove it

2017-12-22 Thread Nicholas Piggin
The radix guest is not subject to the paravirtualized HPT VRMA limit,
so remove that from ppc64_rma_size calculation for that platform.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/mm/pgtable-radix.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index d73816960825..6606216f1992 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -625,15 +625,12 @@ void radix__setup_initial_memory_limit(phys_addr_t 
first_memblock_base,
 
if (!early_cpu_has_feature(CPU_FTR_HVMODE)) {
/*
-* We limit the allocation that depend on ppc64_rma_size
-* to first_memblock_size. We also clamp it to 1GB to
-* avoid some funky things such as RTAS bugs.
+* Radix mode guests are not limited by RMA / VRMA addressing.
 *
-* On radix config we really don't have a limitation
-* on real mode access. But keeping it as above works
-* well enough.
+* We do clamp addresses to 1GB to avoid some funky things
+* such as RTAS bugs.
 */
-   ppc64_rma_size = min_t(u64, first_memblock_size, 0x4000);
+   ppc64_rma_size = 0x4000;
/*
 * Finally limit subsequent allocations. We really don't want
 * to limit the memblock allocations to rma_size. FIXME!! should
-- 
2.15.0



[PATCH 1/9] powerpc/powernv: Remove real mode access limit for early allocations

2017-12-22 Thread Nicholas Piggin
This removes the RMA limit on powernv platform, which constrains
early allocations such as PACAs and stacks. There are still other
restrictions that must be followed, such as bolted SLB limits, but
real mode addressing has no constraints.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/mm/hash_utils_64.c | 20 +---
 arch/powerpc/mm/pgtable-radix.c | 37 +
 2 files changed, 34 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c
index 655a5a9a183d..8922e069b073 100644
--- a/arch/powerpc/mm/hash_utils_64.c
+++ b/arch/powerpc/mm/hash_utils_64.c
@@ -1825,16 +1825,22 @@ void hash__setup_initial_memory_limit(phys_addr_t 
first_memblock_base,
 */
BUG_ON(first_memblock_base != 0);
 
-   /* On LPAR systems, the first entry is our RMA region,
-* non-LPAR 64-bit hash MMU systems don't have a limitation
-* on real mode access, but using the first entry works well
-* enough. We also clamp it to 1G to avoid some funky things
+   /*
+* On virtualized systems the first entry is our RMA region aka VRMA,
+* non-virtualized 64-bit hash MMU systems don't have a limitation
+* on real mode access.
+*
+* We also clamp it to 1G to avoid some funky things
 * such as RTAS bugs etc...
 */
-   ppc64_rma_size = min_t(u64, first_memblock_size, 0x4000);
+   if (!early_cpu_has_feature(CPU_FTR_HVMODE)) {
+   ppc64_rma_size = min_t(u64, first_memblock_size, 0x4000);
 
-   /* Finally limit subsequent allocations */
-   memblock_set_current_limit(ppc64_rma_size);
+   /* Finally limit subsequent allocations */
+   memblock_set_current_limit(ppc64_rma_size);
+   } else {
+   ppc64_rma_size = ULONG_MAX;
+   }
 }
 
 #ifdef CONFIG_DEBUG_FS
diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index cfbbee941a76..d73816960825 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -622,22 +622,27 @@ void radix__setup_initial_memory_limit(phys_addr_t 
first_memblock_base,
 * physical on those processors
 */
BUG_ON(first_memblock_base != 0);
-   /*
-* We limit the allocation that depend on ppc64_rma_size
-* to first_memblock_size. We also clamp it to 1GB to
-* avoid some funky things such as RTAS bugs.
-*
-* On radix config we really don't have a limitation
-* on real mode access. But keeping it as above works
-* well enough.
-*/
-   ppc64_rma_size = min_t(u64, first_memblock_size, 0x4000);
-   /*
-* Finally limit subsequent allocations. We really don't want
-* to limit the memblock allocations to rma_size. FIXME!! should
-* we even limit at all ?
-*/
-   memblock_set_current_limit(first_memblock_base + first_memblock_size);
+
+   if (!early_cpu_has_feature(CPU_FTR_HVMODE)) {
+   /*
+* We limit the allocation that depend on ppc64_rma_size
+* to first_memblock_size. We also clamp it to 1GB to
+* avoid some funky things such as RTAS bugs.
+*
+* On radix config we really don't have a limitation
+* on real mode access. But keeping it as above works
+* well enough.
+*/
+   ppc64_rma_size = min_t(u64, first_memblock_size, 0x4000);
+   /*
+* Finally limit subsequent allocations. We really don't want
+* to limit the memblock allocations to rma_size. FIXME!! should
+* we even limit at all ?
+*/
+   memblock_set_current_limit(first_memblock_base + 
first_memblock_size);
+   } else {
+   ppc64_rma_size = ULONG_MAX;
+   }
 }
 
 #ifdef CONFIG_MEMORY_HOTPLUG
-- 
2.15.0



[PATCH 0/9] modernize early memory allocation limits and

2017-12-22 Thread Nicholas Piggin
This series improves (mostly relaxes) limits on early memory
allocations for stacks, pacas, etc. on 64s.

It also avoids allocating lppacas for bare metal, and it changes
allocation of paca and lppaca from single big arrays to inividual
allocations.

The main goal of this is toward allocating these basic structures
per-node. We're not there yet, but closer.

Thanks,
Nick

Nicholas Piggin (9):
  powerpc/powernv: Remove real mode access limit for early allocations
  powerpc/pseries: radix is not subject to RMA limit, remove it
  powerpc/64: rtas avoid accessing paca in 32-bit mode
  powerpc/pseries: lift RTAS limit for radix
  powerpc/pseries: lift RTAS limit for hash
  powerpc/64s: Relax PACA address limitations
  powerpc/64s: do not allocate lppaca if we are not virtualized
  powerpc/64: Use array of paca pointers and allocate pacas individually
  powerpc/64s: Use array of lppaca pointers and allocate lppacas
individually

 arch/powerpc/include/asm/kvm_ppc.h   |   8 +-
 arch/powerpc/include/asm/lppaca.h|  26 ++--
 arch/powerpc/include/asm/paca.h  |  12 +-
 arch/powerpc/include/asm/pmc.h   |  13 +-
 arch/powerpc/include/asm/smp.h   |   4 +-
 arch/powerpc/kernel/asm-offsets.c|   5 +
 arch/powerpc/kernel/crash.c  |   2 +-
 arch/powerpc/kernel/entry_64.S   |  17 ++-
 arch/powerpc/kernel/head_64.S|  19 +--
 arch/powerpc/kernel/machine_kexec_64.c   |  37 +++---
 arch/powerpc/kernel/paca.c   | 174 ++-
 arch/powerpc/kernel/setup.h  |   4 +
 arch/powerpc/kernel/setup_64.c   |  40 +++---
 arch/powerpc/kernel/smp.c|  10 +-
 arch/powerpc/kernel/sysfs.c  |   2 +-
 arch/powerpc/kvm/book3s_hv.c |  34 +++---
 arch/powerpc/kvm/book3s_hv_builtin.c |   2 +-
 arch/powerpc/kvm/book3s_hv_interrupts.S  |   3 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S  |   3 +-
 arch/powerpc/mm/hash_utils_64.c  |  24 ++--
 arch/powerpc/mm/numa.c   |   4 +-
 arch/powerpc/mm/pgtable-radix.c  |  17 +--
 arch/powerpc/mm/tlb-radix.c  |   2 +-
 arch/powerpc/platforms/85xx/smp.c|   8 +-
 arch/powerpc/platforms/cell/smp.c|   4 +-
 arch/powerpc/platforms/powernv/idle.c|  13 +-
 arch/powerpc/platforms/powernv/setup.c   |   4 +-
 arch/powerpc/platforms/powernv/smp.c |   2 +-
 arch/powerpc/platforms/powernv/subcore.c |   2 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c |   2 +-
 arch/powerpc/platforms/pseries/kexec.c   |   7 +-
 arch/powerpc/platforms/pseries/lpar.c|   4 +-
 arch/powerpc/platforms/pseries/setup.c   |   2 +-
 arch/powerpc/platforms/pseries/smp.c |   4 +-
 arch/powerpc/sysdev/xics/icp-native.c|   2 +-
 arch/powerpc/xmon/xmon.c |   2 +-
 36 files changed, 289 insertions(+), 229 deletions(-)

-- 
2.15.0



[GIT PULL] Please pull powerpc/linux.git powerpc-4.15-5 tag

2017-12-22 Thread Michael Ellerman
Hi Linus,

Please pull some more powerpc fixes for 4.15.

This is all fairly boring, except that there's two KVM fixes that you'd
normally get via Paul's kvm-ppc tree. He's away so I picked them up. I
was waiting to see if he would apply them, which is why they have only
been in my tree since today. But they were on the list for a while and
have been tested on the relevant hardware.

cheers


The following changes since commit d8104182087319fd753d6d8e0afcd95d84c2aa2f:

  powerpc/xmon: Don't print hashed pointers in xmon (2017-12-07 00:27:01 +1100)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-4.15-5

for you to fetch changes up to 7333b5aca412d6ad02667b5a513485838a91b136:

  KVM: PPC: Book3S HV: Fix pending_pri value in kvmppc_xive_get_icp() 
(2017-12-22 15:36:24 +1100)


powerpc fixes for 4.15 #5

Of note is two fixes for KVM XIVE (Power9 interrupt controller). These would
normally go via the KVM tree but Paul is away so I've picked them up.

Other than that, two fixes for error handling in the IMC driver, and one for a
potential oops in the BHRB code if the hardware records a branch address that
has subsequently been unmapped, and finally a s/%p/%px/ in our oops code.

Thanks to:
  Anju T Sudhakar, Cédric Le Goater, Laurent Vivier, Madhavan Srinivasan, Naveen
  N. Rao, Ravi Bangoria.


Anju T Sudhakar (2):
  powerpc/perf/imc: Fix nest-imc cpuhotplug callback failure
  powerpc/perf: Fix kfree memory allocated for nest pmus

Cédric Le Goater (1):
  KVM: PPC: Book3S: fix XIVE migration of pending interrupts

Laurent Vivier (1):
  KVM: PPC: Book3S HV: Fix pending_pri value in kvmppc_xive_get_icp()

Michael Ellerman (1):
  powerpc/kernel: Print actual address of regs when oopsing

Ravi Bangoria (1):
  powerpc/perf: Dereference BHRB entries safely

 arch/powerpc/kernel/process.c   |  2 +-
 arch/powerpc/kvm/book3s_xive.c  |  7 ---
 arch/powerpc/perf/core-book3s.c |  8 ++--
 arch/powerpc/perf/imc-pmu.c | 17 -
 4 files changed, 27 insertions(+), 7 deletions(-)


signature.asc
Description: PGP signature


[PATCH] SB600 for the Nemo board has non-zero devices on non-root bus

2017-12-22 Thread Christian Zigotzky
I mean: I haven't gotten an answer from Darren yet. Sorry because of my 
English. I am still learning.


-- Christian


On 22 December 2017 at 10:57AM, Christian Zigotzky wrote:
> Hi Bjorn,
>
> Sorry I'm bothering you again. Is this small out of tree init routine 
in the Nemo patch? I haven't get an answer from Darren yet and I didn't 
found the small out of tree init routine in the Nemo patch. Please find 
attached the Nemo patch. Maybe you can find this small out of tree init 
routine.

>
> What do you think of this following code?
>
> if (sb600_bus == -1)
> +   {
> +   busp = pci_find_bus(0, 0);
> +   pa_pxp_read_config(busp, PCI_DEVFN(17,0), 
PCI_SECONDARY_BUS, 1, );

> +
> +   sb600_bus = val;
> +
> +   printk(KERN_CRIT "NEMO SB600 on bus %d.\n",sb600_bus);
> +   }
>
> Thanks,
> Christian
>
>
> On 04 December 2017 at 12:40PM, Darren Stevens wrote:
> > Hello Bjorn
> >
> > Firstly sorry for not being able to join in this discussion, I have 
been

> > moving house and only got my X1000 set up again yesterday..
> >
> > On 30/11/2017, Bjorn Helgaas wrote:
> >> I *think* something like the patch below should make this work if you
> >> use the "pci=pcie_scan_all" parameter.  We have some x86 DMI quirks
> >> that set PCI_SCAN_ALL_PCIE_DEVS automatically.  I don't know how to do
> >> something similar on powerpc, but maybe you do?
> >
> > Actually the root ports on the Nemo's PA6T processor don't respond 
to the
> > SB600 unless we turn on a special 'relax pci-e' bit in one of its 
control
> > registers. We use a small out of tree init routine to do this, and 
there

> > would be the ideal place to put a call to
> > pci_set_flag(PCI_SCAN_ALL_PCIE_DEVS).
> >
> > This patch fixes the last major hurdle to getting the X1000 fully 
supported in

> > the linux kernel, so thanks very much for that.
> >
> > Regards
> > Darren
> >
> >
>




[PATCH] SB600 for the Nemo board has non-zero devices on non-root bus

2017-12-22 Thread Christian Zigotzky

Hi Bjorn,

Sorry I'm bothering you again. Is this small out of tree init routine in 
the Nemo patch? I haven't get an answer from Darren yet and I didn't 
found the small out of tree init routine in the Nemo patch. Please find 
attached the Nemo patch. Maybe you can find this small out of tree init 
routine.


What do you think of this following code?

if (sb600_bus == -1)
+   {
+   busp = pci_find_bus(0, 0);
+   pa_pxp_read_config(busp, PCI_DEVFN(17,0), 
PCI_SECONDARY_BUS, 1, );

+
+   sb600_bus = val;
+
+   printk(KERN_CRIT "NEMO SB600 on bus %d.\n",sb600_bus);
+   }

Thanks,
Christian


On 04 December 2017 at 12:40PM, Darren Stevens wrote:
> Hello Bjorn
>
> Firstly sorry for not being able to join in this discussion, I have been
> moving house and only got my X1000 set up again yesterday..
>
> On 30/11/2017, Bjorn Helgaas wrote:
>> I *think* something like the patch below should make this work if you
>> use the "pci=pcie_scan_all" parameter.  We have some x86 DMI quirks
>> that set PCI_SCAN_ALL_PCIE_DEVS automatically.  I don't know how to do
>> something similar on powerpc, but maybe you do?
>
> Actually the root ports on the Nemo's PA6T processor don't respond to the
> SB600 unless we turn on a special 'relax pci-e' bit in one of its control
> registers. We use a small out of tree init routine to do this, and there
> would be the ideal place to put a call to
> pci_set_flag(PCI_SCAN_ALL_PCIE_DEVS).
>
> This patch fixes the last major hurdle to getting the X1000 fully 
supported in

> the linux kernel, so thanks very much for that.
>
> Regards
> Darren
>
>

diff -rupN a/arch/powerpc/platforms/pasemi/pci.c b/arch/powerpc/platforms/pasemi/pci.c
--- a/arch/powerpc/platforms/pasemi/pci.c	2017-09-11 17:04:18.257586417 +0200
+++ b/arch/powerpc/platforms/pasemi/pci.c	2017-09-11 17:03:43.040599938 +0200
@@ -27,6 +27,7 @@
 #include 
 
 #include 
+#include 
 #include 
 
 #include 
@@ -108,6 +109,69 @@ static int workaround_5945(struct pci_bu
 	return 1;
 }
 
+#ifdef CONFIG_PPC_PASEMI_NEMO
+static int sb600_bus = 5;
+static void __iomem *iob_mapbase = NULL;
+
+static int pa_pxp_read_config(struct pci_bus *bus, unsigned int devfn,
+ int offset, int len, u32 *val);
+
+static void sb600_set_flag(int bus)
+{
+struct resource res;
+struct device_node *dn;
+   struct pci_bus *busp;
+   u32 val;
+   int err;
+
+   if (sb600_bus == -1)
+   {
+   busp = pci_find_bus(0, 0);
+   pa_pxp_read_config(busp, PCI_DEVFN(17,0), PCI_SECONDARY_BUS, 1, );
+
+   sb600_bus = val;
+
+   printk(KERN_CRIT "NEMO SB600 on bus %d.\n",sb600_bus);
+   }
+
+   if (iob_mapbase == NULL)
+   {
+dn = of_find_compatible_node(NULL, "isa", "pasemi,1682m-iob");
+if (!dn)
+{
+   printk(KERN_CRIT "NEMO SB600 missing iob node\n");
+   return;
+   }
+
+   err = of_address_to_resource(dn, 0, );
+of_node_put(dn);
+
+   if (err)
+   {
+   printk(KERN_CRIT "NEMO SB600 missing resource\n");
+   return;
+   }
+
+   printk(KERN_CRIT "NEMO SB600 IOB base %08lx\n",res.start);
+
+   iob_mapbase = ioremap(res.start + 0x100, 0x94);
+   }
+
+   if (iob_mapbase != NULL)
+   {
+   if (bus == sb600_bus)
+   {
+   out_le32(iob_mapbase + 4, in_le32(iob_mapbase + 4) | 0x800);
+   }
+   else
+   {
+   out_le32(iob_mapbase + 4, in_le32(iob_mapbase + 4) & ~0x800);
+   }
+   }
+}
+#endif
+
+
 static int pa_pxp_read_config(struct pci_bus *bus, unsigned int devfn,
 			  int offset, int len, u32 *val)
 {
@@ -126,6 +190,10 @@ static int pa_pxp_read_config(struct pci
 
 	addr = pa_pxp_cfg_addr(hose, bus->number, devfn, offset);
 
+#ifdef CONFIG_PPC_PASEMI_NEMO
+   sb600_set_flag(bus->number);
+#endif
+
 	/*
 	 * Note: the caller has already checked that offset is
 	 * suitably aligned and that len is 1, 2 or 4.
@@ -210,6 +278,9 @@ static int __init pas_add_bridge(struct
 	/* Interpret the "ranges" property */
 	pci_process_bridge_OF_ranges(hose, dev, 1);
 
+	/* Scan for an isa bridge. */
+	isa_bridge_find_early(hose);
+
 	return 0;
 }
 
diff -rupN a/arch/powerpc/platforms/pasemi/setup.c b/arch/powerpc/platforms/pasemi/setup.c
--- a/arch/powerpc/platforms/pasemi/setup.c	2017-09-11 17:04:18.256586450 +0200
+++ b/arch/powerpc/platforms/pasemi/setup.c	2017-09-11 17:03:43.042599888 +0200
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -72,6 +73,17 @@ static void __noreturn pas_restart(char
 		out_le32(reset_reg, 0x600);
 }
 
+#ifdef CONFIG_PPC_PASEMI_NEMO
+void pas_shutdown(void)
+{
+   /* (added by DStevens 19/06/13)
+  Set the PLD bit that 

Re: WARNING: CPU: 0 PID: 2777 at arch/powerpc/mm/hugetlbpage.c:354 h,ugetlb_free_pgd_range+0xc8/0x1e4

2017-12-22 Thread Christophe LEROY



Le 20/12/2017 à 13:17, Christophe LEROY a écrit :
Trying to malloc() with libhugetlbfs, it runs indefinitly doing page 
faults in do_page_fault()/hugetlb_fault().

When interrupting the blocked app with CTRL+C, I get the following WARNING:

Any idea of what can be wrong ? I'm on a 8xx with 512k huge pages.



It looks like something goes wrong when the app tries to mmap a 
hugetlbpage at a given address.

When it requests the page with a NULL address, it works well.

Any idea ?

Christophe



[162980.035629] WARNING: CPU: 0 PID: 2777 at 
arch/powerpc/mm/hugetlbpage.c:354 h

ugetlb_free_pgd_range+0xc8/0x1e4
[162980.035699] CPU: 0 PID: 2777 Comm: malloc Tainted: G W   4.14.6-s
3k-dev-ga8e8e8b176-svn9134 #85
[162980.035744] task: c67e2c00 task.stack: c668e000
[162980.035783] NIP:  c000fe18 LR: c00e1eec CTR: c00f90c0
[162980.035830] REGS: c668fc20 TRAP: 0700   Tainted: G W    (4.14.6-s
3k-dev-ga8e8e8b176-svn9134)
[162980.035854] MSR:  00029032   CR: 24044224 XER: 2000
[162980.036003]
[162980.036003] GPR00: c00e1eec c668fcd0 c67e2c00 0010 c6869410 
1008 000

0 77fb4000
[162980.036003] GPR08: 0001 0683c001  ff80 44028228 
10018a34 000

04008 418004fc
[162980.036003] GPR16: c668e000 00040100 c668e000 c06c c668fe78 
c668e000 c68

35ba0 c668fd48
[162980.036003] GPR24:  73ff 7400 0001 77fb4000 
100f 101

0 1010
[162980.036743] NIP [c000fe18] hugetlb_free_pgd_range+0xc8/0x1e4
[162980.036839] LR [c00e1eec] free_pgtables+0x12c/0x150
[162980.036861] Call Trace:
[162980.036939] [c668fcd0] [c00f0774] unlink_anon_vmas+0x1c4/0x214 
(unreliable)

[162980.037040] [c668fd10] [c00e1eec] free_pgtables+0x12c/0x150
[162980.037118] [c668fd40] [c00eabac] exit_mmap+0xe8/0x1b4
[162980.037210] [c668fda0] [c0019710] mmput.part.9+0x20/0xd8
[162980.037301] [c668fdb0] [c001ecb0] do_exit+0x1f0/0x93c
[162980.037386] [c668fe00] [c001f478] do_group_exit+0x40/0xcc
[162980.037479] [c668fe10] [c002a76c] get_signal+0x47c/0x614
[162980.037570] [c668fe70] [c0007840] do_signal+0x54/0x244
[162980.037654] [c668ff30] [c0007ae8] do_notify_resume+0x34/0x88
[162980.037744] [c668ff40] [c000dae8] do_user_signal+0x74/0xc4
[162980.037781] Instruction dump:
[162980.037821] 7fdff378 8137 54a3463a 80890020 7d24182e 7c841a14 
712a0004 4

082ff94
[162980.038014] 2f89 419e0010 712a0ff0 408200e0 <0fe0> 54a9000a 
7f984840

  419d0094
[162980.038216] ---[ end trace c0ceeca8e7a5800a ]---
[162980.038754] BUG: non-zero nr_ptes on freeing mm: 1
[162985.363322] BUG: non-zero nr_ptes on freeing mm: -1

Christophe