Re: [RFC PATCH v2 1/1] of: introduce event tracepoints for dynamic device_node lifecyle

2018-01-24 Thread Frank Rowand
On 01/24/18 22:48, Frank Rowand wrote:
> On 01/21/18 06:31, Wolfram Sang wrote:
>> From: Tyrel Datwyler 
>>
>> This patch introduces event tracepoints for tracking a device_nodes
>> reference cycle as well as reconfig notifications generated in response
>> to node/property manipulations.
>>
>> With the recent upstreaming of the refcount API several device_node
>> underflows and leaks have come to my attention in the pseries (DLPAR)
>> dynamic logical partitioning code (ie. POWER speak for hotplugging
>> virtual and physcial resources at runtime such as cpus or IOAs). These
>> tracepoints provide a easy and quick mechanism for validating the
>> reference counting of device_nodes during their lifetime.
>>
>> Further, when pseries lpars are migrated to a different machine we
>> perform a live update of our device tree to bring it into alignment with
>> the configuration of the new machine. The of_reconfig_notify trace point
>> provides a mechanism that can be turned for debuging the device tree
>> modifications with out having to build a custom kernel to get at the
>> DEBUG code introduced by commit 00aa37206e1a54 ("of/reconfig: Add debug
>> output for OF_RECONFIG notifiers").
>>
>> The following trace events are provided: of_node_get, of_node_put,
>> of_node_release, and of_reconfig_notify. These trace points require a
> 
> Please add a note that the of_reconfig_notify trace event is not an
> added bit of debug info, but is instead replacing information that
> was previously available via pr_debug() when DEBUG was defined.

I got a little carried away, "when DEBUG was defined" is extra
un-needed detail for the commit message.


> 
> 
>> kernel built with ftrace support to be enabled. In a typical environment
>> where debugfs is mounted at /sys/kernel/debug the entire set of
>> tracepoints can be set with the following:
>>
>>   echo "of:*" > /sys/kernel/debug/tracing/set_event
>>
>> or
>>
>>   echo 1 > /sys/kernel/debug/tracing/events/of/enable
>>
>> The following shows the trace point data from a DLPAR remove of a cpu
>> from a pseries lpar:
>>
>> cat /sys/kernel/debug/tracing/trace | grep "POWER8@10"
>>
>> cpuhp/23-147   [023]    128.324827:
>> of_node_put: refcount=5, dn->full_name=/cpus/PowerPC,POWER8@10
>> cpuhp/23-147   [023]    128.324829:
>> of_node_put: refcount=4, dn->full_name=/cpus/PowerPC,POWER8@10
>> cpuhp/23-147   [023]    128.324829:
>> of_node_put: refcount=3, dn->full_name=/cpus/PowerPC,POWER8@10
>> cpuhp/23-147   [023]    128.324831:
>> of_node_put: refcount=2, dn->full_name=/cpus/PowerPC,POWER8@10
>>drmgr-7284  [009]    128.439000:
>> of_node_put: refcount=1, dn->full_name=/cpus/PowerPC,POWER8@10
>>drmgr-7284  [009]    128.439002:
>> of_reconfig_notify: action=DETACH_NODE, 
>> dn->full_name=/cpus/PowerPC,POWER8@10,
>> prop->name=null, old_prop->name=null
>>drmgr-7284  [009]    128.439015:
>> of_node_put: refcount=0, dn->full_name=/cpus/PowerPC,POWER8@10
>>drmgr-7284  [009]    128.439016:
>> of_node_release: dn->full_name=/cpus/PowerPC,POWER8@10, dn->_flags=4
>>
>> Signed-off-by: Tyrel Datwyler 
> 
> The following belongs in a list of version 2 changes, below the "---" line:
> 
>> [wsa: fixed commit abbrev and one of the sysfs paths in commit desc,
>> removed trailing space and fixed pointer declaration in code]
> 
>> Signed-off-by: Wolfram Sang 
>> ---
>>  drivers/of/dynamic.c  | 32 ++--
>>  include/trace/events/of.h | 93 
>> +++
>>  2 files changed, 105 insertions(+), 20 deletions(-)
>>  create mode 100644 include/trace/events/of.h
> 
> mode looks incorrect.  Existing files in include/trace/events/ are -rw-rw
> 
> 
>> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
>> index ab988d88704da0..b0d6ab5a35b8c6 100644
>> --- a/drivers/of/dynamic.c
>> +++ b/drivers/of/dynamic.c
>> @@ -21,6 +21,9 @@ static struct device_node *kobj_to_device_node(struct 
>> kobject *kobj)
>>  return container_of(kobj, struct device_node, kobj);
>>  }
>>  
>> +#define CREATE_TRACE_POINTS
>> +#include 
>> +
>>  /**
>>   * of_node_get() - Increment refcount of a node
>>   * @node:   Node to inc refcount, NULL is supported to simplify writing of
>> @@ -30,8 +33,10 @@ static struct device_node *kobj_to_device_node(struct 
>> kobject *kobj)
>>   */
>>  struct device_node *of_node_get(struct device_node *node)
>>  {
>> -if (node)
>> +if (node) {
>>  kobject_get(>kobj);
>> +trace_of_node_get(refcount_read(>kobj.kref.refcount), 
>> node->full_name);
> 
> See the comment from Ron that I mentioned in my previous email.
   
   Rob, darn it.


> Also, the path has been removed from node->full_name.  Does using it here
> still give all of the information 

Re: [RFC PATCH v2 0/1] of: easier debugging for node life cycle issues

2018-01-24 Thread Frank Rowand
Hi Steve,

On 01/21/18 06:31, Wolfram Sang wrote:
> I got a bug report for a DT node refcounting problem in the I2C subsystem. 
> This
> patch was a huge help in validating the bug report and the proposed solution.
> So, I thought I bring it to attention again. Thanks Tyrel, for the initial
> work!
> 
> Note that I did not test the dynamic updates, only of_node_{get|put} so far. I
> read that Tyrel checked dynamic updates extensively with this patch. And since
> DT overlays are also used within our Renesas dev team, this will help there, 
> as
> well.
> 
> Tested on a Renesas Salvator-XS board (R-Car H3).
> 
> Changes since RFC v1:
>   * rebased to v4.15-rc8
>   * fixed commit abbrev and one of the sysfs paths in commit desc
>   * removed trailing space and fixed pointer declaration in code
> 
> I consider all the remaining checkpatch issues irrelevant for this patch.
> 
> So what about applying it?
> 
> Kind regards,
> 
>Wolfram
> 
> 
> Tyrel Datwyler (1):
>   of: introduce event tracepoints for dynamic device_node lifecyle
> 
>  drivers/of/dynamic.c  | 32 ++--
>  include/trace/events/of.h | 93 
> +++
>  2 files changed, 105 insertions(+), 20 deletions(-)
>  create mode 100644 include/trace/events/of.h
> 

Off the top of your head, can you tell me know early in the boot
process a trace_event can be called and successfully provide the
data to someone trying to debug early boot issues?

Also, way back when version 1 of this patch was being discussed,
a question about stacktrace triggers:

 >>> # echo stacktrace > /sys/kernel/debug/tracing/trace_options
 >>> # cat trace | grep -A6 "/pci@8002018"  
 >>
 >> Just to let you know that there is now stacktrace event triggers, where
 >> you don't need to stacktrace all events, you can pick and choose. And
 >> even filter the stack trace on specific fields of the event.  
 >
 > This is great, and I did figure that out this afternoon. One thing I was
 > still trying to determine though was whether its possible to set these
 > triggers at boot? As far as I could tell I'm still limited to
 > "trace_options=stacktrace" as a kernel boot parameter to get the stack
 > for event tracepoints.

 No not yet. But I'll add that to the todo list.

 Thanks,

 -- Steve

Is this still on your todo list, or is it now available?

Thanks,

Frank


Re: [RFC PATCH v2 1/1] of: introduce event tracepoints for dynamic device_node lifecyle

2018-01-24 Thread Frank Rowand
On 01/21/18 06:31, Wolfram Sang wrote:
> From: Tyrel Datwyler 
> 
> This patch introduces event tracepoints for tracking a device_nodes
> reference cycle as well as reconfig notifications generated in response
> to node/property manipulations.
> 
> With the recent upstreaming of the refcount API several device_node
> underflows and leaks have come to my attention in the pseries (DLPAR)
> dynamic logical partitioning code (ie. POWER speak for hotplugging
> virtual and physcial resources at runtime such as cpus or IOAs). These
> tracepoints provide a easy and quick mechanism for validating the
> reference counting of device_nodes during their lifetime.
> 
> Further, when pseries lpars are migrated to a different machine we
> perform a live update of our device tree to bring it into alignment with
> the configuration of the new machine. The of_reconfig_notify trace point
> provides a mechanism that can be turned for debuging the device tree
> modifications with out having to build a custom kernel to get at the
> DEBUG code introduced by commit 00aa37206e1a54 ("of/reconfig: Add debug
> output for OF_RECONFIG notifiers").
> 
> The following trace events are provided: of_node_get, of_node_put,
> of_node_release, and of_reconfig_notify. These trace points require a

Please add a note that the of_reconfig_notify trace event is not an
added bit of debug info, but is instead replacing information that
was previously available via pr_debug() when DEBUG was defined.


> kernel built with ftrace support to be enabled. In a typical environment
> where debugfs is mounted at /sys/kernel/debug the entire set of
> tracepoints can be set with the following:
> 
>   echo "of:*" > /sys/kernel/debug/tracing/set_event
> 
> or
> 
>   echo 1 > /sys/kernel/debug/tracing/events/of/enable
> 
> The following shows the trace point data from a DLPAR remove of a cpu
> from a pseries lpar:
> 
> cat /sys/kernel/debug/tracing/trace | grep "POWER8@10"
> 
> cpuhp/23-147   [023]    128.324827:
> of_node_put: refcount=5, dn->full_name=/cpus/PowerPC,POWER8@10
> cpuhp/23-147   [023]    128.324829:
> of_node_put: refcount=4, dn->full_name=/cpus/PowerPC,POWER8@10
> cpuhp/23-147   [023]    128.324829:
> of_node_put: refcount=3, dn->full_name=/cpus/PowerPC,POWER8@10
> cpuhp/23-147   [023]    128.324831:
> of_node_put: refcount=2, dn->full_name=/cpus/PowerPC,POWER8@10
>drmgr-7284  [009]    128.439000:
> of_node_put: refcount=1, dn->full_name=/cpus/PowerPC,POWER8@10
>drmgr-7284  [009]    128.439002:
> of_reconfig_notify: action=DETACH_NODE, 
> dn->full_name=/cpus/PowerPC,POWER8@10,
> prop->name=null, old_prop->name=null
>drmgr-7284  [009]    128.439015:
> of_node_put: refcount=0, dn->full_name=/cpus/PowerPC,POWER8@10
>drmgr-7284  [009]    128.439016:
> of_node_release: dn->full_name=/cpus/PowerPC,POWER8@10, dn->_flags=4
> 
> Signed-off-by: Tyrel Datwyler 

The following belongs in a list of version 2 changes, below the "---" line:

> [wsa: fixed commit abbrev and one of the sysfs paths in commit desc,
> removed trailing space and fixed pointer declaration in code]

> Signed-off-by: Wolfram Sang 
> ---
>  drivers/of/dynamic.c  | 32 ++--
>  include/trace/events/of.h | 93 
> +++
>  2 files changed, 105 insertions(+), 20 deletions(-)
>  create mode 100644 include/trace/events/of.h

mode looks incorrect.  Existing files in include/trace/events/ are -rw-rw


> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index ab988d88704da0..b0d6ab5a35b8c6 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -21,6 +21,9 @@ static struct device_node *kobj_to_device_node(struct 
> kobject *kobj)
>   return container_of(kobj, struct device_node, kobj);
>  }
>  
> +#define CREATE_TRACE_POINTS
> +#include 
> +
>  /**
>   * of_node_get() - Increment refcount of a node
>   * @node:Node to inc refcount, NULL is supported to simplify writing of
> @@ -30,8 +33,10 @@ static struct device_node *kobj_to_device_node(struct 
> kobject *kobj)
>   */
>  struct device_node *of_node_get(struct device_node *node)
>  {
> - if (node)
> + if (node) {
>   kobject_get(>kobj);
> + trace_of_node_get(refcount_read(>kobj.kref.refcount), 
> node->full_name);

See the comment from Ron that I mentioned in my previous email.

Also, the path has been removed from node->full_name.  Does using it here
still give all of the information that is desired?  Same for all others uses
of full_name in this patch.

The trace point should have a single argument, node.  Accessing the two
fields can be done in the tracepoint assignment.  Or is there some
reason that can't be done?  Same for the trace_of_node_put() tracepoint.


> + }
>   return node;
>  }
>  

Re: [RFC PATCH v2 0/1] of: easier debugging for node life cycle issues

2018-01-24 Thread Frank Rowand
On 01/21/18 06:31, Wolfram Sang wrote:
> I got a bug report for a DT node refcounting problem in the I2C subsystem. 
> This
> patch was a huge help in validating the bug report and the proposed solution.
> So, I thought I bring it to attention again. Thanks Tyrel, for the initial
> work!
> 
> Note that I did not test the dynamic updates, only of_node_{get|put} so far. I
> read that Tyrel checked dynamic updates extensively with this patch. And since
> DT overlays are also used within our Renesas dev team, this will help there, 
> as
> well.

It's been nine months since version 1.  If you are going to include the
dynamic updates part of the patch then please test them.


> Tested on a Renesas Salvator-XS board (R-Car H3).
> 
> Changes since RFC v1:
>   * rebased to v4.15-rc8
>   * fixed commit abbrev and one of the sysfs paths in commit desc
>   * removed trailing space and fixed pointer declaration in code
> 

> I consider all the remaining checkpatch issues irrelevant for this patch.

I am OK with the line length warnings in this patch.

Why can't the macro error be fixed?

A file entry needs to be added to MAINTAINERS.


> 
> So what about applying it?
> 
> Kind regards,
> 
>Wolfram
> 
> 
> Tyrel Datwyler (1):
>   of: introduce event tracepoints for dynamic device_node lifecyle
> 
>  drivers/of/dynamic.c  | 32 ++--
>  include/trace/events/of.h | 93 
> +++
>  2 files changed, 105 insertions(+), 20 deletions(-)
>  create mode 100644 include/trace/events/of.h
> 



Re: [RFC PATCH v2 0/1] of: easier debugging for node life cycle issues

2018-01-24 Thread Frank Rowand
On 01/22/18 03:49, Wolfram Sang wrote:
> Hi Frank,
> 
>> Please go back and read the thread for version 1.  Simply resubmitting a
>> forward port is ignoring that whole conversation.
>>
>> There is a lot of good info in that thread.  I certainly learned stuff in it.
> 
> Yes, I did that and learned stuff, too. My summary of the discussion was:
> 
> - you mentioned some drawbacks you saw (like the mixture of trace output
>   and printk output)> - most of them look like addressed to me? (e.g. Steven 
> showed a way to redirect
>   printk to trace
> - you posted your version (which was, however, marked as "not user friendly"
>   even by yourself)

Not exactly a fair quoting.  There were two things I said:

  "Here is a patch that I have used.  It is not as user friendly in terms
  of human readable stack traces (though a very small user space program
  should be able to fix that)."

 So easy to fix using existing userspace programs to convert kernel
 addresses to symbols.

  "FIXME: Currently using pr_err() so I don't need to set loglevel on boot.

  So obviously not a user friendly tool!!!
  The process is:
 - apply patch
 - configure, build, boot kernel
 - analyze data
 - remove patch"

 So not friendly because it uses pr_err() instead of pr_debug().  In
 a reply I said if I submitted my patches I would change it to use
 pr_debug() instead.  So not an issue.

 And not user friendly because it requires patching the kernel.
 Again a NOP if I submitted my patch, because the patch would
 already be in the kernel.

But whatever, let's ignore that - a poor quoting is not a reason to
reject this version of the patch.


> - The discussion stalled over having two approaches

Then you should have stated such when you resubmitted.


> So, I thought reposting would be a good way of finding out if your
> concerns were addressed in the discussion or not. If I overlooked

Then you should have stated that there were concerns raised in the
discussion and asked me if my concerns were addressed.


> something, I am sorry for that. Still, my intention is to continue the
> discussion, not to ignore it. Because as it stands, we don't have such a
> debugging mechanism in place currently, and with people working with DT
> overlays, I'd think it would be nice to have.
> 
> Kind regards,
> 
>Wolfram
> 


Rob suggested:

 >
 > @@ -25,8 +28,10 @@
 >   */
 >  struct device_node *of_node_get(struct device_node *node)
 >  {
 > -   if (node)
 > +   if (node) {
 > kobject_get(>kobj);
 > +   
trace_of_node_get(refcount_read(>kobj.kref.refcount), node->full_name);

 Seems like there should be a kobj wrapper to read the refcount.

As far as I noticed, that was never addressed.  I don't know the answer, but
the question was asked.  And if there is no such function, then there is at
least kref_read(), which would improve the code a little bit.

I'll reply to the patch 0/1 and patch 1/1 emails with review comments.

-Frank


Re: [PATCH] powerpc: pseries: use irq_of_parse_and_map helper

2018-01-24 Thread Michael Ellerman
Rob Herring  writes:

> On Tue, Jan 23, 2018 at 12:53 AM, Michael Ellerman  
> wrote:
>> Rob Herring  writes:
>>
>>> Instead of calling both of_irq_parse_one and irq_create_of_mapping, call
>>> of_irq_parse_and_map instead which does the same thing. This gets us closer
>>> to making the former 2 functions static.
...
>> Are you trying to remove the low-level routines or is this just a
>> cleanup?
>
> The former, but I'm not sure that will happen. There's a handful of
> others left, but they aren't simply a call to of_irq_parse_one and
> then irq_create_of_mapping.
>
>> The patch below works, it loses the error handling if the interrupts
>> property is corrupt/empty, but that's probably overly paranoid anyway.
>
> Not quite. Previously, it was silent if parsing failed. Only the
> mapping would give an error which would mean the interrupt parent had
> some error.
>
> Actually, we could use of_irq_get here to preserve the error handling.
> It will return error codes from parsing, 0 on mapping failure, or the
> Linux irq number. It adds an irq_find_host call for deferred probe,
> but that should be harmless. I'll respin it.

OK thanks.

cheers


[RFC PATCH] powerpc/powernv: Provide a way to force a core into SMT4 mode

2018-01-24 Thread Paul Mackerras
POWER9 processors up to and including "Nimbus" v2.2 have hardware
bugs relating to transactional memory and thread reconfiguration.
One of these bugs has a workaround which is to get the core into
SMT4 state temporarily.  This workaround is only needed when
running bare-metal.

This patch provides a function which gets the core into SMT4 mode
by preventing threads from going to a stop state, and waking up
those which are already in a stop state.  Once at least 3 threads
are not in a stop state, the core will be in SMT4 and we can
continue.

To do this, we add a "dont_stop" flag to the paca to tell the
thread not to go into a stop state.  If this flag is set,
power9_idle_stop() just returns immediately with a return value
of 0.  The pnv_power9_force_smt4() function does the following:

1. Set the dont_stop flag for each thread in the core, except
   ourselves (in fact we use an atomic_inc() in case more than
   one thread is calling this function concurrently).
2. See how many threads are awake, indicated by their
   requested_psscr field in the paca being 0.  If this is at
   least 3, skip to step 5.
3. Send a doorbell interrupt to each thread that was seen as
   being in a stop state in step 2.
4. Until at least 3 threads are awake, scan the threads to which
   we sent a doorbell interrupt and check if they are awake now.
5. Clear (actually atomic_dec()) the dont_stop flag for each
   thread in the core, except for ourselves.

This relies on the following properties:

- Once dont_stop is non-zero, requested_psccr can't go from zero to
  non-zero, except transiently (and without the thread doing stop).
- requested_psscr being zero guarantees that the thread isn't in
  a state-losing stop state where thread reconfiguration could occur.
- Doing stop with a PSSCR value of 0 won't be a state-losing stop
  and thus won't allow thread reconfiguration.

This does add a sync to power9_idle_stop(), which is necessary to
provide the correct ordering between setting requested_psscr and
checking dont_stop.  The overhead of the sync should be unnoticeable
compared to the latency of going into and out of a stop state.

Signed-off-by: Paul Mackerras 
---
 arch/powerpc/include/asm/paca.h   |  3 ++
 arch/powerpc/kernel/asm-offsets.c |  1 +
 arch/powerpc/kernel/idle_book3s.S | 15 +
 arch/powerpc/platforms/powernv/idle.c | 62 +++
 4 files changed, 81 insertions(+)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 23ac7fc..71b5c34 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 
 register struct paca_struct *local_paca asm("r13");
 
@@ -177,6 +178,8 @@ struct paca_struct {
u8 thread_mask;
/* Mask to denote subcore sibling threads */
u8 subcore_sibling_mask;
+   /* Flag to request this thread not to stop */
+   atomic_t dont_stop;
/*
 * Pointer to an array which contains pointer
 * to the sibling threads' paca.
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index ff6ce2f..91cb8df 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -758,6 +758,7 @@ int main(void)
OFFSET(PACA_SUBCORE_SIBLING_MASK, paca_struct, subcore_sibling_mask);
OFFSET(PACA_SIBLING_PACA_PTRS, paca_struct, thread_sibling_pacas);
OFFSET(PACA_REQ_PSSCR, paca_struct, requested_psscr);
+   OFFSET(PACA_DONT_STOP, paca_struct, dont_stop);
 #define STOP_SPR(x, f) OFFSET(x, paca_struct, stop_sprs.f)
STOP_SPR(STOP_PID, pid);
STOP_SPR(STOP_LDBAR, ldbar);
diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 01e1c19..4a7f88c 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -430,10 +430,23 @@ ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_ARCH_207S, 66);  
\
  */
 _GLOBAL(power9_idle_stop)
std r3, PACA_REQ_PSSCR(r13)
+   sync
+   lwz r5, PACA_DONT_STOP(r13)
+   cmpwi   r5, 0
+   bne 1f
mtspr   SPRN_PSSCR,r3
LOAD_REG_ADDR(r4,power_enter_stop)
b   pnv_powersave_common
/* No return */
+1:
+   /*
+* We get here when TM / thread reconfiguration bug workaround
+* code wants to get the CPU into SMT4 mode, and therefore
+* we are being asked not to stop.
+*/
+   li  r3, 0
+   std r3, PACA_REQ_PSSCR(r13)
+   blr /* return 0 for wakeup cause / SRR1 value */
 
 /*
  * On waking up from stop 0,1,2 with ESL=1 on POWER9 DD1,
@@ -584,6 +597,8 @@ FTR_SECTION_ELSE_NESTED(71)
mfspr   r5, SPRN_PSSCR
rldicl  r5,r5,4,60
 ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_POWER9_DD1, 71)
+   li  r0, 0   /* clear requested_psscr to say we're awake */
+   std r0, 

Re: [PATCH 5/5] powerpc/ftw: Document FTW API/usage

2018-01-24 Thread Sukadev Bhattiprolu
Randy Dunlap [rdun...@infradead.org] wrote:

> > +struct ftw_setup_attr ftwattr;
> > +
> > +fd = open("/dev/ftw", O_RDWR);
> > +
> > +memset(, 0, sizeof(rxattr));
> 
> Is that supposed to be ftwattr (2x above)?

Yes. I agree with your other comments as well and will send a new version.

Thanks for the detailed review.

Sukadev



Re: [PATCH 5/5] powerpc/ftw: Document FTW API/usage

2018-01-24 Thread Randy Dunlap
On 01/16/2018 06:50 PM, Sukadev Bhattiprolu wrote:
> Document the usage of the VAS Fast thread-wakeup API and add an entry in
> MAINTAINERS file.
> 
> Thanks for input/comments from Benjamin Herrenschmidt, Michael Neuling,
> Michael Ellerman, Robert Blackmore, Ian Munsie, Haren Myneni and Paul
> Mackerras.
> 
> Signed-off-by: Sukadev Bhattiprolu 
> ---
> 
> Changelog[v2]
>   - [Michael Neuling] Update API to use a single, VAS_FTW_SEUTP ioctl
> rather than two ioctls.
>   - [Michael Neuling] Drop "nx" from name "nx-ftw".
> 
> ---
>  Documentation/powerpc/ftw-api.txt | 283 
> ++
>  MAINTAINERS   |   8 ++
>  2 files changed, 291 insertions(+)
>  create mode 100644 Documentation/powerpc/ftw-api.txt
> 
> diff --git a/Documentation/powerpc/ftw-api.txt 
> b/Documentation/powerpc/ftw-api.txt
> new file mode 100644
> index 000..a107628
> --- /dev/null
> +++ b/Documentation/powerpc/ftw-api.txt
> @@ -0,0 +1,283 @@
> +Virtual Accelerator Switchboard and Fast Thread-Wakeup API
> +
...
> +
> +Application access to the FTW mechanism is provided through the FTW
> +device node (/dev/ftw) implemented by the FTW device driver.
> +
> +A multi-threaded software processes that intends to use the FTW

 process

> +mechanism must first setup a channel (consisting of a pair of VAS
> +windows) for the waiting and waking threads to communicate. The
> +channel is set up by opening the FTW device and issuing the FTW_SETUP
> +ioctl. Upon successful return from the ioctl, the waiting side of
> +channel is complete and a thread can issue the "Wait" instruction
> +to wait for an event.
> +
> +After the successful return from the FTW_SETUP ioctl, the waking
> +thread must use mmap() system call on the same file descriptor and
> +obtain a virtual address known as the "paste address".
> +
> +Once the mmap() call succeeds the setup of "waking" side of the channel
> +is complete. To wake up a waiting thread, the waking thread should use
> +the "COPY" and "PASTE" instructions to write a zero-filled CRB to the
> +paste-address.
> +
> +The wait and wake up operations can be repeated as long as the paste
> +address and the FTW file descriptor are valid (i.e until munmap() of
> +the paste address or a close() of the FTW fd).
> +
> +1. FTW Device Node
> +
> +There is one /dev/ftw node in the system and it provides access to the
> +VAS/FTW functionality.
> +
> +The only valid operations (system calls) on the FTW node are:
> +
> +- open() the device for read and write.
> +
> +- issue the FTW_SETUP ioctl to set up a channel.
> +
> +- mmap() the file descriptor
> +
> +- close the device node.
> +
> +Other file operations on the FTW node are undefined.
> +
> +Note that the COPY and PASTE operations go directly to the hardware
> +and do not involve system calls or go through the FTW device.
> +
> +Although a system may have several instances of the VAS in the system
> +(typically, one per P9 chip) there is just one FTW device node in
> +the system.
> +
> +When the FTW device node is opened, the kernel assigns a suitable
> +instance of VAS to the process. Kernel will make a best-effort attempt
> +to assign an optimal instance of VAS for the process - based on the CPU/
> +chip that the process is running on. In the initial release, the kernel
> +does not support migrating the VAS instance if the process migrates from
> +a CPU on one chip to a CPU on another chip.
> +
> +Applications may chose a specific instance of the VAS using the 'vas_id'

choose

> +field in the FTW_SETUP ioctl as detailed below.
> +
> +2. Open FTW node
> +
> +The device should be opened for read and write. No special privileges
> +are needed to open the device. The device may be opened multiple times.
> +
> +Each open() of the FTW device is associated with one channel of
> +communication. There is a system-wide limit (currently 64K windows per
> +chip and since some are reserved for hardware, there are about 32K
> +channels per chip). If no more channels are available, the open() system
> +call will fail.
> +
> +See open(2) system call man pages for other details such as return
> +values, error codes and restrictions.
> +
> +3. Setup a communication channel (FTW_SETUP ioctl)
> +
> +A process that intends to use the Fast Thread-wakeup mechanism must
> +first setup a channel by issuing the FTW_SETUP ioctl.
> +
> +#include 
> +
> +struct ftw_setup_attr ftwattr;
> +
> +rc = ioctl(fd, FTW_SETUP, );
> +
> +The attributes of ftwattr are as follows:
> +
> +struct ftw_setup_attr {
> +int16_t   version;
> +int16_t   vas_id;
> +

Re: Are those hacks still valid on powerpc kernel ?

2018-01-24 Thread Benjamin Herrenschmidt
On Wed, 2018-01-24 at 11:17 +0100, Christophe LEROY wrote:
> Below comments are very old.
> 
> Aren't new glibc and binutils now able to go without this ?
> 
> Note that the code inside the #if 0 is wrong as we have no vma defined 
> in the function.
> 
> Or does it just have no performance impact anyway ?
> 
> 
>  From /arch/powerpc/mm/mem.c:
> 
> void clear_user_page(void *page, unsigned long vaddr, struct page *pg)
> {
>   clear_page(page);
> 
>   /*
>* We shouldn't have to do this, but some versions of glibc
>* require it (ld.so assumes zero filled pages are icache clean)
>* - Anton
>*/
>   flush_dcache_page(pg);
> }
> EXPORT_SYMBOL(clear_user_page);

Well, I think it would be a security issue to potentially leave garbage
icache content (possibly instructions from another process) accessible
to userspace. So I don't think we can avoid that one.

> void copy_user_page(void *vto, void *vfrom, unsigned long vaddr,
>   struct page *pg)
> {
>   copy_page(vto, vfrom);
> 
>   /*
>* We should be able to use the following optimisation, however
>* there are two problems.
>* Firstly a bug in some versions of binutils meant PLT sections
>* were not marked executable.
>* Secondly the first word in the GOT section is blrl, used
>* to establish the GOT address. Until recently the GOT was
>* not marked executable.
>* - Anton
>*/
> #if 0
>   if (!vma->vm_file && ((vma->vm_flags & VM_EXEC) == 0))
>   return;
> #endif

Well, we try not to break userspace This doesn't affect newer CPUs
that much because they have CPU_FTR_COHERENT_ICACHE, so
flush_dcache_page is pretty much a nop on them.

Cheers,
Ben.

>   flush_dcache_page(pg);
> }
> 
> Christophe


[PATCH] powerpc: dts: use 'atmel' as at24 manufacturer for kmcent2

2018-01-24 Thread Bartosz Golaszewski
Using compatible strings without the  part for at24 is
now deprecated. Use a correct 'atmel,' value.

Signed-off-by: Bartosz Golaszewski 
---
 arch/powerpc/boot/dts/fsl/kmcent2.dts | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/boot/dts/fsl/kmcent2.dts 
b/arch/powerpc/boot/dts/fsl/kmcent2.dts
index 5922c1ea0e96..3094df05f5ea 100644
--- a/arch/powerpc/boot/dts/fsl/kmcent2.dts
+++ b/arch/powerpc/boot/dts/fsl/kmcent2.dts
@@ -130,7 +130,7 @@
#size-cells = <0>;
 
eeprom@54 {
-   compatible = "24c02";
+   compatible = "atmel,24c02";
reg = <0x54>;
pagesize = <2>;
read-only;
-- 
2.16.1



[PATCH] powerpc: dts: use a correct at24 compatible fallback in ac14xx

2018-01-24 Thread Bartosz Golaszewski
Using 'at24' as fallback is now deprecated - use the full
'atmel,' string.

Signed-off-by: Bartosz Golaszewski 
---
 arch/powerpc/boot/dts/ac14xx.dts | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/boot/dts/ac14xx.dts b/arch/powerpc/boot/dts/ac14xx.dts
index 83bcfd865167..0be5c4f3265d 100644
--- a/arch/powerpc/boot/dts/ac14xx.dts
+++ b/arch/powerpc/boot/dts/ac14xx.dts
@@ -176,12 +176,12 @@
clock-frequency = <40>;
 
at24@30 {
-   compatible = "at24,24c01";
+   compatible = "atmel,24c01";
reg = <0x30>;
};
 
at24@31 {
-   compatible = "at24,24c01";
+   compatible = "atmel,24c01";
reg = <0x31>;
};
 
@@ -191,42 +191,42 @@
};
 
at24@50 {
-   compatible = "at24,24c01";
+   compatible = "atmel,24c01";
reg = <0x50>;
};
 
at24@51 {
-   compatible = "at24,24c01";
+   compatible = "atmel,24c01";
reg = <0x51>;
};
 
at24@52 {
-   compatible = "at24,24c01";
+   compatible = "atmel,24c01";
reg = <0x52>;
};
 
at24@53 {
-   compatible = "at24,24c01";
+   compatible = "atmel,24c01";
reg = <0x53>;
};
 
at24@54 {
-   compatible = "at24,24c01";
+   compatible = "atmel,24c01";
reg = <0x54>;
};
 
at24@55 {
-   compatible = "at24,24c01";
+   compatible = "atmel,24c01";
reg = <0x55>;
};
 
at24@56 {
-   compatible = "at24,24c01";
+   compatible = "atmel,24c01";
reg = <0x56>;
};
 
at24@57 {
-   compatible = "at24,24c01";
+   compatible = "atmel,24c01";
reg = <0x57>;
};
 
-- 
2.16.1



[PATCH] powerpc: dts: use 'atmel' as at24 anufacturer for pdm360ng

2018-01-24 Thread Bartosz Golaszewski
Using 'at' as the  part of the compatible string is now
deprecated. Use a correct string: 'atmel,'.

Signed-off-by: Bartosz Golaszewski 
---
 arch/powerpc/boot/dts/pdm360ng.dts | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/boot/dts/pdm360ng.dts 
b/arch/powerpc/boot/dts/pdm360ng.dts
index 445b88114009..df1283b63d9b 100644
--- a/arch/powerpc/boot/dts/pdm360ng.dts
+++ b/arch/powerpc/boot/dts/pdm360ng.dts
@@ -98,7 +98,7 @@
fsl,preserve-clocking;
 
eeprom@50 {
-   compatible = "at,24c01";
+   compatible = "atmel,24c01";
reg = <0x50>;
};
 
-- 
2.16.1



Re: [PATCH v4 3/7] platforms/pseries: Set eeh_pe of EEH_PE_VF type

2018-01-24 Thread Bryant G. Ly

On 1/23/18 7:14 PM, Michael Ellerman wrote:

> "Bryant G. Ly"  writes:
>
>> To correctly use EEH code one has to make
>> sure that the EEH_PE_VF is set for dynamic created
>> VFs. Therefore this patch allocates an eeh_pe of
>> eeh type EEH_PE_VF and associates PE with parent.
>>
>> Signed-off-by: Bryant G. Ly 
>> Signed-off-by: Juan J. Alvarez 
>> ---
>>  arch/powerpc/include/asm/pci-bridge.h|  5 -
>>  arch/powerpc/platforms/pseries/eeh_pseries.c | 17 +
>>  2 files changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/include/asm/pci-bridge.h 
>> b/arch/powerpc/include/asm/pci-bridge.h
>> index 9f66ddebb799..16d70740a76f 100644
>> --- a/arch/powerpc/include/asm/pci-bridge.h
>> +++ b/arch/powerpc/include/asm/pci-bridge.h
>> @@ -211,7 +211,10 @@ struct pci_dn {
>>  unsigned int *pe_num_map;   /* PE# for the first VF PE or array */
>>  boolm64_single_mode;/* Use M64 BAR in Single Mode */
>>  #define IODA_INVALID_M64(-1)
>> -int (*m64_map)[PCI_SRIOV_NUM_BARS];
>> +union {
>> +int (*m64_map)[PCI_SRIOV_NUM_BARS]; /*Only used in powernv 
>> */
>> +int last_allow_rc;  /* Only used in pSeries */
>> +};
>>  #endif /* CONFIG_PCI_IOV */
>>  int mps;/* Maximum Payload Size */
>>  struct list_head child_list;
> I don't see the point of using a union to save 4 bytes.
>
> And if you look at the current layout of the struct there's actually a 4
> byte hole after mps, so it doesn't actually save any space at all.
>
> I can remove it before applying, unless there's some compelling reason
> for it I'm not seeing.
>
> cheers

No specific reason for the union, you can go ahead and remove it before 
applying. 

Thanks!

Bryant




Re: [PATCH] macintosh/ams-input: Use true and false for boolean values

2018-01-24 Thread Michael Hanselmann
On 24.01.2018 02:48, Gustavo A. R. Silva wrote:
> Assign true or false to boolean variables instead of an integer value.
> 
> This issue was detected with the help of Coccinelle
> 
> Signed-off-by: Gustavo A. R. Silva 

Reviewed-by: Michael Hanselmann 



Re: [PATCH v3 5/5] powerpc/mm: Fix growth direction for hugepages mmaps with slice

2018-01-24 Thread Christophe LEROY



Le 24/01/2018 à 11:08, Aneesh Kumar K.V a écrit :



On 01/24/2018 03:33 PM, Christophe LEROY wrote:



Le 24/01/2018 à 10:51, Aneesh Kumar K.V a écrit :



On 01/24/2018 03:09 PM, Christophe LEROY wrote:



Le 24/01/2018 à 10:35, Aneesh Kumar K.V a écrit :




Did you try with HUGETLB_MORECORE_HEAPBASE=0x1100 on PPC64 as 
I suggested in my last email on this subject (22/01/2018 9:22) ?



yes. The test ran fine for me


You tried with 0x3000, it works as well on PPC32.

I'd really like you to try with 0x1100 which is in the same 
slice as the 1002-1003 range.





Now that explains is better. But then the requested HEAPBASE was not 
free and hence topdown search got an address in the below range.


7efffd00-7f00 rw-p  00:0d 1082770 /anon_hugepage 
(deleted)



The new range allocated is such that there is no scope for expansion 
of heap if we do a topdown search. But why should that require us to 
change from topdown/bottomup search?



1000-1001 r-xp  fc:00 9044312 /home/kvaneesh/a.out
1001-1002 r--p  fc:00 9044312 /home/kvaneesh/a.out
1002-1003 rw-p 0001 fc:00 9044312 /home/kvaneesh/a.out
7efffd00-7f00 rw-p  00:0d 1082770 /anon_hugepage 
(deleted)

72d4-77d6 rw-p  00:00 0
77d6-77f1 r-xp  fc:00 9250090 
/lib/powerpc64le-linux-gnu/libc-2.23.so
77f1-77f2 r--p 001a fc:00 9250090 
/lib/powerpc64le-linux-gnu/libc-2.23.so
77f2-77f3 rw-p 001b fc:00 9250090 
/lib/powerpc64le-linux-gnu/libc-2.23.so
77f4-77f6 r-xp  fc:00 10754812 
/usr/lib/libhugetlbfs.so.0
77f6-77f7 r--p 0001 fc:00 10754812 
/usr/lib/libhugetlbfs.so.0
77f7-77f8 rw-p 0002 fc:00 10754812 
/usr/lib/libhugetlbfs.so.0

77f8-77fa r-xp  00:00 0 [vdso]
77fa-77fe r-xp  fc:00 9250107 
/lib/powerpc64le-linux-gnu/ld-2.23.so
77fe-77ff r--p 0003 fc:00 9250107 
/lib/powerpc64le-linux-gnu/ld-2.23.so
77ff-7800 rw-p 0004 fc:00 9250107 
/lib/powerpc64le-linux-gnu/ld-2.23.so

7ffd-8000 rw-p  00:00 0 [stack]


For the specific test, one should pass the HEAPBASE value such that 
it can be expanded if required isn't it ?


For the test, yes, it is dumb to pass an unusable HEAPBASE, but what 
happens in real life:
* PPC32: No HEAPBASE, hugetlbfs defines a HEAPBASE at sbrk(0) + 
PAGE_SIZE = 0x1080 ==> This is in the same slice as already 
allocated ==> the kernel does as if mmap() had been called with no 
hint address and allocates something unusable instead.
* PPC64: No HEAPBASE, hugetlbfs seems to define a HEAPBASE at 
1000, which doesn't conflict with an already allocated mapping 
==> it works.


Now, when we take the generic case, ie when slice is not activated, 
when you call mmap() without a hint address, it allocates a suitable 
address because it does bottom-up. Why do differently with slices ?




IIUC that is largely arch dependent, PPC64 always did topdown search. 
Even for regular non hugetlb mmap it did topdown search. If you set 
legacy mmap we selected bottom up approach. You can check 
arch_pick_mmap_layout() for more details. Now x86 is slightly different.
For the default search if we can't find a mapping address it will try a 
bottomup search. Having said that if you think libhugetlbfs made 
assumptions with respect to 8xx and you don't want to break it make

8xx unmapped area search bottomup.



Or would there be a way to make libhugetlbfs aware of the slices 
constraints and make it choose a suitable hint address at first try ?


Christophe


Are those hacks still valid on powerpc kernel ?

2018-01-24 Thread Christophe LEROY

Below comments are very old.

Aren't new glibc and binutils now able to go without this ?

Note that the code inside the #if 0 is wrong as we have no vma defined 
in the function.


Or does it just have no performance impact anyway ?


From /arch/powerpc/mm/mem.c:

void clear_user_page(void *page, unsigned long vaddr, struct page *pg)
{
clear_page(page);

/*
 * We shouldn't have to do this, but some versions of glibc
 * require it (ld.so assumes zero filled pages are icache clean)
 * - Anton
 */
flush_dcache_page(pg);
}
EXPORT_SYMBOL(clear_user_page);

void copy_user_page(void *vto, void *vfrom, unsigned long vaddr,
struct page *pg)
{
copy_page(vto, vfrom);

/*
 * We should be able to use the following optimisation, however
 * there are two problems.
 * Firstly a bug in some versions of binutils meant PLT sections
 * were not marked executable.
 * Secondly the first word in the GOT section is blrl, used
 * to establish the GOT address. Until recently the GOT was
 * not marked executable.
 * - Anton
 */
#if 0
if (!vma->vm_file && ((vma->vm_flags & VM_EXEC) == 0))
return;
#endif

flush_dcache_page(pg);
}

Christophe


Re: [PATCH v3 5/5] powerpc/mm: Fix growth direction for hugepages mmaps with slice

2018-01-24 Thread Aneesh Kumar K.V



On 01/24/2018 03:33 PM, Christophe LEROY wrote:



Le 24/01/2018 à 10:51, Aneesh Kumar K.V a écrit :



On 01/24/2018 03:09 PM, Christophe LEROY wrote:



Le 24/01/2018 à 10:35, Aneesh Kumar K.V a écrit :




Did you try with HUGETLB_MORECORE_HEAPBASE=0x1100 on PPC64 as I 
suggested in my last email on this subject (22/01/2018 9:22) ?



yes. The test ran fine for me


You tried with 0x3000, it works as well on PPC32.

I'd really like you to try with 0x1100 which is in the same slice 
as the 1002-1003 range.





Now that explains is better. But then the requested HEAPBASE was not 
free and hence topdown search got an address in the below range.


7efffd00-7f00 rw-p  00:0d 1082770 /anon_hugepage 
(deleted)



The new range allocated is such that there is no scope for expansion 
of heap if we do a topdown search. But why should that require us to 
change from topdown/bottomup search?



1000-1001 r-xp  fc:00 9044312 /home/kvaneesh/a.out
1001-1002 r--p  fc:00 9044312 /home/kvaneesh/a.out
1002-1003 rw-p 0001 fc:00 9044312 /home/kvaneesh/a.out
7efffd00-7f00 rw-p  00:0d 1082770 /anon_hugepage 
(deleted)

72d4-77d6 rw-p  00:00 0
77d6-77f1 r-xp  fc:00 9250090 
/lib/powerpc64le-linux-gnu/libc-2.23.so
77f1-77f2 r--p 001a fc:00 9250090 
/lib/powerpc64le-linux-gnu/libc-2.23.so
77f2-77f3 rw-p 001b fc:00 9250090 
/lib/powerpc64le-linux-gnu/libc-2.23.so
77f4-77f6 r-xp  fc:00 10754812 
/usr/lib/libhugetlbfs.so.0
77f6-77f7 r--p 0001 fc:00 10754812 
/usr/lib/libhugetlbfs.so.0
77f7-77f8 rw-p 0002 fc:00 10754812 
/usr/lib/libhugetlbfs.so.0

77f8-77fa r-xp  00:00 0 [vdso]
77fa-77fe r-xp  fc:00 9250107 
/lib/powerpc64le-linux-gnu/ld-2.23.so
77fe-77ff r--p 0003 fc:00 9250107 
/lib/powerpc64le-linux-gnu/ld-2.23.so
77ff-7800 rw-p 0004 fc:00 9250107 
/lib/powerpc64le-linux-gnu/ld-2.23.so

7ffd-8000 rw-p  00:00 0 [stack]


For the specific test, one should pass the HEAPBASE value such that it 
can be expanded if required isn't it ?


For the test, yes, it is dumb to pass an unusable HEAPBASE, but what 
happens in real life:
* PPC32: No HEAPBASE, hugetlbfs defines a HEAPBASE at sbrk(0) + 
PAGE_SIZE = 0x1080 ==> This is in the same slice as already 
allocated ==> the kernel does as if mmap() had been called with no hint 
address and allocates something unusable instead.
* PPC64: No HEAPBASE, hugetlbfs seems to define a HEAPBASE at 
1000, which doesn't conflict with an already allocated mapping 
==> it works.


Now, when we take the generic case, ie when slice is not activated, when 
you call mmap() without a hint address, it allocates a suitable address 
because it does bottom-up. Why do differently with slices ?




IIUC that is largely arch dependent, PPC64 always did topdown search. 
Even for regular non hugetlb mmap it did topdown search. If you set 
legacy mmap we selected bottom up approach. You can check 
arch_pick_mmap_layout() for more details. Now x86 is slightly different.
For the default search if we can't find a mapping address it will try a 
bottomup search. Having said that if you think libhugetlbfs made 
assumptions with respect to 8xx and you don't want to break it make

8xx unmapped area search bottomup.

-aneesh



Re: [PATCH v3 5/5] powerpc/mm: Fix growth direction for hugepages mmaps with slice

2018-01-24 Thread Christophe LEROY



Le 24/01/2018 à 10:51, Aneesh Kumar K.V a écrit :



On 01/24/2018 03:09 PM, Christophe LEROY wrote:



Le 24/01/2018 à 10:35, Aneesh Kumar K.V a écrit :




Did you try with HUGETLB_MORECORE_HEAPBASE=0x1100 on PPC64 as I 
suggested in my last email on this subject (22/01/2018 9:22) ?



yes. The test ran fine for me


You tried with 0x3000, it works as well on PPC32.

I'd really like you to try with 0x1100 which is in the same slice 
as the 1002-1003 range.





Now that explains is better. But then the requested HEAPBASE was not 
free and hence topdown search got an address in the below range.


7efffd00-7f00 rw-p  00:0d 1082770 /anon_hugepage 
(deleted)



The new range allocated is such that there is no scope for expansion of 
heap if we do a topdown search. But why should that require us to change 
from topdown/bottomup search?



1000-1001 r-xp  fc:00 9044312 /home/kvaneesh/a.out
1001-1002 r--p  fc:00 9044312 /home/kvaneesh/a.out
1002-1003 rw-p 0001 fc:00 9044312 /home/kvaneesh/a.out
7efffd00-7f00 rw-p  00:0d 1082770 /anon_hugepage 
(deleted)

72d4-77d6 rw-p  00:00 0
77d6-77f1 r-xp  fc:00 9250090 
/lib/powerpc64le-linux-gnu/libc-2.23.so
77f1-77f2 r--p 001a fc:00 9250090 
/lib/powerpc64le-linux-gnu/libc-2.23.so
77f2-77f3 rw-p 001b fc:00 9250090 
/lib/powerpc64le-linux-gnu/libc-2.23.so
77f4-77f6 r-xp  fc:00 10754812 
/usr/lib/libhugetlbfs.so.0
77f6-77f7 r--p 0001 fc:00 10754812 
/usr/lib/libhugetlbfs.so.0
77f7-77f8 rw-p 0002 fc:00 10754812 
/usr/lib/libhugetlbfs.so.0

77f8-77fa r-xp  00:00 0 [vdso]
77fa-77fe r-xp  fc:00 9250107 
/lib/powerpc64le-linux-gnu/ld-2.23.so
77fe-77ff r--p 0003 fc:00 9250107 
/lib/powerpc64le-linux-gnu/ld-2.23.so
77ff-7800 rw-p 0004 fc:00 9250107 
/lib/powerpc64le-linux-gnu/ld-2.23.so

7ffd-8000 rw-p  00:00 0 [stack]


For the specific test, one should pass the HEAPBASE value such that it 
can be expanded if required isn't it ?


For the test, yes, it is dumb to pass an unusable HEAPBASE, but what 
happens in real life:
* PPC32: No HEAPBASE, hugetlbfs defines a HEAPBASE at sbrk(0) + 
PAGE_SIZE = 0x1080 ==> This is in the same slice as already 
allocated ==> the kernel does as if mmap() had been called with no hint 
address and allocates something unusable instead.
* PPC64: No HEAPBASE, hugetlbfs seems to define a HEAPBASE at 
1000, which doesn't conflict with an already allocated mapping 
==> it works.


Now, when we take the generic case, ie when slice is not activated, when 
you call mmap() without a hint address, it allocates a suitable address 
because it does bottom-up. Why do differently with slices ?


Christophe


Re: [PATCH v3 5/5] powerpc/mm: Fix growth direction for hugepages mmaps with slice

2018-01-24 Thread Aneesh Kumar K.V



On 01/24/2018 03:09 PM, Christophe LEROY wrote:



Le 24/01/2018 à 10:35, Aneesh Kumar K.V a écrit :




Did you try with HUGETLB_MORECORE_HEAPBASE=0x1100 on PPC64 as I 
suggested in my last email on this subject (22/01/2018 9:22) ?



yes. The test ran fine for me


You tried with 0x3000, it works as well on PPC32.

I'd really like you to try with 0x1100 which is in the same slice as 
the 1002-1003 range.





Now that explains is better. But then the requested HEAPBASE was not 
free and hence topdown search got an address in the below range.


7efffd00-7f00 rw-p  00:0d 1082770 
/anon_hugepage (deleted)



The new range allocated is such that there is no scope for expansion of 
heap if we do a topdown search. But why should that require us to change 
from topdown/bottomup search?



1000-1001 r-xp  fc:00 9044312 
/home/kvaneesh/a.out
1001-1002 r--p  fc:00 9044312 
/home/kvaneesh/a.out
1002-1003 rw-p 0001 fc:00 9044312 
/home/kvaneesh/a.out
7efffd00-7f00 rw-p  00:0d 1082770 
/anon_hugepage (deleted)

72d4-77d6 rw-p  00:00 0
77d6-77f1 r-xp  fc:00 9250090 
/lib/powerpc64le-linux-gnu/libc-2.23.so
77f1-77f2 r--p 001a fc:00 9250090 
/lib/powerpc64le-linux-gnu/libc-2.23.so
77f2-77f3 rw-p 001b fc:00 9250090 
/lib/powerpc64le-linux-gnu/libc-2.23.so
77f4-77f6 r-xp  fc:00 10754812 
/usr/lib/libhugetlbfs.so.0
77f6-77f7 r--p 0001 fc:00 10754812 
/usr/lib/libhugetlbfs.so.0
77f7-77f8 rw-p 0002 fc:00 10754812 
/usr/lib/libhugetlbfs.so.0
77f8-77fa r-xp  00:00 0 
[vdso]
77fa-77fe r-xp  fc:00 9250107 
/lib/powerpc64le-linux-gnu/ld-2.23.so
77fe-77ff r--p 0003 fc:00 9250107 
/lib/powerpc64le-linux-gnu/ld-2.23.so
77ff-7800 rw-p 0004 fc:00 9250107 
/lib/powerpc64le-linux-gnu/ld-2.23.so
7ffd-8000 rw-p  00:00 0 
[stack]



For the specific test, one should pass the HEAPBASE value such that it 
can be expanded if required isn't it ?


-aneesh



Re: [PATCH 25/26] KVM: PPC: Book3S PR: Support TAR handling for PR KVM HTM.

2018-01-24 Thread Paul Mackerras
On Thu, Jan 11, 2018 at 06:11:38PM +0800, wei.guo.si...@gmail.com wrote:
> From: Simon Guo 
> 
> Currently guest kernel doesn't handle TAR fac unavailable and it always
> runs with TAR bit on. PR KVM will lazily enable TAR. TAR is not a
> frequent-use reg and it is not included in SVCPU struct.
> 
> To make it work for transaction memory at PR KVM:
> 1). Flush/giveup TAR at kvmppc_save_tm_pr().
> 2) If we are receiving a TAR fac unavail exception inside a transaction,
> the checkpointed TAR might be a TAR value from another process. So we need
> treclaim the transaction, then load the desired TAR value into reg, and
> perform trecheckpoint.
> 3) Load TAR facility at kvmppc_restore_tm_pr() when TM active.
> The reason we always loads TAR when restoring TM is that:
> If we don't do this way, when there is a TAR fac unavailable exception
> during TM active:
> case 1: it is the 1st TAR fac unavail exception after tbegin.
> vcpu->arch.tar should be reloaded as checkpoint tar val.
> case 2: it is the 2nd or later TAR fac unavail exception after tbegin.
> vcpu->arch.tar_tm should be reloaded as checkpoint tar val.
> There will be unnecessary difficulty to handle the above 2 cases.
> 
> at the end of emulating treclaim., the correct TAR val need to be loaded
> into reg if FSCR_TAR bit is on.
> at the beginning of emulating trechkpt., TAR needs to be flushed so that
> the right tar val can be copy into tar_tm.

Would it be simpler always to load up TAR when guest_MSR[TM] is 1?

Paul.


Re: [PATCH v3 5/5] powerpc/mm: Fix growth direction for hugepages mmaps with slice

2018-01-24 Thread Christophe LEROY



Le 24/01/2018 à 10:35, Aneesh Kumar K.V a écrit :



On 01/24/2018 02:57 PM, Christophe LEROY wrote:



Le 24/01/2018 à 10:15, Aneesh Kumar K.V a écrit :



On 01/24/2018 02:32 PM, Christophe Leroy wrote:

An application running with libhugetlbfs fails to allocate
additional pages to HEAP due to the hugemap being done
inconditionally as topdown mapping:

mmap(0x1008, 1572864, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0) = 0x73e8

[...]
mmap(0x7400, 1048576, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d8

munmap(0x73d8, 1048576) = 0
[...]
mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0

munmap(0x73d0, 1572864) = 0
[...]
mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0

munmap(0x73d0, 1572864) = 0
[...]

As one can see from the above strace log, mmap() allocates further
pages below the initial one because no space is available on top of it.

This patch fixes it by requesting bottomup mapping as the non
generic hugetlb_get_unmapped_area() does

Fixes: d0f13e3c20b6f ("[POWERPC] Introduce address space "slices" ")
Signed-off-by: Christophe Leroy 
---
  v3: Was a standalone patch before, but conflicts with this serie.

  arch/powerpc/mm/hugetlbpage.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c 
b/arch/powerpc/mm/hugetlbpage.c

index 79e1378ee303..368ea6b248ad 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -558,7 +558,7 @@ unsigned long hugetlb_get_unmapped_area(struct 
file *file, unsigned long addr,

  return radix__hugetlb_get_unmapped_area(file, addr, len,
 pgoff, flags);
  #endif
-    return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1);
+    return slice_get_unmapped_area(addr, len, flags, mmu_psize, 0);
  }
  #endif


Why make this change also for PPC64? Can you do this #ifdef 8xx?.You 
can ideally move hugetlb_get_unmapped_area to slice.h and then make 
this much simpler for 8xxx?




Did you try with HUGETLB_MORECORE_HEAPBASE=0x1100 on PPC64 as I 
suggested in my last email on this subject (22/01/2018 9:22) ?



yes. The test ran fine for me


You tried with 0x3000, it works as well on PPC32.

I'd really like you to try with 0x1100 which is in the same slice as 
the 1002-1003 range.


Christophe



kvaneesh@ltctulc6a-p1:[~]$  HUGETLB_MORECORE=yes 
HUGETLB_MORECORE_HEAPBASE=0x3000 ./a.out

1000-1001 r-xp  fc:00 9044312 /home/kvaneesh/a.out
1001-1002 r--p  fc:00 9044312 /home/kvaneesh/a.out
1002-1003 rw-p 0001 fc:00 9044312 /home/kvaneesh/a.out
3000-3300 rw-p  00:0d 1062697 /anon_hugepage (deleted)
3300-3500 rw-p 0300 00:0d 1062698 /anon_hugepage (deleted)
3500-3700 rw-p 0500 00:0d 1062699 /anon_hugepage (deleted)
77d6-77f1 r-xp  fc:00 9250090 
/lib/powerpc64le-linux-gnu/libc-2.23.so
77f1-77f2 r--p 001a fc:00 9250090 
/lib/powerpc64le-linux-gnu/libc-2.23.so
77f2-77f3 rw-p 001b fc:00 9250090 
/lib/powerpc64le-linux-gnu/libc-2.23.so
77f4-77f6 r-xp  fc:00 10754812 
/usr/lib/libhugetlbfs.so.0
77f6-77f7 r--p 0001 fc:00 10754812 
/usr/lib/libhugetlbfs.so.0
77f7-77f8 rw-p 0002 fc:00 10754812 
/usr/lib/libhugetlbfs.so.0

77f8-77fa r-xp  00:00 0 [vdso]
77fa-77fe r-xp  fc:00 9250107 
/lib/powerpc64le-linux-gnu/ld-2.23.so
77fe-77ff r--p 0003 fc:00 9250107 
/lib/powerpc64le-linux-gnu/ld-2.23.so
77ff-7800 rw-p 0004 fc:00 9250107 
/lib/powerpc64le-linux-gnu/ld-2.23.so

7ffd-8000 rw-p  00:00 0 [stack]




Before doing anything specific to the PPC32/8xx, I'd like to be sure 
the issue is definitly only on PPC32.




I am not sure I understand the problem correctly. If there is a free 
space in the required range, both topdown/bottomup search should be able 
to find it. Unless topdown found another free area suitable for hugetlb 
allocation above. My take is we should not change the topdown to 
bottomup without really understanding the failure scenarios.


-aneesh


Re: [PATCH v3 5/5] powerpc/mm: Fix growth direction for hugepages mmaps with slice

2018-01-24 Thread Aneesh Kumar K.V



On 01/24/2018 02:57 PM, Christophe LEROY wrote:



Le 24/01/2018 à 10:15, Aneesh Kumar K.V a écrit :



On 01/24/2018 02:32 PM, Christophe Leroy wrote:

An application running with libhugetlbfs fails to allocate
additional pages to HEAP due to the hugemap being done
inconditionally as topdown mapping:

mmap(0x1008, 1572864, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0) = 0x73e8

[...]
mmap(0x7400, 1048576, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d8

munmap(0x73d8, 1048576) = 0
[...]
mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0

munmap(0x73d0, 1572864) = 0
[...]
mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0

munmap(0x73d0, 1572864) = 0
[...]

As one can see from the above strace log, mmap() allocates further
pages below the initial one because no space is available on top of it.

This patch fixes it by requesting bottomup mapping as the non
generic hugetlb_get_unmapped_area() does

Fixes: d0f13e3c20b6f ("[POWERPC] Introduce address space "slices" ")
Signed-off-by: Christophe Leroy 
---
  v3: Was a standalone patch before, but conflicts with this serie.

  arch/powerpc/mm/hugetlbpage.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c 
b/arch/powerpc/mm/hugetlbpage.c

index 79e1378ee303..368ea6b248ad 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -558,7 +558,7 @@ unsigned long hugetlb_get_unmapped_area(struct 
file *file, unsigned long addr,

  return radix__hugetlb_get_unmapped_area(file, addr, len,
 pgoff, flags);
  #endif
-    return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1);
+    return slice_get_unmapped_area(addr, len, flags, mmu_psize, 0);
  }
  #endif


Why make this change also for PPC64? Can you do this #ifdef 8xx?.You 
can ideally move hugetlb_get_unmapped_area to slice.h and then make 
this much simpler for 8xxx?




Did you try with HUGETLB_MORECORE_HEAPBASE=0x1100 on PPC64 as I 
suggested in my last email on this subject (22/01/2018 9:22) ?



yes. The test ran fine for me

kvaneesh@ltctulc6a-p1:[~]$  HUGETLB_MORECORE=yes 
HUGETLB_MORECORE_HEAPBASE=0x3000 ./a.out
1000-1001 r-xp  fc:00 9044312 
/home/kvaneesh/a.out
1001-1002 r--p  fc:00 9044312 
/home/kvaneesh/a.out
1002-1003 rw-p 0001 fc:00 9044312 
/home/kvaneesh/a.out
3000-3300 rw-p  00:0d 1062697 
/anon_hugepage (deleted)
3300-3500 rw-p 0300 00:0d 1062698 
/anon_hugepage (deleted)
3500-3700 rw-p 0500 00:0d 1062699 
/anon_hugepage (deleted)
77d6-77f1 r-xp  fc:00 9250090 
/lib/powerpc64le-linux-gnu/libc-2.23.so
77f1-77f2 r--p 001a fc:00 9250090 
/lib/powerpc64le-linux-gnu/libc-2.23.so
77f2-77f3 rw-p 001b fc:00 9250090 
/lib/powerpc64le-linux-gnu/libc-2.23.so
77f4-77f6 r-xp  fc:00 10754812 
/usr/lib/libhugetlbfs.so.0
77f6-77f7 r--p 0001 fc:00 10754812 
/usr/lib/libhugetlbfs.so.0
77f7-77f8 rw-p 0002 fc:00 10754812 
/usr/lib/libhugetlbfs.so.0
77f8-77fa r-xp  00:00 0 
[vdso]
77fa-77fe r-xp  fc:00 9250107 
/lib/powerpc64le-linux-gnu/ld-2.23.so
77fe-77ff r--p 0003 fc:00 9250107 
/lib/powerpc64le-linux-gnu/ld-2.23.so
77ff-7800 rw-p 0004 fc:00 9250107 
/lib/powerpc64le-linux-gnu/ld-2.23.so
7ffd-8000 rw-p  00:00 0 
[stack]





Before doing anything specific to the PPC32/8xx, I'd like to be sure the 
issue is definitly only on PPC32.




I am not sure I understand the problem correctly. If there is a free 
space in the required range, both topdown/bottomup search should be able 
to find it. Unless topdown found another free area suitable for hugetlb 
allocation above. My take is we should not change the topdown to 
bottomup without really understanding the failure scenarios.


-aneesh



Re: [PATCH v3 5/5] powerpc/mm: Fix growth direction for hugepages mmaps with slice

2018-01-24 Thread Christophe LEROY



Le 24/01/2018 à 10:15, Aneesh Kumar K.V a écrit :



On 01/24/2018 02:32 PM, Christophe Leroy wrote:

An application running with libhugetlbfs fails to allocate
additional pages to HEAP due to the hugemap being done
inconditionally as topdown mapping:

mmap(0x1008, 1572864, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0) = 0x73e8

[...]
mmap(0x7400, 1048576, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d8

munmap(0x73d8, 1048576) = 0
[...]
mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0

munmap(0x73d0, 1572864) = 0
[...]
mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0

munmap(0x73d0, 1572864) = 0
[...]

As one can see from the above strace log, mmap() allocates further
pages below the initial one because no space is available on top of it.

This patch fixes it by requesting bottomup mapping as the non
generic hugetlb_get_unmapped_area() does

Fixes: d0f13e3c20b6f ("[POWERPC] Introduce address space "slices" ")
Signed-off-by: Christophe Leroy 
---
  v3: Was a standalone patch before, but conflicts with this serie.

  arch/powerpc/mm/hugetlbpage.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c 
b/arch/powerpc/mm/hugetlbpage.c

index 79e1378ee303..368ea6b248ad 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -558,7 +558,7 @@ unsigned long hugetlb_get_unmapped_area(struct 
file *file, unsigned long addr,

  return radix__hugetlb_get_unmapped_area(file, addr, len,
 pgoff, flags);
  #endif
-    return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1);
+    return slice_get_unmapped_area(addr, len, flags, mmu_psize, 0);
  }
  #endif


Why make this change also for PPC64? Can you do this #ifdef 8xx?.You can 
ideally move hugetlb_get_unmapped_area to slice.h and then make this 
much simpler for 8xxx?




Did you try with HUGETLB_MORECORE_HEAPBASE=0x1100 on PPC64 as I 
suggested in my last email on this subject (22/01/2018 9:22) ?


Before doing anything specific to the PPC32/8xx, I'd like to be sure the 
issue is definitly only on PPC32.


Thanks,
Christophe



Re: [PATCH v3 5/5] powerpc/mm: Fix growth direction for hugepages mmaps with slice

2018-01-24 Thread Aneesh Kumar K.V



On 01/24/2018 02:32 PM, Christophe Leroy wrote:

An application running with libhugetlbfs fails to allocate
additional pages to HEAP due to the hugemap being done
inconditionally as topdown mapping:

mmap(0x1008, 1572864, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0) = 0x73e8
[...]
mmap(0x7400, 1048576, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d8
munmap(0x73d8, 1048576) = 0
[...]
mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0
munmap(0x73d0, 1572864) = 0
[...]
mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0
munmap(0x73d0, 1572864) = 0
[...]

As one can see from the above strace log, mmap() allocates further
pages below the initial one because no space is available on top of it.

This patch fixes it by requesting bottomup mapping as the non
generic hugetlb_get_unmapped_area() does

Fixes: d0f13e3c20b6f ("[POWERPC] Introduce address space "slices" ")
Signed-off-by: Christophe Leroy 
---
  v3: Was a standalone patch before, but conflicts with this serie.

  arch/powerpc/mm/hugetlbpage.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 79e1378ee303..368ea6b248ad 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -558,7 +558,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, 
unsigned long addr,
return radix__hugetlb_get_unmapped_area(file, addr, len,
   pgoff, flags);
  #endif
-   return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1);
+   return slice_get_unmapped_area(addr, len, flags, mmu_psize, 0);
  }
  #endif


Why make this change also for PPC64? Can you do this #ifdef 8xx?.You can 
ideally move hugetlb_get_unmapped_area to slice.h and then make this 
much simpler for 8xxx?


-aneesh

-aneesh



[PATCH v3 5/5] powerpc/mm: Fix growth direction for hugepages mmaps with slice

2018-01-24 Thread Christophe Leroy
An application running with libhugetlbfs fails to allocate
additional pages to HEAP due to the hugemap being done
inconditionally as topdown mapping:

mmap(0x1008, 1572864, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0) = 0x73e8
[...]
mmap(0x7400, 1048576, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d8
munmap(0x73d8, 1048576) = 0
[...]
mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0
munmap(0x73d0, 1572864) = 0
[...]
mmap(0x7400, 1572864, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0x18) = 0x73d0
munmap(0x73d0, 1572864) = 0
[...]

As one can see from the above strace log, mmap() allocates further
pages below the initial one because no space is available on top of it.

This patch fixes it by requesting bottomup mapping as the non
generic hugetlb_get_unmapped_area() does

Fixes: d0f13e3c20b6f ("[POWERPC] Introduce address space "slices" ")
Signed-off-by: Christophe Leroy 
---
 v3: Was a standalone patch before, but conflicts with this serie.

 arch/powerpc/mm/hugetlbpage.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 79e1378ee303..368ea6b248ad 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -558,7 +558,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, 
unsigned long addr,
return radix__hugetlb_get_unmapped_area(file, addr, len,
   pgoff, flags);
 #endif
-   return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1);
+   return slice_get_unmapped_area(addr, len, flags, mmu_psize, 0);
 }
 #endif
 
-- 
2.13.3



[PATCH v3 4/5] powerpc/mm: Allow up to 64 low slices

2018-01-24 Thread Christophe Leroy
While the implementation of the "slices" address space allows
a significant amount of high slices, it limits the number of
low slices to 16 due to the use of a single u64 low_slices_psize
element in struct mm_context_t

On the 8xx, the minimum slice size is the size of the area
covered by a single PMD entry, ie 4M in 4K pages mode and 64M in
16K pages mode. This means we could have at least 64 slices.

In order to override this limitation, this patch switches the
handling of low_slices_psize to char array as done already for
high_slices_psize. This allows to increase the number of low
slices to 64 on the 8xx.

Signed-off-by: Christophe Leroy 
---
 v2: Usign slice_bitmap_xxx() macros instead of bitmap_xxx() functions.
 v3: keep low_slices as a u64, this allows 64 slices which is enough.
 
 arch/powerpc/include/asm/book3s/64/mmu.h |  3 +-
 arch/powerpc/include/asm/mmu-8xx.h   |  7 +++-
 arch/powerpc/include/asm/paca.h  |  2 +-
 arch/powerpc/include/asm/slice.h |  1 -
 arch/powerpc/include/asm/slice_32.h  |  2 ++
 arch/powerpc/include/asm/slice_64.h  |  2 ++
 arch/powerpc/kernel/paca.c   |  3 +-
 arch/powerpc/mm/hash_utils_64.c  | 13 
 arch/powerpc/mm/slb_low.S|  8 +++--
 arch/powerpc/mm/slice.c  | 57 +---
 10 files changed, 56 insertions(+), 42 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index c9448e19847a..b076a2d74c69 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -91,7 +91,8 @@ typedef struct {
struct npu_context *npu_context;
 
 #ifdef CONFIG_PPC_MM_SLICES
-   u64 low_slices_psize;   /* SLB page size encodings */
+/* SLB page size encodings*/
+   unsigned char low_slices_psize[BITS_PER_LONG / BITS_PER_BYTE];
unsigned char high_slices_psize[SLICE_ARRAY_SIZE];
unsigned long slb_addr_limit;
 #else
diff --git a/arch/powerpc/include/asm/mmu-8xx.h 
b/arch/powerpc/include/asm/mmu-8xx.h
index 5f89b6010453..5f37ba06b56c 100644
--- a/arch/powerpc/include/asm/mmu-8xx.h
+++ b/arch/powerpc/include/asm/mmu-8xx.h
@@ -164,6 +164,11 @@
  */
 #define SPRN_M_TW  799
 
+#ifdef CONFIG_PPC_MM_SLICES
+#include 
+#define SLICE_ARRAY_SIZE   (1 << (32 - SLICE_LOW_SHIFT - 1))
+#endif
+
 #ifndef __ASSEMBLY__
 typedef struct {
unsigned int id;
@@ -171,7 +176,7 @@ typedef struct {
unsigned long vdso_base;
 #ifdef CONFIG_PPC_MM_SLICES
u16 user_psize; /* page size index */
-   u64 low_slices_psize;   /* page size encodings */
+   unsigned char low_slices_psize[SLICE_ARRAY_SIZE];
unsigned char high_slices_psize[0];
unsigned long slb_addr_limit;
 #endif
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 23ac7fc0af23..a3e531fe9ac7 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -141,7 +141,7 @@ struct paca_struct {
 #ifdef CONFIG_PPC_BOOK3S
mm_context_id_t mm_ctx_id;
 #ifdef CONFIG_PPC_MM_SLICES
-   u64 mm_ctx_low_slices_psize;
+   unsigned char mm_ctx_low_slices_psize[BITS_PER_LONG / BITS_PER_BYTE];
unsigned char mm_ctx_high_slices_psize[SLICE_ARRAY_SIZE];
unsigned long mm_ctx_slb_addr_limit;
 #else
diff --git a/arch/powerpc/include/asm/slice.h b/arch/powerpc/include/asm/slice.h
index 2b4b70de7e71..b67ba8faa507 100644
--- a/arch/powerpc/include/asm/slice.h
+++ b/arch/powerpc/include/asm/slice.h
@@ -16,7 +16,6 @@
 #define HAVE_ARCH_UNMAPPED_AREA
 #define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
 
-#define SLICE_LOW_SHIFT28
 #define SLICE_LOW_TOP  (0x1ull)
 #define SLICE_NUM_LOW  (SLICE_LOW_TOP >> SLICE_LOW_SHIFT)
 #define GET_LOW_SLICE_INDEX(addr)  ((addr) >> SLICE_LOW_SHIFT)
diff --git a/arch/powerpc/include/asm/slice_32.h 
b/arch/powerpc/include/asm/slice_32.h
index 7e27c0dfb913..349187c20100 100644
--- a/arch/powerpc/include/asm/slice_32.h
+++ b/arch/powerpc/include/asm/slice_32.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_POWERPC_SLICE_32_H
 #define _ASM_POWERPC_SLICE_32_H
 
+#define SLICE_LOW_SHIFT26  /* 64 slices */
+
 #define SLICE_HIGH_SHIFT   0
 #define SLICE_NUM_HIGH 0ul
 #define GET_HIGH_SLICE_INDEX(addr) (addr & 0)
diff --git a/arch/powerpc/include/asm/slice_64.h 
b/arch/powerpc/include/asm/slice_64.h
index 9d1c97b83010..0959475239c6 100644
--- a/arch/powerpc/include/asm/slice_64.h
+++ b/arch/powerpc/include/asm/slice_64.h
@@ -2,6 +2,8 @@
 #ifndef _ASM_POWERPC_SLICE_64_H
 #define _ASM_POWERPC_SLICE_64_H
 
+#define SLICE_LOW_SHIFT28
+
 #define SLICE_HIGH_SHIFT   40
 #define SLICE_NUM_HIGH (H_PGTABLE_RANGE >> SLICE_HIGH_SHIFT)
 #define GET_HIGH_SLICE_INDEX(addr) ((addr) >> SLICE_HIGH_SHIFT)
diff --git a/arch/powerpc/kernel/paca.c b/arch/powerpc/kernel/paca.c
index 

[PATCH v3 3/5] powerpc/32: Fix hugepage allocation on 8xx at hint address

2018-01-24 Thread Christophe Leroy
On the 8xx, the page size is set in the PMD entry and applies to
all pages of the page table pointed by the said PMD entry.

When an app has some regular pages allocated (e.g. see below) and tries
to mmap() a huge page at a hint address covered by the same PMD entry,
the kernel accepts the hint allthough the 8xx cannot handle different
page sizes in the same PMD entry.

1000-10001000 r-xp  00:0f 2597 /root/malloc
1001-10011000 rwxp  00:0f 2597 /root/malloc

mmap(0x1008, 524288, PROT_READ|PROT_WRITE,
 MAP_PRIVATE|MAP_ANONYMOUS|0x4, -1, 0) = 0x1008

This results the app remaining forever in do_page_fault()/hugetlb_fault()
and when interrupting that app, we get the following warning:

[162980.035629] WARNING: CPU: 0 PID: 2777 at arch/powerpc/mm/hugetlbpage.c:354 
hugetlb_free_pgd_range+0xc8/0x1e4
[162980.035699] CPU: 0 PID: 2777 Comm: malloc Tainted: G W   4.14.6 #85
[162980.035744] task: c67e2c00 task.stack: c668e000
[162980.035783] NIP:  c000fe18 LR: c00e1eec CTR: c00f90c0
[162980.035830] REGS: c668fc20 TRAP: 0700   Tainted: G W(4.14.6)
[162980.035854] MSR:  00029032   CR: 24044224 XER: 2000
[162980.036003]
[162980.036003] GPR00: c00e1eec c668fcd0 c67e2c00 0010 c6869410 1008 
 77fb4000
[162980.036003] GPR08: 0001 0683c001  ff80 44028228 10018a34 
4008 418004fc
[162980.036003] GPR16: c668e000 00040100 c668e000 c06c c668fe78 c668e000 
c6835ba0 c668fd48
[162980.036003] GPR24:  73ff 7400 0001 77fb4000 100f 
1010 1010
[162980.036743] NIP [c000fe18] hugetlb_free_pgd_range+0xc8/0x1e4
[162980.036839] LR [c00e1eec] free_pgtables+0x12c/0x150
[162980.036861] Call Trace:
[162980.036939] [c668fcd0] [c00f0774] unlink_anon_vmas+0x1c4/0x214 (unreliable)
[162980.037040] [c668fd10] [c00e1eec] free_pgtables+0x12c/0x150
[162980.037118] [c668fd40] [c00eabac] exit_mmap+0xe8/0x1b4
[162980.037210] [c668fda0] [c0019710] mmput.part.9+0x20/0xd8
[162980.037301] [c668fdb0] [c001ecb0] do_exit+0x1f0/0x93c
[162980.037386] [c668fe00] [c001f478] do_group_exit+0x40/0xcc
[162980.037479] [c668fe10] [c002a76c] get_signal+0x47c/0x614
[162980.037570] [c668fe70] [c0007840] do_signal+0x54/0x244
[162980.037654] [c668ff30] [c0007ae8] do_notify_resume+0x34/0x88
[162980.037744] [c668ff40] [c000dae8] do_user_signal+0x74/0xc4
[162980.037781] Instruction dump:
[162980.037821] 7fdff378 8137 54a3463a 80890020 7d24182e 7c841a14 712a0004 
4082ff94
[162980.038014] 2f89 419e0010 712a0ff0 408200e0 <0fe0> 54a9000a 
7f984840 419d0094
[162980.038216] ---[ end trace c0ceeca8e7a5800a ]---
[162980.038754] BUG: non-zero nr_ptes on freeing mm: 1
[162985.363322] BUG: non-zero nr_ptes on freeing mm: -1

In order to fix this, this patch uses the address space "slices"
implemented for BOOK3S/64 and enhanced to support PPC32 by the
preceding patch.

This patch modifies the context.id on the 8xx to be in the range
[1:16] instead of [0:15] in order to identify context.id == 0 as
not initialised contexts as done on BOOK3S

This patch activates CONFIG_PPC_MM_SLICES when CONFIG_HUGETLB_PAGE is
selected for the 8xx

Alltough we could in theory have as many slices as PMD entries, the
current slices implementation limits the number of low slices to 16.
This limitation is not preventing us to fix the initial issue allthough
it is suboptimal. It will be cured in a subsequent patch.

Fixes: 4b91428699477 ("powerpc/8xx: Implement support of hugepages")
Signed-off-by: Christophe Leroy 
---
 v2: First patch of v1 serie split in two parts
 v3: No changes

 arch/powerpc/include/asm/mmu-8xx.h |  6 ++
 arch/powerpc/kernel/setup-common.c |  2 ++
 arch/powerpc/mm/8xx_mmu.c  |  2 +-
 arch/powerpc/mm/hugetlbpage.c  |  2 ++
 arch/powerpc/mm/mmu_context_nohash.c   | 18 --
 arch/powerpc/platforms/Kconfig.cputype |  1 +
 6 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu-8xx.h 
b/arch/powerpc/include/asm/mmu-8xx.h
index 5bb3dbede41a..5f89b6010453 100644
--- a/arch/powerpc/include/asm/mmu-8xx.h
+++ b/arch/powerpc/include/asm/mmu-8xx.h
@@ -169,6 +169,12 @@ typedef struct {
unsigned int id;
unsigned int active;
unsigned long vdso_base;
+#ifdef CONFIG_PPC_MM_SLICES
+   u16 user_psize; /* page size index */
+   u64 low_slices_psize;   /* page size encodings */
+   unsigned char high_slices_psize[0];
+   unsigned long slb_addr_limit;
+#endif
 } mm_context_t;
 
 #define PHYS_IMMR_BASE (mfspr(SPRN_IMMR) & 0xfff8)
diff --git a/arch/powerpc/kernel/setup-common.c 
b/arch/powerpc/kernel/setup-common.c
index 8fd3a70047f1..edf98ea92035 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -916,6 +916,8 @@ void __init setup_arch(char **cmdline_p)
 #ifdef CONFIG_PPC64
if (!radix_enabled())
init_mm.context.slb_addr_limit = 

[PATCH v3 2/5] powerpc/mm: Enhance 'slice' for supporting PPC32

2018-01-24 Thread Christophe Leroy
In preparation for the following patch which will fix an issue on
the 8xx by re-using the 'slices', this patch enhances the
'slices' implementation to support 32 bits CPUs.

On PPC32, the address space is limited to 4Gbytes, hence only the low
slices will be used.

This patch moves "slices" functions prototypes from page64.h to slice.h

The high slices use bitmaps. As bitmap functions are not prepared to
handling bitmaps of size 0, the bitmap_xxx() calls are wrapped into
slice_bitmap_xxx() functions which will void on PPC32

Signed-off-by: Christophe Leroy 
---
 v2: First patch of v1 serie split in two parts ; added slice_bitmap_xxx() 
macros.
 v3: Moving slice related stuff in slice.h and slice_32/64.h
 slice_bitmap_xxx() are now static inline functions and platform dependent
 SLICE_LOW_TOP declared ull on PPC32 with correct casts allows to keep it 
0x1

 arch/powerpc/include/asm/page.h |  1 +
 arch/powerpc/include/asm/page_64.h  | 59 --
 arch/powerpc/include/asm/slice.h| 63 +
 arch/powerpc/include/asm/slice_32.h | 56 +
 arch/powerpc/include/asm/slice_64.h | 61 +++
 arch/powerpc/mm/slice.c | 38 --
 6 files changed, 203 insertions(+), 75 deletions(-)
 create mode 100644 arch/powerpc/include/asm/slice.h
 create mode 100644 arch/powerpc/include/asm/slice_32.h
 create mode 100644 arch/powerpc/include/asm/slice_64.h

diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index 8da5d4c1cab2..d5f1c41b7dba 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -344,5 +344,6 @@ typedef struct page *pgtable_t;
 
 #include 
 #endif /* __ASSEMBLY__ */
+#include 
 
 #endif /* _ASM_POWERPC_PAGE_H */
diff --git a/arch/powerpc/include/asm/page_64.h 
b/arch/powerpc/include/asm/page_64.h
index 56234c6fcd61..af04acdb873f 100644
--- a/arch/powerpc/include/asm/page_64.h
+++ b/arch/powerpc/include/asm/page_64.h
@@ -86,65 +86,6 @@ extern u64 ppc64_pft_size;
 
 #endif /* __ASSEMBLY__ */
 
-#ifdef CONFIG_PPC_MM_SLICES
-
-#define SLICE_LOW_SHIFT28
-#define SLICE_HIGH_SHIFT   40
-
-#define SLICE_LOW_TOP  (0x1ul)
-#define SLICE_NUM_LOW  (SLICE_LOW_TOP >> SLICE_LOW_SHIFT)
-#define SLICE_NUM_HIGH (H_PGTABLE_RANGE >> SLICE_HIGH_SHIFT)
-
-#define GET_LOW_SLICE_INDEX(addr)  ((addr) >> SLICE_LOW_SHIFT)
-#define GET_HIGH_SLICE_INDEX(addr) ((addr) >> SLICE_HIGH_SHIFT)
-
-#ifndef __ASSEMBLY__
-struct mm_struct;
-
-extern unsigned long slice_get_unmapped_area(unsigned long addr,
-unsigned long len,
-unsigned long flags,
-unsigned int psize,
-int topdown);
-
-extern unsigned int get_slice_psize(struct mm_struct *mm,
-   unsigned long addr);
-
-extern void slice_set_user_psize(struct mm_struct *mm, unsigned int psize);
-extern void slice_set_range_psize(struct mm_struct *mm, unsigned long start,
- unsigned long len, unsigned int psize);
-
-#endif /* __ASSEMBLY__ */
-#else
-#define slice_init()
-#ifdef CONFIG_PPC_BOOK3S_64
-#define get_slice_psize(mm, addr)  ((mm)->context.user_psize)
-#define slice_set_user_psize(mm, psize)\
-do {   \
-   (mm)->context.user_psize = (psize); \
-   (mm)->context.sllp = SLB_VSID_USER | mmu_psize_defs[(psize)].sllp; \
-} while (0)
-#else /* !CONFIG_PPC_BOOK3S_64 */
-#ifdef CONFIG_PPC_64K_PAGES
-#define get_slice_psize(mm, addr)  MMU_PAGE_64K
-#else /* CONFIG_PPC_64K_PAGES */
-#define get_slice_psize(mm, addr)  MMU_PAGE_4K
-#endif /* !CONFIG_PPC_64K_PAGES */
-#define slice_set_user_psize(mm, psize)do { BUG(); } while(0)
-#endif /* CONFIG_PPC_BOOK3S_64 */
-
-#define slice_set_range_psize(mm, start, len, psize)   \
-   slice_set_user_psize((mm), (psize))
-#endif /* CONFIG_PPC_MM_SLICES */
-
-#ifdef CONFIG_HUGETLB_PAGE
-
-#ifdef CONFIG_PPC_MM_SLICES
-#define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
-#endif
-
-#endif /* !CONFIG_HUGETLB_PAGE */
-
 #define VM_DATA_DEFAULT_FLAGS \
(is_32bit_task() ? \
 VM_DATA_DEFAULT_FLAGS32 : VM_DATA_DEFAULT_FLAGS64)
diff --git a/arch/powerpc/include/asm/slice.h b/arch/powerpc/include/asm/slice.h
new file mode 100644
index ..2b4b70de7e71
--- /dev/null
+++ b/arch/powerpc/include/asm/slice.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_SLICE_H
+#define _ASM_POWERPC_SLICE_H
+
+#ifdef CONFIG_PPC_MM_SLICES
+
+#ifdef CONFIG_PPC64
+#include 
+#else
+#include 
+#endif
+
+#ifdef CONFIG_HUGETLB_PAGE
+#define HAVE_ARCH_HUGETLB_UNMAPPED_AREA
+#endif
+#define HAVE_ARCH_UNMAPPED_AREA
+#define 

[PATCH v3 1/5] powerpc/mm: Remove intermediate bitmap copy in 'slices'

2018-01-24 Thread Christophe Leroy
bitmap_or() and bitmap_andnot() can work properly with dst identical
to src1 or src2. There is no need of an intermediate result bitmap
that is copied back to dst in a second step.

Signed-off-by: Christophe Leroy 
---
 v2: New in v2
 v3: patch moved up front of the serie to avoid ephemeral slice_bitmap_copy() 
function in following patch

 arch/powerpc/mm/slice.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/mm/slice.c b/arch/powerpc/mm/slice.c
index 23ec2c5e3b78..98b53d48968f 100644
--- a/arch/powerpc/mm/slice.c
+++ b/arch/powerpc/mm/slice.c
@@ -388,21 +388,17 @@ static unsigned long slice_find_area(struct mm_struct 
*mm, unsigned long len,
 
 static inline void slice_or_mask(struct slice_mask *dst, struct slice_mask 
*src)
 {
-   DECLARE_BITMAP(result, SLICE_NUM_HIGH);
-
dst->low_slices |= src->low_slices;
-   bitmap_or(result, dst->high_slices, src->high_slices, SLICE_NUM_HIGH);
-   bitmap_copy(dst->high_slices, result, SLICE_NUM_HIGH);
+   bitmap_or(dst->high_slices, dst->high_slices, src->high_slices,
+ SLICE_NUM_HIGH);
 }
 
 static inline void slice_andnot_mask(struct slice_mask *dst, struct slice_mask 
*src)
 {
-   DECLARE_BITMAP(result, SLICE_NUM_HIGH);
-
dst->low_slices &= ~src->low_slices;
 
-   bitmap_andnot(result, dst->high_slices, src->high_slices, 
SLICE_NUM_HIGH);
-   bitmap_copy(dst->high_slices, result, SLICE_NUM_HIGH);
+   bitmap_andnot(dst->high_slices, dst->high_slices, src->high_slices,
+ SLICE_NUM_HIGH);
 }
 
 #ifdef CONFIG_PPC_64K_PAGES
-- 
2.13.3