Re: [PATCH] ftrace: filter: Match dot symbols when searching functions on ppc64.

2016-04-14 Thread Michael Ellerman
On Fri, 2016-04-01 at 00:22 -0300, Thiago Jung Bauermann wrote:

> In the ppc64 big endian ABI, function symbols point to function
> descriptors. The symbols which point to the function entry points
> have a dot in front of the function name. Consequently, when the
> ftrace filter mechanism searches for the symbol corresponding to
> an entry point address, it gets the dot symbol.
> 
> As a result, ftrace filter users have to be aware of this ABI detail on
> ppc64 and prepend a dot to the function name when setting the filter.
> 
> The perf probe command insulates the user from this by ignoring the dot
> in front of the symbol name when matching function names to symbols,
> but the sysfs interface does not. This patch makes the ftrace filter
> mechanism do the same when searching symbols.
> 
> Fixes the following failure in ftracetest's kprobe_ftrace.tc:
> 
>   .../kprobe_ftrace.tc: line 9: echo: write error: Invalid argument
> 
> That failure is on this line of kprobe_ftrace.tc:
> 
>   echo _do_fork > set_ftrace_filter
> 
> This is because there's no _do_fork entry in the functions list:
> 
>   # cat available_filter_functions | grep _do_fork
>   ._do_fork
> 
> This change introduces no regressions on the perf and ftracetest
> testsuite results.
> 
> Cc: Steven Rostedt 
> Cc: Ingo Molnar 
> Cc: Michael Ellerman 
> Signed-off-by: Thiago Jung Bauermann 
> ---
>  arch/powerpc/include/asm/ftrace.h |  9 +
>  kernel/trace/ftrace.c | 13 +
>  2 files changed, 22 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/ftrace.h 
> b/arch/powerpc/include/asm/ftrace.h
> index 50ca7585abe2..68f1858796c6 100644
> --- a/arch/powerpc/include/asm/ftrace.h
> +++ b/arch/powerpc/include/asm/ftrace.h
> @@ -58,6 +58,15 @@ struct dyn_arch_ftrace {
>   struct module *mod;
>  };
>  #endif /*  CONFIG_DYNAMIC_FTRACE */
> +
> +#if CONFIG_PPC64 && (!defined(_CALL_ELF) || _CALL_ELF != 2)
> +#define ARCH_HAS_FTRACE_MATCH_ADJUST

I *think* the consenus these days is that we don't add ARCH_HAS #defines in
headers. Instead you should either:
 - add a CONFIG_HAVE_ARCH_FOO and use that
 - use the #define foo foo trick

The latter being that you do:

static inline void arch_ftrace_match_adjust(char **str, char *search)
{
...
}
#define arch_ftrace_match_adjust arch_ftrace_match_adjust

And in ftrace.c:

#ifndef arch_ftrace_match_adjust
static inline void arch_ftrace_match_adjust(char **str, char *search) {}
#endif


Presumably Steve will have a preference for which style you use.

> +static inline void arch_ftrace_match_adjust(char **str, char *search)
> +{
> + if ((*str)[0] == '.' && search[0] != '.')
> + (*str)++;
> +}
> +#endif /* CONFIG_PPC64 && (!defined(_CALL_ELF) || _CALL_ELF != 2) */
>  #endif /* __ASSEMBLY__ */

It seems unfortunate that we need to introduce yet another mechanism to deal
with dot symbols. But I guess none of the existing mechanisms work, eg.
kprobe_lookup_name().


>  #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
> diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> index b1870fbd2b67..e806c2a3b7a8 100644
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -3444,11 +3444,24 @@ struct ftrace_glob {
>   int type;
>  };
>  
> +#ifndef ARCH_HAS_FTRACE_MATCH_ADJUST
> +/*
> + * If symbols in an architecture don't correspond exactly to the user-visible
> + * name of what they represent, it is possible to define this function to
> + * perform the necessary adjustments.
> +*/
> +static inline void arch_ftrace_match_adjust(char **str, char *search)
> +{
> +}
> +#endif
> +
>  static int ftrace_match(char *str, struct ftrace_glob *g)
>  {
>   int matched = 0;
>   int slen;
>  
> + arch_ftrace_match_adjust(&str, g->search);

I think this would less magical if it didn't modify str directly, instead doing:

str = arch_ftrace_match_adjust(str, g->search);

And arch_ftrace_match_adjust() would return the adjusted char *.

That would mean the generic version would need to return str rather than being
empty.

cheers

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: introduce {cmp}xchg for u8 and u16

2016-04-14 Thread Pan Xinhui
Hello, Waiman

On 2016年04月13日 23:53, Waiman Long wrote:
> On 04/13/2016 07:15 AM, Pan Xinhui wrote:
>> Hello Peter,
>>
>> On 2016年04月12日 22:30, Peter Zijlstra wrote:
>>
 I am working on the qspinlock implementation on PPC.
 Your and Waiman's patches are so nice. :)
>>> Thanks!, last time I looked at PPC spinlocks they could not use things
>>> like ticket locks because PPC might be a guest and fairness blows etc..
>>>
>>> You're making the qspinlock-paravirt thing work on PPC, or doing
>>> qspinlock only for bare-metal PPC?
>>>
>> I am making the both work. :)
>> qspinlock works on PPC now. I am preparing the patches and will send them 
>> out in next weeks :)
> 
> What of performance improvement are you seeing in PPC?
> 
well, not good. I wrote one small benchmark which just increase a integer with 
spinlock hold.
the overhead of lock itself is high. But the fairness is good.
I just do the tests in guestOS, and the qspinlock does not make use of 
paravirt, but spinlock does. So the performance gap now is a little big.
looks like I need change the kernel config, and re-test.

I does not measure the system impact in real world now. Let's see the kernel 
build times with two different locks.
If possible, Could you share us how you do the performance test?

>> The paravirt work is a little hard.
>> currently, there are pv_wait() and pv_kick(). but only pv_kick has the 
>> parameter cpu(who will hold the lock as soon as the lock is unlocked).
>> We need parameter cpu(who holds the lock now) in pv_wait,too.
> 
> That can be doable to a certain extent. However, if the current lock holder 
> acquired the lock via the fastpath only. The CPU information is not logged 
> anywhere. For a contended lock, the information should be there.
> 
yes. Maybe we could use hashtable. We could put lock, lock holder in pv_node, 
too. :)
Just my thoughts.

thanks
xinhui

> Cheers, 
> Longman
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] ftrace: filter: Match dot symbols when searching functions on ppc64.

2016-04-14 Thread Steven Rostedt
On Thu, 14 Apr 2016 17:11:35 +1000
Michael Ellerman  wrote:
> > 
> > diff --git a/arch/powerpc/include/asm/ftrace.h 
> > b/arch/powerpc/include/asm/ftrace.h
> > index 50ca7585abe2..68f1858796c6 100644
> > --- a/arch/powerpc/include/asm/ftrace.h
> > +++ b/arch/powerpc/include/asm/ftrace.h
> > @@ -58,6 +58,15 @@ struct dyn_arch_ftrace {
> > struct module *mod;
> >  };
> >  #endif /*  CONFIG_DYNAMIC_FTRACE */
> > +
> > +#if CONFIG_PPC64 && (!defined(_CALL_ELF) || _CALL_ELF != 2)
> > +#define ARCH_HAS_FTRACE_MATCH_ADJUST  
> 
> I *think* the consenus these days is that we don't add ARCH_HAS #defines in
> headers. Instead you should either:
>  - add a CONFIG_HAVE_ARCH_FOO and use that
>  - use the #define foo foo trick
> 
> The latter being that you do:
> 
> static inline void arch_ftrace_match_adjust(char **str, char *search)
> {
>   ...
> }
> #define arch_ftrace_match_adjust arch_ftrace_match_adjust
> 
> And in ftrace.c:
> 
> #ifndef arch_ftrace_match_adjust
> static inline void arch_ftrace_match_adjust(char **str, char *search) {}
> #endif
> 
> 
> Presumably Steve will have a preference for which style you use.

Actually, what I usually do is simply make a "weak" stub function and
let the arch override it.

> 
> > +static inline void arch_ftrace_match_adjust(char **str, char *search)
> > +{
> > +   if ((*str)[0] == '.' && search[0] != '.')
> > +   (*str)++;
> > +}
> > +#endif /* CONFIG_PPC64 && (!defined(_CALL_ELF) || _CALL_ELF != 2) */
> >  #endif /* __ASSEMBLY__ */  
> 
> It seems unfortunate that we need to introduce yet another mechanism to deal
> with dot symbols. But I guess none of the existing mechanisms work, eg.
> kprobe_lookup_name().

What about making a generic conversion to function name, like:

arch_sane_function_name(str);

Where all sane archs simply return the string name and powerpc can
strip the '.' prefix. ;-)

Of course that would require:

sane_str = arch_sane_function(str);
sane_search = arch_sane_function(g->search);

and then compare sane_str with sane_search.

> 
> 
> >  #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
> > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> > index b1870fbd2b67..e806c2a3b7a8 100644
> > --- a/kernel/trace/ftrace.c
> > +++ b/kernel/trace/ftrace.c
> > @@ -3444,11 +3444,24 @@ struct ftrace_glob {
> > int type;
> >  };
> >  
> > +#ifndef ARCH_HAS_FTRACE_MATCH_ADJUST
> > +/*
> > + * If symbols in an architecture don't correspond exactly to the 
> > user-visible
> > + * name of what they represent, it is possible to define this function to
> > + * perform the necessary adjustments.
> > +*/
> > +static inline void arch_ftrace_match_adjust(char **str, char *search)
> > +{
> > +}
> > +#endif
> > +
> >  static int ftrace_match(char *str, struct ftrace_glob *g)
> >  {
> > int matched = 0;
> > int slen;
> >  
> > +   arch_ftrace_match_adjust(&str, g->search);  
> 
> I think this would less magical if it didn't modify str directly, instead 
> doing:
> 
>   str = arch_ftrace_match_adjust(str, g->search);
> 
> And arch_ftrace_match_adjust() would return the adjusted char *.
> 
> That would mean the generic version would need to return str rather than being
> empty.

I agree, because if we need to free the string passed it, the function
would modify the pointer and we wouldn't be able to free it. In those
cases we could do:

tmp_str = arch_ftrace_match_adjust(str, search);
/* use tmp_str and then ignore */
kfree(str);

** Disclaimer **

Note, I just took the red-eye (2 hours of sleep on the plane) and
waiting for my next flight. My focus may be off in this email.

-- Steve
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/5] livepatch: Allow architectures to specify an alternate ftrace location

2016-04-14 Thread Miroslav Benes
On Wed, 13 Apr 2016, Michael Ellerman wrote:

> When livepatch tries to patch a function it takes the function address
> and asks ftrace to install the livepatch handler at that location.
> ftrace will look for an mcount call site at that exact address.
> 
> On powerpc the mcount location is not the first instruction of the
> function, and in fact it's not at a constant offset from the start of
> the function. To accommodate this add a hook which arch code can
> override to customise the behaviour.
> 
> Signed-off-by: Torsten Duwe 
> Signed-off-by: Balbir Singh 
> Signed-off-by: Petr Mladek 
> Signed-off-by: Michael Ellerman 
> ---
>  kernel/livepatch/core.c | 34 +++---
>  1 file changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> index d68fbf63b083..b0476bb30f92 100644
> --- a/kernel/livepatch/core.c
> +++ b/kernel/livepatch/core.c
> @@ -298,6 +298,19 @@ unlock:
>   rcu_read_unlock();
>  }
>  
> +/*
> + * Convert a function address into the appropriate ftrace location.
> + *
> + * Usually this is just the address of the function, but on some 
> architectures
> + * it's more complicated so allow them to provide a custom behaviour.
> + */
> +#ifndef klp_get_ftrace_location
> +static unsigned long klp_get_ftrace_location(unsigned long faddr)
> +{
> + return faddr;
> +}
> +#endif

Whoah, what an ugly hack :)

Note to my future self - This is what you want to do if you need a weak 
static inline function.

static inline is probably unnecessary here so __weak function would be 
enough. It would introduce powerpc-specific livepatch.c though because of 
it and this is not worth it.

>  static void klp_disable_func(struct klp_func *func)
>  {
>   struct klp_ops *ops;
> @@ -312,8 +325,14 @@ static void klp_disable_func(struct klp_func *func)
>   return;
>  
>   if (list_is_singular(&ops->func_stack)) {
> + unsigned long ftrace_loc;

This is a nit, but could you move the definition up to have them all in 
one place to be consistent with the rest of the code? The same applies to 
klp_enable_func() below.

> +
> + ftrace_loc = klp_get_ftrace_location(func->old_addr);
> + if (WARN_ON(!ftrace_loc))
> + return;
> +
>   WARN_ON(unregister_ftrace_function(&ops->fops));
> - WARN_ON(ftrace_set_filter_ip(&ops->fops, func->old_addr, 1, 0));
> + WARN_ON(ftrace_set_filter_ip(&ops->fops, ftrace_loc, 1, 0));
>  
>   list_del_rcu(&func->stack_node);
>   list_del(&ops->node);
> @@ -338,6 +357,15 @@ static int klp_enable_func(struct klp_func *func)
>  
>   ops = klp_find_ops(func->old_addr);
>   if (!ops) {
> + unsigned long ftrace_loc;

Here.

> +
> + ftrace_loc = klp_get_ftrace_location(func->old_addr);
> + if (!ftrace_loc) {
> + pr_err("failed to find location for function '%s'\n",
> + func->old_name);
> + return -EINVAL;
> + }
> +
>   ops = kzalloc(sizeof(*ops), GFP_KERNEL);
>   if (!ops)
>   return -ENOMEM;
> @@ -352,7 +380,7 @@ static int klp_enable_func(struct klp_func *func)
>   INIT_LIST_HEAD(&ops->func_stack);
>   list_add_rcu(&func->stack_node, &ops->func_stack);
>  
> - ret = ftrace_set_filter_ip(&ops->fops, func->old_addr, 0, 0);
> + ret = ftrace_set_filter_ip(&ops->fops, ftrace_loc, 0, 0);
>   if (ret) {
>   pr_err("failed to set ftrace filter for function '%s' 
> (%d)\n",
>  func->old_name, ret);
> @@ -363,7 +391,7 @@ static int klp_enable_func(struct klp_func *func)
>   if (ret) {
>   pr_err("failed to register ftrace handler for function 
> '%s' (%d)\n",
>  func->old_name, ret);
> - ftrace_set_filter_ip(&ops->fops, func->old_addr, 1, 0);
> + ftrace_set_filter_ip(&ops->fops, ftrace_loc, 1, 0);
>   goto err;
>   }

Otherwise it is ok.

Miroslav
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 3/5] powerpc/livepatch: Add livepatch header

2016-04-14 Thread Miroslav Benes
On Wed, 13 Apr 2016, Michael Ellerman wrote:

> Add the powerpc specific livepatch definitions. In particular we provide
> a non-default implementation of klp_get_ftrace_location().
> 
> This is required because the location of the mcount call is not constant
> when using -mprofile-kernel (which we always do for live patching).
> 
> Signed-off-by: Torsten Duwe 
> Signed-off-by: Balbir Singh 
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/include/asm/livepatch.h | 54 
> 
>  1 file changed, 54 insertions(+)
>  create mode 100644 arch/powerpc/include/asm/livepatch.h
> 
> diff --git a/arch/powerpc/include/asm/livepatch.h 
> b/arch/powerpc/include/asm/livepatch.h
> new file mode 100644
> index ..ad36e8e34fa1
> --- /dev/null
> +++ b/arch/powerpc/include/asm/livepatch.h
> @@ -0,0 +1,54 @@
> +/*
> + * livepatch.h - powerpc-specific Kernel Live Patching Core
> + *
> + * Copyright (C) 2015-2016, SUSE, IBM Corp.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, see .
> + */
> +#ifndef _ASM_POWERPC_LIVEPATCH_H
> +#define _ASM_POWERPC_LIVEPATCH_H
> +
> +#include 
> +#include 
> +
> +#ifdef CONFIG_LIVEPATCH

We don't use these guards in our header files since 335e073faacc ("klp: 
remove CONFIG_LIVEPATCH dependency from klp headers").

> +static inline int klp_check_compiler_support(void)
> +{
> + return 0;
> +}
> +
> +static inline int klp_write_module_reloc(struct module *mod, unsigned long
> + type, unsigned long loc, unsigned long value)
> +{
> + /* This requires infrastructure changes; we need the loadinfos. */
> + return -ENOSYS;
> +}

And this is not needed anymore as Jessica pointed out.

> +static inline void klp_arch_set_pc(struct pt_regs *regs, unsigned long ip)
> +{
> + regs->nip = ip;
> +}
> +
> +#define klp_get_ftrace_location klp_get_ftrace_location
> +static inline unsigned long klp_get_ftrace_location(unsigned long faddr)
> +{
> + /*
> +  * Live patch works only with -mprofile-kernel on PPC. In this case,
> +  * the ftrace location is always within the first 16 bytes.
> +  */
> + return ftrace_location_range(faddr, faddr + 16);
> +}
> +#endif /* CONFIG_LIVEPATCH */
> +
> +#endif /* _ASM_POWERPC_LIVEPATCH_H */
> -- 
> 2.5.0
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 3/5] powerpc/livepatch: Add livepatch header

2016-04-14 Thread Miroslav Benes
On Thu, 14 Apr 2016, Miroslav Benes wrote:

> On Wed, 13 Apr 2016, Michael Ellerman wrote:
> 
> > Add the powerpc specific livepatch definitions. In particular we provide
> > a non-default implementation of klp_get_ftrace_location().
> > 
> > This is required because the location of the mcount call is not constant
> > when using -mprofile-kernel (which we always do for live patching).
> > 
> > Signed-off-by: Torsten Duwe 
> > Signed-off-by: Balbir Singh 
> > Signed-off-by: Michael Ellerman 
> > ---
> >  arch/powerpc/include/asm/livepatch.h | 54 
> > 
> >  1 file changed, 54 insertions(+)
> >  create mode 100644 arch/powerpc/include/asm/livepatch.h
> > 
> > diff --git a/arch/powerpc/include/asm/livepatch.h 
> > b/arch/powerpc/include/asm/livepatch.h
> > new file mode 100644
> > index ..ad36e8e34fa1
> > --- /dev/null
> > +++ b/arch/powerpc/include/asm/livepatch.h
> > @@ -0,0 +1,54 @@
> > +/*
> > + * livepatch.h - powerpc-specific Kernel Live Patching Core
> > + *
> > + * Copyright (C) 2015-2016, SUSE, IBM Corp.
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version 2
> > + * of the License, or (at your option) any later version.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, see .
> > + */
> > +#ifndef _ASM_POWERPC_LIVEPATCH_H
> > +#define _ASM_POWERPC_LIVEPATCH_H
> > +
> > +#include 
> > +#include 
> > +
> > +#ifdef CONFIG_LIVEPATCH
> 
> We don't use these guards in our header files since 335e073faacc ("klp: 
> remove CONFIG_LIVEPATCH dependency from klp headers").

...but you're gonna need it in the next patch...
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/5] Live patching for powerpc

2016-04-14 Thread Torsten Duwe
On Thu, Apr 14, 2016 at 04:49:50PM +1000, Michael Ellerman wrote:
> On Wed, 2016-04-13 at 15:22 +0200, Jiri Kosina wrote:
> > On Wed, 13 Apr 2016, Miroslav Benes wrote:
> > > > This series adds live patching support for powerpc (ppc64le only ATM).
> > > > 
> > > > It's unchanged since the version I posted on March 24, with the 
> > > > exception that
> > > > I've dropped the first patch, which was a testing-only patch.

Confirmed. And it still works on top of 4.6-rc3, even with the additional 
testing.

> > > > If there's no further comments I'll put this in a topic branch in the 
> > > > next day
> > > > or two and Jiri & I will both merge that into next.

"Go" from my side.

FTR: then I still have a few ppc64 hunks floating around to support certain 
consistency
models...

Torsten

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/5] livepatch: Allow architectures to specify an alternate ftrace location

2016-04-14 Thread Michael Ellerman
On Thu, 2016-04-14 at 14:01 +0200, Miroslav Benes wrote:
> On Wed, 13 Apr 2016, Michael Ellerman wrote:
> > diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
> > index d68fbf63b083..b0476bb30f92 100644
> > --- a/kernel/livepatch/core.c
> > +++ b/kernel/livepatch/core.c
> > @@ -298,6 +298,19 @@ unlock:
> >   rcu_read_unlock();
> >  }
> >  
> > +/*
> > + * Convert a function address into the appropriate ftrace location.
> > + *
> > + * Usually this is just the address of the function, but on some 
> > architectures
> > + * it's more complicated so allow them to provide a custom behaviour.
> > + */
> > +#ifndef klp_get_ftrace_location
> > +static unsigned long klp_get_ftrace_location(unsigned long faddr)
> > +{
> > + return faddr;
> > +}
> > +#endif
> 
> Whoah, what an ugly hack :)
 
Hey it's a "cool trick" ;)

> Note to my future self - This is what you want to do if you need a weak 
> static inline function.
> 
> static inline is probably unnecessary here so __weak function would be 
> enough. It would introduce powerpc-specific livepatch.c though because of 
> it and this is not worth it.

Yeah that was my logic, at least for now. We can always change it in future
to be weak if anyone cares deeply.

> >  static void klp_disable_func(struct klp_func *func)
> >  {
> >   struct klp_ops *ops;
> > @@ -312,8 +325,14 @@ static void klp_disable_func(struct klp_func *func)
> >   return;
> >  
> >   if (list_is_singular(&ops->func_stack)) {
> > + unsigned long ftrace_loc;
> 
> This is a nit, but could you move the definition up to have them all in 
> one place to be consistent with the rest of the code? The same applies to 
> klp_enable_func() below.

Hmm, actually I moved it in there because you pointed out we only needed it
inside the if:

http://lkml.kernel.org/r/alpine.lnx.2.00.1603151113020.20...@pobox.suse.cz

Thinking about it, we need ftrace_loc only in cases where we call 
ftrace_set_filter_ip() right? So we can move klp_get_ftrace_location() 
call to appropriate if branch both in klp_disable_func() and 
klp_enable_func().

But I guess you meant the function call, not the variable declaration.

Personally I think it's better this way, as the variable is in scope for the
shortest possible amount of time, but I can change it if you want me to.

cheers

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/5] Live patching for powerpc

2016-04-14 Thread Michael Ellerman
On Thu, 2016-04-14 at 14:57 +0200, Torsten Duwe wrote:
> On Thu, Apr 14, 2016 at 04:49:50PM +1000, Michael Ellerman wrote:
> > On Wed, 2016-04-13 at 15:22 +0200, Jiri Kosina wrote:
> > > On Wed, 13 Apr 2016, Miroslav Benes wrote:
> > > > > This series adds live patching support for powerpc (ppc64le only ATM).
> > > > > 
> > > > > It's unchanged since the version I posted on March 24, with the 
> > > > > exception that
> > > > > I've dropped the first patch, which was a testing-only patch.
> 
> Confirmed. And it still works on top of 4.6-rc3, even with the additional 
> testing.
 
Thanks. Yeah I tested on top of rc3 as well as back on the topic branch 
(4.5-rc).

> > > > > If there's no further comments I'll put this in a topic branch in the 
> > > > > next day
> > > > > or two and Jiri & I will both merge that into next.
> 
> "Go" from my side.
 
Throttle up!

> FTR: then I still have a few ppc64 hunks floating around to support certain 
> consistency
> models...

OK. I'm not quite sure what you mean but post them and we'll see I guess :)

cheers

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 3/5] powerpc/livepatch: Add livepatch header

2016-04-14 Thread Michael Ellerman
On Thu, 2016-04-14 at 14:23 +0200, Miroslav Benes wrote:
> On Thu, 14 Apr 2016, Miroslav Benes wrote:
> > On Wed, 13 Apr 2016, Michael Ellerman wrote:
> > > diff --git a/arch/powerpc/include/asm/livepatch.h 
> > > b/arch/powerpc/include/asm/livepatch.h
> > > new file mode 100644
> > > index ..ad36e8e34fa1
> > > --- /dev/null
> > > +++ b/arch/powerpc/include/asm/livepatch.h
> > > @@ -0,0 +1,54 @@
...
> > > +#ifndef _ASM_POWERPC_LIVEPATCH_H
> > > +#define _ASM_POWERPC_LIVEPATCH_H
> > > +
> > > +#include 
> > > +#include 
> > > +
> > > +#ifdef CONFIG_LIVEPATCH
> > 
> > We don't use these guards in our header files since 335e073faacc ("klp: 
> > remove CONFIG_LIVEPATCH dependency from klp headers").
> 
> ...but you're gonna need it in the next patch...

Yeah I know I said at one point those #ifdefs were unneeded, but then it turns
out we did want it on powerpc for other reasons.

cheers

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Live patching for powerpc

2016-04-14 Thread Miroslav Benes
On Wed, 13 Apr 2016, Jessica Yu wrote:

> +++ Miroslav Benes [13/04/16 15:01 +0200]:
> > On Wed, 13 Apr 2016, Michael Ellerman wrote:
> > 
> > > This series adds live patching support for powerpc (ppc64le only ATM).
> > > 
> > > It's unchanged since the version I posted on March 24, with the exception
> > > that
> > > I've dropped the first patch, which was a testing-only patch.
> > > 
> > > If there's no further comments I'll put this in a topic branch in the next
> > > day
> > > or two and Jiri & I will both merge that into next.
> > 
> > Hi,
> > 
> > I'll definitely give it a proper look today or tomorrow, but there is one
> > thing that needs to be solved. The patch set from Jessica reworking
> > relocations for live patching is now merged in our for-next branch. This
> > means that we need to find out if there is something in struct
> > mod_arch_specific for powerpc which needs to be preserved and do it.
> > 
> 
> I took a look around the powerpc module.c code and it looks like the
> mod_arch_specific stuff should be fine, since it is statically allocated
> in the module struct (unlike the situation in s390, where
> mod->arch.syminfo was vmalloc'd and we had to delay the free).
> However I'm not familiar with the powerpc code so I need to dig around
> a bit more to be 100% sure.

I came to the same conclusion. There is only struct bug_entry *bug_table 
in mod_arch_specific but it looks unimportant wrt relocations.

> A second concern I have is that apply_relocate_add() relies on
> sections like .stubs and .toc (for 64-bit) and .init.plt and .plt
> sections (for 32-bit). In order for apply_relocate_add() to work for
> livepatch, we must make sure these sections aren't thrown away and are
> not in init module memory since this memory will be freed at the end
> of module load (see how INIT_OFFSET_MASK is used in kernel/module.c).
> As long as these sections are placed in module core memory, we will be
> OK. I need to think about this a bit more.

I knew I shouldn't have opened arch/powerpc/kernel/module*.c.

We could always hack sh_flags of those sections in 
module_arch_frob_sections() to make them stay.

Miroslav

> 
> Third and unrelated comment: the klp_write_module_reloc stub isn't
> needed anymore :-)
> 
> Thanks,
> Jessica
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 2/5] livepatch: Allow architectures to specify an alternate ftrace location

2016-04-14 Thread Miroslav Benes
On Thu, 14 Apr 2016, Michael Ellerman wrote:

> On Thu, 2016-04-14 at 14:01 +0200, Miroslav Benes wrote:
> > On Wed, 13 Apr 2016, Michael Ellerman wrote:
> 
> > >  static void klp_disable_func(struct klp_func *func)
> > >  {
> > >   struct klp_ops *ops;
> > > @@ -312,8 +325,14 @@ static void klp_disable_func(struct klp_func *func)
> > >   return;
> > >  
> > >   if (list_is_singular(&ops->func_stack)) {
> > > + unsigned long ftrace_loc;
> > 
> > This is a nit, but could you move the definition up to have them all in 
> > one place to be consistent with the rest of the code? The same applies to 
> > klp_enable_func() below.
> 
> Hmm, actually I moved it in there because you pointed out we only needed it
> inside the if:
> 
> http://lkml.kernel.org/r/alpine.lnx.2.00.1603151113020.20...@pobox.suse.cz
> 
> Thinking about it, we need ftrace_loc only in cases where we call 
> ftrace_set_filter_ip() right? So we can move klp_get_ftrace_location() 
> call to appropriate if branch both in klp_disable_func() and 
> klp_enable_func().
> 
> But I guess you meant the function call, not the variable declaration.

Exactly.

> Personally I think it's better this way, as the variable is in scope for the
> shortest possible amount of time, but I can change it if you want me to.

No, it is nothing I would insist on.

Thanks,
Miroslav
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/5] Live patching for powerpc

2016-04-14 Thread Jiri Kosina
On Thu, 14 Apr 2016, Torsten Duwe wrote:

> > > > > It's unchanged since the version I posted on March 24, with the 
> > > > > exception that
> > > > > I've dropped the first patch, which was a testing-only patch.
> 
> Confirmed. And it still works on top of 4.6-rc3, even with the 
> additional testing.

Thanks a lot for testing.

The imporant part here is testing on top of 
livepatching.git#for-4.7/arch-independent-klp-relocations as well.

I am pretty sure there will be adjustments needed for the merge, as we'll 
have to figure out which parts of ELF can't be thrown away and need to be 
preserved in order for the relocation entry to be successfully 
constructed.

Michael, I think this is an additional reason why the whole final pile 
will have to go through livepatching.git, as the merge with what we have 
in for-4.7/arch-independent-klp-relocations might not be completely 
trivial.

-- 
Jiri Kosina
SUSE Labs

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/5] Live patching for powerpc

2016-04-14 Thread Torsten Duwe
On Thu, Apr 14, 2016 at 11:08:02PM +1000, Michael Ellerman wrote:
> On Thu, 2016-04-14 at 14:57 +0200, Torsten Duwe wrote:
> 
> > FTR: then I still have a few ppc64 hunks floating around to support certain 
> > consistency
> > models...
> 
> OK. I'm not quite sure what you mean but post them and we'll see I guess :)

It's *roughly* the ppc64 equivalent of Josh Poimboeuf's Mar 25
| [RFC PATCH v1.9 14/14] livepatch: update task universe when exiting kernel
which only considers x86.

It's forward ported from an earlier code base; there's some glue missing,
but here it is, for reference.

Signed-off-by: Torsten Duwe 


diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index b034ecd..3e749f4 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -92,6 +92,7 @@ static inline struct thread_info *current_thread_info(void)
   TIF_NEED_RESCHED */
 #define TIF_32BIT  4   /* 32 bit binary */
 #define TIF_RESTORE_TM 5   /* need to restore TM FP/VEC/VSX */
+#define TIF_KLP_NEED_UPDATE6   /* kGraft patching in progress */
 #define TIF_SYSCALL_AUDIT  7   /* syscall auditing active */
 #define TIF_SINGLESTEP 8   /* singlestepping active */
 #define TIF_NOHZ   9   /* in adaptive nohz mode */
@@ -115,8 +116,10 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_POLLING_NRFLAG(1

Re: [PATCH 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate

2016-04-14 Thread Akshay Adiga

Hi Balbir,

On 04/14/2016 11:10 AM, Balbir Singh wrote:


On 13/04/16 04:06, Akshay Adiga wrote:

This patch brings down global pstate at a slower rate than the local
pstate. As the frequency transition latency from pmin to pmax is
observed to be in few millisecond granurality. It takes a performance
penalty during sudden frequency rampup. Hence by holding global pstates
higher than local pstate makes the subsequent rampups faster.

What domains does local and global refer to?


The *global pstate* is a Chip-level entity as such, so the global entity
(Voltage)  is managed across several cores. But with a DCM (Dual Chip Modules),
its more of a socket-level entity and hence Voltage is managed across both 
chips.

The *local pstate* is a Core-level entity, so the local entity (frequency) is
managed across threads.

I will include this in the commit message. Thanks.




+/*
+ * Quadratic equation which gives the percentage rampdown for time elapsed in
+ * milliseconds. time can be between 0 and MAX_RAMP_DOWN_TIME ( milliseconds )
+ * This equation approximates to y = -4e-6 x^2

Thanks for documenting this, but I think it will also be good to explain why we
use y = -4 e-6*x^2 as opposed to any other magic numbers.


Well it empirically worked out best this way.

On an idle system we want the Global Pstate to ramp-down from max value to min 
over
a span of ~5 secs. Also we want initially ramp-down slowly and ramp-down 
rapidly later
on, hence the equation.

I will try to make this in the comment more informative.

Regards

Akshay Adiga

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/5] Live patching for powerpc

2016-04-14 Thread Josh Poimboeuf
On Thu, Apr 14, 2016 at 05:20:29PM +0200, Torsten Duwe wrote:
> On Thu, Apr 14, 2016 at 11:08:02PM +1000, Michael Ellerman wrote:
> > On Thu, 2016-04-14 at 14:57 +0200, Torsten Duwe wrote:
> > 
> > > FTR: then I still have a few ppc64 hunks floating around to support 
> > > certain consistency
> > > models...
> > 
> > OK. I'm not quite sure what you mean but post them and we'll see I guess :)
> 
> It's *roughly* the ppc64 equivalent of Josh Poimboeuf's Mar 25
> | [RFC PATCH v1.9 14/14] livepatch: update task universe when exiting kernel
> which only considers x86.
> 
> It's forward ported from an earlier code base; there's some glue missing,
> but here it is, for reference.
> 
> Signed-off-by: Torsten Duwe 

Hi Torsten,

Thanks for sharing.  This is quite fortuitous as Miroslav just today
mentioned to me that we would need something like this.  If you don't
mind, I may pull this patch or some variant of it into v2 of the
consistency model.

> 
> 
> diff --git a/arch/powerpc/include/asm/thread_info.h 
> b/arch/powerpc/include/asm/thread_info.h
> index b034ecd..3e749f4 100644
> --- a/arch/powerpc/include/asm/thread_info.h
> +++ b/arch/powerpc/include/asm/thread_info.h
> @@ -92,6 +92,7 @@ static inline struct thread_info *current_thread_info(void)
>  TIF_NEED_RESCHED */
>  #define TIF_32BIT4   /* 32 bit binary */
>  #define TIF_RESTORE_TM   5   /* need to restore TM 
> FP/VEC/VSX */
> +#define TIF_KLP_NEED_UPDATE  6   /* kGraft patching in progress */
>  #define TIF_SYSCALL_AUDIT7   /* syscall auditing active */
>  #define TIF_SINGLESTEP   8   /* singlestepping active */
>  #define TIF_NOHZ 9   /* in adaptive nohz mode */
> @@ -115,8 +116,10 @@ static inline struct thread_info 
> *current_thread_info(void)
>  #define _TIF_POLLING_NRFLAG  (1<  #define _TIF_32BIT   (1<  #define _TIF_RESTORE_TM  (1< +#define _TIF_KLP_NEED_UPDATE (1<  #define _TIF_SYSCALL_AUDIT   (1<  #define _TIF_SINGLESTEP  (1< +#define _TIF_NOHZ(1<  #define _TIF_SECCOMP (1<  #define _TIF_RESTOREALL  (1<  #define _TIF_NOERROR (1< @@ -124,7 +127,7 @@ static inline struct thread_info 
> *current_thread_info(void)
>  #define _TIF_UPROBE  (1<  #define _TIF_SYSCALL_TRACEPOINT  (1<  #define _TIF_EMULATE_STACK_STORE (1< -#define _TIF_NOHZ(1< +
>  #define _TIF_SYSCALL_DOTRACE (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
>_TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | \
>_TIF_NOHZ)
> @@ -132,7 +135,8 @@ static inline struct thread_info 
> *current_thread_info(void)
>  #define _TIF_USER_WORK_MASK  (_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
>_TIF_NOTIFY_RESUME | _TIF_UPROBE | \
>_TIF_RESTORE_TM)
> -#define _TIF_PERSYSCALL_MASK (_TIF_RESTOREALL|_TIF_NOERROR)
> +
> +#define _TIF_PERSYSCALL_MASK 
> (_TIF_RESTOREALL|_TIF_NOERROR|_TIF_KLP_NEED_UPDATE)
>  
>  /* Bits in local_flags */
>  /* Don't move TLF_NAPPING without adjusting the code in entry_32.S */
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index 5bbd1bc..17f8a18 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -151,8 +151,8 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
>  
>   CURRENT_THREAD_INFO(r11, r1)
>   ld  r10,TI_FLAGS(r11)
> - andi.   r11,r10,_TIF_SYSCALL_DOTRACE
> - bne syscall_dotrace /* does not return */
> + andi.   r10,r10,(_TIF_SYSCALL_DOTRACE|_TIF_KLP_NEED_UPDATE)
> + bne-syscall_precall /* does not return */
>   cmpldi  0,r0,NR_syscalls
>   bge-syscall_enosys
>  
> @@ -245,6 +245,17 @@ syscall_error:
>   neg r3,r3
>   std r5,_CCR(r1)
>   b   .Lsyscall_error_cont
> +
> +syscall_precall:
> + andi.   r10,r10,(_TIF_KLP_NEED_UPDATE)
> + beq+syscall_dotrace
> +
> + addir11,r11,TI_FLAGS
> +1:   ldarx   r12,0,r11
> + andcr12,r12,r10
> + stdcx.  r12,0,r11
> + bne-1b
> + subir11,r11,TI_FLAGS
>   
>  /* Traced system call support */
>  syscall_dotrace:

-- 
Josh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/3] powerpc: Complete FSCR context switch

2016-04-14 Thread Jack Miller
> I'm not sure that works on processes before power8.
>
> There DSCR SPR number 0x11 will always trap and emulate from userspace
> (see arch/powerpc/kernel/traps.c:emulate_instruction()).  That is not
> controlled by FSCR and should work on POWER7 where FSCR is not
> present.  We need to set the inherit bit there too.
>
> DSCR SPR number 0x3 is controlled by fscr, but it's only avaliable on
> POWER8.
>
>> Right now the FSCR switch is conditional on FTR_ARCH_207S which is
>> more exclusive than FTR_DSCR, but I guess the actual FSCR register is
>> universal to PPC64 like the fscr field in the thread struct? If so, I
>> can just move the FSCR save/restore out of the 207 conditional.
>
> FSCR was only introduced in power8, so it needs to be 207 conditional
>

So on P6/P7 (which have FTR_DSCR) set we're potentially mtspr'ing to a
non-existent register? Yuck. Can at least strip that logic out thanks
to the full context switch, I think, even if dscr_inherit actually has
a use.

- Jack
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Live patching for powerpc

2016-04-14 Thread Jessica Yu

+++ Miroslav Benes [14/04/16 15:28 +0200]:

On Wed, 13 Apr 2016, Jessica Yu wrote:


+++ Miroslav Benes [13/04/16 15:01 +0200]:
> On Wed, 13 Apr 2016, Michael Ellerman wrote:
>
> > This series adds live patching support for powerpc (ppc64le only ATM).
> >
> > It's unchanged since the version I posted on March 24, with the exception
> > that
> > I've dropped the first patch, which was a testing-only patch.
> >
> > If there's no further comments I'll put this in a topic branch in the next
> > day
> > or two and Jiri & I will both merge that into next.
>
> Hi,
>
> I'll definitely give it a proper look today or tomorrow, but there is one
> thing that needs to be solved. The patch set from Jessica reworking
> relocations for live patching is now merged in our for-next branch. This
> means that we need to find out if there is something in struct
> mod_arch_specific for powerpc which needs to be preserved and do it.
>

I took a look around the powerpc module.c code and it looks like the
mod_arch_specific stuff should be fine, since it is statically allocated
in the module struct (unlike the situation in s390, where
mod->arch.syminfo was vmalloc'd and we had to delay the free).
However I'm not familiar with the powerpc code so I need to dig around
a bit more to be 100% sure.


I came to the same conclusion. There is only struct bug_entry *bug_table
in mod_arch_specific but it looks unimportant wrt relocations.


Yeah, I think we are fine. As long as none of the values in
mod_arch_specific are "cleared," and I don't see that happening
anywhere.


A second concern I have is that apply_relocate_add() relies on
sections like .stubs and .toc (for 64-bit) and .init.plt and .plt
sections (for 32-bit). In order for apply_relocate_add() to work for
livepatch, we must make sure these sections aren't thrown away and are
not in init module memory since this memory will be freed at the end
of module load (see how INIT_OFFSET_MASK is used in kernel/module.c).
As long as these sections are placed in module core memory, we will be
OK. I need to think about this a bit more.


I knew I shouldn't have opened arch/powerpc/kernel/module*.c.

We could always hack sh_flags of those sections in
module_arch_frob_sections() to make them stay.



I think we are fine here too. The onus would be on the patch build
tool (e.g., kpatch) to set the sh_flags to SHF_ALLOC, like we
already do to keep the klp relocation sections in memory :-)

For the 32-bit module code, I don't believe we would need to preserve
the .init.plt section for livepatch's call to apply_relocate_add(),
since relocations to init sections should've been applied during
module initialization, and we don't patch those types of functions.
Please correct me if my understanding is off.

Jessica

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] documentation: Add disclaimer

2016-04-14 Thread Paul E. McKenney
On Wed, Jan 27, 2016 at 09:35:46AM +0100, Peter Zijlstra wrote:
> On Tue, Jan 26, 2016 at 12:11:43PM -0800, Paul E. McKenney wrote:
> > So Peter, would you like to update your patch to include yourself
> > and Will as authors?
> 
> Sure, here goes.
> 
> ---
> Subject: documentation: Add disclaimer
> 
> It appears people are reading this document as a requirements list for
> building hardware. This is not the intent of this document. Nor is it
> particularly suited for this purpose.
> 
> The primary purpose of this document is our collective attempt to define
> a set of primitives that (hopefully) allow us to write correct code on
> the myriad of SMP platforms Linux supports.
> 
> Its a definite work in progress as our understanding of these platforms,
> and memory ordering in general, progresses.
> 
> Nor does being mentioned in this document mean we think its a
> particularly good idea; the data dependency barrier required by Alpha
> being a prime example. Yes we have it, no you're insane to require it
> when building new hardware.
> 
> Signed-off-by: Peter Zijlstra (Intel) 

Rather belatedly queued and pushed to -rcu, apologies for the delay.
One minor edit noted below.

Thanx, Paul

> ---
>  Documentation/memory-barriers.txt | 18 +-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/memory-barriers.txt 
> b/Documentation/memory-barriers.txt
> index a61be39c7b51..98626125f484 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -4,8 +4,24 @@
> 
>  By: David Howells 
>  Paul E. McKenney 
> +Will Deacon 
> +Peter Zijlstra 
> 
> -Contents:
> +==
> +DISCLAIMER
> +==
> +
> +This document is not a specification; it is intentionally (for the sake of
> +brevity) and unintentionally (due to being human) incomplete. This document 
> is
> +meant as a guide to using the various memory barriers provided by Linux, but
> +in case of any doubt (and there are many) please ask.
> +
> +I repeat, this document is not a specification of what Linux expects from

s/I/To/ because there is more than one author.

> +hardware.
> +
> +
> +CONTENTS
> +
> 
>   (*) Abstract memory access model.
> 
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] documentation: Add disclaimer

2016-04-14 Thread Paul E. McKenney
On Wed, Jan 27, 2016 at 02:57:07PM +, David Howells wrote:
> Peter Zijlstra  wrote:
> 
> > +==
> > +DISCLAIMER
> > +==
> > +
> > +This document is not a specification; it is intentionally (for the sake of
> > +brevity) and unintentionally (due to being human) incomplete. This 
> > document is
> > +meant as a guide to using the various memory barriers provided by Linux, 
> > but
> > +in case of any doubt (and there are many) please ask.
> > +
> > +I repeat, this document is not a specification of what Linux expects from
> > +hardware.
> 
> The purpose of this document is twofold:
> 
>  (1) to specify the minimum functionality that one can rely on for any
>  particular barrier, and
> 
>  (2) to provide a guide as to how to use the barriers that are available.
> 
> Note that an architecture can provide more than the minimum requirement for
> any particular barrier, but if the barrier provides less than that, it is
> incorrect.
> 
> Note also that it is possible that a barrier may be a no-op for an
> architecture because the way that arch works renders an explicit barrier
> unnecessary in that case.
> 
> > +
> 
> Can you bung an extra blank line in here if you have to redo this at all?

Done as part of your patch.  Again, apologies for the delay.

Thanx, Paul

> > +
> > +CONTENTS
> > +
> >  
> >   (*) Abstract memory access model.
> >  
> 
> David
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v5] powerpc/pci: Assign fixed PHB number based on device-tree properties

2016-04-14 Thread Guilherme G. Piccoli
The domain/PHB field of PCI addresses has its value obtained from a
global variable, incremented each time a new domain (represented by
struct pci_controller) is added on the system. The domain addition
process happens during boot or due to PCI device hotplug.

As recent kernels are using predictable naming for network interfaces,
the network stack is more tied to PCI naming. This can be a problem in
hotplug scenarios, because PCI addresses will change if devices are
removed and then re-added. This situation seems unusual, but it can
happen if a user wants to replace a NIC without rebooting the machine,
for example.

This patch changes the way PCI domain values are generated: now, we use
device-tree properties to assign fixed PHB numbers to PCI addresses
when available (meaning pSeries and PowerNV cases). We also use a bitmap
to allow dynamic PHB numbering when device-tree properties are not
used. This bitmap keeps track of used PHB numbers and if a PHB is
released (by hotplug operations for example), it allows the reuse of
this PHB number, avoiding PCI address to change in case of device remove
and re-add soon after. No functional changes were introduced.

Reviewed-by: Gavin Shan 
Signed-off-by: Guilherme G. Piccoli 
---
 arch/powerpc/kernel/pci-common.c | 66 ++--
 1 file changed, 63 insertions(+), 3 deletions(-)

v5:
  * Improved comments.

  * Changed the the Fixed PHB Numbering to set the PHB number bit
  on the bitmap anyway, avoiding issues when system has virtual PHBs.

  * Changed the device-tree check order - now, firstly we check for
  "ibm,opal-phbid" and if it's not available, we try the pSeries case.

v4:
  * Minor change (if/else nesting rearranged).

v3:
  * Made the bitmap static.

  * Rearranged if/else statements of Fixed PHB checking.

  * Improved bitmap checkings, by removing loop and using instead the
  find_first_zero_bit() function.

  * Removed the single-statement function release_phb_number() by
  adding its logic directly into pcibios_free_controller().

  *Added check for bitmap size before clearing bit, avoiding memory
  corruption.

v2:
  * Added the Fixed PHB Numbering mechanism based on device-tree
  properties.

  * Changed list approach to bitmap on the Dynamic PHB Numbering
  mechanism.

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 0f7a60f..ad423c1 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -41,11 +41,17 @@
 #include 
 #include 
 
+/* hose_spinlock protects accesses to the the phb_bitmap. */
 static DEFINE_SPINLOCK(hose_spinlock);
 LIST_HEAD(hose_list);
 
-/* XXX kill that some day ... */
-static int global_phb_number;  /* Global phb counter */
+/* For dynamic PHB numbering on get_phb_number(): max number of PHBs. */
+#defineMAX_PHBS8192
+
+/* For dynamic PHB numbering: used/free PHBs tracking bitmap.
+ * Accesses to this bitmap should be protected by hose_spinlock.
+ */
+static DECLARE_BITMAP(phb_bitmap, MAX_PHBS);
 
 /* ISA Memory physical address */
 resource_size_t isa_mem_base;
@@ -64,6 +70,55 @@ struct dma_map_ops *get_pci_dma_ops(void)
 }
 EXPORT_SYMBOL(get_pci_dma_ops);
 
+/* get_phb_number() function should run under locking
+ * protection, specifically hose_spinlock.
+ */
+static int get_phb_number(struct device_node *dn)
+{
+   const __be64 *prop64;
+   const __be32 *regs;
+   int phb_id = 0;
+
+   /* Try fixed PHB numbering first, by checking archs and reading
+* the respective device-tree properties. Firstly, try PowerNV by
+* reading "ibm,opal-phbid", only present in OPAL environment.
+*/
+   prop64 = of_get_property(dn, "ibm,opal-phbid", NULL);
+   if (prop64) {
+   phb_id = (int)(be64_to_cpup(prop64) & 0x);
+
+   } else if (machine_is(pseries)) {
+   regs = of_get_property(dn, "reg", NULL);
+   if (regs)
+   phb_id = (int)(be32_to_cpu(regs[1]) & 0x);
+   } else {
+   goto dynamic_phb_numbering;
+   }
+
+   /* If we have a huge PHB number obtained from device-tree, no need
+* to worry with the bitmap. Otherwise, we need to be sure we're
+* not trying to use the same PHB number twice.
+*/
+   if (phb_id < MAX_PHBS) {
+   if (test_bit(phb_id, phb_bitmap))
+   goto dynamic_phb_numbering;
+   set_bit(phb_id, phb_bitmap);
+   }
+
+   return phb_id;
+
+   /* If not pSeries nor PowerNV, or if fixed PHB numbering tried to add
+* the same PHB number twice, then fallback to dynamic PHB numbering.
+*/
+dynamic_phb_numbering:
+
+   phb_id = find_first_zero_bit(phb_bitmap, MAX_PHBS);
+   BUG_ON(phb_id >= MAX_PHBS); /* Reached maximum number of PHBs. */
+   set_bit(phb_id, phb_bitmap);
+
+   return phb_id;
+}
+
 struct pci_controller *pcibios_alloc_controller(struct device_n

Re: [PATCH][v3] drivers/memory: Add deep sleep support for IFC

2016-04-14 Thread Leo Li
Hi Raghav,

Are we planning to send a new version of this patch?  Btw, I see that
the current patch covers NOR/GPCM related registers, but the driver is
only built when IFC NAND is enabled right now.  Do we want to change
the Kconfig to make it not depending on NAND?

Regards,
Leo

On Wed, Feb 17, 2016 at 7:19 PM, Scott Wood  wrote:
> On Wed, 2016-02-17 at 14:40 +, Raghav Dogra wrote:
>>
>> > -Original Message-
>> > From: Scott Wood [mailto:o...@buserror.net]
>> > Sent: Tuesday, February 16, 2016 2:05 PM
>> > To: Raghav Dogra ; linuxppc-dev@lists.ozlabs.org
>> > Cc: Prabhakar Kushwaha 
>> > Subject: Re: [PATCH][v3] drivers/memory: Add deep sleep support for IFC
>> >
>> > On Mon, 2016-02-15 at 11:44 +0530, Raghav Dogra wrote:
>> > >
>> > > +
>> > > + ctrl->saved_regs = kzalloc(sizeof(struct fsl_ifc_regs),
>> > > GFP_KERNEL);
>> > > + if (!ctrl->saved_regs)
>> > > + return -ENOMEM;
>> >
>> > Allocate memory at probe time, not here.
>> >
>>
>> But, why allocate memory at the probe when it is not known at that time
>> whether
>> deep sleep state would be required or not? Is that because we want to save
>> time
>> while going to deep sleep?
>
> We also want to avoid potential failures here.  We can also keep the code
> simpler by embedding this into the ctrl struct itself, and not dynamically
> allocating it at all.
>
>> > >
>> > > + ifc_out32(0x0, &ifc->ifc_nand.nand_evter_intr_en);
>> > > + ifc_out32(0x0, &ifc->ifc_nor.nor_evter_intr_en);
>> > > + ifc_out32(0x0, &ifc->ifc_gpcm.gpcm_evter_intr_en);
>> > > +
>> > > + memcpy_fromio(ctrl->saved_regs, ifc, sizeof(struct
>> > > fsl_ifc_regs));
>> > > +
>> > > +/* save the interrupt values */
>> > > + ctrl->saved_regs->cm_evter_intr_en = cm_evter_intr_en;
>> > > + ctrl->saved_regs->ifc_nand.nand_evter_intr_en =
>> > nand_evter_intr_en;
>> > > + ctrl->saved_regs->ifc_nor.nor_evter_intr_en =
>> > > nor_evter_intr_en;
>> > > + ctrl->saved_regs->ifc_gpcm.gpcm_evter_intr_en =
>> > gpcm_evter_intr_en;
>> >
>> > Why didn't you use the memcpy_fromio() to save these, and clear intr_en
>> > later?
>> >
>>
>> I used it whenever I did a write/read on iomem. In this case, both memories
>> are non iomem.
>
> Huh?  &ifc->ifc_nand.nand_evter_intr_en and such are iomem.
>
>> > > +
>> > > +/*
>> > > + * IFC interrupts disabled
>> > > + */
>> > > + ifc_out32(0x0, &ifc->cm_evter_intr_en);
>> > > + ifc_out32(0x0, &ifc->ifc_nand.nand_evter_intr_en);
>> > > + ifc_out32(0x0, &ifc->ifc_nor.nor_evter_intr_en);
>> > > + ifc_out32(0x0, &ifc->ifc_gpcm.gpcm_evter_intr_en);
>> > > +
>> > > +
>> > > + if (ctrl->saved_regs) {
>> > > + for (ifc_bank = 0; ifc_bank < FSL_IFC_BANK_COUNT;
>> > > ifc_bank++) {
>> > > + ifc_out32(savd_regs
>> > > ->cspr_cs[ifc_bank].cspr_ext,
>> > > + &ifc
>> > > ->cspr_cs[ifc_bank].cspr_ext);
>> > > + ifc_out32(savd_regs->cspr_cs[ifc_bank].cspr,
>> > > + &ifc->cspr_cs[ifc_bank].cspr);
>> > > + ifc_out32(savd_regs->amask_cs[ifc_bank].amask,
>> > > + &ifc
>> > > ->amask_cs[ifc_bank].amask);
>> > > + ifc_out32(savd_regs
>> > > ->csor_cs[ifc_bank].csor_ext,
>> > > + &ifc
>> > > ->csor_cs[ifc_bank].csor_ext);
>> > > + ifc_out32(savd_regs->csor_cs[ifc_bank].csor,
>> > > + &ifc->csor_cs[ifc_bank].csor);
>> >
>> > Align continuation lines the way patchwork suggests ("&ifc" aligned with
>> > "savd").
>>
>> Okay, I will take care of this in the next patch.
>>
>> >
>> > Does resume from deep sleep go via U-Boot (which would initialize these
>> > registers) on these chips?
>>
>> Yes, deep sleep resume goes via u-boot and these registers should be
>> initialized
>> By u-boot.
>
> So then we don't need to save/restore them.
>
>>
>
>> > > +
>> > > +/*
>> > > +* IFC controller NOR machine registers */
>> > > + ifc_out32(savd_regs->ifc_nor.nor_evter_en,
>> > > + &ifc->ifc_nor.nor_evter_en);
>> > > + ifc_out32(savd_regs->ifc_nor.norcr, &ifc
>> > > ->ifc_nor.norcr);
>> >
>> > What uses these?
>> >
>>
>> These registers are not used as such, but we would like to retain their
>> value as they
>> can be of help in case of error conditions.
>
> I don't follow.  Neither of those registers reports errors, and the registers
> that *do* report errors are generally w1c and thus you can't save/restore
> them.
>
>> > >
>> > > +
>> > > + ver = ifc_in32(&ctrl->regs->ifc_rev);
>> > > + ncfgr = ifc_in32(&ifc->ifc_nand.ncfgr);
>> > > + if (ver >= FSL_IFC_V1_3_0) {
>> > > +
>> > > + ifc_out32(ncfgr | IFC_NAND_SRAM_INIT_EN,
>> > > + &ifc->ifc_nand.ncfgr);
>> > > + /* wait for  SRAM_INIT bit to be clear or timeout */
>> > > + timeout = IFC_TIMEOUT_MSECS;
>> > > + while ((ifc_in32(&ifc->ifc_nand.ncfgr) &
>> > > +

Re: [PATCH 1/3] powerpc: Complete FSCR context switch

2016-04-14 Thread Michael Neuling
On Thu, 2016-04-14 at 13:39 -0500, Jack Miller wrote:


> > I'm not sure that works on processes before power8.
> > 
> > There DSCR SPR number 0x11 will always trap and emulate from userspace
> > (see arch/powerpc/kernel/traps.c:emulate_instruction()).  That is not
> > controlled by FSCR and should work on POWER7 where FSCR is not
> > present.  We need to set the inherit bit there too.
> > 
> > DSCR SPR number 0x3 is controlled by fscr, but it's only avaliable on
> > POWER8.
> > 

> > > Right now the FSCR switch is conditional on FTR_ARCH_207S which is
> > > more exclusive than FTR_DSCR, but I guess the actual FSCR register is
> > > universal to PPC64 like the fscr field in the thread struct? If so, I
> > > can just move the FSCR save/restore out of the 207 conditional.
> > 
> > FSCR was only introduced in power8, so it needs to be 207 conditional
> > 
> 
> So on P6/P7 (which have FTR_DSCR) set we're potentially mtspr'ing to a
> non-existent register? Yuck. Can at least strip that logic out thanks
> to the full context switch, I think, even if dscr_inherit actually has
> a use.

Yeah, welcome to DSCR... it's horrible. :-(

It's not actually non-existent.  It's just OS only privileged.

Not that that makes much difference, as we still trap and emulate (and
set the inherent bit).

Mikey
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] ftrace: filter: Match dot symbols when searching functions on ppc64.

2016-04-14 Thread Thiago Jung Bauermann
Am Donnerstag, 14 April 2016, 06:58:05 schrieb Steven Rostedt:
> On Thu, 14 Apr 2016 17:11:35 +1000
> 
> Michael Ellerman  wrote:
> > Presumably Steve will have a preference for which style you use.
> 
> Actually, what I usually do is simply make a "weak" stub function and
> let the arch override it.

Ok, in the next version of the patch I'll use the weak function alternative.

> > It seems unfortunate that we need to introduce yet another mechanism to
> > deal with dot symbols. But I guess none of the existing mechanisms
> > work, eg. kprobe_lookup_name().
> 
> What about making a generic conversion to function name, like:
> 
>   arch_sane_function_name(str);
> 
> Where all sane archs simply return the string name and powerpc can
> strip the '.' prefix. ;-)
> 
> Of course that would require:
> 
>   sane_str = arch_sane_function(str);
>   sane_search = arch_sane_function(g->search);
> 
>   and then compare sane_str with sane_search.

I had a look at kprobe_lookup_name but it aims to convert a function name to 
an address while ftrace_match just wants to compare symbol names, so it's 
not applicable to what ftrace_match is trying to do. It also goes in the 
opposite direction of the other places which deal with dot symbols that I 
could find: it adds a dot because it wants to find the dot symbol, while 
everywhere else the code removes the dot from the name. For that reason, 
kprobe_lookup_name wouldn't be able to use a generalized solution for 
working with dot symbol names like arch_sane_function as proposed above.

I did find arch__compare_symbol_names in tools/perf/arch/powerpc/util/sym-
handling.c which could be used as-is (if I moved it to somewhere which is 
common to kernel and userspace code) if ftrace_match only matched the full 
string, but ftrace_match supports doing front, middle and end string matches 
as well. It looks like those additional comparison modes are not used at all 
though.

If we do want to reuse arch__compare_symbol_names instead of creating yet 
another arch-specific symbol comparison mechanism, there are two options:

1. remove ftrace_match and use arch__compare_symbol_names directly;
2. extend arch__compare_symbol_names to support front, middle and end 
matches.

What do you think?

> > >  #ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
> > > 
> > > diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
> > > index b1870fbd2b67..e806c2a3b7a8 100644
> > > --- a/kernel/trace/ftrace.c
> > > +++ b/kernel/trace/ftrace.c
> > > @@ -3444,11 +3444,24 @@ struct ftrace_glob {
> > > 
> > >   int type;
> > >  
> > >  };
> > > 
> > > +#ifndef ARCH_HAS_FTRACE_MATCH_ADJUST
> > > +/*
> > > + * If symbols in an architecture don't correspond exactly to the
> > > user-visible + * name of what they represent, it is possible to
> > > define this function to + * perform the necessary adjustments.
> > > +*/
> > > +static inline void arch_ftrace_match_adjust(char **str, char *search)
> > > +{
> > > +}
> > > +#endif
> > > +
> > > 
> > >  static int ftrace_match(char *str, struct ftrace_glob *g)
> > >  {
> > >  
> > >   int matched = 0;
> > >   int slen;
> > > 
> > > + arch_ftrace_match_adjust(&str, g->search);
> > 
> > I think this would less magical if it didn't modify str directly, 
instead doing:
> > str = arch_ftrace_match_adjust(str, g->search);
> > 
> > And arch_ftrace_match_adjust() would return the adjusted char *.
> > 
> > That would mean the generic version would need to return str rather than
> > being empty.
> 
> I agree, because if we need to free the string passed it, the function
> would modify the pointer and we wouldn't be able to free it. In those
> cases we could do:
> 
>   tmp_str = arch_ftrace_match_adjust(str, search);
>   /* use tmp_str and then ignore */
>   kfree(str);

If you decide against either of my alternatives for using 
arch__compare_symbol_names, I'll change arch_ftrace_match_adjust to work as 
you suggested above in the next version of this patch.

> ** Disclaimer **
> 
> Note, I just took the red-eye (2 hours of sleep on the plane) and
> waiting for my next flight. My focus may be off in this email.

Ouch. Thanks for having a look at the patch and responding to my ping!

-- 
[]'s
Thiago Jung Bauermann
IBM Linux Technology Center

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 45/45] PCI/hotplug: PowerPC PowerNV PCI hotplug driver

2016-04-14 Thread Alistair Popple
Hi Gavin,

I was reading through this to understand how it all works and noticed a couple
of things, comments below.

- Alistair

On Wed, 17 Feb 2016 14:44:28 Gavin Shan wrote:



> +
> +static void pnv_php_handle_poweron(struct pnv_php_slot *php_slot)
> +{
> + void *fdt, *fdt1, *dt;
> + int confirm = PNV_PHP_POWER_CONFIRMED_SUCCESS;
> + int ret;
> +
> + /* We don't know the FDT blob size. We try to get it through
> +  * maximal memory chunk and then copy it to another chunk that
> +  * fits the real size.
> +  */
> + fdt1 = kzalloc(0x1, GFP_KERNEL);
> + if (!fdt1)
> + goto error;
> +
> + ret = pnv_pci_get_device_tree(php_slot->dn->phandle, fdt1, 0x1);
> + if (ret)
> + goto free_fdt1;
> +
> + fdt = kzalloc(fdt_totalsize(fdt1), GFP_KERNEL);
> + if (!fdt)
> + goto free_fdt1;
> +
> + /* Unflatten device tree blob */
> + memcpy(fdt, fdt1, fdt_totalsize(fdt1));
> + dt = of_fdt_unflatten_tree(fdt, php_slot->dn, NULL);
> + if (!dt) {
> + dev_warn(&php_slot->pdev->dev, "Cannot unflatten FDT\n");
> + goto free_fdt;
> + }
> +
> + /* Initialize and apply the changeset */
> + of_changeset_init(&php_slot->ocs);
> + ret = pnv_php_populate_changeset(&php_slot->ocs, php_slot->dn);
> + if (ret) {
> + dev_warn(&php_slot->pdev->dev, "Error %d populating 
> changeset\n",
> +  ret);
> + goto free_dt;
> + }
> +
> + php_slot->dn->child = NULL;
> + ret = of_changeset_apply(&php_slot->ocs);
> + if (ret) {
> + dev_warn(&php_slot->pdev->dev, "Error %d applying changeset\n",
> +  ret);
> + goto destroy_changeset;
> + }
> +
> + /* Add device node firmware data */
> + pnv_php_add_pdns(php_slot);
> + php_slot->fdt = fdt;
> + php_slot->dt  = dt;
> + goto out;

Doesn't this leak memory from fdt1? I can't see where it gets freed in this
case.

> +destroy_changeset:
> + of_changeset_destroy(&php_slot->ocs);
> +free_dt:
> + kfree(dt);
> + php_slot->dn->child = NULL;
> +free_fdt:
> + kfree(fdt);
> +free_fdt1:
> + kfree(fdt1);
> +error:
> + confirm = PNV_PHP_POWER_CONFIRMED_FAIL;
> +out:
> + /* Confirm status change */
> + php_slot->power_state_confirmed = confirm;
> + wake_up_interruptible(&php_slot->queue);
> +}
> +



> +
> +static void __exit pnv_php_exit(void)
> +{
> + struct device_node *dn;
> +
> + for_each_compatible_node(dn, NULL, "ibm,ioda-phb")
> + pnv_php_unregister(dn);
> + for_each_compatible_node(dn, NULL, "ibm,ioda2-phb")
> + pnv_php_unregister(dn);
> +
> + pnv_pci_hotplug_notifier_unregister(&php_msg_nb);

Do you flush the workqueues anywhere? Usually you would stop work being queued 
and call something like flush_workqueue() to ensure no work is still
running/queued before unloading the module.

- Alistair

> +}
> +
> +module_init(pnv_php_init);
> +module_exit(pnv_php_exit);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH kernel 2/2] powerpc/powernv/ioda2: Delay PE disposal

2016-04-14 Thread Alexey Kardashevskiy

On 04/14/2016 11:40 AM, David Gibson wrote:

On Fri, Apr 08, 2016 at 04:36:44PM +1000, Alexey Kardashevskiy wrote:

When SRIOV is disabled, the existing code presumes there is no
virtual function (VF) in use and destroys all associated PEs.
However it is possible to get into the situation when the user
activated SRIOV disabling while a VF is still in use via VFIO.
For example, unbinding a physical function (PF) while there is a guest
running with a VF passed throuhgh via VFIO will trigger the bug.

This defines an IODA2-specific IOMMU group release() callback.
This moves all the disposal code from pnv_ioda_release_vf_PE() to this
new callback so the cleanup happens when the last user of an IOMMU
group released the reference.

As pnv_pci_ioda2_release_dma_pe() was reduced to just calling
iommu_group_put(), this merges pnv_pci_ioda2_release_dma_pe()
into pnv_ioda_release_vf_PE().

Signed-off-by: Alexey Kardashevskiy 
---
  arch/powerpc/platforms/powernv/pci-ioda.c | 33 +--
  1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index ce9f2bf..8108c54 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1333,27 +1333,25 @@ static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe 
*pe, bool enable);
  static void pnv_pci_ioda2_group_release(void *iommu_data)
  {
struct iommu_table_group *table_group = iommu_data;
+   struct pnv_ioda_pe *pe = container_of(table_group,
+   struct pnv_ioda_pe, table_group);
+   struct pci_controller *hose = pci_bus_to_host(pe->parent_dev->bus);
+   struct pnv_phb *phb = hose->private_data;
+   struct iommu_table *tbl = pe->table_group.tables[0];
+   int64_t rc;

-   table_group->group = NULL;
-}
-
-static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct 
pnv_ioda_pe *pe)
-{
-   struct iommu_table*tbl;
-   int64_t   rc;
-
-   tbl = pe->table_group.tables[0];
rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0);


Is it safe to go manipulating the PE windows, etc. after SR-IOV is
disabled?


Manipulating windows in this case is just updating 8 bytes in the TVT. At 
this point a VF is expected to be destroyed but PE is expected to remain 
not free so pnv_ioda2_pick_m64_pe() (or pnv_ioda2_reserve_m64_pe()?) won't 
use it.




When SR-IOV is disabled, you need to immediately disable the VF (I'm
guessing that happens somewhere) and stop all access to the VF
"hardware".


drivers/pci/iov.c
===
static void sriov_disable(struct pci_dev *dev)
{
...
for (i = 0; i < iov->num_VFs; i++)
pci_iov_remove_virtfn(dev, i, 0);
...
pcibios_sriov_disable(dev);
===

pcibios_sriov_disable() is where pnv_pci_ioda2_release_dma_pe() is called from.


Only the iommu group structure *has* to stick around
until the reference count drops to zero.  I think other structures and
hardware reconfiguration can be deferred or done immediately,
whichever is more convenient.


I deferred everything because of convenience as iommu_table_group is 
embedded into pnv_ioda struct, not a pointer.





if (rc)
pe_warn(pe, "OPAL error %ld release DMA window\n", rc);

pnv_pci_ioda2_set_bypass(pe, false);
-   if (pe->table_group.group) {
-   iommu_group_put(pe->table_group.group);
-   BUG_ON(pe->table_group.group);
-   }
+
+   BUG_ON(!tbl);
pnv_pci_ioda2_table_free_pages(tbl);
-   iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
+   iommu_free_table(tbl, of_node_full_name(pe->parent_dev->dev.of_node));
+
+   pnv_ioda_deconfigure_pe(phb, pe);
+   pnv_ioda_free_pe(phb, pe->pe_number);
  }

  static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
@@ -1376,16 +1374,13 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
if (pe->parent_dev != pdev)
continue;

-   pnv_pci_ioda2_release_dma_pe(pdev, pe);
-
/* Remove from list */
mutex_lock(&phb->ioda.pe_list_mutex);
list_del(&pe->list);
mutex_unlock(&phb->ioda.pe_list_mutex);

-   pnv_ioda_deconfigure_pe(phb, pe);
-
-   pnv_ioda_free_pe(phb, pe->pe_number);
+   if (pe->table_group.group)
+   iommu_group_put(pe->table_group.group);
}
  }






--
Alexey
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 45/45] PCI/hotplug: PowerPC PowerNV PCI hotplug driver

2016-04-14 Thread Gavin Shan
On Fri, Apr 15, 2016 at 10:47:52AM +1000, Alistair Popple wrote:
>Hi Gavin,
>
>I was reading through this to understand how it all works and noticed a couple
>of things, comments below.
>

Alistair, thanks for your time on review.

>
>On Wed, 17 Feb 2016 14:44:28 Gavin Shan wrote:
>
>
>
>> +
>> +static void pnv_php_handle_poweron(struct pnv_php_slot *php_slot)
>> +{
>> +void *fdt, *fdt1, *dt;
>> +int confirm = PNV_PHP_POWER_CONFIRMED_SUCCESS;
>> +int ret;
>> +
>> +/* We don't know the FDT blob size. We try to get it through
>> + * maximal memory chunk and then copy it to another chunk that
>> + * fits the real size.
>> + */
>> +fdt1 = kzalloc(0x1, GFP_KERNEL);
>> +if (!fdt1)
>> +goto error;
>> +
>> +ret = pnv_pci_get_device_tree(php_slot->dn->phandle, fdt1, 0x1);
>> +if (ret)
>> +goto free_fdt1;
>> +
>> +fdt = kzalloc(fdt_totalsize(fdt1), GFP_KERNEL);
>> +if (!fdt)
>> +goto free_fdt1;
>> +
>> +/* Unflatten device tree blob */
>> +memcpy(fdt, fdt1, fdt_totalsize(fdt1));
>> +dt = of_fdt_unflatten_tree(fdt, php_slot->dn, NULL);
>> +if (!dt) {
>> +dev_warn(&php_slot->pdev->dev, "Cannot unflatten FDT\n");
>> +goto free_fdt;
>> +}
>> +
>> +/* Initialize and apply the changeset */
>> +of_changeset_init(&php_slot->ocs);
>> +ret = pnv_php_populate_changeset(&php_slot->ocs, php_slot->dn);
>> +if (ret) {
>> +dev_warn(&php_slot->pdev->dev, "Error %d populating 
>> changeset\n",
>> + ret);
>> +goto free_dt;
>> +}
>> +
>> +php_slot->dn->child = NULL;
>> +ret = of_changeset_apply(&php_slot->ocs);
>> +if (ret) {
>> +dev_warn(&php_slot->pdev->dev, "Error %d applying changeset\n",
>> + ret);
>> +goto destroy_changeset;
>> +}
>> +
>> +/* Add device node firmware data */
>> +pnv_php_add_pdns(php_slot);
>> +php_slot->fdt = fdt;
>> +php_slot->dt  = dt;
>> +goto out;
>
>Doesn't this leak memory from fdt1? I can't see where it gets freed in this
>case.
>

You're right that @fdt1 should be released here. I'll fix it in
next revision.

>> +destroy_changeset:
>> +of_changeset_destroy(&php_slot->ocs);
>> +free_dt:
>> +kfree(dt);
>> +php_slot->dn->child = NULL;
>> +free_fdt:
>> +kfree(fdt);
>> +free_fdt1:
>> +kfree(fdt1);
>> +error:
>> +confirm = PNV_PHP_POWER_CONFIRMED_FAIL;
>> +out:
>> +/* Confirm status change */
>> +php_slot->power_state_confirmed = confirm;
>> +wake_up_interruptible(&php_slot->queue);
>> +}
>> +
>
>
>
>> +
>> +static void __exit pnv_php_exit(void)
>> +{
>> +struct device_node *dn;
>> +
>> +for_each_compatible_node(dn, NULL, "ibm,ioda-phb")
>> +pnv_php_unregister(dn);
>> +for_each_compatible_node(dn, NULL, "ibm,ioda2-phb")
>> +pnv_php_unregister(dn);
>> +
>> +pnv_pci_hotplug_notifier_unregister(&php_msg_nb);
>
>Do you flush the workqueues anywhere? Usually you would stop work being queued 
>and call something like flush_workqueue() to ensure no work is still
>running/queued before unloading the module.
>

Good question. Yeah, I'll flush the workqueue before the module is going
to be unloaded.

Thanks,
Gavin

>- Alistair
>
>> +}
>> +
>> +module_init(pnv_php_init);
>> +module_exit(pnv_php_exit);
>> +
>> +MODULE_VERSION(DRIVER_VERSION);
>> +MODULE_LICENSE("GPL v2");
>> +MODULE_AUTHOR(DRIVER_AUTHOR);
>> +MODULE_DESCRIPTION(DRIVER_DESC);
>> 
>

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 1/3] powerpc: scan_features() updates incorrect bits

2016-04-14 Thread Anton Blanchard via Linuxppc-dev
The real LE feature entry in the ibm_pa_feature struct has the
wrong number of elements. Instead of checking for byte 5, bit 0,
we check for byte 0, bit 0, and we also incorrectly update cpu user
feature bit 5.

Fixes: 44ae3ab3358e ("powerpc: Free up some CPU feature bits by moving out 
MMU-related features")
Signed-off-by: Anton Blanchard 
Cc: sta...@vger.kernel.org
---
 arch/powerpc/kernel/prom.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 7030b03..9a3a7c6 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -158,7 +158,7 @@ static struct ibm_pa_feature {
{CPU_FTR_NOEXECUTE, 0, 0,   0, 6, 0},
{CPU_FTR_NODSISRALIGN, 0, 0,1, 1, 1},
{0, MMU_FTR_CI_LARGE_PAGE, 0,   1, 2, 0},
-   {CPU_FTR_REAL_LE, PPC_FEATURE_TRUE_LE, 5, 0, 0},
+   {CPU_FTR_REAL_LE, PPC_FEATURE_TRUE_LE, 0, 5, 0, 0},
/*
 * If the kernel doesn't support TM (ie. 
CONFIG_PPC_TRANSACTIONAL_MEM=n),
 * we don't want to turn on CPU_FTR_TM here, so we use CPU_FTR_TM_COMP
-- 
2.7.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/3] powerpc: Update cpu_user_features2 in scan_features()

2016-04-14 Thread Anton Blanchard via Linuxppc-dev
scan_features() updates cpu_user_features but not cpu_user_features2.

Amongst other things, cpu_user_features2 contains the user TM feature
bits which we must keep in sync with the kernel TM feature bit.

Signed-off-by: Anton Blanchard 
Cc: sta...@vger.kernel.org
---
 arch/powerpc/kernel/prom.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 9a3a7c6..99709bb 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -148,23 +148,24 @@ static struct ibm_pa_feature {
unsigned long   cpu_features;   /* CPU_FTR_xxx bit */
unsigned long   mmu_features;   /* MMU_FTR_xxx bit */
unsigned intcpu_user_ftrs;  /* PPC_FEATURE_xxx bit */
+   unsigned intcpu_user_ftrs2; /* PPC_FEATURE2_xxx bit */
unsigned char   pabyte; /* byte number in ibm,pa-features */
unsigned char   pabit;  /* bit number (big-endian) */
unsigned char   invert; /* if 1, pa bit set => clear feature */
 } ibm_pa_features[] __initdata = {
-   {0, 0, PPC_FEATURE_HAS_MMU, 0, 0, 0},
-   {0, 0, PPC_FEATURE_HAS_FPU, 0, 1, 0},
-   {CPU_FTR_CTRL, 0, 0,0, 3, 0},
-   {CPU_FTR_NOEXECUTE, 0, 0,   0, 6, 0},
-   {CPU_FTR_NODSISRALIGN, 0, 0,1, 1, 1},
-   {0, MMU_FTR_CI_LARGE_PAGE, 0,   1, 2, 0},
-   {CPU_FTR_REAL_LE, PPC_FEATURE_TRUE_LE, 0, 5, 0, 0},
+   {0, 0, PPC_FEATURE_HAS_MMU, 0,  0, 0, 0},
+   {0, 0, PPC_FEATURE_HAS_FPU, 0,  0, 1, 0},
+   {CPU_FTR_CTRL, 0, 0, 0, 0, 3, 0},
+   {CPU_FTR_NOEXECUTE, 0, 0, 0,0, 6, 0},
+   {CPU_FTR_NODSISRALIGN, 0, 0, 0, 1, 1, 1},
+   {0, MMU_FTR_CI_LARGE_PAGE, 0, 0,1, 2, 0},
+   {CPU_FTR_REAL_LE, PPC_FEATURE_TRUE_LE, 0, 0, 5, 0, 0},
/*
 * If the kernel doesn't support TM (ie. 
CONFIG_PPC_TRANSACTIONAL_MEM=n),
 * we don't want to turn on CPU_FTR_TM here, so we use CPU_FTR_TM_COMP
 * which is 0 if the kernel doesn't support TM.
 */
-   {CPU_FTR_TM_COMP, 0, 0, 22, 0, 0},
+   {CPU_FTR_TM_COMP, 0, 0, 0,  22, 0, 0},
 };
 
 static void __init scan_features(unsigned long node, const unsigned char *ftrs,
@@ -195,10 +196,12 @@ static void __init scan_features(unsigned long node, 
const unsigned char *ftrs,
if (bit ^ fp->invert) {
cur_cpu_spec->cpu_features |= fp->cpu_features;
cur_cpu_spec->cpu_user_features |= fp->cpu_user_ftrs;
+   cur_cpu_spec->cpu_user_features2 |= fp->cpu_user_ftrs2;
cur_cpu_spec->mmu_features |= fp->mmu_features;
} else {
cur_cpu_spec->cpu_features &= ~fp->cpu_features;
cur_cpu_spec->cpu_user_features &= ~fp->cpu_user_ftrs;
+   cur_cpu_spec->cpu_user_features2 &= ~fp->cpu_user_ftrs2;
cur_cpu_spec->mmu_features &= ~fp->mmu_features;
}
}
-- 
2.7.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 3/3] powerpc: Update TM user feature bits in scan_features()

2016-04-14 Thread Anton Blanchard via Linuxppc-dev
We need to update the user TM feature bits (PPC_FEATURE2_HTM and
PPC_FEATURE2_HTM) to mirror what we do with the kernel TM feature
bit.

At the moment, if firmware reports TM is not available we turn off
the kernel TM feature bit but leave the userspace ones on. Userspace
thinks it can execute TM instructions and it dies trying.

This (together with a QEMU patch) fixes PR KVM, which doesn't currently
support TM.

Signed-off-by: Anton Blanchard 
Cc: sta...@vger.kernel.org
---
 arch/powerpc/kernel/prom.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 99709bb..5beffd7 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -161,11 +161,12 @@ static struct ibm_pa_feature {
{0, MMU_FTR_CI_LARGE_PAGE, 0, 0,1, 2, 0},
{CPU_FTR_REAL_LE, PPC_FEATURE_TRUE_LE, 0, 0, 5, 0, 0},
/*
-* If the kernel doesn't support TM (ie. 
CONFIG_PPC_TRANSACTIONAL_MEM=n),
-* we don't want to turn on CPU_FTR_TM here, so we use CPU_FTR_TM_COMP
-* which is 0 if the kernel doesn't support TM.
+* If the kernel doesn't support TM (ie CONFIG_PPC_TRANSACTIONAL_MEM=n),
+* we don't want to turn on TM here, so we use the *_COMP versions
+* which are 0 if the kernel doesn't support TM.
 */
-   {CPU_FTR_TM_COMP, 0, 0, 0,  22, 0, 0},
+   {CPU_FTR_TM_COMP, 0, 0,
+PPC_FEATURE2_HTM_COMP|PPC_FEATURE2_HTM_NOSC_COMP, 22, 0, 0},
 };
 
 static void __init scan_features(unsigned long node, const unsigned char *ftrs,
-- 
2.7.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH kernel 2/2] powerpc/powernv/ioda2: Delay PE disposal

2016-04-14 Thread David Gibson
On Fri, Apr 15, 2016 at 11:29:32AM +1000, Alexey Kardashevskiy wrote:
> On 04/14/2016 11:40 AM, David Gibson wrote:
> >On Fri, Apr 08, 2016 at 04:36:44PM +1000, Alexey Kardashevskiy wrote:
> >>When SRIOV is disabled, the existing code presumes there is no
> >>virtual function (VF) in use and destroys all associated PEs.
> >>However it is possible to get into the situation when the user
> >>activated SRIOV disabling while a VF is still in use via VFIO.
> >>For example, unbinding a physical function (PF) while there is a guest
> >>running with a VF passed throuhgh via VFIO will trigger the bug.
> >>
> >>This defines an IODA2-specific IOMMU group release() callback.
> >>This moves all the disposal code from pnv_ioda_release_vf_PE() to this
> >>new callback so the cleanup happens when the last user of an IOMMU
> >>group released the reference.
> >>
> >>As pnv_pci_ioda2_release_dma_pe() was reduced to just calling
> >>iommu_group_put(), this merges pnv_pci_ioda2_release_dma_pe()
> >>into pnv_ioda_release_vf_PE().
> >>
> >>Signed-off-by: Alexey Kardashevskiy 
> >>---
> >>  arch/powerpc/platforms/powernv/pci-ioda.c | 33 
> >> +--
> >>  1 file changed, 14 insertions(+), 19 deletions(-)
> >>
> >>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> >>b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>index ce9f2bf..8108c54 100644
> >>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >>@@ -1333,27 +1333,25 @@ static void pnv_pci_ioda2_set_bypass(struct 
> >>pnv_ioda_pe *pe, bool enable);
> >>  static void pnv_pci_ioda2_group_release(void *iommu_data)
> >>  {
> >>struct iommu_table_group *table_group = iommu_data;
> >>+   struct pnv_ioda_pe *pe = container_of(table_group,
> >>+   struct pnv_ioda_pe, table_group);
> >>+   struct pci_controller *hose = pci_bus_to_host(pe->parent_dev->bus);
> >>+   struct pnv_phb *phb = hose->private_data;
> >>+   struct iommu_table *tbl = pe->table_group.tables[0];
> >>+   int64_t rc;
> >>
> >>-   table_group->group = NULL;
> >>-}
> >>-
> >>-static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct 
> >>pnv_ioda_pe *pe)
> >>-{
> >>-   struct iommu_table*tbl;
> >>-   int64_t   rc;
> >>-
> >>-   tbl = pe->table_group.tables[0];
> >>rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0);
> >
> >Is it safe to go manipulating the PE windows, etc. after SR-IOV is
> >disabled?
> 
> Manipulating windows in this case is just updating 8 bytes in the TVT. At
> this point a VF is expected to be destroyed but PE is expected to remain not
> free so pnv_ioda2_pick_m64_pe() (or pnv_ioda2_reserve_m64_pe()?) won't use
> it.

Ok.

> >When SR-IOV is disabled, you need to immediately disable the VF (I'm
> >guessing that happens somewhere) and stop all access to the VF
> >"hardware".
> 
> drivers/pci/iov.c
> ===
> static void sriov_disable(struct pci_dev *dev)
> {
> ...
> for (i = 0; i < iov->num_VFs; i++)
> pci_iov_remove_virtfn(dev, i, 0);
> ...
> pcibios_sriov_disable(dev);
> ===
> 
> pcibios_sriov_disable() is where pnv_pci_ioda2_release_dma_pe() is called 
> from.
> 
> >Only the iommu group structure *has* to stick around
> >until the reference count drops to zero.  I think other structures and
> >hardware reconfiguration can be deferred or done immediately,
> >whichever is more convenient.
> 
> I deferred everything because of convenience as iommu_table_group is
> embedded into pnv_ioda struct, not a pointer.

Ok.


With those queries answered,

Reviewed-by: David Gibson 

> >>if (rc)
> >>pe_warn(pe, "OPAL error %ld release DMA window\n", rc);
> >>
> >>pnv_pci_ioda2_set_bypass(pe, false);
> >>-   if (pe->table_group.group) {
> >>-   iommu_group_put(pe->table_group.group);
> >>-   BUG_ON(pe->table_group.group);
> >>-   }
> >>+
> >>+   BUG_ON(!tbl);
> >>pnv_pci_ioda2_table_free_pages(tbl);
> >>-   iommu_free_table(tbl, of_node_full_name(dev->dev.of_node));
> >>+   iommu_free_table(tbl, of_node_full_name(pe->parent_dev->dev.of_node));
> >>+
> >>+   pnv_ioda_deconfigure_pe(phb, pe);
> >>+   pnv_ioda_free_pe(phb, pe->pe_number);
> >>  }
> >>
> >>  static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
> >>@@ -1376,16 +1374,13 @@ static void pnv_ioda_release_vf_PE(struct pci_dev 
> >>*pdev)
> >>if (pe->parent_dev != pdev)
> >>continue;
> >>
> >>-   pnv_pci_ioda2_release_dma_pe(pdev, pe);
> >>-
> >>/* Remove from list */
> >>mutex_lock(&phb->ioda.pe_list_mutex);
> >>list_del(&pe->list);
> >>mutex_unlock(&phb->ioda.pe_list_mutex);
> >>
> >>-   pnv_ioda_deconfigure_pe(phb, pe);
> >>-
> >>-   pnv_ioda_free_pe(phb, pe->pe_number);
> >>+   if (pe->table_group.group)
> >>+   iommu_group_put(pe->table_group.group);
> >>}
> >>  }
> >>
> >
> 
> 

-- 
David Gibson| I'll have my music baroque,

[PATCH v2 0/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate

2016-04-14 Thread Akshay Adiga
The frequency transition latency from pmin to pmax is observed to be in few
millisecond granurality. And it usually happens to take a performance penalty
during sudden frequency rampup requests.

This patch set solves this problem by using a chip-level entity called "global
 pstates". Global pstate manages elements across other dependent core chiplets.
Typically, the element that needs to be managed is the voltage setting.
So by holding global pstates higher than local pstate for some amount of time
( ~5 seconds) the subsequent rampups could be made faster.

(1/2) patch removes the flag from cpufreq_policy->driver_data, so that it can
be used for tracking global pstates.

(2/2) patch adds code for global pstate management.

- The iozone results with this patchset, shows improvements in almost all cases.
- YCSB workload on redis with various  target operations per second shows 
better MaxLatency with this patch.

Changes from v1:
- Fixed coding style
- Added a routine to reset global_pstate_info instead of hacky memset
- Handled case when cpufreq_table_validate_and_show() fails
- changed int queue_gpstate_timer() to void queue_gpstate_timer()


Akshay Adiga (1):
  cpufreq: powernv: Ramp-down global pstate slower than local-pstate

Shilpasri G Bhat (1):
  cpufreq: powernv: Remove flag use-case of policy->driver_data

 drivers/cpufreq/powernv-cpufreq.c | 272 +++---
 1 file changed, 257 insertions(+), 15 deletions(-)

-- 
2.5.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 1/2] cpufreq: powernv: Remove flag use-case of policy->driver_data

2016-04-14 Thread Akshay Adiga
From: Shilpasri G Bhat 

commit 1b0289848d5d ("cpufreq: powernv: Add sysfs attributes to show
throttle stats") used policy->driver_data as a flag for one-time creation
of throttle sysfs files. Instead of this use 'kernfs_find_and_get()' to
check if the attribute already exists. This is required as
policy->driver_data is used for other purposes in the later patch.

Signed-off-by: Shilpasri G Bhat 
Signed-off-by: Akshay Adiga 
---
 drivers/cpufreq/powernv-cpufreq.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/cpufreq/powernv-cpufreq.c 
b/drivers/cpufreq/powernv-cpufreq.c
index 39ac78c..e2e2219 100644
--- a/drivers/cpufreq/powernv-cpufreq.c
+++ b/drivers/cpufreq/powernv-cpufreq.c
@@ -455,13 +455,15 @@ static int powernv_cpufreq_target_index(struct 
cpufreq_policy *policy,
 static int powernv_cpufreq_cpu_init(struct cpufreq_policy *policy)
 {
int base, i;
+   struct kernfs_node *kn;
 
base = cpu_first_thread_sibling(policy->cpu);
 
for (i = 0; i < threads_per_core; i++)
cpumask_set_cpu(base + i, policy->cpus);
 
-   if (!policy->driver_data) {
+   kn = kernfs_find_and_get(policy->kobj.sd, throttle_attr_grp.name);
+   if (!kn) {
int ret;
 
ret = sysfs_create_group(&policy->kobj, &throttle_attr_grp);
@@ -470,11 +472,8 @@ static int powernv_cpufreq_cpu_init(struct cpufreq_policy 
*policy)
policy->cpu);
return ret;
}
-   /*
-* policy->driver_data is used as a flag for one-time
-* creation of throttle sysfs files.
-*/
-   policy->driver_data = policy;
+   } else {
+   kernfs_put(kn);
}
return cpufreq_table_validate_and_show(policy, powernv_freqs);
 }
-- 
2.5.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate

2016-04-14 Thread Akshay Adiga
The frequency transition latency from pmin to pmax is observed to be in
few millisecond granurality. And it usually happens to take a performance
penalty during sudden frequency rampup requests.

This patch set solves this problem by using an entity called "global
pstates". The global pstate is a Chip-level entity, so the global entitiy
(Voltage) is managed across the cores. The local pstate is a Core-level
entity, so the local entity (frequency) is managed across threads.

This patch brings down global pstate at a slower rate than the local
pstate. Hence by holding global pstates higher than local pstate makes
the subsequent rampups faster.

A per policy structure is maintained to keep track of the global and
local pstate changes. The global pstate is brought down using a parabolic
equation. The ramp down time to pmin is set to ~5 seconds. To make sure
that the global pstates are dropped at regular interval , a timer is
queued for every 2 seconds during ramp-down phase, which eventually brings
the pstate down to local pstate.

Iozone results show fairly consistent performance boost.
YCSB on redis shows improved Max latencies in most cases.

Iozone write/rewite test were made with filesizes 200704Kb and 401408Kb
with different record sizes . The following table shows IOoperations/sec
with and without patch.

Iozone Results ( in op/sec) ( mean over 3 iterations )
-
file size-  withwithout   %
recordsize-IOtype   patch   patch   change
--
200704-1-SeqWrite   1616532 1615425 0.06
200704-1-Rewrite2423195 2303130 5.21
200704-2-SeqWrite   1628577 1602620 1.61
200704-2-Rewrite2428264 2312154 5.02
200704-4-SeqWrite   1617605 1617182 0.02
200704-4-Rewrite2430524 2351238 3.37
200704-8-SeqWrite   1629478 1600436 1.81
200704-8-Rewrite2415308 2298136 5.09
200704-16-SeqWrite  1619632 1618250 0.08
200704-16-Rewrite   2396650 2352591 1.87
200704-32-SeqWrite  1632544 1598083 2.15
200704-32-Rewrite   2425119 2329743 4.09
200704-64-SeqWrite  1617812 1617235 0.03
200704-64-Rewrite   2402021 2321080 3.48
200704-128-SeqWrite 1631998 1600256 1.98
200704-128-Rewrite  2422389 2304954 5.09
200704-256 SeqWrite 1617065 1616962 0.00
200704-256-Rewrite  2432539 2301980 5.67
200704-512-SeqWrite 1632599 1598656 2.12
200704-512-Rewrite  2429270 2323676 4.54
200704-1024-SeqWrite1618758 1616156 0.16
200704-1024-Rewrite 2431631 2315889 4.99
401408-1-SeqWrite   1631479 1608132 1.45
401408-1-Rewrite2501550 2459409 1.71
401408-2-SeqWrite   1617095 1626069 -0.55
401408-2-Rewrite2507557 2443621 2.61
401408-4-SeqWrite   1629601 1611869 1.10
401408-4-Rewrite2505909 2462098 1.77
401408-8-SeqWrite   1617110 1626968 -0.60
401408-8-Rewrite2512244 2456827 2.25
401408-16-SeqWrite  1632609 1609603 1.42
401408-16-Rewrite   2500792 2451405 2.01
401408-32-SeqWrite  1619294 1628167 -0.54
401408-32-Rewrite   2510115 2451292 2.39
401408-64-SeqWrite  1632709 1603746 1.80
401408-64-Rewrite   2506692 2433186 3.02
401408-128-SeqWrite 1619284 1627461 -0.50
401408-128-Rewrite  2518698 2453361 2.66
401408-256-SeqWrite 1634022 1610681 1.44
401408-256-Rewrite  2509987 2446328 2.60
401408-512-SeqWrite 1617524 1628016 -0.64
401408-512-Rewrite  2504409 2442899 2.51
401408-1024-SeqWrite1629812 1611566 1.13
401408-1024-Rewrite 2507620  24429682.64

Tested with YCSB workload (50% update + 50% read) over redis for 1 million
records and 1 million operation. Each test was carried out with target
operations per second and persistence disabled.

Max-latency (in us)( mean over 5 iterations )
---