date:20170821

Re: qustion about eeh_add_virt_device

2017-08-21 Thread Russell Currey

On Sun, 2017-08-13 at 16:54 +0200, Julia Lawall wrote:
> Hello,

Hello, sorry for the delayed response.

> 
> At the suggestion of Christoph Hellwig, I am working on inlining the
> functions stored in the err_handler field of a pci_driver structure into
> the pci_driver structure itself.  A number of functions in the file
> arch/powerpc/kernel/eeh_driver.c have code like:
> 
> if (!driver->err_handler ||
> !driver->err_handler->error_detected) {
> eeh_pcid_put(dev);
> return NULL;
> }
> 
> This I would just convert to:
> 
> if (!driver->error_detected) {
> eeh_pcid_put(dev);
> return NULL;
> }
> 
> But I am not sure what is best to do about eeh_add_virt_device, which
> contains:
> 
> if (driver->err_handler)
>   return NULL;
> 
> Should I try to find a subfield of the err_handler that is guaranteed to
> be there if anything is there?  Or could the test just be dropped, leaving
> a direct return NULL?

I believe the test can be dropped.

- Russell

> 
> thanks,
> julia

Re: [PATCH v2 2/5] powerpc: pseries: vio: match parent nodes with of_find_node_by_path

2017-08-21 Thread Michael Ellerman

Rob Herring  writes:

> In preparation to remove the full path from device_node.full_name, use
> of_find_node_by_path instead of open coding with strcmp.
>
> Signed-off-by: Rob Herring 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Michael Ellerman 
> Cc: linuxppc-dev@lists.ozlabs.org
> ---
> v2:
> - rebased to linux-next and removed spurious change fro patch 1.
>
>  arch/powerpc/platforms/pseries/vio.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/vio.c 
> b/arch/powerpc/platforms/pseries/vio.c
> index aa5ca74316fa..5754572deb23 100644
> --- a/arch/powerpc/platforms/pseries/vio.c
> +++ b/arch/powerpc/platforms/pseries/vio.c
> @@ -1357,9 +1357,9 @@ struct vio_dev *vio_register_device_node(struct 
> device_node *of_node)
>*/
>   parent_node = of_get_parent(of_node);
>   if (parent_node) {
> - if (!strcmp(parent_node->full_name, "/ibm,platform-facilities"))
> + if (parent_node == 
> of_find_node_by_path("/ibm,platform-facilities"))
>   family = PFO;
> - else if (!strcmp(parent_node->full_name, "/vdevice"))
> + else if (parent_node == of_find_node_by_path("/vdevice"))
>   family = VDEVICE;

This leaks references to the looked up nodes.

Both these nodes are defined in PAPR (our hypervisor spec), and both of
them must have a device_type, either "ibm,platform-facilities" or
"vdevice".

Looking at the commit that added the code I don't see any particular
reason it used the comparison against full_name, rather than using the
device_type.

So I'm inclined to do this instead:

diff --git a/arch/powerpc/platforms/pseries/vio.c 
b/arch/powerpc/platforms/pseries/vio.c
index 8a47f168476b..f26f906e6021 100644
--- a/arch/powerpc/platforms/pseries/vio.c
+++ b/arch/powerpc/platforms/pseries/vio.c
@@ -1357,9 +1357,9 @@ struct vio_dev *vio_register_device_node(struct 
device_node *of_node)
 */
parent_node = of_get_parent(of_node);
if (parent_node) {
-   if (!strcmp(parent_node->full_name, "/ibm,platform-facilities"))
+   if (!strcmp(parent_node->type, "ibm,platform-facilities"))
family = PFO;
-   else if (!strcmp(parent_node->full_name, "/vdevice"))
+   else if (!strcmp(parent_node->type, "vdevice"))
family = VDEVICE;
else {
pr_warn("%s: parent(%s) of %s not recognized.\n",


I've checked both Qemu and kvmtool add the device_type, and I'm fairly
confident that PowerVM does too. Anyway I'll test it on all the machines
I can find.

cheers

Re: [PATCH 5/6] powerpc/mm: Optimize detection of thread local mm's

2017-08-21 Thread Michael Ellerman

Frederic Barrat  writes:

> Hi Ben,
>
> Le 24/07/2017 à 06:28, Benjamin Herrenschmidt a écrit :
>> Instead of comparing the whole CPU mask every time, let's
>> keep a counter of how many bits are set in the mask. Thus
>> testing for a local mm only requires testing if that counter
>> is 1 and the current CPU bit is set in the mask.
>
>
> I'm trying to see if we could merge this patch with what I'm trying to 
> do to mark a context as requiring global TLBIs.
> In http://patchwork.ozlabs.org/patch/796775/
> I'm introducing a 'flags' per memory context, using one bit to say if 
> the context needs global TLBIs.
> The 2 could co-exist, just checking... Do you think about using the 
> actual active_cpus count down the road, or is it just a matter of 
> knowing if there are more than one active cpus?

Currently it's just an optimisation to save comparing the full cpumask
every time to detect if we can go local vs broadcast.

So if you increment it then it will mean we do broadcast, which is what
you need.

It's possible in future we might try to do something more complicated,
like send targeted IPIs etc. But if we ever do that we can adapt CXL
then.

cheers

Re: [PATCH v3 1/3] powerpc/mm: Add marker for contexts requiring global TLB invalidations

2017-08-21 Thread Alistair Popple

For what it's worth this worked fine when testing with the NPU as well.

Tested-by: Alistair Popple 

On Thu, 3 Aug 2017 05:16:35 PM Balbir Singh wrote:
> On Thu, Aug 3, 2017 at 6:29 AM, Frederic Barrat
>  wrote:
> > Introduce a new 'flags' attribute per context and define its first bit
> > to be a marker requiring all TLBIs for that context to be broadcasted
> > globally. Once that marker is set on a context, it cannot be removed.
> >
> > Such a marker is useful for memory contexts used by devices behind the
> > NPU and CAPP/PSL. The NPU and the PSL keep their own translation cache
> > so they need to see all the TLBIs for those contexts.
> >
> > Rename mm_is_thread_local() to mm_is_invalidation_local() to better
> > describe what it's doing.
> 
> mm_is_tlb_local? Just nitpicking
> 
> >
> > Signed-off-by: Frederic Barrat 
> > ---
> >  arch/powerpc/include/asm/book3s/64/mmu.h | 18 ++
> >  arch/powerpc/include/asm/tlb.h   | 27 +++
> >  arch/powerpc/mm/mmu_context_book3s64.c   |  1 +
> >  arch/powerpc/mm/tlb-radix.c  |  8 
> >  arch/powerpc/mm/tlb_hash64.c |  3 ++-
> >  5 files changed, 48 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
> > b/arch/powerpc/include/asm/book3s/64/mmu.h
> > index 5b4023c616f7..03d4515ecfa6 100644
> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> > @@ -79,8 +79,12 @@ struct spinlock;
> >  /* Maximum possible number of NPUs in a system. */
> >  #define NV_MAX_NPUS 8
> >
> > +/* Bits definition for the context flags */
> > +#define MM_GLOBAL_TLBIE0   /* TLBI must be global */
> > +
> >  typedef struct {
> > mm_context_id_t id;
> > +   unsigned long flags;
> > u16 user_psize; /* page size index */
> >
> > /* NPU NMMU context */
> > @@ -165,5 +169,19 @@ extern void radix_init_pseries(void);
> >  static inline void radix_init_pseries(void) { };
> >  #endif
> >
> > +/*
> > + * Mark the memory context as requiring global TLBIs, when used by
> > + * GPUs or CAPI accelerators managing their own TLB or ERAT.
> > + */
> > +static inline void mm_context_set_global_tlbi(mm_context_t *ctx)
> > +{
> > +   set_bit(MM_GLOBAL_TLBIE, >flags);
> > +}
> > +
> > +static inline bool mm_context_get_global_tlbi(mm_context_t *ctx)
> > +{
> > +   return test_bit(MM_GLOBAL_TLBIE, >flags);
> > +}
> > +
> >  #endif /* __ASSEMBLY__ */
> >  #endif /* _ASM_POWERPC_BOOK3S_64_MMU_H_ */
> > diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
> > index 609557569f65..f06dcac82097 100644
> > --- a/arch/powerpc/include/asm/tlb.h
> > +++ b/arch/powerpc/include/asm/tlb.h
> > @@ -69,10 +69,29 @@ static inline int mm_is_core_local(struct mm_struct *mm)
> >   topology_sibling_cpumask(smp_processor_id()));
> >  }
> >
> > -static inline int mm_is_thread_local(struct mm_struct *mm)
> > +static inline int mm_is_invalidation_local(struct mm_struct *mm)
> >  {
> > -   return cpumask_equal(mm_cpumask(mm),
> > - cpumask_of(smp_processor_id()));
> > +   int rc;
> > +
> > +   rc = cpumask_equal(mm_cpumask(mm),
> > +   cpumask_of(smp_processor_id()));
> > +#ifdef CONFIG_PPC_BOOK3S_64
> > +   if (rc) {
> > +   /*
> > +* Check if context requires global TLBI.
> > +*
> > +* We need to make sure the PTE update is happening
> > +* before reading the context global flag. Otherwise,
> > +* reading the flag may be re-ordered and happen
> > +* first, and we could end up in a situation where the
> > +* old PTE was seen by the NPU/PSL/device, but the
> > +* TLBI is local.
> > +*/
> > +   mb();
> 
> smp_mb()?
> 
> > +   rc = !mm_context_get_global_tlbi(>context);
> > +   }
> 
> Otherwise looks good!
> 
> Acked-by: Balbir Singh

Re: [PATCH v3 2/3] cxl: Mark context requiring global TLBIs

2017-08-21 Thread Alistair Popple

On Thu, 3 Aug 2017 05:22:31 PM Balbir Singh wrote:
> On Thu, Aug 3, 2017 at 6:29 AM, Frederic Barrat
>  wrote:
> > The PSL and XSL need to see all TLBIs pertinent to the memory contexts
> > used on the adapter. For the hash memory model, it is done by making
> > all TLBIs global as soon as the cxl driver is in use. For radix, we
> > need something similar, but we can refine and only convert to global
> > the invalidations for contexts actually used by the device.
> >
> > So mark the contexts being attached to the cxl adapter as requiring
> > global TLBIs.
> >
> 
> Looking at these bits, I'm wondering the previous code should check
> for CONFIG_CXL in mm_is_invalidation_local? That would pretty much cover
> BOOK3S_64

That won't work as we also need this for other nest MMU users which might
not have CONFIG_CXL set - for example the NPU the will use
mm_context_set_global_tlbi() for processes on the GPU.

Regards,

Alistair

> > Signed-off-by: Frederic Barrat 
> > ---
> >  drivers/misc/cxl/api.c| 12 ++--
> >  drivers/misc/cxl/cxllib.c |  7 +++
> >  drivers/misc/cxl/file.c   | 12 ++--
> >  3 files changed, 27 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
> > index 1a138c83f877..d3f3fdede755 100644
> > --- a/drivers/misc/cxl/api.c
> > +++ b/drivers/misc/cxl/api.c
> > @@ -332,8 +332,17 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed,
> > cxl_context_mm_count_get(ctx);
> >
> > /* decrement the use count */
> > -   if (ctx->mm)
> > +   if (ctx->mm) {
> > mmput(ctx->mm);
> > +#ifdef CONFIG_PPC_BOOK3S_64
> > +   mm_context_set_global_tlbi(>mm->context);
> 
> Do we have CXL for non PPC_BOOK3S_64?
> 
> > +   /*
> > +* Barrier guarantees that the device will
> > +* receive all TLBIs from that point on
> > +*/
> > +   wmb();
> 
> smp_wmb();
> 
> > +#endif
> > +   }
> > }
> 
> The other comments are the same as this (in the snip'd code)
> 
> Balbir

[PATCH resend] timers: Fix excessive granularity of new timers after a nohz idle

2017-08-21 Thread Nicholas Piggin

When a timer base is idle, it is forwarded when a new timer is added
to ensure that granularity does not become excessive. When not idle,
the timer tick is expected to increment the base.

However there are several problems:

- If an existing timer is modified, the base is forwarded only after
  the index is calculated.

- The base is not forwarded by add_timer_on.

- There is a window after a timer is restarted from a nohz ide, after
  it is marked not-idle and before the timer tick on this CPU, where a
  timer may be added but the ancient base does not get forwarded.

These result in excessive granularity (a 1 jiffy timeout can blow out
to 100s of jiffies), which cause the rcu lockup detector to trigger,
among other things.

Fix this by keeping track of whether the timer base has been idle
since it was last run or forwarded, and if so then forward it before
adding a new timer.

There is still a problem where the mod_timer optimization where it's
modified with the same expiry time can result in excessive granularity
relative to the new shorter interval. That is not addressed by this
patch because checking base->was_idle would increase overhead and it's
a rather special case (you could reason that the caller should not
expect change in absolute expiry time due to such an operation). So
that is noted as a comment.

As well as fixing the visible RCU softlockup failures, I tested an
idle system (with no lockup watchdogs running) and traced all
non-deferrable timer expiries for 1000s, and analysed wakeup latency
relative to requested latency.  1.0 means we slept for as many jiffies
as requested, 2.0 means we slept 2x the time (this suffers jiffies
round-up skew at low absolute times):

 max avg  std
upstream   506.01.20 4.68
patched  2.01.08 0.15

This was noticed due to the lockup detector Kconfig changes dropping it
out of people's .configs. When the lockup detectors are enabled, no CPU
can go idle for longer than 4 seconds, which limits the granularity
errors. Sub-optimal timer behaviour is observable on a smaller scale:

 max avg  std
upstream 9.01.05 0.19
patched  2.01.04 0.11

Tested-by: David Miller 
Signed-off-by: Nicholas Piggin 
---

Hi Andrew,

I would have preferred to get comments from the timer maintainers, but
they've been busy or away for the past copule of weeks. Perhaps you
would consider carrying it until then?

Thanks,
Nick

 kernel/time/timer.c | 44 
 1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 8f5d1bf18854..dd7be9fe6839 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -203,6 +203,7 @@ struct timer_base {
boolmigration_enabled;
boolnohz_active;
boolis_idle;
+   boolwas_idle; /* was it idle since last run/fwded */
DECLARE_BITMAP(pending_map, WHEEL_SIZE);
struct hlist_head   vectors[WHEEL_SIZE];
 } cacheline_aligned;
@@ -856,13 +857,19 @@ get_target_base(struct timer_base *base, unsigned tflags)
 
 static inline void forward_timer_base(struct timer_base *base)
 {
-   unsigned long jnow = READ_ONCE(jiffies);
+   unsigned long jnow;
 
/*
-* We only forward the base when it's idle and we have a delta between
-* base clock and jiffies.
+* We only forward the base when we are idle or have just come out
+* of idle (was_idle logic), and have a delta between base clock
+* and jiffies. In the common case, run_timers will take care of it.
 */
-   if (!base->is_idle || (long) (jnow - base->clk) < 2)
+   if (likely(!base->was_idle))
+   return;
+
+   jnow = READ_ONCE(jiffies);
+   base->was_idle = base->is_idle;
+   if ((long)(jnow - base->clk) < 2)
return;
 
/*
@@ -938,6 +945,13 @@ __mod_timer(struct timer_list *timer, unsigned long 
expires, bool pending_only)
 * same array bucket then just return:
 */
if (timer_pending(timer)) {
+   /*
+* The downside of this optimization is that it can result in
+* larger granularity than you would get from adding a new
+* timer with this expiry. Would a timer flag for networking
+* be appropriate, then we can try to keep expiry of general
+* timers within ~1/8th of their interval?
+*/
if (timer->expires == expires)
return 1;
 
@@ -948,6 +962,7 @@ __mod_timer(struct timer_list *timer, unsigned long 
expires, bool pending_only)
 * dequeue/enqueue dance.
 */
base = lock_timer_base(timer, );
+   forward_timer_base(base);
 
clk =

[PATCH] powerpc/64s: Fix replay interrupt return label name

2017-08-21 Thread Michael Ellerman

In __replay_interrupt() we take the address of a local label so we can
return to it later. However the assembler turns the local label into a
symbol with a name like ".L1^B42" - where "^B" is literally "\002".
This does not make for pleasant stack traces. Fix it by giving the
label a sensible name.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kernel/exceptions-64s.S | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 1a4c809e841b..b97a40e18cd3 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1671,7 +1671,7 @@ _GLOBAL(__replay_interrupt)
 * we don't give a damn about, so we don't bother storing them.
 */
mfmsr   r12
-   LOAD_REG_ADDR(r11, 1f)
+   LOAD_REG_ADDR(r11, replay_interrupt_return)
mfcrr9
ori r12,r12,MSR_EE
cmpwi   r3,0x900
@@ -1689,7 +1689,7 @@ FTR_SECTION_ELSE
cmpwi   r3,0xa00
beq doorbell_super_common_msgclr
 ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE)
-1:
+replay_interrupt_return:
blr
 
 _ASM_NOKPROBE_SYMBOL(__replay_interrupt)
-- 
2.7.4

Re: [PATCH v2 17/20] perf: Add a speculative page fault sw event

2017-08-21 Thread Michael Ellerman

Anshuman Khandual  writes:

> On 08/18/2017 03:35 AM, Laurent Dufour wrote:
>> Add a new software event to count succeeded speculative page faults.
>> 
>> Signed-off-by: Laurent Dufour 
>
> Should be merged with the next patch.

No it shouldn't.

cheers

Re: [PATCH 1/2] powerpc/workqueue: update list of possible CPUs

2017-08-21 Thread Michael Ellerman

Tejun Heo  writes:

> On Mon, Aug 21, 2017 at 03:49:50PM +0200, Laurent Vivier wrote:
>> In wq_numa_init() a list of NUMA nodes with their list of possible CPUs
>> is built.
>> 
>> Unfortunately, on powerpc, the Firmware is only able to provide the
>> node of a CPU if the CPU is present. So, in our case (possible CPU)
>> CPU ids are known, but as the CPU is not present, the node id is
>> unknown and all the unplugged CPUs are attached to node 0.
>
> This is something powerpc needs to fix.

There is no way for us to fix it.

At boot, for possible but not present CPUs, we have no way of knowing
the CPU <-> node mapping, firmware simply doesn't tell us.

> Workqueue isn't the only one making this assumption. mm as a whole
> assumes that CPU <-> node mapping is stable regardless of hotplug
> events.

At least in this case I don't think the mapping changes, it's just we
don't know the mapping at boot.

Currently we have to report possible but not present CPUs as belonging
to node 0, because otherwise we trip this helpful piece of code:

for_each_possible_cpu(cpu) {
node = cpu_to_node(cpu);
if (WARN_ON(node == NUMA_NO_NODE)) {
pr_warn("workqueue: NUMA node mapping not available for 
cpu%d, disabling NUMA support\n", cpu);
/* happens iff arch is bonkers, let's just proceed */
return;
}

But if we remove that, we could then accurately report NUMA_NO_NODE at
boot, and then update the mapping when the CPU is hotplugged.

cheers

[no subject]

2017-08-21 Thread Nicholas Piggin

Date: Sun, 20 Aug 2017 13:16:16 +1000
Subject: [PATCH] timers: Fix excessive granularity of new timers after a nohz
 idle

When a timer base is idle, it is forwarded when a new timer is added
to ensure that granularity does not become excessive. When not idle,
the timer tick is expected to increment the base.

However there are several problems:

- If an existing timer is modified, the base is forwarded only after
  the index is calculated.

- The base is not forwarded by add_timer_on.

- There is a window after a timer is restarted from a nohz ide, after
  it is marked not-idle and before the timer tick on this CPU, where a
  timer may be added but the ancient base does not get forwarded.

These result in excessive granularity (a 1 jiffy timeout can blow out
to 100s of jiffies), which cause the rcu lockup detector to trigger,
among other things.

Fix this by keeping track of whether the timer base has been idle
since it was last run or forwarded, and if so then forward it before
adding a new timer.

There is still a problem where the mod_timer optimization where it's
modified with the same expiry time can result in excessive granularity
relative to the new shorter interval. That is not addressed by this
patch because checking base->was_idle would increase overhead and it's
a rather special case (you could reason that the caller should not
expect change in absolute expiry time due to such an operation). So
that is noted as a comment.

As well as fixing the visible RCU softlockup failures, I tested an
idle system (with no lockup watchdogs running) and traced all
non-deferrable timer expiries for 1000s, and analysed wakeup latency
relative to requested latency.  1.0 means we slept for as many jiffies
as requested, 2.0 means we slept 2x the time (this suffers jiffies
round-up skew at low absolute times):

 max avg  std
upstream   506.01.20 4.68
patched  2.01.08 0.15

This was noticed due to the lockup detector Kconfig changes dropping it
out of people's .configs. When the lockup detectors are enabled, no CPU
can go idle for longer than 4 seconds, which limits the granularity
errors. Sub-optimal timer behaviour is observable on a smaller scale:

 max avg  std
upstream 9.01.05 0.19
patched  2.01.04 0.11

Tested-by: David Miller 
Signed-off-by: Nicholas Piggin 
---

Hi Andrew,

I would have preferred to get comments from the timer maintainers, but
they've been busy or away for the past copule of weeks. Perhaps you
would consider carrying it until then?

Thanks,
Nick

 kernel/time/timer.c | 44 
 1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 8f5d1bf18854..dd7be9fe6839 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -203,6 +203,7 @@ struct timer_base {
boolmigration_enabled;
boolnohz_active;
boolis_idle;
+   boolwas_idle; /* was it idle since last run/fwded */
DECLARE_BITMAP(pending_map, WHEEL_SIZE);
struct hlist_head   vectors[WHEEL_SIZE];
 } cacheline_aligned;
@@ -856,13 +857,19 @@ get_target_base(struct timer_base *base, unsigned tflags)

 static inline void forward_timer_base(struct timer_base *base)
 {
-   unsigned long jnow = READ_ONCE(jiffies);
+   unsigned long jnow;

/*
-* We only forward the base when it's idle and we have a delta between
-* base clock and jiffies.
+* We only forward the base when we are idle or have just come out
+* of idle (was_idle logic), and have a delta between base clock
+* and jiffies. In the common case, run_timers will take care of it.
 */
-   if (!base->is_idle || (long) (jnow - base->clk) < 2)
+   if (likely(!base->was_idle))
+   return;
+
+   jnow = READ_ONCE(jiffies);
+   base->was_idle = base->is_idle;
+   if ((long)(jnow - base->clk) < 2)
return;

/*
@@ -938,6 +945,13 @@ __mod_timer(struct timer_list *timer, unsigned long 
expires, bool pending_only)
 * same array bucket then just return:
 */
if (timer_pending(timer)) {
+   /*
+* The downside of this optimization is that it can result in
+* larger granularity than you would get from adding a new
+* timer with this expiry. Would a timer flag for networking
+* be appropriate, then we can try to keep expiry of general
+* timers within ~1/8th of their interval?
+*/
if (timer->expires == expires)
return 1;

@@ -948,6 +962,7 @@ __mod_timer(struct timer_list *timer, unsigned long 
expires, bool pending_only)
 * dequeue/enqueue dance.

Re: [PATCH v2 00/20] Speculative page faults

2017-08-21 Thread Paul E. McKenney

On Mon, Aug 21, 2017 at 11:58:03AM +0530, Anshuman Khandual wrote:
> On 08/18/2017 03:34 AM, Laurent Dufour wrote:
> > This is a port on kernel 4.13 of the work done by Peter Zijlstra to
> > handle page fault without holding the mm semaphore [1].
> > 
> > The idea is to try to handle user space page faults without holding the
> > mmap_sem. This should allow better concurrency for massively threaded
> > process since the page fault handler will not wait for other threads memory
> > layout change to be done, assuming that this change is done in another part
> > of the process's memory space. This type page fault is named speculative
> > page fault. If the speculative page fault fails because of a concurrency is
> > detected or because underlying PMD or PTE tables are not yet allocating, it
> > is failing its processing and a classic page fault is then tried.
> > 
> > The speculative page fault (SPF) has to look for the VMA matching the fault
> > address without holding the mmap_sem, so the VMA list is now managed using
> > SRCU allowing lockless walking. The only impact would be the deferred file
> > derefencing in the case of a file mapping, since the file pointer is
> > released once the SRCU cleaning is done.  This patch relies on the change
> > done recently by Paul McKenney in SRCU which now runs a callback per CPU
> > instead of per SRCU structure [1].
> > 
> > The VMA's attributes checked during the speculative page fault processing
> > have to be protected against parallel changes. This is done by using a per
> > VMA sequence lock. This sequence lock allows the speculative page fault
> > handler to fast check for parallel changes in progress and to abort the
> > speculative page fault in that case.
> > 
> > Once the VMA is found, the speculative page fault handler would check for
> > the VMA's attributes to verify that the page fault has to be handled
> > correctly or not. Thus the VMA is protected through a sequence lock which
> > allows fast detection of concurrent VMA changes. If such a change is
> > detected, the speculative page fault is aborted and a *classic* page fault
> > is tried.  VMA sequence locks are added when VMA attributes which are
> > checked during the page fault are modified.
> > 
> > When the PTE is fetched, the VMA is checked to see if it has been changed,
> > so once the page table is locked, the VMA is valid, so any other changes
> > leading to touching this PTE will need to lock the page table, so no
> > parallel change is possible at this time.
> > 
> > Compared to the Peter's initial work, this series introduces a spin_trylock
> > when dealing with speculative page fault. This is required to avoid dead
> > lock when handling a page fault while a TLB invalidate is requested by an
> > other CPU holding the PTE. Another change due to a lock dependency issue
> > with mapping->i_mmap_rwsem.
> > 
> > In addition some VMA field values which are used once the PTE is unlocked
> > at the end the page fault path are saved into the vm_fault structure to
> > used the values matching the VMA at the time the PTE was locked.
> > 
> > This series builds on top of v4.13-rc5 and is functional on x86 and
> > PowerPC.
> > 
> > Tests have been made using a large commercial in-memory database on a
> > PowerPC system with 752 CPU using RFC v5. The results are very encouraging
> > since the loading of the 2TB database was faster by 14% with the
> > speculative page fault.
> > 
> 
> You specifically mention loading as most of the page faults will
> happen at that time and then the working set will settle down with
> very less page faults there after ? That means unless there is
> another wave of page faults we wont notice performance improvement
> during the runtime.
> 
> > Using ebizzy test [3], which spreads a lot of threads, the result are good
> > when running on both a large or a small system. When using kernbench, the
> 
> The performance improvements are greater as there is a lot of creation
> and destruction of anon mappings which generates constant flow of page
> faults to be handled.
> 
> > result are quite similar which expected as not so much multi threaded
> > processes are involved. But there is no performance degradation neither
> > which is good.
> 
> If we compile with 'make -j N' there would be a lot of threads but I
> guess the problem is SPF does not support handling file mapping IIUC
> which limits the performance improvement for some workloads.
> 
> > 
> > --
> > Benchmarks results
> > 
> > Note these test have been made on top of 4.13-rc3 with the following patch
> > from Paul McKenney applied: 
> >  "srcu: Provide ordering for CPU not involved in grace period" [5]
> 
> Is this patch an improvement for SRCU which we are using for walking VMAs.

It is a tweak to an earlier patch that parallelizes SRCU callback
handling.

Thanx, Paul

> > Ebizzy:
> > ---
> > The test is counting the number of records

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-08-21 Thread Paul E. McKenney

On Mon, Aug 21, 2017 at 10:52:58AM +1000, Nicholas Piggin wrote:
> On Sun, 20 Aug 2017 14:14:29 -0700
> "Paul E. McKenney"  wrote:
> 
> > On Sun, Aug 20, 2017 at 11:35:14AM -0700, Paul E. McKenney wrote:
> > > On Sun, Aug 20, 2017 at 11:00:40PM +1000, Nicholas Piggin wrote:  
> > > > On Sun, 20 Aug 2017 14:45:53 +1000
> > > > Nicholas Piggin  wrote:
> > > >   
> > > > > On Wed, 16 Aug 2017 09:27:31 -0700
> > > > > "Paul E. McKenney"  wrote:  
> > > > > > On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney wrote:
> > > > > > 
> > > > > > Thomas, John, am I misinterpreting the timer trace event messages?  
> > > > > >   
> > > > > 
> > > > > So I did some digging, and what you find is that rcu_sched seems to 
> > > > > do a
> > > > > simple scheudle_timeout(1) and just goes out to lunch for many 
> > > > > seconds.
> > > > > The process_timeout timer never fires (when it finally does wake after
> > > > > one of these events, it usually removes the timer with 
> > > > > del_timer_sync).
> > > > > 
> > > > > So this patch seems to fix it. Testing, comments welcome.  
> > > > 
> > > > Okay this had a problem of trying to forward the timer from a timer
> > > > callback function.
> > > > 
> > > > This was my other approach which also fixes the RCU warnings, but it's
> > > > a little more complex. I reworked it a bit so the mod_timer fast path
> > > > hopefully doesn't have much more overhead (actually by reading jiffies
> > > > only when needed, it probably saves a load).  
> > > 
> > > Giving this one a whirl!  
> > 
> > No joy here, but then again there are other reasons to believe that I
> > am seeing a different bug than Dave and Jonathan are.
> > 
> > OK, not -entirely- without joy -- 10 of 14 runs were error-free, which
> > is a good improvement over 0 of 84 for your earlier patch.  ;-)  But
> > not statistically different from what I see without either patch.
> > 
> > But no statistical difference compared to without patch, and I still
> > see the "rcu_sched kthread starved" messages.  For whatever it is worth,
> > by the way, I also see this: "hrtimer: interrupt took 5712368 ns".
> > Hmmm...  I am also seeing that without any of your patches.  Might
> > be hypervisor preemption, I guess.
> 
> Okay it makes the warnings go away for me, but I'm just booting then
> leaving the system idle. You're doing some CPU hotplug activity?

Yes, along with rcutorture, so a very different workload.

Thanx, Paul

[PATCHv2 1/2] scsi: Move scsi_cmd->jiffies_at_alloc initialization to allocation time

2017-08-21 Thread Brian King

This second version also sets up jiffies_at_alloc in scsi_init_request.
This has been tested without the second patch in the series and
I've confirmed I now see the following in the logs after booting:

[  121.718088] sd 1:2:0:0: timing out command, waited 120s
[  121.798081] sd 1:2:1:0: timing out command, waited 120s

Without this patch I was never seeing these messages, indicating
the retry timer code wasn't working. Also, after seeing these
messages, I've confirmed there are no longer any hung tasks
in the kernel with sysrq-w, while before, without this patch,
I would see hung tasks for the scsi_report_opcodes calls which
were getting retried forever.

8<

Move the initialization of scsi_cmd->jiffies_at_alloc to allocation
time rather than prep time. Also ensure that jiffies_at_alloc
is preserved when we go through prep. This lets us send retries
through prep again and not break the overall retry timer logic
in scsi_softirq_done.

Suggested-by: Bart Van Assche 
Signed-off-by: Brian King 
---

Index: linux-2.6.git/drivers/scsi/scsi_lib.c
===
--- linux-2.6.git.orig/drivers/scsi/scsi_lib.c
+++ linux-2.6.git/drivers/scsi/scsi_lib.c
@@ -1154,6 +1154,7 @@ void scsi_init_command(struct scsi_devic
void *buf = cmd->sense_buffer;
void *prot = cmd->prot_sdb;
unsigned int unchecked_isa_dma = cmd->flags & SCMD_UNCHECKED_ISA_DMA;
+   unsigned long jiffies_at_alloc = cmd->jiffies_at_alloc;
 
/* zero out the cmd, except for the embedded scsi_request */
memset((char *)cmd + sizeof(cmd->req), 0,
@@ -1164,7 +1165,7 @@ void scsi_init_command(struct scsi_devic
cmd->prot_sdb = prot;
cmd->flags = unchecked_isa_dma;
INIT_DELAYED_WORK(>abort_work, scmd_eh_abort_handler);
-   cmd->jiffies_at_alloc = jiffies;
+   cmd->jiffies_at_alloc = jiffies_at_alloc;
 
scsi_add_cmd_to_list(cmd);
 }
@@ -2016,6 +2017,7 @@ static int scsi_init_request(struct blk_
if (!cmd->sense_buffer)
return -ENOMEM;
cmd->req.sense = cmd->sense_buffer;
+   cmd->jiffies_at_alloc = jiffies;
 
if (scsi_host_get_prot(shost)) {
sg = (void *)cmd + sizeof(struct scsi_cmnd) +
@@ -2119,6 +2121,7 @@ static int scsi_init_rq(struct request_q
if (!cmd->sense_buffer)
goto fail;
cmd->req.sense = cmd->sense_buffer;
+   cmd->jiffies_at_alloc = jiffies;
 
if (scsi_host_get_prot(shost) >= SHOST_DIX_TYPE0_PROTECTION) {
cmd->prot_sdb = kmem_cache_zalloc(scsi_sdb_cache, gfp);

Re: [PATCH 1/2] scsi: Move scsi_cmd->jiffies_at_alloc initialization to allocation time

2017-08-21 Thread Brian King

Scratch this one... Version 2 on the way with the corresponding changes
in scsi_init_request...

-Brian


-- 
Brian King
Power Linux I/O
IBM Linux Technology Center

[PATCH 2/2] scsi: Preserve retry counter through scsi_prep_fn

2017-08-21 Thread Brian King

Save / restore the retry counter in scsi_cmd in scsi_init_command.
This allows us to go back through scsi_init_command for retries
and not forget we are doing a retry.

Signed-off-by: Brian King 
---

Index: linux-2.6.git/drivers/scsi/scsi_lib.c
===
--- linux-2.6.git.orig/drivers/scsi/scsi_lib.c
+++ linux-2.6.git/drivers/scsi/scsi_lib.c
@@ -1155,6 +1155,7 @@ void scsi_init_command(struct scsi_devic
void *prot = cmd->prot_sdb;
unsigned int unchecked_isa_dma = cmd->flags & SCMD_UNCHECKED_ISA_DMA;
unsigned long jiffies_at_alloc = cmd->jiffies_at_alloc;
+   int retries = cmd->retries;
 
/* zero out the cmd, except for the embedded scsi_request */
memset((char *)cmd + sizeof(cmd->req), 0,
@@ -1166,6 +1167,7 @@ void scsi_init_command(struct scsi_devic
cmd->flags = unchecked_isa_dma;
INIT_DELAYED_WORK(>abort_work, scmd_eh_abort_handler);
cmd->jiffies_at_alloc = jiffies_at_alloc;
+   cmd->retries = retries;
 
scsi_add_cmd_to_list(cmd);
 }

[PATCH 1/2] scsi: Move scsi_cmd->jiffies_at_alloc initialization to allocation time

2017-08-21 Thread Brian King

Move the initialization of scsi_cmd->jiffies_at_alloc to allocation
time rather than prep time. Also ensure that jiffies_at_alloc
is preserved when we go through prep. This lets us send retries
through prep again and not break the overall retry timer logic
in scsi_softirq_done.

Suggested-by: Bart Van Assche 
Signed-off-by: Brian King 
---

Index: linux-2.6.git/drivers/scsi/scsi_lib.c
===
--- linux-2.6.git.orig/drivers/scsi/scsi_lib.c
+++ linux-2.6.git/drivers/scsi/scsi_lib.c
@@ -1154,6 +1154,7 @@ void scsi_init_command(struct scsi_devic
void *buf = cmd->sense_buffer;
void *prot = cmd->prot_sdb;
unsigned int unchecked_isa_dma = cmd->flags & SCMD_UNCHECKED_ISA_DMA;
+   unsigned long jiffies_at_alloc = cmd->jiffies_at_alloc;
 
/* zero out the cmd, except for the embedded scsi_request */
memset((char *)cmd + sizeof(cmd->req), 0,
@@ -1164,7 +1165,7 @@ void scsi_init_command(struct scsi_devic
cmd->prot_sdb = prot;
cmd->flags = unchecked_isa_dma;
INIT_DELAYED_WORK(>abort_work, scmd_eh_abort_handler);
-   cmd->jiffies_at_alloc = jiffies;
+   cmd->jiffies_at_alloc = jiffies_at_alloc;
 
scsi_add_cmd_to_list(cmd);
 }
@@ -2119,6 +2120,7 @@ static int scsi_init_rq(struct request_q
if (!cmd->sense_buffer)
goto fail;
cmd->req.sense = cmd->sense_buffer;
+   cmd->jiffies_at_alloc = jiffies;
 
if (scsi_host_get_prot(shost) >= SHOST_DIX_TYPE0_PROTECTION) {
cmd->prot_sdb = kmem_cache_zalloc(scsi_sdb_cache, gfp);

[PATCH 0/2] Allow scsi_prep_fn to occur for retried commands

2017-08-21 Thread Brian King

The following two patches address the hang issue being observed
with Bart's patch on powerpc. The first patch moves the initialization
of jiffies_at_alloc from scsi_init_command to scsi_init_rq, and ensures
we don't zero jiffies_at_alloc in scsi_init_command. The second patch
saves / restores the retry counter in scsi_init_command which lets us
go through scsi_init_command for retries and not forget why we were
there. 

These patches have only been boot tested on my Power machine with ipr
to ensure they fix the issue I was seeing.

-Brian

-- 
Brian King
Power Linux I/O
IBM Linux Technology Center

[PATCH V9 2/2] powerpc/nodes: Ensure enough nodes avail for operations

2017-08-21 Thread Michael Bringmann

To: linuxppc-dev@lists.ozlabs.org

From: Michael Bringmann 

To: linux-ker...@vger.kernel.org
Cc: Michael Ellerman 
Cc: Michael Bringmann 
Cc: John Allen 
Cc: Nathan Fontenot 
Subject: [PATCH V9 2/2] powerpc/nodes: Ensure enough nodes avail for operations

powerpc/nodes: On systems like PowerPC which allow 'hot-add' of CPU
or memory resources, it may occur that the new resources are to be
inserted into nodes that were not used for these resources at bootup.
In the kernel, any node that is used must be defined and initialized
at boot.

This patch extracts the value of the lowest domain level (number of
allocable resources) from the "rtas" device tree property
"ibm,max-associativity-domains" to use as the maximum number of nodes
to setup as possibly available in the system.  This new setting will
override the instruction,

nodes_and(node_possible_map, node_possible_map, node_online_map);

presently seen in the function arch/powerpc/mm/numa.c:initmem_init().

If the property is not present at boot, no operation will be performed
to define or enable additional nodes.

Signed-off-by: Michael Bringmann 
---
 arch/powerpc/mm/numa.c |   44 
 1 file changed, 44 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 3fd4536..3ae6510 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -893,6 +893,48 @@ static void __init setup_node_data(int nid, u64 start_pfn, 
u64 end_pfn)
NODE_DATA(nid)->node_spanned_pages = spanned_pages;
 }
 
+static void __init node_associativity_setup(void)
+{
+   struct device_node *rtas;
+   printk(KERN_INFO "%s:%d\n", __FUNCTION__, __LINE__);
+
+   rtas = of_find_node_by_path("/rtas");
+   if (rtas) {
+   const __be32 *prop;
+   u32 len, entries, levelval, i;
+   printk(KERN_INFO "%s:%d\n", __FUNCTION__, __LINE__);
+
+   prop = of_get_property(rtas, "ibm,max-associativity-domains", 
);
+   if (!prop || len < sizeof(unsigned int)) {
+   printk(KERN_INFO "%s:%d\n", __FUNCTION__, __LINE__);
+   goto endit;
+   }
+
+   entries = of_read_number(prop++, 1);
+
+   if (len < (entries * sizeof(unsigned int))) {
+   printk(KERN_INFO "%s:%d\n", __FUNCTION__, __LINE__);
+   goto endit;
+   }
+
+   for (i = 0; i < entries; i++)
+   levelval = of_read_number(prop++, 1);
+
+   printk(KERN_INFO "Numa nodes avail: %d (%d) \n", (int) 
levelval, (int) entries);
+
+   for (i = 0; i < levelval; i++) {
+   if (!node_possible(i)) {
+   setup_node_data(i, 0, 0);
+   node_set(i, node_possible_map);
+   }
+   }
+   }
+
+endit:
+   if (rtas)
+   of_node_put(rtas);
+}
+
 void __init initmem_init(void)
 {
int nid, cpu;
@@ -912,6 +954,8 @@ void __init initmem_init(void)
 */
nodes_and(node_possible_map, node_possible_map, node_online_map);
 
+   node_associativity_setup();
+
for_each_online_node(nid) {
unsigned long start_pfn, end_pfn;

[PATCH V9 1/2] powerpc/numa: Update CPU topology when VPHN enabled

2017-08-21 Thread Michael Bringmann


powerpc/numa: Correct the currently broken capability to set the
topology for shared CPUs in LPARs.  At boot time for shared CPU
lpars, the topology for each shared CPU is set to node zero, however,
this is now updated correctly using the Virtual Processor Home Node
(VPHN) capabilities information provided by the pHyp.

Also, update initialization checks for device-tree attributes to
independently recognize PRRN or VPHN usage.

Signed-off-by: Michael Bringmann 
---
 arch/powerpc/include/asm/topology.h  |   14 ++
 arch/powerpc/mm/numa.c   |   64 +++---
 arch/powerpc/platforms/pseries/dlpar.c   |2 +
 arch/powerpc/platforms/pseries/hotplug-cpu.c |2 +
 4 files changed, 75 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/topology.h 
b/arch/powerpc/include/asm/topology.h
index dc4e159..85d6428 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -98,6 +98,20 @@ static inline int prrn_is_enabled(void)
 }
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
+#if defined(CONFIG_HOTPLUG_CPU) || defined(CONFIG_NEED_MULTIPLE_NODES)
+#if defined(CONFIG_PPC_SPLPAR)
+extern int timed_topology_update(int nsecs);
+#else
+#definetimed_topology_update(nsecs)0
+#endif /* CONFIG_PPC_SPLPAR */
+#endif /* CONFIG_HOTPLUG_CPU || CONFIG_NEED_MULTIPLE_NODES */
+
+#if defined(CONFIG_PPC_SPLPAR)
+extern void shared_topology_update(void);
+#else
+#defineshared_topology_update()0
+#endif /* CONFIG_PPC_SPLPAR */
+
 #include 
 
 #ifdef CONFIG_SMP
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index b95c584..3fd4536 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -906,7 +907,7 @@ void __init initmem_init(void)
 
/*
 * Reduce the possible NUMA nodes to the online NUMA nodes,
-* since we do not support node hotplug. This ensures that  we
+* since we do not support node hotplug. This ensures that we
 * lower the maximum NUMA node ID to what is actually present.
 */
nodes_and(node_possible_map, node_possible_map, node_online_map);
@@ -1148,11 +1149,32 @@ struct topology_update_data {
int new_nid;
 };
 
+#defineTOPOLOGY_DEF_TIMER_SECS 60
+
 static u8 vphn_cpu_change_counts[NR_CPUS][MAX_DISTANCE_REF_POINTS];
 static cpumask_t cpu_associativity_changes_mask;
 static int vphn_enabled;
 static int prrn_enabled;
 static void reset_topology_timer(void);
+static int topology_timer_secs = TOPOLOGY_DEF_TIMER_SECS;
+static int topology_inited;
+static int topology_update_needed;
+
+/*
+ * Change polling interval for associativity changes.
+ */
+int timed_topology_update(int nsecs)
+{
+   if (nsecs > 0)
+   topology_timer_secs = nsecs;
+   else
+   topology_timer_secs = TOPOLOGY_DEF_TIMER_SECS;
+
+   if (vphn_enabled)
+   reset_topology_timer();
+
+   return 0;
+}
 
 /*
  * Store the current values of the associativity change counters in the
@@ -1246,6 +1268,12 @@ static long vphn_get_associativity(unsigned long cpu,
"hcall_vphn() experienced a hardware fault "
"preventing VPHN. Disabling polling...\n");
stop_topology_update();
+   break;
+   case H_SUCCESS:
+   printk(KERN_INFO
+   "VPHN hcall succeeded. Reset polling...\n");
+   timed_topology_update(0);
+   break;
}
 
return rc;
@@ -1323,8 +1351,11 @@ int numa_update_cpu_topology(bool cpus_locked)
struct device *dev;
int weight, new_nid, i = 0;
 
-   if (!prrn_enabled && !vphn_enabled)
+   if (!prrn_enabled && !vphn_enabled) {
+   if (!topology_inited)
+   topology_update_needed = 1;
return 0;
+   }
 
weight = cpumask_weight(_associativity_changes_mask);
if (!weight)
@@ -1363,6 +1394,8 @@ int numa_update_cpu_topology(bool cpus_locked)
cpumask_andnot(_associativity_changes_mask,
_associativity_changes_mask,
cpu_sibling_mask(cpu));
+   pr_info("Assoc chg gives same node %d for cpu%d\n",
+   new_nid, cpu);
cpu = cpu_last_thread_sibling(cpu);
continue;
}
@@ -1379,6 +1412,9 @@ int numa_update_cpu_topology(bool cpus_locked)
cpu = cpu_last_thread_sibling(cpu);
}
 
+   if (i)
+   updates[i-1].next = NULL;
+
pr_debug("Topology update for the following CPUs:\n");
if (cpumask_weight(_cpus)) {
for (ud = [0]; ud; ud = ud->next) {
@@ -1433,6 +1469,7 @@ int

[PATCH V9 0/2] powerpc/dlpar: Correct display of hot-add/hot-remove CPUs and memory

2017-08-21 Thread Michael Bringmann


On Power systems with shared configurations of CPUs and memory, there
are some issues with association of additional CPUs and memory to nodes
when hot-adding resources.  These patches address some of those problems.

powerpc/numa: Correct the currently broken capability to set the
topology for shared CPUs in LPARs.  At boot time for shared CPU
lpars, the topology for each shared CPU is set to node zero, however,
this is now updated correctly using the Virtual Processor Home Node
(VPHN) capabilities information provided by the pHyp. The VPHN handling
in Linux is disabled, if PRRN handling is present.

powerpc/nodes: On systems like PowerPC which allow 'hot-add' of CPU
or memory resources, it may occur that the new resources are to be
inserted into nodes that were not used for these resources at bootup.
In the kernel, any node that is used must be defined and initialized
at boot.

This patch extracts the value of the lowest domain level (number of
allocable resources) from the "rtas" device tree property
"ibm,max-associativity-domains" to use as the maximum number of nodes
to setup as possibly available in the system.  This new setting will
override the instruction,

nodes_and(node_possible_map, node_possible_map, node_online_map);

presently seen in the function arch/powerpc/mm/numa.c:initmem_init().

If the property is not present at boot, no operation will be performed
to define or enable additional nodes.

Signed-off-by: Michael Bringmann 

Michael Bringmann (2):
  powerpc/numa: Update CPU topology when VPHN enabled
  powerpc/nodes: Ensure enough nodes avail for operations
---
Changes in V9:
  -- Calculate number of nodes via property "ibm,max-associativity-domains"

Re: [PATCH] mtd: nand: Rename nand.h into rawnand.h

2017-08-21 Thread Boris Brezillon

Le Fri,  4 Aug 2017 17:29:09 +0200,
Boris Brezillon  a écrit :

> We are planning to share more code between different NAND based
> devices (SPI NAND, OneNAND and raw NANDs), but before doing that
> we need to move the existing include/linux/mtd/nand.h file into
> include/linux/mtd/rawnand.h so we can later create a nand.h header
> containing all common structure and function prototypes.
> 
> Signed-off-by: Boris Brezillon 
> Signed-off-by: Peter Pan 
> Cc: Jonathan Corbet 
> Cc: Sekhar Nori 
> Cc: Kevin Hilman 
> Cc: Jason Cooper 
> Cc: Andrew Lunn 
> Cc: Sebastian Hesselbarth 
> Cc: Gregory Clement 
> Cc: Hartley Sweeten 
> Cc: Alexander Sverdlin 
> Cc: Shawn Guo 
> Cc: Sascha Hauer 
> Cc: Fabio Estevam 
> Cc: Imre Kaloz 
> Cc: Krzysztof Halasa 
> Cc: Eric Miao 
> Cc: Haojian Zhuang 
> Cc: Aaro Koskinen 
> Cc: Tony Lindgren 
> Cc: Alexander Clouter 
> Cc: Daniel Mack 
> Cc: Robert Jarzmik 
> Cc: Marek Vasut 
> Cc: Kukjin Kim 
> Cc: Krzysztof Kozlowski 
> Cc: Simtec Linux Team 
> Cc: Steven Miao 
> Cc: Mikael Starvik 
> Cc: Jesper Nilsson 
> Cc: Ralf Baechle 
> Cc: Yoshinori Sato 
> Cc: Rich Felker 
> Cc: Wenyou Yang 
> Cc: Josh Wu 
> Cc: Kamal Dasu 
> Cc: Masahiro Yamada 
> Cc: Han Xu 
> Cc: Harvey Hunt 
> Cc: Vladimir Zapolskiy 
> Cc: Sylvain Lemieux 
> Cc: Matthias Brugger 
> Cc: Wan ZongShun 
> Cc: Neil Armstrong 
> Cc: Ezequiel Garcia 
> Cc: Maxim Levitsky 
> Cc: Marc Gonzalez 
> Cc: Stefan Agner 
> Cc: Greg Kroah-Hartman 
> Cc: Mauro Carvalho Chehab 
> Cc: linux-...@vger.kernel.org
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-o...@vger.kernel.org
> Cc: linux-samsung-...@vger.kernel.org
> Cc: adi-buildroot-de...@lists.sourceforge.net
> Cc: linux-cris-ker...@axis.com
> Cc: linux-m...@linux-mips.org
> Cc: linux...@vger.kernel.org
> Cc: bcm-kernel-feedback-l...@broadcom.com
> Cc: linux-media...@lists.infradead.org
> Cc: linux-ox...@lists.tuxfamily.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: de...@driverdev.osuosl.org

Created the nand/rename-header-file immutable branch which I then
merged in nand/next.

This way, anyone can pull the nand/rename-header-file branch in case a
conflict arise on one of the file modified by this patch.

The following changes since commit 5771a8c08880cdca3bfb4a3fc6d309d6bba20877:

  Linux v4.13-rc1 (2017-07-15 15:22:10 -0700)

are available in the git repository at:

  git://git.infradead.org/l2-mtd.git nand/rename-header-file

for you to fetch changes up to d4092d76a4a4e57b65910899948a83cc8646c5a5:

  mtd: nand: Rename nand.h into rawnand.h (2017-08-13 10:11:49 +0200)


Boris Brezillon (1):
  mtd: nand: Rename nand.h into rawnand.h

 Documentation/driver-api/mtdnand.rst| 8 
 MAINTAINERS | 2 +-
 arch/arm/mach-davinci/board-da850-evm.c | 2 +-
 arch/arm/mach-davinci/board-dm355-evm.c | 2 +-
 arch/arm/mach-davinci/board-dm355-leopard.c | 2 +-
 arch/arm/mach-davinci/board-dm365-evm.c | 2 +-
 arch/arm/mach-davinci/board-dm644x-evm.c| 2 +-
 arch/arm/mach-davinci/board-dm646x-evm.c| 2 +-
 arch/arm/mach-davinci/board-sffsdr.c| 2 +-
 arch/arm/mach-dove/dove-db-setup.c  | 2 +-
 arch/arm/mach-ep93xx/snappercl15.c  | 2 +-
 arch/arm/mach-ep93xx/ts72xx.c   | 2 +-
 arch/arm/mach-imx/mach-qong.c   | 2 +-
 arch/arm/mach-ixp4xx/ixdp425-setup.c| 2 +-
 arch/arm/mach-mmp/aspenite.c| 2 +-
 arch/arm/mach-omap1/board-fsample.c | 2 +-
 arch/arm/mach-omap1/board-h2.c  | 2 +-
 arch/arm/mach-omap1/board-h3.c  | 2 +-
 arch/arm/mach-omap1/board-nand.c| 2 +-
 arch/arm/mach-omap1/board-perseus2.c

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-08-21 Thread David Miller

From: Nicholas Piggin 
Date: Tue, 22 Aug 2017 00:19:28 +1000

> Thanks here's an updated version with a couple more bugs fixed. If
> you could try testing, that would be much appreciated.

I'm not getting RCU stalls on sparc64 any longer with this patch.

I'm really happy you guys were able to figure out what was going
wrong. :-)

Feel free to add my Tested-by:

Re: [BUG][bisected 270065e] linux-next fails to boot on powerpc

2017-08-21 Thread Martin K. Petersen


Brian,

>> Thanks for the detailed analysis. This is very helpful. Have you
>> considered to change the ipr driver such that it terminates REPORT
>> SUPPORTED OPERATION CODES commands with the appropriate check
>> condition code instead of DID_ERROR?
>
> Yes. That data is actually in the sense buffer, but since I'm also
> setting DID_ERROR, scsi_decide_disposition isn't using it. I've got a
> patch to do just as you suggest, to stop setting DID_ERROR when there
> is more detailed error data available, but it will need some
> additional testing before I submit, as it will impact much more than
> just this case.

I agree. In this case where a command is not supported, a check
condition would be a better way to signal the failure to the SCSI
midlayer.

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [PATCH] ipr: Set no_report_opcodes for RAID arrays

2017-08-21 Thread Martin K. Petersen


Brian,

> Since ipr RAID arrays do not support the MAINTENANCE_IN /
> MI_REPORT_SUPPORTED_OPERATION_CODES, set no_report_opcodes to prevent
> it from being sent.

Applied to 4.13/scsi-fixes. Thank you!

-- 
Martin K. Petersen  Oracle Linux Engineering

Re: [v3,3/3] powerpc/mm/cxl: Add the fault handling cpu to mm cpumask

2017-08-21 Thread Benjamin Herrenschmidt

On Mon, 2017-08-21 at 23:39 +1000, Michael Ellerman wrote:
> On Thu, 2017-07-27 at 06:24:55 UTC, "Aneesh Kumar K.V" wrote:
> > We use mm cpumask for serializing against lockless page table walk. Anybody
> > who is doing a lockless page table walk is expected to disable irq and only
> > cpus in mm cpumask is expected do the lockless walk. This ensure that
> > a THP split can send IPI to only cpus in the mm cpumask, to make sure there
> > are no parallel lockless page table walk.
> > 
> > Add the CAPI fault handling cpu to the mm cpumask so that we can do the 
> > lockless
> > page table walk while inserting hash page table entries.
> > 
> > Reviewed-by: Frederic Barrat 
> > Signed-off-by: Aneesh Kumar K.V 
> 
> Applied to powerpc next, thanks.
> 
> https://git.kernel.org/powerpc/c/0f4bc0932e51817105fdee77a46680

Do you need barriers between the cpumask set and the walking ?

Cheers,
Ben.

Re: [PATCH 4/6] dpaa_eth: add NETIF_F_RXHASH

2017-08-21 Thread David Miller

From: Madalin Bucur 
Date: Fri, 18 Aug 2017 11:56:26 +0300

> + if (net_dev->features & NETIF_F_RXHASH && priv->keygen_in_use &&
> + !fman_port_get_hash_result_offset(priv->mac_dev->port[RX],
> +   _offset))
> + skb_set_hash(skb, be32_to_cpu(*(u32 *)(vaddr + hash_offset)),
> +  // if L4 exists, it was used in the hash generation
> +  be32_to_cpu(fd->status) & FM_FD_STAT_L4CV ?
> + PKT_HASH_TYPE_L4 : PKT_HASH_TYPE_L3);

We do not use "//" c++ style comments in the kernel.

And putting a comment right in the middle of a set of arguments being
passed to a function makes the code harder to read, so please do not
do this.

Re: [PATCH 5/6] powerpc/mm: Optimize detection of thread local mm's

2017-08-21 Thread Benjamin Herrenschmidt

On Mon, 2017-08-21 at 19:27 +0200, Frederic Barrat wrote:
> Hi Ben,
> 
> Le 24/07/2017 à 06:28, Benjamin Herrenschmidt a écrit :
> > Instead of comparing the whole CPU mask every time, let's
> > keep a counter of how many bits are set in the mask. Thus
> > testing for a local mm only requires testing if that counter
> > is 1 and the current CPU bit is set in the mask.
> 
> 
> I'm trying to see if we could merge this patch with what I'm trying to 
> do to mark a context as requiring global TLBIs.
> In http://patchwork.ozlabs.org/patch/796775/
> I'm introducing a 'flags' per memory context, using one bit to say if 
> the context needs global TLBIs.
> The 2 could co-exist, just checking... Do you think about using the 
> actual active_cpus count down the road, or is it just a matter of 
> knowing if there are more than one active cpus?

Or you could just incrementer my counter. Just make sure you increment
it at most once per CXL context and decrement when the context is gone.

Cheers,
Ben.

Re: [PATCH 5/6] powerpc/mm: Optimize detection of thread local mm's

2017-08-21 Thread Frederic Barrat


Hi Ben,

Le 24/07/2017 à 06:28, Benjamin Herrenschmidt a écrit :

Instead of comparing the whole CPU mask every time, let's
keep a counter of how many bits are set in the mask. Thus
testing for a local mm only requires testing if that counter
is 1 and the current CPU bit is set in the mask.



I'm trying to see if we could merge this patch with what I'm trying to 
do to mark a context as requiring global TLBIs.

In http://patchwork.ozlabs.org/patch/796775/
I'm introducing a 'flags' per memory context, using one bit to say if 
the context needs global TLBIs.
The 2 could co-exist, just checking... Do you think about using the 
actual active_cpus count down the road, or is it just a matter of 
knowing if there are more than one active cpus?


Thanks,

  Fred




Signed-off-by: Benjamin Herrenschmidt 
---
  arch/powerpc/include/asm/book3s/64/mmu.h |  3 +++
  arch/powerpc/include/asm/mmu_context.h   |  9 +
  arch/powerpc/include/asm/tlb.h   | 11 ++-
  arch/powerpc/mm/mmu_context_book3s64.c   |  2 ++
  4 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 1a220cdff923..c3b00e8ff791 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -83,6 +83,9 @@ typedef struct {
mm_context_id_t id;
u16 user_psize; /* page size index */

+   /* Number of bits in the mm_cpumask */
+   atomic_t active_cpus;
+
/* NPU NMMU context */
struct npu_context *npu_context;

diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index ff1aeb2cd19f..cf8f50cd4030 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -96,6 +96,14 @@ static inline void switch_mm_pgdir(struct task_struct *tsk,
   struct mm_struct *mm) { }
  #endif

+#ifdef CONFIG_PPC_BOOK3S_64
+static inline void inc_mm_active_cpus(struct mm_struct *mm)
+{
+   atomic_inc(>context.active_cpus);
+}
+#else
+static inline void inc_mm_active_cpus(struct mm_struct *mm) { }
+#endif

  /*
   * switch_mm is the entry point called from the architecture independent
@@ -110,6 +118,7 @@ static inline void switch_mm_irqs_off(struct mm_struct 
*prev,
/* Mark this context has been used on the new CPU */
if (!cpumask_test_cpu(smp_processor_id(), mm_cpumask(next))) {
cpumask_set_cpu(smp_processor_id(), mm_cpumask(next));
+   inc_mm_active_cpus(next);
smp_mb();
new_on_cpu = true;
}
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index 609557569f65..a7eabff27a0f 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -69,13 +69,22 @@ static inline int mm_is_core_local(struct mm_struct *mm)
  topology_sibling_cpumask(smp_processor_id()));
  }

+#ifdef CONFIG_PPC_BOOK3S_64
+static inline int mm_is_thread_local(struct mm_struct *mm)
+{
+   if (atomic_read(>context.active_cpus) > 1)
+   return false;
+   return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
+}
+#else /* CONFIG_PPC_BOOK3S_64 */
  static inline int mm_is_thread_local(struct mm_struct *mm)
  {
return cpumask_equal(mm_cpumask(mm),
  cpumask_of(smp_processor_id()));
  }
+#endif /* !CONFIG_PPC_BOOK3S_64 */

-#else
+#else /* CONFIG_SMP */
  static inline int mm_is_core_local(struct mm_struct *mm)
  {
return 1;
diff --git a/arch/powerpc/mm/mmu_context_book3s64.c 
b/arch/powerpc/mm/mmu_context_book3s64.c
index 8159f5219137..de17d3e714aa 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -174,6 +174,8 @@ int init_new_context(struct task_struct *tsk, struct 
mm_struct *mm)
  #ifdef CONFIG_SPAPR_TCE_IOMMU
mm_iommu_init(mm);
  #endif
+   atomic_set(>context.active_cpus, 0);
+
return 0;
  }

[PATCH v2 5/5] of/fdt: only store the device node basename in full_name

2017-08-21 Thread Rob Herring

With dependencies on a statically allocated full path name converted to
use %pOF format specifier, we can store just the basename of node, and
the unflattening of the FDT can be simplified.

This commit will affect the remaining users of full_name. After
analyzing these users, the remaining cases should only change some print
messages. The main users of full_name are providing a name for struct
resource. The resource names shouldn't be important other than providing
/proc/iomem names.

We no longer distinguish between pre and post 0x10 dtb formats as either
a full path or basename will work. However, less than 0x10 formats have
been broken since the conversion to use libfdt (and no one has cared).
The conversion of the unflattening code to be non-recursive also broke
pre 0x10 formats as the populate_node function would return 0 in that
case.

Signed-off-by: Rob Herring 
---
v2:
- rebase to linux-next

 drivers/of/fdt.c | 69 +---
 1 file changed, 11 insertions(+), 58 deletions(-)

diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index ce30c9a588a4..27c535af0be8 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -266,74 +266,32 @@ static void populate_properties(const void *blob,
*pprev = NULL;
 }
 
-static unsigned int populate_node(const void *blob,
- int offset,
- void **mem,
- struct device_node *dad,
- unsigned int fpsize,
- struct device_node **pnp,
- bool dryrun)
+static bool populate_node(const void *blob,
+ int offset,
+ void **mem,
+ struct device_node *dad,
+ struct device_node **pnp,
+ bool dryrun)
 {
struct device_node *np;
const char *pathp;
unsigned int l, allocl;
-   int new_format = 0;
 
pathp = fdt_get_name(blob, offset, );
if (!pathp) {
*pnp = NULL;
-   return 0;
+   return false;
}
 
allocl = ++l;
 
-   /* version 0x10 has a more compact unit name here instead of the full
-* path. we accumulate the full path size using "fpsize", we'll rebuild
-* it later. We detect this because the first character of the name is
-* not '/'.
-*/
-   if ((*pathp) != '/') {
-   new_format = 1;
-   if (fpsize == 0) {
-   /* root node: special case. fpsize accounts for path
-* plus terminating zero. root node only has '/', so
-* fpsize should be 2, but we want to avoid the first
-* level nodes to have two '/' so we use fpsize 1 here
-*/
-   fpsize = 1;
-   allocl = 2;
-   l = 1;
-   pathp = "";
-   } else {
-   /* account for '/' and path size minus terminal 0
-* already in 'l'
-*/
-   fpsize += l;
-   allocl = fpsize;
-   }
-   }
-
np = unflatten_dt_alloc(mem, sizeof(struct device_node) + allocl,
__alignof__(struct device_node));
if (!dryrun) {
char *fn;
of_node_init(np);
np->full_name = fn = ((char *)np) + sizeof(*np);
-   if (new_format) {
-   /* rebuild full path for new format */
-   if (dad && dad->parent) {
-   strcpy(fn, dad->full_name);
-#ifdef DEBUG
-   if ((strlen(fn) + l + 1) != allocl) {
-   pr_debug("%s: p: %d, l: %d, a: %d\n",
-   pathp, (int)strlen(fn),
-   l, allocl);
-   }
-#endif
-   fn += strlen(fn);
-   }
-   *(fn++) = '/';
-   }
+
memcpy(fn, pathp, l);
 
if (dad != NULL) {
@@ -355,7 +313,7 @@ static unsigned int populate_node(const void *blob,
}
 
*pnp = np;
-   return fpsize;
+   return true;
 }
 
 static void reverse_nodes(struct device_node *parent)
@@ -399,7 +357,6 @@ static int unflatten_dt_nodes(const void *blob,
struct device_node *root;
int offset = 0, depth = 0, initial_depth = 0;
 #define FDT_MAX_DEPTH  64
-   unsigned int fpsizes[FDT_MAX_DEPTH];
struct device_node *nps[FDT_MAX_DEPTH];
void *base = mem;
bool dryrun = !base;
@@ -418,7 +375,6 @@ static int

[PATCH v2 4/5] powerpc: pseries: only store the device node basename in full_name

2017-08-21 Thread Rob Herring

With dependencies on full_name containing the entire device node path
removed, stop storing the full_name in nodes created by
dlpar_configure_connector() and pSeries_reconfig_add_node().

Signed-off-by: Rob Herring 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
---
v2:
- Rework dlpar_parse_cc_node() to a single allocation and strcpy instead
  of kasprintf
  
 arch/powerpc/platforms/pseries/dlpar.c| 30 +++---
 arch/powerpc/platforms/pseries/reconfig.c |  2 +-
 2 files changed, 8 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
b/arch/powerpc/platforms/pseries/dlpar.c
index 783f36364690..31a0615aeea1 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -75,28 +75,17 @@ static struct property *dlpar_parse_cc_property(struct 
cc_workarea *ccwa)
return prop;
 }
 
-static struct device_node *dlpar_parse_cc_node(struct cc_workarea *ccwa,
-  const char *path)
+static struct device_node *(dlpar_parse_cc_node)(struct cc_workarea *ccwa)
 {
struct device_node *dn;
-   char *name;
-
-   /* If parent node path is "/" advance path to NULL terminator to
-* prevent double leading slashs in full_name.
-*/
-   if (!path[1])
-   path++;
+   const char *name = (const char *)ccwa + be32_to_cpu(ccwa->name_offset);
 
-   dn = kzalloc(sizeof(*dn), GFP_KERNEL);
+   dn = kzalloc(sizeof(*dn) + strlen(name) + 1, GFP_KERNEL);
if (!dn)
return NULL;
 
-   name = (char *)ccwa + be32_to_cpu(ccwa->name_offset);
-   dn->full_name = kasprintf(GFP_KERNEL, "%s/%s", path, name);
-   if (!dn->full_name) {
-   kfree(dn);
-   return NULL;
-   }
+   dn->full_name = (char *)(dn + 1);
+   strcpy((char *)dn->full_name, name);
 
of_node_set_flag(dn, OF_DYNAMIC);
of_node_init(dn);
@@ -148,7 +137,6 @@ struct device_node *dlpar_configure_connector(__be32 
drc_index,
struct property *last_property = NULL;
struct cc_workarea *ccwa;
char *data_buf;
-   const char *parent_path = parent->full_name;
int cc_token;
int rc = -1;
 
@@ -182,7 +170,7 @@ struct device_node *dlpar_configure_connector(__be32 
drc_index,
break;
 
case NEXT_SIBLING:
-   dn = dlpar_parse_cc_node(ccwa, parent_path);
+   dn = dlpar_parse_cc_node(ccwa);
if (!dn)
goto cc_error;
 
@@ -192,10 +180,7 @@ struct device_node *dlpar_configure_connector(__be32 
drc_index,
break;
 
case NEXT_CHILD:
-   if (first_dn)
-   parent_path = last_dn->full_name;
-
-   dn = dlpar_parse_cc_node(ccwa, parent_path);
+   dn = dlpar_parse_cc_node(ccwa);
if (!dn)
goto cc_error;
 
@@ -226,7 +211,6 @@ struct device_node *dlpar_configure_connector(__be32 
drc_index,
 
case PREV_PARENT:
last_dn = last_dn->parent;
-   parent_path = last_dn->parent->full_name;
break;
 
case CALL_AGAIN:
diff --git a/arch/powerpc/platforms/pseries/reconfig.c 
b/arch/powerpc/platforms/pseries/reconfig.c
index 296c188fd5ca..f24d8159c9e1 100644
--- a/arch/powerpc/platforms/pseries/reconfig.c
+++ b/arch/powerpc/platforms/pseries/reconfig.c
@@ -33,7 +33,7 @@ static int pSeries_reconfig_add_node(const char *path, struct 
property *proplist
if (!np)
goto out_err;
 
-   np->full_name = kstrdup(path, GFP_KERNEL);
+   np->full_name = kstrdup(kbasename(path), GFP_KERNEL);
if (!np->full_name)
goto out_err;
 
-- 
2.11.0

[PATCH v2 3/5] powerpc: pseries: remove dlpar_attach_node dependency on full path

2017-08-21 Thread Rob Herring

In preparation to stop storing the full node path in full_name, remove the
dependency on full_name from dlpar_attach_node(). Callers of
dlpar_attach_node() already have the parent device_node, so just pass the
parent node into dlpar_attach_node instead of the path. This avoids doing
a lookup of the parent node by the path.

Signed-off-by: Rob Herring 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
---
v2:
- rebased to linux-next

 arch/powerpc/platforms/pseries/dlpar.c   | 6 ++
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 +-
 arch/powerpc/platforms/pseries/mobility.c| 2 +-
 arch/powerpc/platforms/pseries/pseries.h | 2 +-
 4 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/dlpar.c 
b/arch/powerpc/platforms/pseries/dlpar.c
index 80b84c9c8509..783f36364690 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -254,13 +254,11 @@ struct device_node *dlpar_configure_connector(__be32 
drc_index,
return first_dn;
 }
 
-int dlpar_attach_node(struct device_node *dn)
+int dlpar_attach_node(struct device_node *dn, struct device_node *parent)
 {
int rc;
 
-   dn->parent = pseries_of_derive_parent(dn->full_name);
-   if (IS_ERR(dn->parent))
-   return PTR_ERR(dn->parent);
+   dn->parent = parent;
 
rc = of_attach_node(dn);
if (rc) {
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 0a93093fbcef..b357f1ae0b0a 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -463,7 +463,7 @@ static ssize_t dlpar_cpu_add(u32 drc_index)
return -EINVAL;
}
 
-   rc = dlpar_attach_node(dn);
+   rc = dlpar_attach_node(dn, parent);
if (rc) {
saved_rc = rc;
pr_warn("Failed to attach node %s, rc: %d, drc index: %x\n",
diff --git a/arch/powerpc/platforms/pseries/mobility.c 
b/arch/powerpc/platforms/pseries/mobility.c
index 2da4851eff99..210ce632d63e 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -229,7 +229,7 @@ static int add_dt_node(__be32 parent_phandle, __be32 
drc_index)
if (!dn)
return -ENOENT;
 
-   rc = dlpar_attach_node(dn);
+   rc = dlpar_attach_node(dn, parent_dn);
if (rc)
dlpar_free_cc_nodes(dn);
 
diff --git a/arch/powerpc/platforms/pseries/pseries.h 
b/arch/powerpc/platforms/pseries/pseries.h
index 1361a9db534b..4470a3194311 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -46,7 +46,7 @@ extern void dlpar_free_cc_nodes(struct device_node *);
 extern void dlpar_free_cc_property(struct property *);
 extern struct device_node *dlpar_configure_connector(__be32,
struct device_node *);
-extern int dlpar_attach_node(struct device_node *);
+extern int dlpar_attach_node(struct device_node *, struct device_node *);
 extern int dlpar_detach_node(struct device_node *);
 extern int dlpar_acquire_drc(u32 drc_index);
 extern int dlpar_release_drc(u32 drc_index);
-- 
2.11.0

[PATCH v2 1/5] powerpc: Convert to using %pOF instead of full_name

2017-08-21 Thread Rob Herring

Now that we have a custom printf format specifier, convert users of
full_name to use %pOF instead. This is preparation to remove storing
of the full path string for each node.

Signed-off-by: Rob Herring 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: Anatolij Gustschin 
Cc: Scott Wood 
Cc: Kumar Gala 
Cc: Arnd Bergmann 
Cc: linuxppc-dev@lists.ozlabs.org
Reviewed-by: Tyrel Datwyler 
---
v2:
- drop spurious vio.c vio_register_device_node() change

 arch/powerpc/kernel/btext.c  |   2 +-
 arch/powerpc/kernel/cacheinfo.c  |  34 ---
 arch/powerpc/kernel/io-workarounds.c |   4 +-
 arch/powerpc/kernel/isa-bridge.c |  32 +++
 arch/powerpc/kernel/legacy_serial.c  |  12 +--
 arch/powerpc/kernel/of_platform.c|   2 +-
 arch/powerpc/kernel/pci-common.c |  15 ++-
 arch/powerpc/kernel/pci_32.c |   4 +-
 arch/powerpc/kernel/pci_64.c |   4 +-
 arch/powerpc/kernel/pci_of_scan.c|  24 ++---
 arch/powerpc/kernel/setup-common.c   |   2 +-
 arch/powerpc/platforms/4xx/cpm.c |   8 +-
 arch/powerpc/platforms/4xx/gpio.c|   3 +-
 arch/powerpc/platforms/4xx/msi.c |   3 +-
 arch/powerpc/platforms/4xx/pci.c | 116 ++-
 arch/powerpc/platforms/4xx/soc.c |   5 +-
 arch/powerpc/platforms/4xx/uic.c |  14 +--
 arch/powerpc/platforms/512x/mpc512x_shared.c |  12 +--
 arch/powerpc/platforms/52xx/efika.c  |   8 +-
 arch/powerpc/platforms/52xx/media5200.c  |   2 +-
 arch/powerpc/platforms/52xx/mpc52xx_gpt.c|   4 +-
 arch/powerpc/platforms/52xx/mpc52xx_pci.c|   8 +-
 arch/powerpc/platforms/83xx/mcu_mpc8349emitx.c   |   3 +-
 arch/powerpc/platforms/83xx/mpc832x_rdb.c|   2 +-
 arch/powerpc/platforms/85xx/p1022_ds.c   |   8 +-
 arch/powerpc/platforms/85xx/xes_mpc85xx.c|   4 +-
 arch/powerpc/platforms/amigaone/setup.c  |   6 +-
 arch/powerpc/platforms/cell/axon_msi.c   |  36 +++
 arch/powerpc/platforms/cell/interrupt.c  |   4 +-
 arch/powerpc/platforms/cell/iommu.c  |  24 ++---
 arch/powerpc/platforms/cell/ras.c|   4 +-
 arch/powerpc/platforms/cell/spider-pci.c |   4 +-
 arch/powerpc/platforms/cell/spider-pic.c |   4 +-
 arch/powerpc/platforms/cell/spu_manage.c |  26 ++---
 arch/powerpc/platforms/chrp/pci.c|  18 ++--
 arch/powerpc/platforms/embedded6xx/linkstation.c |   6 +-
 arch/powerpc/platforms/embedded6xx/mvme5100.c|   2 +-
 arch/powerpc/platforms/embedded6xx/storcenter.c  |   2 +-
 arch/powerpc/platforms/maple/pci.c   |  10 +-
 arch/powerpc/platforms/pasemi/pci.c  |   2 +-
 arch/powerpc/platforms/powermac/feature.c|  14 +--
 arch/powerpc/platforms/powermac/low_i2c.c|  50 +-
 arch/powerpc/platforms/powermac/pci.c|   6 +-
 arch/powerpc/platforms/powermac/pfunc_base.c |  24 ++---
 arch/powerpc/platforms/powermac/pfunc_core.c |   6 +-
 arch/powerpc/platforms/powermac/pic.c|   8 +-
 arch/powerpc/platforms/powermac/setup.c  |   2 +-
 arch/powerpc/platforms/powernv/opal-async.c  |   4 +-
 arch/powerpc/platforms/powernv/opal-xscom.c  |   8 +-
 arch/powerpc/platforms/powernv/pci-ioda.c|  15 ++-
 arch/powerpc/platforms/powernv/rng.c |   6 +-
 arch/powerpc/platforms/pseries/dlpar.c   |   3 +-
 arch/powerpc/platforms/pseries/event_sources.c   |   6 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c |   4 +-
 arch/powerpc/platforms/pseries/ibmebus.c |   5 +-
 arch/powerpc/platforms/pseries/iommu.c   |  58 ++--
 arch/powerpc/platforms/pseries/msi.c |  12 +--
 arch/powerpc/platforms/pseries/pci_dlpar.c   |   2 +-
 arch/powerpc/platforms/pseries/vio.c |   6 +-
 arch/powerpc/sysdev/axonram.c|   4 +-
 arch/powerpc/sysdev/dcr.c|   4 +-
 arch/powerpc/sysdev/fsl_85xx_cache_sram.c|  12 +--
 arch/powerpc/sysdev/fsl_gtm.c|  14 +--
 arch/powerpc/sysdev/fsl_msi.c|  16 ++--
 arch/powerpc/sysdev/fsl_pci.c|  47 +
 arch/powerpc/sysdev/fsl_rio.c|  36 +++
 arch/powerpc/sysdev/fsl_rmu.c|  12 +--
 arch/powerpc/sysdev/mpic.c   |   4 +-
 arch/powerpc/sysdev/mpic_msgr.c  |   2 +-
 arch/powerpc/sysdev/mpic_msi.c   |   2 +-
 arch/powerpc/sysdev/mpic_timer.c |  19 ++--
 arch/powerpc/sysdev/msi_bitmap.c |

[PATCH v2 2/5] powerpc: pseries: vio: match parent nodes with of_find_node_by_path

2017-08-21 Thread Rob Herring

In preparation to remove the full path from device_node.full_name, use
of_find_node_by_path instead of open coding with strcmp.

Signed-off-by: Rob Herring 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Michael Ellerman 
Cc: linuxppc-dev@lists.ozlabs.org
---
v2:
- rebased to linux-next and removed spurious change fro patch 1.

 arch/powerpc/platforms/pseries/vio.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/vio.c 
b/arch/powerpc/platforms/pseries/vio.c
index aa5ca74316fa..5754572deb23 100644
--- a/arch/powerpc/platforms/pseries/vio.c
+++ b/arch/powerpc/platforms/pseries/vio.c
@@ -1357,9 +1357,9 @@ struct vio_dev *vio_register_device_node(struct 
device_node *of_node)
 */
parent_node = of_get_parent(of_node);
if (parent_node) {
-   if (!strcmp(parent_node->full_name, "/ibm,platform-facilities"))
+   if (parent_node == 
of_find_node_by_path("/ibm,platform-facilities"))
family = PFO;
-   else if (!strcmp(parent_node->full_name, "/vdevice"))
+   else if (parent_node == of_find_node_by_path("/vdevice"))
family = VDEVICE;
else {
pr_warn("%s: parent(%pOF) of %s not recognized.\n",
-- 
2.11.0

[PATCH v2 0/5] Removing full paths from DT full_name

2017-08-21 Thread Rob Herring

This series is the last steps to remove storing the full path for every
DT node. Instead, we can create full path strings dynamically as needed
with printf %pOF specifiers (commit ce4fecf1fe15). There are a number of
remaining direct users of full_name after this series. I don't believe
there should be any functional impact for those users with the change to
only the node name (+unit-address). The majority are for struct
resource.name. This should only affect /proc/iomem display.

Michael, Please apply patches 1-3 for 4.14. Patches 4 and 5 are intended for 
4.15. Testing of those in particular would be helpful.

Rob

Rob Herring (5):
  powerpc: Convert to using %pOF instead of full_name
  powerpc: pseries: vio: match parent nodes with of_find_node_by_path
  powerpc: pseries: remove dlpar_attach_node dependency on full path
  powerpc: pseries: only store the device node basename in full_name
  of/fdt: only store the device node basename in full_name

 arch/powerpc/kernel/btext.c  |   2 +-
 arch/powerpc/kernel/cacheinfo.c  |  34 ---
 arch/powerpc/kernel/io-workarounds.c |   4 +-
 arch/powerpc/kernel/isa-bridge.c |  32 +++
 arch/powerpc/kernel/legacy_serial.c  |  12 +--
 arch/powerpc/kernel/of_platform.c|   2 +-
 arch/powerpc/kernel/pci-common.c |  15 ++-
 arch/powerpc/kernel/pci_32.c |   4 +-
 arch/powerpc/kernel/pci_64.c |   4 +-
 arch/powerpc/kernel/pci_of_scan.c|  24 ++---
 arch/powerpc/kernel/setup-common.c   |   2 +-
 arch/powerpc/platforms/4xx/cpm.c |   8 +-
 arch/powerpc/platforms/4xx/gpio.c|   3 +-
 arch/powerpc/platforms/4xx/msi.c |   3 +-
 arch/powerpc/platforms/4xx/pci.c | 116 ++-
 arch/powerpc/platforms/4xx/soc.c |   5 +-
 arch/powerpc/platforms/4xx/uic.c |  14 +--
 arch/powerpc/platforms/512x/mpc512x_shared.c |  12 +--
 arch/powerpc/platforms/52xx/efika.c  |   8 +-
 arch/powerpc/platforms/52xx/media5200.c  |   2 +-
 arch/powerpc/platforms/52xx/mpc52xx_gpt.c|   4 +-
 arch/powerpc/platforms/52xx/mpc52xx_pci.c|   8 +-
 arch/powerpc/platforms/83xx/mcu_mpc8349emitx.c   |   3 +-
 arch/powerpc/platforms/83xx/mpc832x_rdb.c|   2 +-
 arch/powerpc/platforms/85xx/p1022_ds.c   |   8 +-
 arch/powerpc/platforms/85xx/xes_mpc85xx.c|   4 +-
 arch/powerpc/platforms/amigaone/setup.c  |   6 +-
 arch/powerpc/platforms/cell/axon_msi.c   |  36 +++
 arch/powerpc/platforms/cell/interrupt.c  |   4 +-
 arch/powerpc/platforms/cell/iommu.c  |  24 ++---
 arch/powerpc/platforms/cell/ras.c|   4 +-
 arch/powerpc/platforms/cell/spider-pci.c |   4 +-
 arch/powerpc/platforms/cell/spider-pic.c |   4 +-
 arch/powerpc/platforms/cell/spu_manage.c |  26 ++---
 arch/powerpc/platforms/chrp/pci.c|  18 ++--
 arch/powerpc/platforms/embedded6xx/linkstation.c |   6 +-
 arch/powerpc/platforms/embedded6xx/mvme5100.c|   2 +-
 arch/powerpc/platforms/embedded6xx/storcenter.c  |   2 +-
 arch/powerpc/platforms/maple/pci.c   |  10 +-
 arch/powerpc/platforms/pasemi/pci.c  |   2 +-
 arch/powerpc/platforms/powermac/feature.c|  14 +--
 arch/powerpc/platforms/powermac/low_i2c.c|  50 +-
 arch/powerpc/platforms/powermac/pci.c|   6 +-
 arch/powerpc/platforms/powermac/pfunc_base.c |  24 ++---
 arch/powerpc/platforms/powermac/pfunc_core.c |   6 +-
 arch/powerpc/platforms/powermac/pic.c|   8 +-
 arch/powerpc/platforms/powermac/setup.c  |   2 +-
 arch/powerpc/platforms/powernv/opal-async.c  |   4 +-
 arch/powerpc/platforms/powernv/opal-xscom.c  |   8 +-
 arch/powerpc/platforms/powernv/pci-ioda.c|  15 ++-
 arch/powerpc/platforms/powernv/rng.c |   6 +-
 arch/powerpc/platforms/pseries/dlpar.c   |  39 ++--
 arch/powerpc/platforms/pseries/event_sources.c   |   6 +-
 arch/powerpc/platforms/pseries/hotplug-cpu.c |   6 +-
 arch/powerpc/platforms/pseries/ibmebus.c |   5 +-
 arch/powerpc/platforms/pseries/iommu.c   |  58 ++--
 arch/powerpc/platforms/pseries/mobility.c|   2 +-
 arch/powerpc/platforms/pseries/msi.c |  12 +--
 arch/powerpc/platforms/pseries/pci_dlpar.c   |   2 +-
 arch/powerpc/platforms/pseries/pseries.h |   2 +-
 arch/powerpc/platforms/pseries/reconfig.c|   2 +-
 arch/powerpc/platforms/pseries/vio.c |  10 +-
 arch/powerpc/sysdev/axonram.c|   4 +-
 arch/powerpc/sysdev/dcr.c|   4 +-
 arch/powerpc/sysdev/fsl_85xx_cache_sram.c|  12 +--
 arch/powerpc/sysdev/fsl_gtm.c|  14 +--
 arch/powerpc/sysdev/fsl_msi.c|  16

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-08-21 Thread Jonathan Cameron

On Tue, 22 Aug 2017 00:19:28 +1000
Nicholas Piggin  wrote:

> On Mon, 21 Aug 2017 11:18:33 +0100
> Jonathan Cameron  wrote:
> 
> > On Mon, 21 Aug 2017 16:06:05 +1000
> > Nicholas Piggin  wrote:
> >   
> > > On Mon, 21 Aug 2017 10:52:58 +1000
> > > Nicholas Piggin  wrote:
> > > 
> > > > On Sun, 20 Aug 2017 14:14:29 -0700
> > > > "Paul E. McKenney"  wrote:
> > > >   
> > > > > On Sun, Aug 20, 2017 at 11:35:14AM -0700, Paul E. McKenney wrote: 
> > > > >
> > > > > > On Sun, Aug 20, 2017 at 11:00:40PM +1000, Nicholas Piggin wrote:
> > > > > >   
> > > > > > > On Sun, 20 Aug 2017 14:45:53 +1000
> > > > > > > Nicholas Piggin  wrote:
> > > > > > >   
> > > > > > > > On Wed, 16 Aug 2017 09:27:31 -0700
> > > > > > > > "Paul E. McKenney"  wrote:  
> > > > > > > > > On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney 
> > > > > > > > > wrote:
> > > > > > > > > 
> > > > > > > > > Thomas, John, am I misinterpreting the timer trace event 
> > > > > > > > > messages?
> > > > > > > > 
> > > > > > > > So I did some digging, and what you find is that rcu_sched 
> > > > > > > > seems to do a
> > > > > > > > simple scheudle_timeout(1) and just goes out to lunch for many 
> > > > > > > > seconds.
> > > > > > > > The process_timeout timer never fires (when it finally does 
> > > > > > > > wake after
> > > > > > > > one of these events, it usually removes the timer with 
> > > > > > > > del_timer_sync).
> > > > > > > > 
> > > > > > > > So this patch seems to fix it. Testing, comments welcome.   
> > > > > > > >
> > > > > > > 
> > > > > > > Okay this had a problem of trying to forward the timer from a 
> > > > > > > timer
> > > > > > > callback function.
> > > > > > > 
> > > > > > > This was my other approach which also fixes the RCU warnings, but 
> > > > > > > it's
> > > > > > > a little more complex. I reworked it a bit so the mod_timer fast 
> > > > > > > path
> > > > > > > hopefully doesn't have much more overhead (actually by reading 
> > > > > > > jiffies
> > > > > > > only when needed, it probably saves a load).  
> > > > > > 
> > > > > > Giving this one a whirl!  
> > > > > 
> > > > > No joy here, but then again there are other reasons to believe that I
> > > > > am seeing a different bug than Dave and Jonathan are.
> > > > > 
> > > > > OK, not -entirely- without joy -- 10 of 14 runs were error-free, which
> > > > > is a good improvement over 0 of 84 for your earlier patch.  ;-)  But
> > > > > not statistically different from what I see without either patch.
> > > > > 
> > > > > But no statistical difference compared to without patch, and I still
> > > > > see the "rcu_sched kthread starved" messages.  For whatever it is 
> > > > > worth,
> > > > > by the way, I also see this: "hrtimer: interrupt took 5712368 ns".
> > > > > Hmmm...  I am also seeing that without any of your patches.  Might
> > > > > be hypervisor preemption, I guess.
> > > > 
> > > > Okay it makes the warnings go away for me, but I'm just booting then
> > > > leaving the system idle. You're doing some CPU hotplug activity?  
> > > 
> > > Okay found a bug in the patch (it was not forwarding properly before
> > > adding the first timer after an idle) and a few other concerns.
> > > 
> > > There's still a problem of a timer function doing a mod timer from
> > > within expire_timers. It can't forward the base, which might currently
> > > be quite a way behind. I *think* after we close these gaps and get
> > > timely wakeups for timers on there, it should not get too far behind
> > > for standard timers.
> > > 
> > > Deferrable is a different story. Firstly it has no idle tracking so we
> > > never forward it. Even if we wanted to, we can't do it reliably because
> > > it could contain timers way behind the base. They are "deferrable", so
> > > you get what you pay for, but this still means there's a window where
> > > you can add a deferrable timer and get a far later expiry than you
> > > asked for despite the CPU never going idle after you added it.
> > > 
> > > All these problems would seem to go away if mod_timer just queued up
> > > the timer to a single list on the base then pushed them into the
> > > wheel during your wheel processing softirq... Although maybe you end
> > > up with excessive passes over big queue of timers. Anyway that
> > > wouldn't be suitable for 4.13 even if it could work.
> > > 
> > > I'll send out an updated minimal fix after some more testing...
> > 
> > Hi All,
> > 
> > I'm back in the office with hardware access on our D05 64 core ARM64
> > boards.
> > 
> > I think we still have by far the quickest test cases for this so
> > feel free to ping me anything you want tested quickly (we were
> > looking at an average of less than 10 minutes to trigger
> > with machine

Re: [PATCH 2/2] blk-mq: don't use WORK_CPU_UNBOUND

2017-08-21 Thread Tejun Heo

On Mon, Aug 21, 2017 at 03:49:51PM +0200, Laurent Vivier wrote:
> cpumask is the list of CPUs present when the queue is built.
> If a new CPU is hotplugged, this list is not updated,
> and when the scheduler asks for a CPU id, blk_mq_hctx_next_cpu()
> can return WORK_CPU_UNBOUND.
> And in this case _blk_mq_run_hw_queue() can be executed by the new CPU
> (that is not present in cpumask) and raises the following warning:
> 
> WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
> cpu_online(hctx->next_cpu));
> 
> To fix this problem, this patch modifies blk_mq_hctx_next_cpu()
> to only return a CPU id present in cpumask.

So, this one isn't needed either.

Thanks.

-- 
tejun

Re: [PATCH 1/2] powerpc/workqueue: update list of possible CPUs

2017-08-21 Thread Tejun Heo

On Mon, Aug 21, 2017 at 03:49:50PM +0200, Laurent Vivier wrote:
> In wq_numa_init() a list of NUMA nodes with their list of possible CPUs
> is built.
> 
> Unfortunately, on powerpc, the Firmware is only able to provide the
> node of a CPU if the CPU is present. So, in our case (possible CPU)
> CPU ids are known, but as the CPU is not present, the node id is
> unknown and all the unplugged CPUs are attached to node 0.

This is something powerpc needs to fix.  Workqueue isn't the only one
making this assumption.  mm as a whole assumes that CPU <-> node
mapping is stable regardless of hotplug events.  Powerpc people know
about the issue and AFAIK are working on it.

Thanks.

-- 
tejun

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-08-21 Thread Nicholas Piggin

On Mon, 21 Aug 2017 11:18:33 +0100
Jonathan Cameron  wrote:

> On Mon, 21 Aug 2017 16:06:05 +1000
> Nicholas Piggin  wrote:
> 
> > On Mon, 21 Aug 2017 10:52:58 +1000
> > Nicholas Piggin  wrote:
> >   
> > > On Sun, 20 Aug 2017 14:14:29 -0700
> > > "Paul E. McKenney"  wrote:
> > > 
> > > > On Sun, Aug 20, 2017 at 11:35:14AM -0700, Paul E. McKenney wrote:  
> > > > > On Sun, Aug 20, 2017 at 11:00:40PM +1000, Nicholas Piggin wrote:  
> > > > >   
> > > > > > On Sun, 20 Aug 2017 14:45:53 +1000
> > > > > > Nicholas Piggin  wrote:
> > > > > > 
> > > > > > > On Wed, 16 Aug 2017 09:27:31 -0700
> > > > > > > "Paul E. McKenney"  wrote:
> > > > > > > > On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney 
> > > > > > > > wrote:
> > > > > > > > 
> > > > > > > > Thomas, John, am I misinterpreting the timer trace event 
> > > > > > > > messages?  
> > > > > > > 
> > > > > > > So I did some digging, and what you find is that rcu_sched seems 
> > > > > > > to do a
> > > > > > > simple scheudle_timeout(1) and just goes out to lunch for many 
> > > > > > > seconds.
> > > > > > > The process_timeout timer never fires (when it finally does wake 
> > > > > > > after
> > > > > > > one of these events, it usually removes the timer with 
> > > > > > > del_timer_sync).
> > > > > > > 
> > > > > > > So this patch seems to fix it. Testing, comments welcome.
> > > > > > 
> > > > > > Okay this had a problem of trying to forward the timer from a timer
> > > > > > callback function.
> > > > > > 
> > > > > > This was my other approach which also fixes the RCU warnings, but 
> > > > > > it's
> > > > > > a little more complex. I reworked it a bit so the mod_timer fast 
> > > > > > path
> > > > > > hopefully doesn't have much more overhead (actually by reading 
> > > > > > jiffies
> > > > > > only when needed, it probably saves a load).
> > > > > 
> > > > > Giving this one a whirl!
> > > > 
> > > > No joy here, but then again there are other reasons to believe that I
> > > > am seeing a different bug than Dave and Jonathan are.
> > > > 
> > > > OK, not -entirely- without joy -- 10 of 14 runs were error-free, which
> > > > is a good improvement over 0 of 84 for your earlier patch.  ;-)  But
> > > > not statistically different from what I see without either patch.
> > > > 
> > > > But no statistical difference compared to without patch, and I still
> > > > see the "rcu_sched kthread starved" messages.  For whatever it is worth,
> > > > by the way, I also see this: "hrtimer: interrupt took 5712368 ns".
> > > > Hmmm...  I am also seeing that without any of your patches.  Might
> > > > be hypervisor preemption, I guess.  
> > > 
> > > Okay it makes the warnings go away for me, but I'm just booting then
> > > leaving the system idle. You're doing some CPU hotplug activity?
> > 
> > Okay found a bug in the patch (it was not forwarding properly before
> > adding the first timer after an idle) and a few other concerns.
> > 
> > There's still a problem of a timer function doing a mod timer from
> > within expire_timers. It can't forward the base, which might currently
> > be quite a way behind. I *think* after we close these gaps and get
> > timely wakeups for timers on there, it should not get too far behind
> > for standard timers.
> > 
> > Deferrable is a different story. Firstly it has no idle tracking so we
> > never forward it. Even if we wanted to, we can't do it reliably because
> > it could contain timers way behind the base. They are "deferrable", so
> > you get what you pay for, but this still means there's a window where
> > you can add a deferrable timer and get a far later expiry than you
> > asked for despite the CPU never going idle after you added it.
> > 
> > All these problems would seem to go away if mod_timer just queued up
> > the timer to a single list on the base then pushed them into the
> > wheel during your wheel processing softirq... Although maybe you end
> > up with excessive passes over big queue of timers. Anyway that
> > wouldn't be suitable for 4.13 even if it could work.
> > 
> > I'll send out an updated minimal fix after some more testing...  
> 
> Hi All,
> 
> I'm back in the office with hardware access on our D05 64 core ARM64
> boards.
> 
> I think we still have by far the quickest test cases for this so
> feel free to ping me anything you want tested quickly (we were
> looking at an average of less than 10 minutes to trigger
> with machine idling).
> 
> Nick, I'm currently running your previous version and we are over an
> hour so even without any instances of the issue so it looks like a
> considerable improvement.  I'll see if I can line a couple of boards
> up for an overnight run if you have your updated version out by then.
> 
> Be great to finally put this one to bed.

Hi Jonathan,

Thanks

[PATCH 2/2] blk-mq: don't use WORK_CPU_UNBOUND

2017-08-21 Thread Laurent Vivier

cpumask is the list of CPUs present when the queue is built.
If a new CPU is hotplugged, this list is not updated,
and when the scheduler asks for a CPU id, blk_mq_hctx_next_cpu()
can return WORK_CPU_UNBOUND.
And in this case _blk_mq_run_hw_queue() can be executed by the new CPU
(that is not present in cpumask) and raises the following warning:

WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
cpu_online(hctx->next_cpu));

To fix this problem, this patch modifies blk_mq_hctx_next_cpu()
to only return a CPU id present in cpumask.

Signed-off-by: Laurent Vivier 
---
 block/blk-mq.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4603b115e234..bdac1e654814 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1126,9 +1126,6 @@ static void __blk_mq_run_hw_queue(struct blk_mq_hw_ctx 
*hctx)
  */
 static int blk_mq_hctx_next_cpu(struct blk_mq_hw_ctx *hctx)
 {
-   if (hctx->queue->nr_hw_queues == 1)
-   return WORK_CPU_UNBOUND;
-
if (--hctx->next_cpu_batch <= 0) {
int next_cpu;
 
-- 
2.13.5

[PATCH 1/2] powerpc/workqueue: update list of possible CPUs

2017-08-21 Thread Laurent Vivier

In wq_numa_init() a list of NUMA nodes with their list of possible CPUs
is built.

Unfortunately, on powerpc, the Firmware is only able to provide the
node of a CPU if the CPU is present. So, in our case (possible CPU)
CPU ids are known, but as the CPU is not present, the node id is
unknown and all the unplugged CPUs are attached to node 0.

When we hotplugged CPU, there can be an inconsistency between
the node id known by the workqueue, and the real node id of
the CPU.

This patch adds a hotplug handler to add the hotplugged CPU to
update the node entry in wq_numa_possible_cpumask array.

Signed-off-by: Laurent Vivier 
---
 arch/powerpc/kernel/smp.c |  1 +
 include/linux/workqueue.h |  1 +
 kernel/workqueue.c| 21 +
 3 files changed, 23 insertions(+)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 8d3320562c70..abc533146514 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -964,6 +964,7 @@ void start_secondary(void *unused)
 
set_numa_node(numa_cpu_lookup_table[cpu]);
set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
+   wq_numa_add_possible_cpu(cpu);
 
smp_wmb();
notify_cpu_starting(cpu);
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index db6dc9dc0482..4ef030dae22c 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -644,6 +644,7 @@ int workqueue_online_cpu(unsigned int cpu);
 int workqueue_offline_cpu(unsigned int cpu);
 #endif
 
+void wq_numa_add_possible_cpu(unsigned int cpu);
 int __init workqueue_init_early(void);
 int __init workqueue_init(void);
 
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ca937b0c3a96..d1a99e25d5da 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5486,6 +5486,27 @@ static inline void wq_watchdog_init(void) { }
 
 #endif /* CONFIG_WQ_WATCHDOG */
 
+void wq_numa_add_possible_cpu(unsigned int cpu)
+{
+   int node;
+
+   if (num_possible_nodes() <= 1)
+   return;
+
+   if (wq_disable_numa)
+   return;
+
+   if (!wq_numa_enabled)
+   return;
+
+   node = cpu_to_node(cpu);
+   if (node == NUMA_NO_NODE)
+   return;
+
+   cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]);
+}
+EXPORT_SYMBOL_GPL(wq_numa_add_possible_cpu);
+
 static void __init wq_numa_init(void)
 {
cpumask_var_t *tbl;
-- 
2.13.5

Re: [v3, 3/3] powerpc/mm/cxl: Add the fault handling cpu to mm cpumask

2017-08-21 Thread Michael Ellerman

On Thu, 2017-07-27 at 06:24:55 UTC, "Aneesh Kumar K.V" wrote:
> We use mm cpumask for serializing against lockless page table walk. Anybody
> who is doing a lockless page table walk is expected to disable irq and only
> cpus in mm cpumask is expected do the lockless walk. This ensure that
> a THP split can send IPI to only cpus in the mm cpumask, to make sure there
> are no parallel lockless page table walk.
> 
> Add the CAPI fault handling cpu to the mm cpumask so that we can do the 
> lockless
> page table walk while inserting hash page table entries.
> 
> Reviewed-by: Frederic Barrat 
> Signed-off-by: Aneesh Kumar K.V 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/0f4bc0932e51817105fdee77a46680

cheers

Re: [v3,2/3] powerpc/mm: Don't send IPI to all cpus on THP updates

2017-08-21 Thread Michael Ellerman

On Thu, 2017-07-27 at 06:24:54 UTC, "Aneesh Kumar K.V" wrote:
> Now that we made sure that lockless walk of linux page table is mostly 
> limitted
> to current task(current->mm->pgdir) we can update the THP update sequence to
> only send IPI to cpus on which this task has run. This helps in reducing the 
> IPI
> overload on systems with large number of CPUs.
> 
> W.r.t kvm even though kvm is walking page table with vpc->arch.pgdir, it is
> done only on secondary cpus and in that case we have primary cpu added to
> task's mm cpumask. Sending an IPI to primary will force the secondary to do
> a vm exit and hence this mm cpumask usage is safe here.
> 
> W.r.t CAPI, we still end up walking linux page table with capi context MM. For
> now the pte lookup serialization sends an IPI to all cpus in CPI is in use. We
> can further improve this by adding the CAPI interrupt handling cpu to task
> mm cpumask. That will be done in a later patch.
> 
> Signed-off-by: Aneesh Kumar K.V 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/fa4531f753f1c80d21b5eb86ec5c02

cheers

Re: [v3,1/3] powerpc/mm: Rename find_linux_pte_or_hugepte

2017-08-21 Thread Michael Ellerman

On Thu, 2017-07-27 at 06:24:53 UTC, "Aneesh Kumar K.V" wrote:
> Add newer helpers to make the function usage simpler. It is always recommended
> to use find_current_mm_pte() for walking the page table. If we cannot use
> find_current_mm_pte(), it should be documented why the said usage of
> __find_linux_pte() is safe against a parallel THP split.
> 
> For now we have KVM code using __find_linux_pte(). This is because kvm code 
> ends
> up calling __find_linux_pte() in real mode with MSR_EE=0 but with PACA 
> soft_enabled =
> 1. We may want to fix that later and make sure we keep the MSR_EE and  PACA
> soft_enabled in sync. When we do that we can switch kvm to use 
> find_linux_pte().
> 
> Signed-off-by: Aneesh Kumar K.V 

Applied to powerpc next and topic/ppc-kvm, thanks.

https://git.kernel.org/powerpc/c/94171b19c3f1f4d9d4c0e3aaa1aa16

cheers

Re: [v2] macintosh: rack-meter: make of_device_ids const.

2017-08-21 Thread Michael Ellerman

On Mon, 2017-06-19 at 05:44:25 UTC, Arvind Yadav wrote:
> of_device_ids are not supposed to change at runtime. All functions
> working with of_device_ids provided by  work with const
> of_device_ids. So mark the non-const structs as const.
> 
> File size before:
>text  data bss dec hex filename
> 407   576   0 983 3d7 drivers/macintosh/rack-meter.o
> 
> File size after constify rackmeter_match.
>text  data bss dec hex filename
> 807   176   0 983 3d7 drivers/macintosh/rack-meter.o
> 
> Signed-off-by: Arvind Yadav 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/516fa8d0e19d854253460ea7692551

cheers

[PATCH] i2c: busses: make i2c_adapter_quirks const

2017-08-21 Thread Bhumika Goyal

Make these const as they are only stored as a reference in the quirks
field of an i2c_adapter structure, which is const.

Done using Coccinelle:
@match disable optional_qualifier@
identifier s;
@@
static struct i2c_adapter_quirks s = {...};

@ref@
position p;
identifier match.s;
@@
s@p

@good1@
identifier y;
position ref.p;
identifier match.s;
@@
struct i2c_adapter y = {...,.quirks=@p,...};

@good2@
struct i2c_adapter y;
identifier match.s;
position ref.p;
@@
y.quirks = @p

@bad depends on  !good1 && !good2@
position ref.p;
identifier match.s;
@@
s@p

@depends on forall !bad disable optional_qualifier@
identifier match.s;
@@
static
+ const
struct i2c_adapter_quirks s;

Signed-off-by: Bhumika Goyal 
---
 drivers/i2c/busses/i2c-at91.c | 2 +-
 drivers/i2c/busses/i2c-cpm.c  | 2 +-
 drivers/i2c/busses/i2c-mlxcpld.c  | 2 +-
 drivers/i2c/busses/i2c-opal.c | 2 +-
 drivers/i2c/busses/i2c-powermac.c | 2 +-
 drivers/i2c/busses/i2c-qup.c  | 2 +-
 drivers/i2c/busses/i2c-tegra.c| 2 +-
 7 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/i2c/busses/i2c-at91.c b/drivers/i2c/busses/i2c-at91.c
index 38dd61d..bfd1fdf 100644
--- a/drivers/i2c/busses/i2c-at91.c
+++ b/drivers/i2c/busses/i2c-at91.c
@@ -809,7 +809,7 @@ static int at91_twi_xfer(struct i2c_adapter *adap, struct 
i2c_msg *msg, int num)
  * The hardware can handle at most two messages concatenated by a
  * repeated start via it's internal address feature.
  */
-static struct i2c_adapter_quirks at91_twi_quirks = {
+static const struct i2c_adapter_quirks at91_twi_quirks = {
.flags = I2C_AQ_COMB | I2C_AQ_COMB_WRITE_FIRST | I2C_AQ_COMB_SAME_ADDR,
.max_comb_1st_msg_len = 3,
 };
diff --git a/drivers/i2c/busses/i2c-cpm.c b/drivers/i2c/busses/i2c-cpm.c
index d89bde2..8a8ca94 100644
--- a/drivers/i2c/busses/i2c-cpm.c
+++ b/drivers/i2c/busses/i2c-cpm.c
@@ -413,7 +413,7 @@ static u32 cpm_i2c_func(struct i2c_adapter *adap)
 };
 
 /* CPM_MAX_READ is also limiting writes according to the code! */
-static struct i2c_adapter_quirks cpm_i2c_quirks = {
+static const struct i2c_adapter_quirks cpm_i2c_quirks = {
.max_num_msgs = CPM_MAXBD,
.max_read_len = CPM_MAX_READ,
.max_write_len = CPM_MAX_READ,
diff --git a/drivers/i2c/busses/i2c-mlxcpld.c b/drivers/i2c/busses/i2c-mlxcpld.c
index d271e6a..4c28fa2 100644
--- a/drivers/i2c/busses/i2c-mlxcpld.c
+++ b/drivers/i2c/busses/i2c-mlxcpld.c
@@ -433,7 +433,7 @@ static u32 mlxcpld_i2c_func(struct i2c_adapter *adap)
.functionality  = mlxcpld_i2c_func
 };
 
-static struct i2c_adapter_quirks mlxcpld_i2c_quirks = {
+static const struct i2c_adapter_quirks mlxcpld_i2c_quirks = {
.flags = I2C_AQ_COMB_WRITE_THEN_READ,
.max_read_len = MLXCPLD_I2C_DATA_REG_SZ - MLXCPLD_I2C_MAX_ADDR_LEN,
.max_write_len = MLXCPLD_I2C_DATA_REG_SZ,
diff --git a/drivers/i2c/busses/i2c-opal.c b/drivers/i2c/busses/i2c-opal.c
index 11e2a1f..0aabb7e 100644
--- a/drivers/i2c/busses/i2c-opal.c
+++ b/drivers/i2c/busses/i2c-opal.c
@@ -204,7 +204,7 @@ static u32 i2c_opal_func(struct i2c_adapter *adapter)
  * For two messages, we basically support simple smbus transactions of a
  * write-then-anything.
  */
-static struct i2c_adapter_quirks i2c_opal_quirks = {
+static const struct i2c_adapter_quirks i2c_opal_quirks = {
.flags = I2C_AQ_COMB | I2C_AQ_COMB_WRITE_FIRST | I2C_AQ_COMB_SAME_ADDR,
.max_comb_1st_msg_len = 4,
 };
diff --git a/drivers/i2c/busses/i2c-powermac.c 
b/drivers/i2c/busses/i2c-powermac.c
index ef9c858..f2a2067 100644
--- a/drivers/i2c/busses/i2c-powermac.c
+++ b/drivers/i2c/busses/i2c-powermac.c
@@ -197,7 +197,7 @@ static u32 i2c_powermac_func(struct i2c_adapter * adapter)
.functionality  = i2c_powermac_func,
 };
 
-static struct i2c_adapter_quirks i2c_powermac_quirks = {
+static const struct i2c_adapter_quirks i2c_powermac_quirks = {
.max_num_msgs = 1,
 };
 
diff --git a/drivers/i2c/busses/i2c-qup.c b/drivers/i2c/busses/i2c-qup.c
index 1902d8a..08f8e01 100644
--- a/drivers/i2c/busses/i2c-qup.c
+++ b/drivers/i2c/busses/i2c-qup.c
@@ -1396,7 +1396,7 @@ static u32 qup_i2c_func(struct i2c_adapter *adap)
  * the end of the read, the length of the read is specified as one byte
  * which limits the possible read to 256 (QUP_READ_LIMIT) bytes.
  */
-static struct i2c_adapter_quirks qup_i2c_quirks = {
+static const struct i2c_adapter_quirks qup_i2c_quirks = {
.max_read_len = QUP_READ_LIMIT,
 };
 
diff --git a/drivers/i2c/busses/i2c-tegra.c b/drivers/i2c/busses/i2c-tegra.c
index a238844..60292d2 100644
--- a/drivers/i2c/busses/i2c-tegra.c
+++ b/drivers/i2c/busses/i2c-tegra.c
@@ -793,7 +793,7 @@ static void tegra_i2c_parse_dt(struct tegra_i2c_dev 
*i2c_dev)
 };
 
 /* payload size is only 12 bit */
-static struct i2c_adapter_quirks tegra_i2c_quirks = {
+static const struct i2c_adapter_quirks tegra_i2c_quirks = {
.max_read_len = 4096,
.max_write_len = 4096,
 };
-- 
1.9.1

Re: [linux-next][bisected c64e09ce] sysctl command hung indefinitely

2017-08-21 Thread Stephen Rothwell

Hi Michal,

On Mon, 21 Aug 2017 09:50:01 +0200 Michal Hocko  wrote:
>
> commit 69885605ee3ba681deb54021e3df645f46589ba1
> Author: Michal Hocko 
> Date:   Mon Aug 21 09:46:04 2017 +0200
> 
> mmotm: mm-page_alloc-rip-out-zonelist_order_zone-fix
> 
> Abdul has noticed that reading sysctl vm.numa_zonelist_order
> read will never terminate. This is because of
> http://lkml.kernel.org/r/20170714080006.7250-2-mho...@kernel.org
> where the reading side doesn't update ppos and so the reader will
> never get 0. Return back to proc_dostring which does all the necessary
> stuff.
> 
> Reported-by: Abdul Haleem 
> Signed-off-by: Michal Hocko 

I have added that to linux-next for tomorrow.

-- 
Cheers,
Stephen Rothwell

Re: [PATCH v7 3/4] lib/cmdline.c Remove quotes symmetrically.

2017-08-21 Thread Michal Suchánek

On Mon, 21 Aug 2017 18:27:12 +0800
Baoquan He  wrote:

> On 08/17/17 at 10:14pm, Michal Suchanek wrote:
> > Remove quotes from argument value only if there is qoute on both
> > sides.
> > 
> > Signed-off-by: Michal Suchanek   
> 
> Sounds reasonable. Just for curiosity, do we have chance to pass in
> option with a single '"'?

No, we don't. Perhaps it would work if it was at the end of the
commandline.

next_arg checks that quoting is closed.

It was possible but undocumented with previous behavior - you would
place the quote in the middle and the closing quote at start or end so
that next_arg would remove the closing quote but not the one in the
middle.

It would be also possible with shell-like backslash escaping but that
is not implemented.

Thanks

Michal

Re: [PATCH v2 1/1] Split VGA default device handler out of VGA arbiter

2017-08-21 Thread Lorenzo Pieralisi

On Thu, Aug 17, 2017 at 09:30:28PM +1000, Daniel Axtens wrote:
> A system without PCI legacy resources (e.g. ARM64) may find that no
> default/boot VGA device has been marked, because the VGA arbiter
> checks for legacy resource decoding before marking a card as default.

I do not understand this paragraph, in particular:

- "A system without PCI legacy resources (e.g. ARM64)". What does this
  mean ? I take this as "ARM64 does not support IO space"; if a PCI host
  bridge supports IO space, there is nothing from an architectural
  point of view that prevents an MMIO based IO space implementation to
  work on ARM64. It is PCI bridge specific, not arch specific.
- "the VGA arbiter checks for legacy resources". Do you mean it checks
  if a VGA class device enabled IO/MEM address spaces (and if the
  bridges upstream forwards those address spaces), correct ?

The assumption relying on ARM64 not supporting IO space (if that's what
you mean by "a system without PCI legacy resources") is not correct.

Thanks,
Lorenzo

> Split the small bit of code that does default VGA handling out from
> the arbiter. Add a Kconfig option to allow the kernel to be built
> with just the default handling, or the arbiter and default handling.
> 
> Add handling for devices that should be marked as default but aren't
> handled by the vga arbiter by adding a late initcall and a class
> enable hook. If there is no default from vgaarb then the first card
> that is enabled, has a driver bound, and can decode memory or I/O
> will be marked as default.
> 
> Signed-off-by: Daniel Axtens 
> 
> ---
> 
> v2: Tested on:
>  - x86_64 laptop
>  - arm64 D05 board with hibmc card
>  - qemu powerpc with tcg and bochs std-vga
> 
> I know this adds another config option and that's a bit sad, but
> we can't include it unconditionally as it depends on PCI.
> Suggestions welcome.
> ---
>  arch/ia64/pci/fixup.c|   2 +-
>  arch/powerpc/kernel/pci-common.c |   2 +-
>  arch/x86/pci/fixup.c |   2 +-
>  arch/x86/video/fbdev.c   |   2 +-
>  drivers/gpu/vga/Kconfig  |  12 +++
>  drivers/gpu/vga/Makefile |   1 +
>  drivers/gpu/vga/vga_default.c| 159 
> +++
>  drivers/gpu/vga/vga_switcheroo.c |   2 +-
>  drivers/gpu/vga/vgaarb.c |  41 +-
>  drivers/pci/pci-sysfs.c  |   2 +-
>  include/linux/vga_default.h  |  44 +++
>  include/linux/vgaarb.h   |  14 
>  12 files changed, 225 insertions(+), 58 deletions(-)
>  create mode 100644 drivers/gpu/vga/vga_default.c
>  create mode 100644 include/linux/vga_default.h
> 
> diff --git a/arch/ia64/pci/fixup.c b/arch/ia64/pci/fixup.c
> index 41caa99add51..b35d1cf4501a 100644
> --- a/arch/ia64/pci/fixup.c
> +++ b/arch/ia64/pci/fixup.c
> @@ -5,7 +5,7 @@
>  
>  #include 
>  #include 
> -#include 
> +#include 
>  #include 
>  
>  #include 
> diff --git a/arch/powerpc/kernel/pci-common.c 
> b/arch/powerpc/kernel/pci-common.c
> index 341a7469cab8..4fd890a51d18 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -31,7 +31,7 @@
>  #include 
>  #include 
>  #include 
> -#include 
> +#include 
>  
>  #include 
>  #include 
> diff --git a/arch/x86/pci/fixup.c b/arch/x86/pci/fixup.c
> index 11e407489db0..b1254bc09a45 100644
> --- a/arch/x86/pci/fixup.c
> +++ b/arch/x86/pci/fixup.c
> @@ -5,7 +5,7 @@
>  #include 
>  #include 
>  #include 
> -#include 
> +#include 
>  #include 
>  #include 
>  
> diff --git a/arch/x86/video/fbdev.c b/arch/x86/video/fbdev.c
> index 9fd24846d094..62cfa74ea86e 100644
> --- a/arch/x86/video/fbdev.c
> +++ b/arch/x86/video/fbdev.c
> @@ -9,7 +9,7 @@
>  #include 
>  #include 
>  #include 
> -#include 
> +#include 
>  
>  int fb_is_primary_device(struct fb_info *info)
>  {
> diff --git a/drivers/gpu/vga/Kconfig b/drivers/gpu/vga/Kconfig
> index 29437eabe095..81d4105aecf6 100644
> --- a/drivers/gpu/vga/Kconfig
> +++ b/drivers/gpu/vga/Kconfig
> @@ -1,3 +1,14 @@
> +config VGA_DEFAULT
> + bool "VGA Default Device Support" if EXPERT
> + default y
> + depends on PCI
> + help
> +   Some programs find it helpful to know what VGA device is the default.
> +   On platforms like x86 this means the device used by the BIOS to show
> +   early boot messages. On other platforms this may be an arbitrary PCI
> +   graphics card. Select this to have a default device recorded within
> +   the kernel and exposed to userspace through sysfs.
> +
>  config VGA_ARB
>   bool "VGA Arbitration" if EXPERT
>   default y
> @@ -22,6 +33,7 @@ config VGA_SWITCHEROO
>   depends on X86
>   depends on ACPI
>   select VGA_ARB
> + select VGA_DEFAULT
>   help
> Many laptops released in 2008/9/10 have two GPUs with a multiplexer
> to switch between them. This adds support for dynamic switching when
> diff --git a/drivers/gpu/vga/Makefile b/drivers/gpu/vga/Makefile
> index

Re: [PATCH v7 3/4] lib/cmdline.c Remove quotes symmetrically.

2017-08-21 Thread Baoquan He

On 08/17/17 at 10:14pm, Michal Suchanek wrote:
> Remove quotes from argument value only if there is qoute on both sides.
> 
> Signed-off-by: Michal Suchanek 

Sounds reasonable. Just for curiosity, do we have chance to pass in
option with a single '"'?

> ---
>  arch/powerpc/kernel/fadump.c | 6 ++
>  lib/cmdline.c| 7 ++-
>  2 files changed, 4 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index a1614d9b8a21..d7da4ce9f7ae 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -489,10 +489,8 @@ static void __init fadump_update_params(struct 
> param_info *param_info,
>   *tgt++ = ' ';
>  
>   /* next_arg removes one leading and one trailing '"' */
> - if (*tgt == '"')
> - shortening += 1;
> - if (*(tgt + vallen + shortening) == '"')
> - shortening += 1;
> + if ((*tgt == '"') && (*(tgt + vallen + shortening) == '"'))
> + shortening += 2;
>  
>   /* remove one leading and one trailing quote if both are present */
>   if ((val[0] == '"') && (val[vallen - 1] == '"')) {
> diff --git a/lib/cmdline.c b/lib/cmdline.c
> index 4c0888c4a68d..01e701b2afe8 100644
> --- a/lib/cmdline.c
> +++ b/lib/cmdline.c
> @@ -227,14 +227,11 @@ char *next_arg(char *args, char **param, char **val)
>   *val = args + equals + 1;
>  
>   /* Don't include quotes in value. */
> - if (**val == '"') {
> + if ((**val == '"') && (args[i-1] == '"')) {
>   (*val)++;
> - if (args[i-1] == '"')
> - args[i-1] = '\0';
> + args[i-1] = '\0';
>   }
>   }
> - if (quoted && args[i-1] == '"')
> - args[i-1] = '\0';
>  
>   if (args[i]) {
>   args[i] = '\0';
> -- 
> 2.10.2
>

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-08-21 Thread Jonathan Cameron

On Mon, 21 Aug 2017 16:06:05 +1000
Nicholas Piggin  wrote:

> On Mon, 21 Aug 2017 10:52:58 +1000
> Nicholas Piggin  wrote:
> 
> > On Sun, 20 Aug 2017 14:14:29 -0700
> > "Paul E. McKenney"  wrote:
> >   
> > > On Sun, Aug 20, 2017 at 11:35:14AM -0700, Paul E. McKenney wrote:
> > > > On Sun, Aug 20, 2017 at 11:00:40PM +1000, Nicholas Piggin wrote:  
> > > > > On Sun, 20 Aug 2017 14:45:53 +1000
> > > > > Nicholas Piggin  wrote:
> > > > >   
> > > > > > On Wed, 16 Aug 2017 09:27:31 -0700
> > > > > > "Paul E. McKenney"  wrote:  
> > > > > > > On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney wrote:
> > > > > > > 
> > > > > > > Thomas, John, am I misinterpreting the timer trace event 
> > > > > > > messages?
> > > > > > 
> > > > > > So I did some digging, and what you find is that rcu_sched seems to 
> > > > > > do a
> > > > > > simple scheudle_timeout(1) and just goes out to lunch for many 
> > > > > > seconds.
> > > > > > The process_timeout timer never fires (when it finally does wake 
> > > > > > after
> > > > > > one of these events, it usually removes the timer with 
> > > > > > del_timer_sync).
> > > > > > 
> > > > > > So this patch seems to fix it. Testing, comments welcome.  
> > > > > 
> > > > > Okay this had a problem of trying to forward the timer from a timer
> > > > > callback function.
> > > > > 
> > > > > This was my other approach which also fixes the RCU warnings, but it's
> > > > > a little more complex. I reworked it a bit so the mod_timer fast path
> > > > > hopefully doesn't have much more overhead (actually by reading jiffies
> > > > > only when needed, it probably saves a load).  
> > > > 
> > > > Giving this one a whirl!  
> > > 
> > > No joy here, but then again there are other reasons to believe that I
> > > am seeing a different bug than Dave and Jonathan are.
> > > 
> > > OK, not -entirely- without joy -- 10 of 14 runs were error-free, which
> > > is a good improvement over 0 of 84 for your earlier patch.  ;-)  But
> > > not statistically different from what I see without either patch.
> > > 
> > > But no statistical difference compared to without patch, and I still
> > > see the "rcu_sched kthread starved" messages.  For whatever it is worth,
> > > by the way, I also see this: "hrtimer: interrupt took 5712368 ns".
> > > Hmmm...  I am also seeing that without any of your patches.  Might
> > > be hypervisor preemption, I guess.
> > 
> > Okay it makes the warnings go away for me, but I'm just booting then
> > leaving the system idle. You're doing some CPU hotplug activity?  
> 
> Okay found a bug in the patch (it was not forwarding properly before
> adding the first timer after an idle) and a few other concerns.
> 
> There's still a problem of a timer function doing a mod timer from
> within expire_timers. It can't forward the base, which might currently
> be quite a way behind. I *think* after we close these gaps and get
> timely wakeups for timers on there, it should not get too far behind
> for standard timers.
> 
> Deferrable is a different story. Firstly it has no idle tracking so we
> never forward it. Even if we wanted to, we can't do it reliably because
> it could contain timers way behind the base. They are "deferrable", so
> you get what you pay for, but this still means there's a window where
> you can add a deferrable timer and get a far later expiry than you
> asked for despite the CPU never going idle after you added it.
> 
> All these problems would seem to go away if mod_timer just queued up
> the timer to a single list on the base then pushed them into the
> wheel during your wheel processing softirq... Although maybe you end
> up with excessive passes over big queue of timers. Anyway that
> wouldn't be suitable for 4.13 even if it could work.
> 
> I'll send out an updated minimal fix after some more testing...

Hi All,

I'm back in the office with hardware access on our D05 64 core ARM64
boards.

I think we still have by far the quickest test cases for this so
feel free to ping me anything you want tested quickly (we were
looking at an average of less than 10 minutes to trigger
with machine idling).

Nick, I'm currently running your previous version and we are over an
hour so even without any instances of the issue so it looks like a
considerable improvement.  I'll see if I can line a couple of boards
up for an overnight run if you have your updated version out by then.

Be great to finally put this one to bed.

Thanks,

Jonathan

> 
> Thanks,
> Nick

Re: [PATCH v2 17/20] perf: Add a speculative page fault sw event

2017-08-21 Thread Anshuman Khandual

On 08/18/2017 03:35 AM, Laurent Dufour wrote:
> Add a new software event to count succeeded speculative page faults.
> 
> Signed-off-by: Laurent Dufour 

Should be merged with the next patch.

Re: [PATCH v2 18/20] perf tools: Add support for the SPF perf event

2017-08-21 Thread Anshuman Khandual

On 08/18/2017 03:35 AM, Laurent Dufour wrote:
> Add support for the new speculative faults event.
> 
> Signed-off-by: Laurent Dufour 
> ---
>  tools/include/uapi/linux/perf_event.h | 1 +
>  tools/perf/util/evsel.c   | 1 +
>  tools/perf/util/parse-events.c| 4 
>  tools/perf/util/parse-events.l| 1 +
>  tools/perf/util/python.c  | 1 +
>  5 files changed, 8 insertions(+)
> 
> diff --git a/tools/include/uapi/linux/perf_event.h 
> b/tools/include/uapi/linux/perf_event.h
> index b1c0b187acfe..3043ec0988e9 100644
> --- a/tools/include/uapi/linux/perf_event.h
> +++ b/tools/include/uapi/linux/perf_event.h
> @@ -111,6 +111,7 @@ enum perf_sw_ids {
>   PERF_COUNT_SW_EMULATION_FAULTS  = 8,
>   PERF_COUNT_SW_DUMMY = 9,
>   PERF_COUNT_SW_BPF_OUTPUT= 10,
> + PERF_COUNT_SW_SPF_DONE  = 11,

Right, just one event for the success case. 'DONE' is redundant, only
'SPF' should be fine IMHO.

Re: [linux-next][bisected c64e09ce] sysctl command hung indefinitely

2017-08-21 Thread Abdul Haleem

On Mon, 2017-08-21 at 09:50 +0200, Michal Hocko wrote:
> On Sat 19-08-17 15:49:31, Abdul Haleem wrote:
> > Hi Michal,
> > 
> > 'sysctl -a' command never completes on my PowerPC machine with latest
> > next kernel (As of next-20170811)
> > 
> > Machine Type : Power8 bare-metal
> > Kernel version : 4.13.0-rc4-next-20170817
> > gcc version : 4.8.5
> > 
> > 
> > command output
> > --
> > $ sysctl -a
> > [...
> > vm.hugetlb_shm_group = 0
> > vm.laptop_mode = 0
> > vm.legacy_va_layout = 0
> > vm.lowmem_reserve_ratio = 256   256 32
> > vm.max_map_count = 65530
> > vm.min_free_kbytes = 6637
> > vm.min_slab_ratio = 5
> > vm.min_unmapped_ratio = 1
> > vm.mmap_min_addr = 4096
> > vm.mmap_rnd_bits = 14
> > vm.mmap_rnd_compat_bits = 7
> > vm.nr_hugepages = 0
> > vm.nr_hugepages_mempolicy = 0
> > vm.nr_overcommit_hugepages = 0
> > vm.nr_pdflush_threads = 0
> > vm.numa_zonelist_order = Node
> > vm.numa_zonelist_order = e
> > vm.numa_zonelist_order = ode
> > vm.numa_zonelist_order = 
> > vm.numa_zonelist_order = de
> > vm.numa_zonelist_order = Node
> > vm.numa_zonelist_order = e
> > vm.numa_zonelist_order = ode
> > vm.numa_zonelist_order = 
> > vm.numa_zonelist_order = de
> > vm.numa_zonelist_order = Node
> > vm.numa_zonelist_order = e
> > vm.numa_zonelist_order = ode
> > vm.numa_zonelist_order = 
> > vm.numa_zonelist_order = de
> > vm.numa_zonelist_order = Node
> > vm.numa_zonelist_order = e
> > vm.numa_zonelist_order = ode
> > ]
> > 
> > The last string 'vm.numa_zonelist_order = ' keeps flooding the stdout
> > and command never exit.
> > 
> > A bisection resulted commit c64e09ce mm, page_alloc: rip out
> > ZONELIST_ORDER_ZONE
> 
> Yeah, my read implementation is broken. I do not update the ppos and so
> the reader doesn't know it should exit. Mel was suggesting to keep the
> proc_dostring but I thought I was more clever. Sigh...
> 
> This should do the trick. Andrew, could you fold it into
> mm-page_alloc-rip-out-zonelist_order_zone.patch please?
> 
> Thanks for the report Abdul!
> 
> ---
> commit 69885605ee3ba681deb54021e3df645f46589ba1
> Author: Michal Hocko 
> Date:   Mon Aug 21 09:46:04 2017 +0200
> 
> mmotm: mm-page_alloc-rip-out-zonelist_order_zone-fix
> 
> Abdul has noticed that reading sysctl vm.numa_zonelist_order
> read will never terminate. This is because of
> http://lkml.kernel.org/r/20170714080006.7250-2-mho...@kernel.org
> where the reading side doesn't update ppos and so the reader will
> never get 0. Return back to proc_dostring which does all the necessary
> stuff.
> 
> Reported-by: Abdul Haleem 
> Signed-off-by: Michal Hocko 
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fda9afbd14d9..e7e92c8f4883 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h

Thanks for the fix Michal, the patch fixes the problem.

Reported-and-Tested-by: Abdul Haleem 

-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre

Re: [linux-next][bisected c64e09ce] sysctl command hung indefinitely

2017-08-21 Thread Michal Hocko

On Sat 19-08-17 15:49:31, Abdul Haleem wrote:
> Hi Michal,
> 
> 'sysctl -a' command never completes on my PowerPC machine with latest
> next kernel (As of next-20170811)
> 
> Machine Type : Power8 bare-metal
> Kernel version : 4.13.0-rc4-next-20170817
> gcc version : 4.8.5
> 
> 
> command output
> --
> $ sysctl -a
> [...
> vm.hugetlb_shm_group = 0
> vm.laptop_mode = 0
> vm.legacy_va_layout = 0
> vm.lowmem_reserve_ratio = 256 256 32
> vm.max_map_count = 65530
> vm.min_free_kbytes = 6637
> vm.min_slab_ratio = 5
> vm.min_unmapped_ratio = 1
> vm.mmap_min_addr = 4096
> vm.mmap_rnd_bits = 14
> vm.mmap_rnd_compat_bits = 7
> vm.nr_hugepages = 0
> vm.nr_hugepages_mempolicy = 0
> vm.nr_overcommit_hugepages = 0
> vm.nr_pdflush_threads = 0
> vm.numa_zonelist_order = Node
> vm.numa_zonelist_order = e
> vm.numa_zonelist_order = ode
> vm.numa_zonelist_order = 
> vm.numa_zonelist_order = de
> vm.numa_zonelist_order = Node
> vm.numa_zonelist_order = e
> vm.numa_zonelist_order = ode
> vm.numa_zonelist_order = 
> vm.numa_zonelist_order = de
> vm.numa_zonelist_order = Node
> vm.numa_zonelist_order = e
> vm.numa_zonelist_order = ode
> vm.numa_zonelist_order = 
> vm.numa_zonelist_order = de
> vm.numa_zonelist_order = Node
> vm.numa_zonelist_order = e
> vm.numa_zonelist_order = ode
> ]
> 
> The last string 'vm.numa_zonelist_order = ' keeps flooding the stdout
> and command never exit.
> 
> A bisection resulted commit c64e09ce mm, page_alloc: rip out
> ZONELIST_ORDER_ZONE

Yeah, my read implementation is broken. I do not update the ppos and so
the reader doesn't know it should exit. Mel was suggesting to keep the
proc_dostring but I thought I was more clever. Sigh...

This should do the trick. Andrew, could you fold it into
mm-page_alloc-rip-out-zonelist_order_zone.patch please?

Thanks for the report Abdul!

---
commit 69885605ee3ba681deb54021e3df645f46589ba1
Author: Michal Hocko 
Date:   Mon Aug 21 09:46:04 2017 +0200

mmotm: mm-page_alloc-rip-out-zonelist_order_zone-fix

Abdul has noticed that reading sysctl vm.numa_zonelist_order
read will never terminate. This is because of
http://lkml.kernel.org/r/20170714080006.7250-2-mho...@kernel.org
where the reading side doesn't update ppos and so the reader will
never get 0. Return back to proc_dostring which does all the necessary
stuff.

Reported-by: Abdul Haleem 
Signed-off-by: Michal Hocko 

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fda9afbd14d9..e7e92c8f4883 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -894,6 +894,8 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table 
*, int,
 
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
+extern char numa_zonelist_order[];
+#define NUMA_ZONELIST_ORDER_LEN16
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 0cbce40f5426..655686d546cb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1553,6 +1553,8 @@ static struct ctl_table vm_table[] = {
 #ifdef CONFIG_NUMA
{
.procname   = "numa_zonelist_order",
+   .data   = _zonelist_order,
+   .maxlen = NUMA_ZONELIST_ORDER_LEN,
.mode   = 0644,
.proc_handler   = numa_zonelist_order_handler,
},
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5cba36337893..b55e97decac4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4859,6 +4859,8 @@ static __init int setup_numa_zonelist_order(char *s)
 }
 early_param("numa_zonelist_order", setup_numa_zonelist_order);
 
+char numa_zonelist_order[] = "Node";
+
 /*
  * sysctl handler for numa_zonelist_order
  */
@@ -4869,12 +4871,8 @@ int numa_zonelist_order_handler(struct ctl_table *table, 
int write,
char *str;
int ret;
 
-   if (!write) {
-   int len = sizeof("Node");
-   if (copy_to_user(buffer, "Node", len))
-   return -EFAULT;
-   return len;
-   }
+   if (!write)
+   return proc_dostring(table, write, buffer, length, ppos);
str = memdup_user_nul(buffer, 16);
if (IS_ERR(str))
return PTR_ERR(str);
-- 
Michal Hocko
SUSE Labs

Re: [PATCH v2 19/20] x86/mm: Add speculative pagefault handling

2017-08-21 Thread Anshuman Khandual

On 08/18/2017 03:35 AM, Laurent Dufour wrote:
> From: Peter Zijlstra 
> 
> Try a speculative fault before acquiring mmap_sem, if it returns with
> VM_FAULT_RETRY continue with the mmap_sem acquisition and do the
> traditional fault.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> 
> [Clearing of FAULT_FLAG_ALLOW_RETRY is now done in
>  handle_speculative_fault()]
> [Retry with usual fault path in the case VM_ERROR is returned by
>  handle_speculative_fault(). This allows signal to be delivered]
> Signed-off-by: Laurent Dufour 
> ---
>  arch/x86/include/asm/pgtable_types.h |  7 +++
>  arch/x86/mm/fault.c  | 19 +++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h 
> b/arch/x86/include/asm/pgtable_types.h
> index bf9638e1ee42..4fd2693a037e 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -234,6 +234,13 @@ enum page_cache_mode {
>  #define PGD_IDENT_ATTR0x001  /* PRESENT (no other 
> attributes) */
>  #endif
>  
> +/*
> + * Advertise that we call the Speculative Page Fault handler.
> + */
> +#ifdef CONFIG_X86_64
> +#define __HAVE_ARCH_CALL_SPF
> +#endif
> +
>  #ifdef CONFIG_X86_32
>  # include 
>  #else
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 2a1fa10c6a98..4c070b9a4362 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1365,6 +1365,24 @@ __do_page_fault(struct pt_regs *regs, unsigned long 
> error_code,
>   if (error_code & PF_INSTR)
>   flags |= FAULT_FLAG_INSTRUCTION;
>  
> +#ifdef __HAVE_ARCH_CALL_SPF
> + if (error_code & PF_USER) {
> + fault = handle_speculative_fault(mm, address, flags);
> +
> + /*
> +  * We also check against VM_FAULT_ERROR because we have to
> +  * raise a signal by calling later mm_fault_error() which
> +  * requires the vma pointer to be set. So in that case,
> +  * we fall through the normal path.

Cant mm_fault_error() be called inside handle_speculative_fault() ?
Falling through the normal page fault path again just to raise a
signal seems overkill. Looking into mm_fault_error(), it seems they
are different for x86 and powerpc.

X86:

mm_fault_error(struct pt_regs *regs, unsigned long error_code,
   unsigned long address, struct vm_area_struct *vma,
   unsigned int fault)

powerpc:

mm_fault_error(struct pt_regs *regs, unsigned long addr, int fault)

Even in case of X86, I guess we would have reference to the faulting
VMA (after the SRCU search) which can be used to call this function
directly.

Re: [PATCH v2 20/20] powerpc/mm: Add speculative page fault

2017-08-21 Thread Anshuman Khandual

On 08/18/2017 03:35 AM, Laurent Dufour wrote:
> This patch enable the speculative page fault on the PowerPC
> architecture.
> 
> This will try a speculative page fault without holding the mmap_sem,
> if it returns with WM_FAULT_RETRY, the mmap_sem is acquired and the

s/WM_FAULT_RETRY/VM_FAULT_RETRY/

> traditional page fault processing is done.
> 
> Support is only provide for BOOK3S_64 currently because:
> - require CONFIG_PPC_STD_MMU because checks done in
>   set_access_flags_filter()

What checks are done in set_access_flags_filter() ? We are just
adding the code block in do_page_fault().


> - require BOOK3S because we can't support for book3e_hugetlb_preload()
>   called by update_mmu_cache()
> 
> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h |  5 +
>  arch/powerpc/mm/fault.c  | 30 
> +++-
>  2 files changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 818a58fc3f4f..897f8b9f67e6 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -313,6 +313,11 @@ extern unsigned long pci_io_base;
>  /* Advertise support for _PAGE_SPECIAL */
>  #define __HAVE_ARCH_PTE_SPECIAL
>  
> +/* Advertise that we call the Speculative Page Fault handler */
> +#if defined(CONFIG_PPC_BOOK3S_64)
> +#define __HAVE_ARCH_CALL_SPF
> +#endif
> +
>  #ifndef __ASSEMBLY__
>  
>  /*
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 4c422632047b..7b3cc4c30eab 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -291,9 +291,36 @@ int do_page_fault(struct pt_regs *regs, unsigned long 
> address,
>   if (is_write && is_user)
>   store_update_sp = store_updates_sp(regs);
>  
> - if (is_user)
> + if (is_user) {
>   flags |= FAULT_FLAG_USER;
>  
> +#if defined(__HAVE_ARCH_CALL_SPF)
> + /* let's try a speculative page fault without grabbing the
> +  * mmap_sem.
> +  */
> +
> + /*
> +  * flags is set later based on the VMA's flags, for the common
> +  * speculative service, we need some flags to be set.
> +  */
> + if (is_write)
> + flags |= FAULT_FLAG_WRITE;
> +
> + fault = handle_speculative_fault(mm, address, flags);
> + if (!(fault & VM_FAULT_RETRY || fault & VM_FAULT_ERROR)) {
> + perf_sw_event(PERF_COUNT_SW_SPF_DONE, 1,
> +   regs, address);
> + goto done;

Why we should retry with classical page fault on VM_FAULT_ERROR ?
We should always return VM_FAULT_RETRY in case there is a clear
collision some where which requires retry with classical method
and return VM_FAULT_ERROR in cases where we know that it cannot
be retried and fail for good. Should not handle_speculative_fault()
be changed to accommodate this ?

> + }
> +
> + /*
> +  * Resetting flags since the following code assumes
> +  * FAULT_FLAG_WRITE is not set.
> +  */
> + flags &= ~FAULT_FLAG_WRITE;
> +#endif /* defined(__HAVE_ARCH_CALL_SPF) */

Setting and resetting of FAULT_FLAG_WRITE seems confusing. Why you
say that some flags need to be set for handle_speculative_fault()
function. Could you elaborate on this ?

Re: [PATCH v2 00/20] Speculative page faults

2017-08-21 Thread Anshuman Khandual

On 08/18/2017 03:34 AM, Laurent Dufour wrote:
> This is a port on kernel 4.13 of the work done by Peter Zijlstra to
> handle page fault without holding the mm semaphore [1].
> 
> The idea is to try to handle user space page faults without holding the
> mmap_sem. This should allow better concurrency for massively threaded
> process since the page fault handler will not wait for other threads memory
> layout change to be done, assuming that this change is done in another part
> of the process's memory space. This type page fault is named speculative
> page fault. If the speculative page fault fails because of a concurrency is
> detected or because underlying PMD or PTE tables are not yet allocating, it
> is failing its processing and a classic page fault is then tried.
> 
> The speculative page fault (SPF) has to look for the VMA matching the fault
> address without holding the mmap_sem, so the VMA list is now managed using
> SRCU allowing lockless walking. The only impact would be the deferred file
> derefencing in the case of a file mapping, since the file pointer is
> released once the SRCU cleaning is done.  This patch relies on the change
> done recently by Paul McKenney in SRCU which now runs a callback per CPU
> instead of per SRCU structure [1].
> 
> The VMA's attributes checked during the speculative page fault processing
> have to be protected against parallel changes. This is done by using a per
> VMA sequence lock. This sequence lock allows the speculative page fault
> handler to fast check for parallel changes in progress and to abort the
> speculative page fault in that case.
> 
> Once the VMA is found, the speculative page fault handler would check for
> the VMA's attributes to verify that the page fault has to be handled
> correctly or not. Thus the VMA is protected through a sequence lock which
> allows fast detection of concurrent VMA changes. If such a change is
> detected, the speculative page fault is aborted and a *classic* page fault
> is tried.  VMA sequence locks are added when VMA attributes which are
> checked during the page fault are modified.
> 
> When the PTE is fetched, the VMA is checked to see if it has been changed,
> so once the page table is locked, the VMA is valid, so any other changes
> leading to touching this PTE will need to lock the page table, so no
> parallel change is possible at this time.
> 
> Compared to the Peter's initial work, this series introduces a spin_trylock
> when dealing with speculative page fault. This is required to avoid dead
> lock when handling a page fault while a TLB invalidate is requested by an
> other CPU holding the PTE. Another change due to a lock dependency issue
> with mapping->i_mmap_rwsem.
> 
> In addition some VMA field values which are used once the PTE is unlocked
> at the end the page fault path are saved into the vm_fault structure to
> used the values matching the VMA at the time the PTE was locked.
> 
> This series builds on top of v4.13-rc5 and is functional on x86 and
> PowerPC.
> 
> Tests have been made using a large commercial in-memory database on a
> PowerPC system with 752 CPU using RFC v5. The results are very encouraging
> since the loading of the 2TB database was faster by 14% with the
> speculative page fault.
> 

You specifically mention loading as most of the page faults will
happen at that time and then the working set will settle down with
very less page faults there after ? That means unless there is
another wave of page faults we wont notice performance improvement
during the runtime.

> Using ebizzy test [3], which spreads a lot of threads, the result are good
> when running on both a large or a small system. When using kernbench, the

The performance improvements are greater as there is a lot of creation
and destruction of anon mappings which generates constant flow of page
faults to be handled.

> result are quite similar which expected as not so much multi threaded
> processes are involved. But there is no performance degradation neither
> which is good.

If we compile with 'make -j N' there would be a lot of threads but I
guess the problem is SPF does not support handling file mapping IIUC
which limits the performance improvement for some workloads.

> 
> --
> Benchmarks results
> 
> Note these test have been made on top of 4.13-rc3 with the following patch
> from Paul McKenney applied: 
>  "srcu: Provide ordering for CPU not involved in grace period" [5]

Is this patch an improvement for SRCU which we are using for walking VMAs.

> 
> Ebizzy:
> ---
> The test is counting the number of records per second it can manage, the
> higher is the best. I run it like this 'ebizzy -mTRp'. To get consistent
> result I repeated the test 100 times and measure the average result, mean
> deviation, max and min.
> 
> - 16 CPUs x86 VM
> Records/s 4.13-rc54.13-rc5-spf
> Average   11350.2921760.36
> Mean deviation396.56  881.40
> Max   13773

Re: RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

2017-08-21 Thread Nicholas Piggin

On Mon, 21 Aug 2017 10:52:58 +1000
Nicholas Piggin  wrote:

> On Sun, 20 Aug 2017 14:14:29 -0700
> "Paul E. McKenney"  wrote:
> 
> > On Sun, Aug 20, 2017 at 11:35:14AM -0700, Paul E. McKenney wrote:  
> > > On Sun, Aug 20, 2017 at 11:00:40PM +1000, Nicholas Piggin wrote:
> > > > On Sun, 20 Aug 2017 14:45:53 +1000
> > > > Nicholas Piggin  wrote:
> > > > 
> > > > > On Wed, 16 Aug 2017 09:27:31 -0700
> > > > > "Paul E. McKenney"  wrote:
> > > > > > On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney wrote:
> > > > > > 
> > > > > > Thomas, John, am I misinterpreting the timer trace event messages?  
> > > > > > 
> > > > > 
> > > > > So I did some digging, and what you find is that rcu_sched seems to 
> > > > > do a
> > > > > simple scheudle_timeout(1) and just goes out to lunch for many 
> > > > > seconds.
> > > > > The process_timeout timer never fires (when it finally does wake after
> > > > > one of these events, it usually removes the timer with 
> > > > > del_timer_sync).
> > > > > 
> > > > > So this patch seems to fix it. Testing, comments welcome.
> > > > 
> > > > Okay this had a problem of trying to forward the timer from a timer
> > > > callback function.
> > > > 
> > > > This was my other approach which also fixes the RCU warnings, but it's
> > > > a little more complex. I reworked it a bit so the mod_timer fast path
> > > > hopefully doesn't have much more overhead (actually by reading jiffies
> > > > only when needed, it probably saves a load).
> > > 
> > > Giving this one a whirl!
> > 
> > No joy here, but then again there are other reasons to believe that I
> > am seeing a different bug than Dave and Jonathan are.
> > 
> > OK, not -entirely- without joy -- 10 of 14 runs were error-free, which
> > is a good improvement over 0 of 84 for your earlier patch.  ;-)  But
> > not statistically different from what I see without either patch.
> > 
> > But no statistical difference compared to without patch, and I still
> > see the "rcu_sched kthread starved" messages.  For whatever it is worth,
> > by the way, I also see this: "hrtimer: interrupt took 5712368 ns".
> > Hmmm...  I am also seeing that without any of your patches.  Might
> > be hypervisor preemption, I guess.  
> 
> Okay it makes the warnings go away for me, but I'm just booting then
> leaving the system idle. You're doing some CPU hotplug activity?

Okay found a bug in the patch (it was not forwarding properly before
adding the first timer after an idle) and a few other concerns.

There's still a problem of a timer function doing a mod timer from
within expire_timers. It can't forward the base, which might currently
be quite a way behind. I *think* after we close these gaps and get
timely wakeups for timers on there, it should not get too far behind
for standard timers.

Deferrable is a different story. Firstly it has no idle tracking so we
never forward it. Even if we wanted to, we can't do it reliably because
it could contain timers way behind the base. They are "deferrable", so
you get what you pay for, but this still means there's a window where
you can add a deferrable timer and get a far later expiry than you
asked for despite the CPU never going idle after you added it.

All these problems would seem to go away if mod_timer just queued up
the timer to a single list on the base then pushed them into the
wheel during your wheel processing softirq... Although maybe you end
up with excessive passes over big queue of timers. Anyway that
wouldn't be suitable for 4.13 even if it could work.

I'll send out an updated minimal fix after some more testing...

Thanks,
Nick

58 matches

Mail list logo