[PATCH -v2 6/6] locking,netfilter: Fix nf_conntrack_lock()

2016-05-26 Thread Peter Zijlstra
nf_conntrack_lock{,_all}() is borken as it misses a bunch of memory
barriers to order the whole global vs local locks scheme.

Even x86 (and other TSO archs) are affected.

Signed-off-by: Peter Zijlstra (Intel) 
---
 net/netfilter/nf_conntrack_core.c |   18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -83,6 +83,12 @@ void nf_conntrack_lock(spinlock_t *lock)
spin_lock(lock);
while (unlikely(nf_conntrack_locks_all)) {
spin_unlock(lock);
+
+   /* Order the nf_contrack_locks_all load vs the
+* spin_unlock_wait() loads below, to ensure locks_all is
+* indeed held.
+*/
+   smp_rmb(); /* spin_lock(locks_all) */
spin_unlock_wait(_conntrack_locks_all_lock);
spin_lock(lock);
}
@@ -128,6 +134,12 @@ static void nf_conntrack_all_lock(void)
spin_lock(_conntrack_locks_all_lock);
nf_conntrack_locks_all = true;
 
+   /* Order the above store against the spin_unlock_wait() loads
+* below, such that if nf_conntrack_lock() observes lock_all
+* we must observe lock[] held.
+*/
+   smp_mb(); /* spin_lock(locks_all) */
+
for (i = 0; i < CONNTRACK_LOCKS; i++) {
spin_unlock_wait(_conntrack_locks[i]);
}
@@ -135,7 +147,11 @@ static void nf_conntrack_all_lock(void)
 
 static void nf_conntrack_all_unlock(void)
 {
-   nf_conntrack_locks_all = false;
+   /* All prior stores must be complete before we clear locks_all.
+* Otherwise nf_conntrack_lock() might observe the false but not the
+* entire critical section.
+*/
+   smp_store_release(_conntrack_locks_all, false);
spin_unlock(_conntrack_locks_all_lock);
 }
 




[PATCH -v2 6/6] locking,netfilter: Fix nf_conntrack_lock()

2016-05-26 Thread Peter Zijlstra
nf_conntrack_lock{,_all}() is borken as it misses a bunch of memory
barriers to order the whole global vs local locks scheme.

Even x86 (and other TSO archs) are affected.

Signed-off-by: Peter Zijlstra (Intel) 
---
 net/netfilter/nf_conntrack_core.c |   18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -83,6 +83,12 @@ void nf_conntrack_lock(spinlock_t *lock)
spin_lock(lock);
while (unlikely(nf_conntrack_locks_all)) {
spin_unlock(lock);
+
+   /* Order the nf_contrack_locks_all load vs the
+* spin_unlock_wait() loads below, to ensure locks_all is
+* indeed held.
+*/
+   smp_rmb(); /* spin_lock(locks_all) */
spin_unlock_wait(_conntrack_locks_all_lock);
spin_lock(lock);
}
@@ -128,6 +134,12 @@ static void nf_conntrack_all_lock(void)
spin_lock(_conntrack_locks_all_lock);
nf_conntrack_locks_all = true;
 
+   /* Order the above store against the spin_unlock_wait() loads
+* below, such that if nf_conntrack_lock() observes lock_all
+* we must observe lock[] held.
+*/
+   smp_mb(); /* spin_lock(locks_all) */
+
for (i = 0; i < CONNTRACK_LOCKS; i++) {
spin_unlock_wait(_conntrack_locks[i]);
}
@@ -135,7 +147,11 @@ static void nf_conntrack_all_lock(void)
 
 static void nf_conntrack_all_unlock(void)
 {
-   nf_conntrack_locks_all = false;
+   /* All prior stores must be complete before we clear locks_all.
+* Otherwise nf_conntrack_lock() might observe the false but not the
+* entire critical section.
+*/
+   smp_store_release(_conntrack_locks_all, false);
spin_unlock(_conntrack_locks_all_lock);
 }
 




[PATCH -v2 4/6] locking, arch: Update spin_unlock_wait()

2016-05-26 Thread Peter Zijlstra
This patch updates/fixes all spin_unlock_wait() implementations.

The update is in semantics; where it previously was only a control
dependency, we now upgrade to a full load-acquire to match the
store-release from the spin_unlock() we waited on. This ensures that
when spin_unlock_wait() returns, we're guaranteed to observe the full
critical section we waited on.

This fixes a number of spin_unlock_wait() users that (not
unreasonably) rely on this.

I also fixed a number of ticket lock versions to only wait on the
current lock holder, instead of for a full unlock, as this is
sufficient.

Furthermore; again for ticket locks; I added an smp_rmb() in between
the initial ticket load and the spin loop testing the current value
because I could not convince myself the address dependency is
sufficient, esp. if the loads are of different sizes.

I'm more than happy to remove this smp_rmb() again if people are
certain the address dependency does indeed work as expected.

Cc: r...@twiddle.net
Cc: vgu...@synopsys.com
Cc: li...@armlinux.org.uk
Cc: real...@gmail.com
Cc: r...@codeaurora.org
Cc: tony.l...@intel.com
Cc: james.ho...@imgtec.com
Cc: r...@linux-mips.org
Cc: dhowe...@redhat.com
Cc: j...@parisc-linux.org
Cc: m...@ellerman.id.au
Cc: schwidef...@de.ibm.com
Cc: ys...@users.sourceforge.jp
Cc: da...@davemloft.net
Cc: cmetc...@mellanox.com
Cc: ch...@zankel.net
Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/alpha/include/asm/spinlock.h|7 +--
 arch/arc/include/asm/spinlock.h  |7 +--
 arch/arm/include/asm/spinlock.h  |   18 --
 arch/blackfin/include/asm/spinlock.h |3 +--
 arch/hexagon/include/asm/spinlock.h  |8 ++--
 arch/ia64/include/asm/spinlock.h |2 ++
 arch/m32r/include/asm/spinlock.h |7 +--
 arch/metag/include/asm/spinlock.h|   11 +--
 arch/mips/include/asm/spinlock.h |   18 --
 arch/mn10300/include/asm/spinlock.h  |6 +-
 arch/parisc/include/asm/spinlock.h   |9 +++--
 arch/powerpc/include/asm/spinlock.h  |6 --
 arch/s390/include/asm/spinlock.h |3 +--
 arch/sh/include/asm/spinlock.h   |7 +--
 arch/sparc/include/asm/spinlock_32.h |6 --
 arch/sparc/include/asm/spinlock_64.h |9 ++---
 arch/tile/lib/spinlock_32.c  |4 
 arch/tile/lib/spinlock_64.c  |4 
 arch/xtensa/include/asm/spinlock.h   |7 +--
 include/asm-generic/qspinlock.h  |3 +--
 include/linux/spinlock_up.h  |9 ++---
 21 files changed, 117 insertions(+), 37 deletions(-)

--- a/arch/alpha/include/asm/spinlock.h
+++ b/arch/alpha/include/asm/spinlock.h
@@ -13,8 +13,11 @@
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 #define arch_spin_is_locked(x) ((x)->lock != 0)
-#define arch_spin_unlock_wait(x) \
-   do { cpu_relax(); } while ((x)->lock)
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+   smp_cond_load_acquire(>lock, !VAL);
+}
 
 static inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 {
--- a/arch/arc/include/asm/spinlock.h
+++ b/arch/arc/include/asm/spinlock.h
@@ -15,8 +15,11 @@
 
 #define arch_spin_is_locked(x) ((x)->slock != __ARCH_SPIN_LOCK_UNLOCKED__)
 #define arch_spin_lock_flags(lock, flags)  arch_spin_lock(lock)
-#define arch_spin_unlock_wait(x) \
-   do { while (arch_spin_is_locked(x)) cpu_relax(); } while (0)
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+   smp_cond_load_acquire(>slock, !VAL);
+}
 
 #ifdef CONFIG_ARC_HAS_LLSC
 
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -50,8 +50,22 @@ static inline void dsb_sev(void)
  * memory.
  */
 
-#define arch_spin_unlock_wait(lock) \
-   do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+   u16 owner = READ_ONCE(lock->tickets.owner);
+
+   smp_rmb();
+   for (;;) {
+   arch_spinlock_t tmp = READ_ONCE(*lock);
+
+   if (tmp.tickets.owner == tmp.tickets.next ||
+   tmp.tickets.owner != owner)
+   break;
+
+   wfe();
+   }
+   smp_acquire__after_ctrl_dep();
+}
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
--- a/arch/blackfin/include/asm/spinlock.h
+++ b/arch/blackfin/include/asm/spinlock.h
@@ -48,8 +48,7 @@ static inline void arch_spin_unlock(arch
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
-   while (arch_spin_is_locked(lock))
-   cpu_relax();
+   smp_cond_load_acquire(>lock, !VAL);
 }
 
 static inline int arch_read_can_lock(arch_rwlock_t *rw)
--- a/arch/hexagon/include/asm/spinlock.h
+++ b/arch/hexagon/include/asm/spinlock.h
@@ -176,8 +176,12 @@ static inline unsigned int arch_spin_try
  * SMP spinlocks are intended to allow only a single CPU at the lock
  */
 #define 

[PATCH -v2 4/6] locking, arch: Update spin_unlock_wait()

2016-05-26 Thread Peter Zijlstra
This patch updates/fixes all spin_unlock_wait() implementations.

The update is in semantics; where it previously was only a control
dependency, we now upgrade to a full load-acquire to match the
store-release from the spin_unlock() we waited on. This ensures that
when spin_unlock_wait() returns, we're guaranteed to observe the full
critical section we waited on.

This fixes a number of spin_unlock_wait() users that (not
unreasonably) rely on this.

I also fixed a number of ticket lock versions to only wait on the
current lock holder, instead of for a full unlock, as this is
sufficient.

Furthermore; again for ticket locks; I added an smp_rmb() in between
the initial ticket load and the spin loop testing the current value
because I could not convince myself the address dependency is
sufficient, esp. if the loads are of different sizes.

I'm more than happy to remove this smp_rmb() again if people are
certain the address dependency does indeed work as expected.

Cc: r...@twiddle.net
Cc: vgu...@synopsys.com
Cc: li...@armlinux.org.uk
Cc: real...@gmail.com
Cc: r...@codeaurora.org
Cc: tony.l...@intel.com
Cc: james.ho...@imgtec.com
Cc: r...@linux-mips.org
Cc: dhowe...@redhat.com
Cc: j...@parisc-linux.org
Cc: m...@ellerman.id.au
Cc: schwidef...@de.ibm.com
Cc: ys...@users.sourceforge.jp
Cc: da...@davemloft.net
Cc: cmetc...@mellanox.com
Cc: ch...@zankel.net
Signed-off-by: Peter Zijlstra (Intel) 
---
 arch/alpha/include/asm/spinlock.h|7 +--
 arch/arc/include/asm/spinlock.h  |7 +--
 arch/arm/include/asm/spinlock.h  |   18 --
 arch/blackfin/include/asm/spinlock.h |3 +--
 arch/hexagon/include/asm/spinlock.h  |8 ++--
 arch/ia64/include/asm/spinlock.h |2 ++
 arch/m32r/include/asm/spinlock.h |7 +--
 arch/metag/include/asm/spinlock.h|   11 +--
 arch/mips/include/asm/spinlock.h |   18 --
 arch/mn10300/include/asm/spinlock.h  |6 +-
 arch/parisc/include/asm/spinlock.h   |9 +++--
 arch/powerpc/include/asm/spinlock.h  |6 --
 arch/s390/include/asm/spinlock.h |3 +--
 arch/sh/include/asm/spinlock.h   |7 +--
 arch/sparc/include/asm/spinlock_32.h |6 --
 arch/sparc/include/asm/spinlock_64.h |9 ++---
 arch/tile/lib/spinlock_32.c  |4 
 arch/tile/lib/spinlock_64.c  |4 
 arch/xtensa/include/asm/spinlock.h   |7 +--
 include/asm-generic/qspinlock.h  |3 +--
 include/linux/spinlock_up.h  |9 ++---
 21 files changed, 117 insertions(+), 37 deletions(-)

--- a/arch/alpha/include/asm/spinlock.h
+++ b/arch/alpha/include/asm/spinlock.h
@@ -13,8 +13,11 @@
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 #define arch_spin_is_locked(x) ((x)->lock != 0)
-#define arch_spin_unlock_wait(x) \
-   do { cpu_relax(); } while ((x)->lock)
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+   smp_cond_load_acquire(>lock, !VAL);
+}
 
 static inline int arch_spin_value_unlocked(arch_spinlock_t lock)
 {
--- a/arch/arc/include/asm/spinlock.h
+++ b/arch/arc/include/asm/spinlock.h
@@ -15,8 +15,11 @@
 
 #define arch_spin_is_locked(x) ((x)->slock != __ARCH_SPIN_LOCK_UNLOCKED__)
 #define arch_spin_lock_flags(lock, flags)  arch_spin_lock(lock)
-#define arch_spin_unlock_wait(x) \
-   do { while (arch_spin_is_locked(x)) cpu_relax(); } while (0)
+
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+   smp_cond_load_acquire(>slock, !VAL);
+}
 
 #ifdef CONFIG_ARC_HAS_LLSC
 
--- a/arch/arm/include/asm/spinlock.h
+++ b/arch/arm/include/asm/spinlock.h
@@ -50,8 +50,22 @@ static inline void dsb_sev(void)
  * memory.
  */
 
-#define arch_spin_unlock_wait(lock) \
-   do { while (arch_spin_is_locked(lock)) cpu_relax(); } while (0)
+static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
+{
+   u16 owner = READ_ONCE(lock->tickets.owner);
+
+   smp_rmb();
+   for (;;) {
+   arch_spinlock_t tmp = READ_ONCE(*lock);
+
+   if (tmp.tickets.owner == tmp.tickets.next ||
+   tmp.tickets.owner != owner)
+   break;
+
+   wfe();
+   }
+   smp_acquire__after_ctrl_dep();
+}
 
 #define arch_spin_lock_flags(lock, flags) arch_spin_lock(lock)
 
--- a/arch/blackfin/include/asm/spinlock.h
+++ b/arch/blackfin/include/asm/spinlock.h
@@ -48,8 +48,7 @@ static inline void arch_spin_unlock(arch
 
 static inline void arch_spin_unlock_wait(arch_spinlock_t *lock)
 {
-   while (arch_spin_is_locked(lock))
-   cpu_relax();
+   smp_cond_load_acquire(>lock, !VAL);
 }
 
 static inline int arch_read_can_lock(arch_rwlock_t *rw)
--- a/arch/hexagon/include/asm/spinlock.h
+++ b/arch/hexagon/include/asm/spinlock.h
@@ -176,8 +176,12 @@ static inline unsigned int arch_spin_try
  * SMP spinlocks are intended to allow only a single CPU at the lock
  */
 #define arch_spin_lock_flags(lock, flags) 

[PATCH -v2 3/6] locking: Introduce smp_acquire__after_ctrl_dep

2016-05-26 Thread Peter Zijlstra
Introduce smp_acquire__after_ctrl_dep(), this construct is not
uncommen, but the lack of this barrier is.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/compiler.h |   15 ++-
 ipc/sem.c|   14 ++
 2 files changed, 12 insertions(+), 17 deletions(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -305,6 +305,15 @@ static __always_inline void __write_once
 })
 
 /**
+ * smp_acquire__after_ctrl_dep() - Provide ACQUIRE ordering after a control 
dependency
+ *
+ * A control dependency provides a LOAD->STORE order, the additional RMB
+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
+ * aka. (load)-ACQUIRE.
+ */
+#define smp_acquire__after_ctrl_dep()  smp_rmb()
+
+/**
  * cmpwait - compare and wait for a variable to change
  * @ptr: pointer to the variable to wait on
  * @val: the value it should change from
@@ -331,10 +340,6 @@ static __always_inline void __write_once
  *
  * Due to C lacking lambda expressions we load the value of *ptr into a
  * pre-named variable @VAL to be used in @cond.
- *
- * The control dependency provides a LOAD->STORE order, the additional RMB
- * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
- * aka. ACQUIRE.
  */
 #ifndef smp_cond_load_acquire
 #define smp_cond_load_acquire(ptr, cond_expr) ({   \
@@ -346,7 +351,7 @@ static __always_inline void __write_once
break;  \
cmpwait(__PTR, VAL);\
}   \
-   smp_rmb(); /* ctrl + rmb := acquire */  \
+   smp_acquire__after_ctrl_dep();  \
VAL;\
 })
 #endif
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -260,16 +260,6 @@ static void sem_rcu_free(struct rcu_head
 }
 
 /*
- * spin_unlock_wait() and !spin_is_locked() are not memory barriers, they
- * are only control barriers.
- * The code must pair with spin_unlock(>lock) or
- * spin_unlock(_perm.lock), thus just the control barrier is insufficient.
- *
- * smp_rmb() is sufficient, as writes cannot pass the control barrier.
- */
-#define ipc_smp_acquire__after_spin_is_unlocked()  smp_rmb()
-
-/*
  * Wait until all currently ongoing simple ops have completed.
  * Caller must own sem_perm.lock.
  * New simple ops cannot start, because simple ops first check
@@ -292,7 +282,7 @@ static void sem_wait_array(struct sem_ar
sem = sma->sem_base + i;
spin_unlock_wait(>lock);
}
-   ipc_smp_acquire__after_spin_is_unlocked();
+   smp_acquire__after_ctrl_dep();
 }
 
 /*
@@ -350,7 +340,7 @@ static inline int sem_lock(struct sem_ar
 *  complex_count++;
 *  spin_unlock(sem_perm.lock);
 */
-   ipc_smp_acquire__after_spin_is_unlocked();
+   smp_acquire__after_ctrl_dep();
 
/*
 * Now repeat the test of complex_count:




[PATCH -v2 3/6] locking: Introduce smp_acquire__after_ctrl_dep

2016-05-26 Thread Peter Zijlstra
Introduce smp_acquire__after_ctrl_dep(), this construct is not
uncommen, but the lack of this barrier is.

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/compiler.h |   15 ++-
 ipc/sem.c|   14 ++
 2 files changed, 12 insertions(+), 17 deletions(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -305,6 +305,15 @@ static __always_inline void __write_once
 })
 
 /**
+ * smp_acquire__after_ctrl_dep() - Provide ACQUIRE ordering after a control 
dependency
+ *
+ * A control dependency provides a LOAD->STORE order, the additional RMB
+ * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
+ * aka. (load)-ACQUIRE.
+ */
+#define smp_acquire__after_ctrl_dep()  smp_rmb()
+
+/**
  * cmpwait - compare and wait for a variable to change
  * @ptr: pointer to the variable to wait on
  * @val: the value it should change from
@@ -331,10 +340,6 @@ static __always_inline void __write_once
  *
  * Due to C lacking lambda expressions we load the value of *ptr into a
  * pre-named variable @VAL to be used in @cond.
- *
- * The control dependency provides a LOAD->STORE order, the additional RMB
- * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
- * aka. ACQUIRE.
  */
 #ifndef smp_cond_load_acquire
 #define smp_cond_load_acquire(ptr, cond_expr) ({   \
@@ -346,7 +351,7 @@ static __always_inline void __write_once
break;  \
cmpwait(__PTR, VAL);\
}   \
-   smp_rmb(); /* ctrl + rmb := acquire */  \
+   smp_acquire__after_ctrl_dep();  \
VAL;\
 })
 #endif
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -260,16 +260,6 @@ static void sem_rcu_free(struct rcu_head
 }
 
 /*
- * spin_unlock_wait() and !spin_is_locked() are not memory barriers, they
- * are only control barriers.
- * The code must pair with spin_unlock(>lock) or
- * spin_unlock(_perm.lock), thus just the control barrier is insufficient.
- *
- * smp_rmb() is sufficient, as writes cannot pass the control barrier.
- */
-#define ipc_smp_acquire__after_spin_is_unlocked()  smp_rmb()
-
-/*
  * Wait until all currently ongoing simple ops have completed.
  * Caller must own sem_perm.lock.
  * New simple ops cannot start, because simple ops first check
@@ -292,7 +282,7 @@ static void sem_wait_array(struct sem_ar
sem = sma->sem_base + i;
spin_unlock_wait(>lock);
}
-   ipc_smp_acquire__after_spin_is_unlocked();
+   smp_acquire__after_ctrl_dep();
 }
 
 /*
@@ -350,7 +340,7 @@ static inline int sem_lock(struct sem_ar
 *  complex_count++;
 *  spin_unlock(sem_perm.lock);
 */
-   ipc_smp_acquire__after_spin_is_unlocked();
+   smp_acquire__after_ctrl_dep();
 
/*
 * Now repeat the test of complex_count:




Re: [PATCH] soc/tegra: pmc: Fix "scheduling while atomic"

2016-05-26 Thread Jon Hunter

On 26/05/16 12:42, Dmitry Osipenko wrote:
> On 26.05.2016 11:42, Jon Hunter wrote:
>>
>> On 25/05/16 19:51, Dmitry Osipenko wrote:
>>> On 25.05.2016 18:09, Jon Hunter wrote:
>>
>> ...
>>
 If you are able to reproduce this on v3.18, then it would be good if
 you
 could trace the CCF calls around this WARNING to see what is causing
 the
 contention.
>>>
>>> I managed to reproduce it with some CCF "tracing".
>>> Full kmsg log is here: https://bpaste.net/show/d8ab7b7534b7
>>>
>>> Looks like CPU freq governor thread yields during clk_set_rate() and
>>> then CPU idle kicks in, taking the same mutex.
>>
>> On the surface that sounds odd to me, but without understanding the
>> details, I guess I don't know if this is a valid thing to be doing or
>> even how that actually works!
>>
> 
> The reason of that happening should be that I'm using clk PRE/POST rate
> change notifiers in my DVFS driver that takes other mutexes and they
> could be locked, causing schedule. I haven't mentioned it before, sorry.

OK, but I am not sure how these "other mutexes" would be relevant here
without any more details.

> From drivers/clk/clk.c:
> 
> static struct task_struct *prepare_owner;
> 
> ...
> 
> /***   locking ***/
> static void clk_prepare_lock(void)
> {
> if (!mutex_trylock(_lock)) {
> if (prepare_owner == current) {
> prepare_refcnt++;
> return;
> }
> mutex_lock(_lock);
> }
> 
> You can see that it would lock the mutex if prepare_owner != current, in
> my case it's idle thread != interactive gov. thread.

Right, but that would imply that someone else is actively doing
something with a clock. However, if we are entering LP2, then that
implies that all CPUs are idle and so I still don't understand the
scenario where this would be locked in that case. May be there is
something I am overlooking here?

>>> However, cpufreq_interactive governor is android specific governor and
>>> isn't in upstream kernel yet. Quick googling shows that recent
>>> "upstreaming" patch uses same cpufreq_interactive_speedchange_task:
>>> https://lkml.org/lkml/2016/5/20/41
>>
>> Do you know if this version they are upstreaming could also yield during
>> the clk_set_rate()?
>>
> 
> I think it should be assumed that any clk_set_rate() potentially could.
> Please correct me if I'm wrong.
> 
>>> I'm not aware of other possibility to reproduce this issue, it needs
>>> some CCF interaction from a separate task. So the current upstream
>>> kernel shouldn't be affected, I guess.
>>
>> What still does not make sense to me is why any frequency changes have
>> not completed before we attempt to enter the LP2 state?
>>
> Why not? I don't see any CPUIDLE <-> CPUFREQ interlocking. Do you think
> it could be harmful somehow?

Like I said before, I still don't understand that scenario that is
causing this and without being able to fully understand it, I have no
idea what the exact problem we are trying to fix here is.

Cheers
Jon

-- 
nvpublic


Re: [PATCH] soc/tegra: pmc: Fix "scheduling while atomic"

2016-05-26 Thread Jon Hunter

On 26/05/16 12:42, Dmitry Osipenko wrote:
> On 26.05.2016 11:42, Jon Hunter wrote:
>>
>> On 25/05/16 19:51, Dmitry Osipenko wrote:
>>> On 25.05.2016 18:09, Jon Hunter wrote:
>>
>> ...
>>
 If you are able to reproduce this on v3.18, then it would be good if
 you
 could trace the CCF calls around this WARNING to see what is causing
 the
 contention.
>>>
>>> I managed to reproduce it with some CCF "tracing".
>>> Full kmsg log is here: https://bpaste.net/show/d8ab7b7534b7
>>>
>>> Looks like CPU freq governor thread yields during clk_set_rate() and
>>> then CPU idle kicks in, taking the same mutex.
>>
>> On the surface that sounds odd to me, but without understanding the
>> details, I guess I don't know if this is a valid thing to be doing or
>> even how that actually works!
>>
> 
> The reason of that happening should be that I'm using clk PRE/POST rate
> change notifiers in my DVFS driver that takes other mutexes and they
> could be locked, causing schedule. I haven't mentioned it before, sorry.

OK, but I am not sure how these "other mutexes" would be relevant here
without any more details.

> From drivers/clk/clk.c:
> 
> static struct task_struct *prepare_owner;
> 
> ...
> 
> /***   locking ***/
> static void clk_prepare_lock(void)
> {
> if (!mutex_trylock(_lock)) {
> if (prepare_owner == current) {
> prepare_refcnt++;
> return;
> }
> mutex_lock(_lock);
> }
> 
> You can see that it would lock the mutex if prepare_owner != current, in
> my case it's idle thread != interactive gov. thread.

Right, but that would imply that someone else is actively doing
something with a clock. However, if we are entering LP2, then that
implies that all CPUs are idle and so I still don't understand the
scenario where this would be locked in that case. May be there is
something I am overlooking here?

>>> However, cpufreq_interactive governor is android specific governor and
>>> isn't in upstream kernel yet. Quick googling shows that recent
>>> "upstreaming" patch uses same cpufreq_interactive_speedchange_task:
>>> https://lkml.org/lkml/2016/5/20/41
>>
>> Do you know if this version they are upstreaming could also yield during
>> the clk_set_rate()?
>>
> 
> I think it should be assumed that any clk_set_rate() potentially could.
> Please correct me if I'm wrong.
> 
>>> I'm not aware of other possibility to reproduce this issue, it needs
>>> some CCF interaction from a separate task. So the current upstream
>>> kernel shouldn't be affected, I guess.
>>
>> What still does not make sense to me is why any frequency changes have
>> not completed before we attempt to enter the LP2 state?
>>
> Why not? I don't see any CPUIDLE <-> CPUFREQ interlocking. Do you think
> it could be harmful somehow?

Like I said before, I still don't understand that scenario that is
causing this and without being able to fully understand it, I have no
idea what the exact problem we are trying to fix here is.

Cheers
Jon

-- 
nvpublic


Re: [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm

2016-05-26 Thread Tetsuo Handa
Michal Hocko wrote:
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5bb2f7698ad7..0e33e912f7e4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -820,6 +820,13 @@ void oom_kill_process(struct oom_control *oc, struct 
> task_struct *p,
>   task_unlock(victim);
>  
>   /*
> +  * skip expensive iterations over all tasks if we know that there
> +  * are no users outside of threads in the same thread group
> +  */
> + if (atomic_read(>mm_users) <= get_nr_threads(victim))
> + goto oom_reap;

Is this really safe? Isn't it possible that victim thread's thread group has
more than atomic_read(>mm_users) threads which are past exit_mm() and 
blocked
at exit_task_work() which are before __exit_signal() from release_task() from
exit_notify()?

> +
> + /*
>* Kill all user processes sharing victim->mm in other thread groups, if
>* any.  They don't get access to memory reserves, though, to avoid
>* depletion of all memory.  This prevents mm->mmap_sem livelock when an


Re: [PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm

2016-05-26 Thread Tetsuo Handa
Michal Hocko wrote:
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5bb2f7698ad7..0e33e912f7e4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -820,6 +820,13 @@ void oom_kill_process(struct oom_control *oc, struct 
> task_struct *p,
>   task_unlock(victim);
>  
>   /*
> +  * skip expensive iterations over all tasks if we know that there
> +  * are no users outside of threads in the same thread group
> +  */
> + if (atomic_read(>mm_users) <= get_nr_threads(victim))
> + goto oom_reap;

Is this really safe? Isn't it possible that victim thread's thread group has
more than atomic_read(>mm_users) threads which are past exit_mm() and 
blocked
at exit_task_work() which are before __exit_signal() from release_task() from
exit_notify()?

> +
> + /*
>* Kill all user processes sharing victim->mm in other thread groups, if
>* any.  They don't get access to memory reserves, though, to avoid
>* depletion of all memory.  This prevents mm->mmap_sem livelock when an


[PATCH V2] fs: ubifs: Replace kmem_cache_alloc/memset with kmem_cache_zalloc

2016-05-26 Thread Salah Triki
Code is more readable when using kmem_cache_zalloc instead of 
kmem_cache_alloc/memset.

Signed-off-by: Salah Triki 
---
 fs/ubifs/super.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 7034995..f509200 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -262,12 +262,10 @@ static struct inode *ubifs_alloc_inode(struct super_block 
*sb)
 {
struct ubifs_inode *ui;
 
-   ui = kmem_cache_alloc(ubifs_inode_slab, GFP_NOFS);
+   ui = kmem_cache_zalloc(ubifs_inode_slab, GFP_NOFS);
if (!ui)
return NULL;
 
-   memset((void *)ui + sizeof(struct inode), 0,
-  sizeof(struct ubifs_inode) - sizeof(struct inode));
mutex_init(>ui_mutex);
spin_lock_init(>ui_lock);
return >vfs_inode;
-- 
1.9.1



[PATCH V2] fs: ubifs: Replace kmem_cache_alloc/memset with kmem_cache_zalloc

2016-05-26 Thread Salah Triki
Code is more readable when using kmem_cache_zalloc instead of 
kmem_cache_alloc/memset.

Signed-off-by: Salah Triki 
---
 fs/ubifs/super.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 7034995..f509200 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -262,12 +262,10 @@ static struct inode *ubifs_alloc_inode(struct super_block 
*sb)
 {
struct ubifs_inode *ui;
 
-   ui = kmem_cache_alloc(ubifs_inode_slab, GFP_NOFS);
+   ui = kmem_cache_zalloc(ubifs_inode_slab, GFP_NOFS);
if (!ui)
return NULL;
 
-   memset((void *)ui + sizeof(struct inode), 0,
-  sizeof(struct ubifs_inode) - sizeof(struct inode));
mutex_init(>ui_mutex);
spin_lock_init(>ui_lock);
return >vfs_inode;
-- 
1.9.1



[PATCH -v2 1/6] locking: Replace smp_cond_acquire with smp_cond_load_acquire

2016-05-26 Thread Peter Zijlstra
This new form allows using hardware assisted waiting.

Requested-by: Will Deacon 
Suggested-by: Linus Torvalds 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/compiler.h   |   25 +++--
 kernel/locking/qspinlock.c |   12 ++--
 kernel/sched/core.c|8 
 kernel/sched/sched.h   |2 +-
 kernel/smp.c   |2 +-
 5 files changed, 31 insertions(+), 18 deletions(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -305,21 +305,34 @@ static __always_inline void __write_once
 })
 
 /**
- * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
+ * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
+ * @ptr: pointer to the variable to wait on
  * @cond: boolean expression to wait for
  *
  * Equivalent to using smp_load_acquire() on the condition variable but employs
  * the control dependency of the wait to reduce the barrier on many platforms.
  *
+ * Due to C lacking lambda expressions we load the value of *ptr into a
+ * pre-named variable @VAL to be used in @cond.
+ *
  * The control dependency provides a LOAD->STORE order, the additional RMB
  * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
  * aka. ACQUIRE.
  */
-#define smp_cond_acquire(cond) do {\
-   while (!(cond)) \
-   cpu_relax();\
-   smp_rmb(); /* ctrl + rmb := acquire */  \
-} while (0)
+#ifndef smp_cond_load_acquire
+#define smp_cond_load_acquire(ptr, cond_expr) ({   \
+   typeof(ptr) __PTR = (ptr);  \
+   typeof(*ptr) VAL;   \
+   for (;;) {  \
+   VAL = READ_ONCE(*__PTR);\
+   if (cond_expr)  \
+   break;  \
+   cpu_relax();\
+   }   \
+   smp_rmb(); /* ctrl + rmb := acquire */  \
+   VAL;\
+})
+#endif
 
 #endif /* __KERNEL__ */
 
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -358,7 +358,7 @@ void queued_spin_lock_slowpath(struct qs
 * sequentiality; this is because not all clear_pending_set_locked()
 * implementations imply full barriers.
 */
-   smp_cond_acquire(!(atomic_read(>val) & _Q_LOCKED_MASK));
+   smp_cond_load_acquire(>val.counter, !(VAL & _Q_LOCKED_MASK));
 
/*
 * take ownership and clear the pending bit.
@@ -434,7 +434,7 @@ void queued_spin_lock_slowpath(struct qs
 *
 * The PV pv_wait_head_or_lock function, if active, will acquire
 * the lock and return a non-zero value. So we have to skip the
-* smp_cond_acquire() call. As the next PV queue head hasn't been
+* smp_cond_load_acquire() call. As the next PV queue head hasn't been
 * designated yet, there is no way for the locked value to become
 * _Q_SLOW_VAL. So both the set_locked() and the
 * atomic_cmpxchg_relaxed() calls will be safe.
@@ -445,7 +445,7 @@ void queued_spin_lock_slowpath(struct qs
if ((val = pv_wait_head_or_lock(lock, node)))
goto locked;
 
-   smp_cond_acquire(!((val = atomic_read(>val)) & 
_Q_LOCKED_PENDING_MASK));
+   val = smp_cond_load_acquire(>val.counter, !(VAL & 
_Q_LOCKED_PENDING_MASK));
 
 locked:
/*
@@ -465,9 +465,9 @@ void queued_spin_lock_slowpath(struct qs
break;
}
/*
-* The smp_cond_acquire() call above has provided the necessary
-* acquire semantics required for locking. At most two
-* iterations of this loop may be ran.
+* The smp_cond_load_acquire() call above has provided the
+* necessary acquire semantics required for locking. At most
+* two iterations of this loop may be ran.
 */
old = atomic_cmpxchg_relaxed(>val, val, _Q_LOCKED_VAL);
if (old == val)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1843,7 +1843,7 @@ static void ttwu_queue(struct task_struc
  * chain to provide order. Instead we do:
  *
  *   1) smp_store_release(X->on_cpu, 0)
- *   2) smp_cond_acquire(!X->on_cpu)
+ *   2) smp_cond_load_acquire(!X->on_cpu)
  *
  * Example:
  *
@@ -1854,7 +1854,7 @@ static void ttwu_queue(struct task_struc
  *   sched-out X
  *   smp_store_release(X->on_cpu, 0);
  *
- *smp_cond_acquire(!X->on_cpu);
+ *smp_cond_load_acquire(>on_cpu, !VAL);
  *X->state = WAKING
  *

[PATCH -v2 1/6] locking: Replace smp_cond_acquire with smp_cond_load_acquire

2016-05-26 Thread Peter Zijlstra
This new form allows using hardware assisted waiting.

Requested-by: Will Deacon 
Suggested-by: Linus Torvalds 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/compiler.h   |   25 +++--
 kernel/locking/qspinlock.c |   12 ++--
 kernel/sched/core.c|8 
 kernel/sched/sched.h   |2 +-
 kernel/smp.c   |2 +-
 5 files changed, 31 insertions(+), 18 deletions(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -305,21 +305,34 @@ static __always_inline void __write_once
 })
 
 /**
- * smp_cond_acquire() - Spin wait for cond with ACQUIRE ordering
+ * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
+ * @ptr: pointer to the variable to wait on
  * @cond: boolean expression to wait for
  *
  * Equivalent to using smp_load_acquire() on the condition variable but employs
  * the control dependency of the wait to reduce the barrier on many platforms.
  *
+ * Due to C lacking lambda expressions we load the value of *ptr into a
+ * pre-named variable @VAL to be used in @cond.
+ *
  * The control dependency provides a LOAD->STORE order, the additional RMB
  * provides LOAD->LOAD order, together they provide LOAD->{LOAD,STORE} order,
  * aka. ACQUIRE.
  */
-#define smp_cond_acquire(cond) do {\
-   while (!(cond)) \
-   cpu_relax();\
-   smp_rmb(); /* ctrl + rmb := acquire */  \
-} while (0)
+#ifndef smp_cond_load_acquire
+#define smp_cond_load_acquire(ptr, cond_expr) ({   \
+   typeof(ptr) __PTR = (ptr);  \
+   typeof(*ptr) VAL;   \
+   for (;;) {  \
+   VAL = READ_ONCE(*__PTR);\
+   if (cond_expr)  \
+   break;  \
+   cpu_relax();\
+   }   \
+   smp_rmb(); /* ctrl + rmb := acquire */  \
+   VAL;\
+})
+#endif
 
 #endif /* __KERNEL__ */
 
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -358,7 +358,7 @@ void queued_spin_lock_slowpath(struct qs
 * sequentiality; this is because not all clear_pending_set_locked()
 * implementations imply full barriers.
 */
-   smp_cond_acquire(!(atomic_read(>val) & _Q_LOCKED_MASK));
+   smp_cond_load_acquire(>val.counter, !(VAL & _Q_LOCKED_MASK));
 
/*
 * take ownership and clear the pending bit.
@@ -434,7 +434,7 @@ void queued_spin_lock_slowpath(struct qs
 *
 * The PV pv_wait_head_or_lock function, if active, will acquire
 * the lock and return a non-zero value. So we have to skip the
-* smp_cond_acquire() call. As the next PV queue head hasn't been
+* smp_cond_load_acquire() call. As the next PV queue head hasn't been
 * designated yet, there is no way for the locked value to become
 * _Q_SLOW_VAL. So both the set_locked() and the
 * atomic_cmpxchg_relaxed() calls will be safe.
@@ -445,7 +445,7 @@ void queued_spin_lock_slowpath(struct qs
if ((val = pv_wait_head_or_lock(lock, node)))
goto locked;
 
-   smp_cond_acquire(!((val = atomic_read(>val)) & 
_Q_LOCKED_PENDING_MASK));
+   val = smp_cond_load_acquire(>val.counter, !(VAL & 
_Q_LOCKED_PENDING_MASK));
 
 locked:
/*
@@ -465,9 +465,9 @@ void queued_spin_lock_slowpath(struct qs
break;
}
/*
-* The smp_cond_acquire() call above has provided the necessary
-* acquire semantics required for locking. At most two
-* iterations of this loop may be ran.
+* The smp_cond_load_acquire() call above has provided the
+* necessary acquire semantics required for locking. At most
+* two iterations of this loop may be ran.
 */
old = atomic_cmpxchg_relaxed(>val, val, _Q_LOCKED_VAL);
if (old == val)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1843,7 +1843,7 @@ static void ttwu_queue(struct task_struc
  * chain to provide order. Instead we do:
  *
  *   1) smp_store_release(X->on_cpu, 0)
- *   2) smp_cond_acquire(!X->on_cpu)
+ *   2) smp_cond_load_acquire(!X->on_cpu)
  *
  * Example:
  *
@@ -1854,7 +1854,7 @@ static void ttwu_queue(struct task_struc
  *   sched-out X
  *   smp_store_release(X->on_cpu, 0);
  *
- *smp_cond_acquire(!X->on_cpu);
+ *smp_cond_load_acquire(>on_cpu, !VAL);
  *X->state = WAKING
  *set_task_cpu(X,2)
  *
@@ -1880,7 +1880,7 @@ static void ttwu_queue(struct 

[PATCH -v2 2/6] locking: Introduce cmpwait()

2016-05-26 Thread Peter Zijlstra
Provide the cmpwait() primitive, which will 'spin' wait for a variable
to change and use it to implement smp_cond_load_acquire().

This primitive can be implemented with hardware assist on some
platforms (ARM64, x86).

Suggested-by: Will Deacon 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/compiler.h |   19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -305,6 +305,23 @@ static __always_inline void __write_once
 })
 
 /**
+ * cmpwait - compare and wait for a variable to change
+ * @ptr: pointer to the variable to wait on
+ * @val: the value it should change from
+ *
+ * A simple constuct that waits for a variable to change from a known
+ * value; some architectures can do this in hardware.
+ */
+#ifndef cmpwait
+#define cmpwait(ptr, val) do { \
+   typeof (ptr) __ptr = (ptr); \
+   typeof (val) __val = (val); \
+   while (READ_ONCE(*__ptr) == __val)  \
+   cpu_relax();\
+} while (0)
+#endif
+
+/**
  * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
  * @ptr: pointer to the variable to wait on
  * @cond: boolean expression to wait for
@@ -327,7 +344,7 @@ static __always_inline void __write_once
VAL = READ_ONCE(*__PTR);\
if (cond_expr)  \
break;  \
-   cpu_relax();\
+   cmpwait(__PTR, VAL);\
}   \
smp_rmb(); /* ctrl + rmb := acquire */  \
VAL;\




[PATCH -v2 2/6] locking: Introduce cmpwait()

2016-05-26 Thread Peter Zijlstra
Provide the cmpwait() primitive, which will 'spin' wait for a variable
to change and use it to implement smp_cond_load_acquire().

This primitive can be implemented with hardware assist on some
platforms (ARM64, x86).

Suggested-by: Will Deacon 
Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/compiler.h |   19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -305,6 +305,23 @@ static __always_inline void __write_once
 })
 
 /**
+ * cmpwait - compare and wait for a variable to change
+ * @ptr: pointer to the variable to wait on
+ * @val: the value it should change from
+ *
+ * A simple constuct that waits for a variable to change from a known
+ * value; some architectures can do this in hardware.
+ */
+#ifndef cmpwait
+#define cmpwait(ptr, val) do { \
+   typeof (ptr) __ptr = (ptr); \
+   typeof (val) __val = (val); \
+   while (READ_ONCE(*__ptr) == __val)  \
+   cpu_relax();\
+} while (0)
+#endif
+
+/**
  * smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
  * @ptr: pointer to the variable to wait on
  * @cond: boolean expression to wait for
@@ -327,7 +344,7 @@ static __always_inline void __write_once
VAL = READ_ONCE(*__PTR);\
if (cond_expr)  \
break;  \
-   cpu_relax();\
+   cmpwait(__PTR, VAL);\
}   \
smp_rmb(); /* ctrl + rmb := acquire */  \
VAL;\




[PATCH -v2 0/6] spin_unlock_wait borkage

2016-05-26 Thread Peter Zijlstra
This version rewrites all spin_unlock_wait() implementations to provide the
acquire order against the spin_unlock() release we wait on, ensuring we can
fully observe the critical section we waited on.

It pulls in the smp_cond_acquire() rewrite because it introduces a lot more
users of it. All simple spin_unlock_wait implementations end up being an
smp_cond_load_acquire().

And while this still introduces smp_acquire__after_ctrl_dep() it keeps its
usage contained to a few sites.

Please consider.. carefully.



[PATCH -v2 0/6] spin_unlock_wait borkage

2016-05-26 Thread Peter Zijlstra
This version rewrites all spin_unlock_wait() implementations to provide the
acquire order against the spin_unlock() release we wait on, ensuring we can
fully observe the critical section we waited on.

It pulls in the smp_cond_acquire() rewrite because it introduces a lot more
users of it. All simple spin_unlock_wait implementations end up being an
smp_cond_load_acquire().

And while this still introduces smp_acquire__after_ctrl_dep() it keeps its
usage contained to a few sites.

Please consider.. carefully.



[PATCH -v2 5/6] locking: Update spin_unlock_wait users

2016-05-26 Thread Peter Zijlstra
With the modified semantics of spin_unlock_wait() a number of
explicit barriers can be removed. And update the comment for the
do_exit() usecase, as that was somewhat stale/obscure.

Signed-off-by: Peter Zijlstra (Intel) 
---
 ipc/sem.c  |1 -
 kernel/exit.c  |8 ++--
 kernel/task_work.c |1 -
 3 files changed, 6 insertions(+), 4 deletions(-)

--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -282,7 +282,6 @@ static void sem_wait_array(struct sem_ar
sem = sma->sem_base + i;
spin_unlock_wait(>lock);
}
-   smp_acquire__after_ctrl_dep();
 }
 
 /*
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -776,10 +776,14 @@ void do_exit(long code)
 
exit_signals(tsk);  /* sets PF_EXITING */
/*
-* tsk->flags are checked in the futex code to protect against
-* an exiting task cleaning up the robust pi futexes.
+* Ensure that all new tsk->pi_lock acquisitions must observe
+* PF_EXITING. Serializes against futex.c:attach_to_pi_owner().
 */
smp_mb();
+   /*
+* Ensure that we must observe the pi_state in exit_mm() ->
+* mm_release() -> exit_pi_state_list().
+*/
raw_spin_unlock_wait(>pi_lock);
 
if (unlikely(in_atomic())) {
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -108,7 +108,6 @@ void task_work_run(void)
 * fail, but it can play with *work and other entries.
 */
raw_spin_unlock_wait(>pi_lock);
-   smp_mb();
 
do {
next = work->next;




[PATCH -v2 5/6] locking: Update spin_unlock_wait users

2016-05-26 Thread Peter Zijlstra
With the modified semantics of spin_unlock_wait() a number of
explicit barriers can be removed. And update the comment for the
do_exit() usecase, as that was somewhat stale/obscure.

Signed-off-by: Peter Zijlstra (Intel) 
---
 ipc/sem.c  |1 -
 kernel/exit.c  |8 ++--
 kernel/task_work.c |1 -
 3 files changed, 6 insertions(+), 4 deletions(-)

--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -282,7 +282,6 @@ static void sem_wait_array(struct sem_ar
sem = sma->sem_base + i;
spin_unlock_wait(>lock);
}
-   smp_acquire__after_ctrl_dep();
 }
 
 /*
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -776,10 +776,14 @@ void do_exit(long code)
 
exit_signals(tsk);  /* sets PF_EXITING */
/*
-* tsk->flags are checked in the futex code to protect against
-* an exiting task cleaning up the robust pi futexes.
+* Ensure that all new tsk->pi_lock acquisitions must observe
+* PF_EXITING. Serializes against futex.c:attach_to_pi_owner().
 */
smp_mb();
+   /*
+* Ensure that we must observe the pi_state in exit_mm() ->
+* mm_release() -> exit_pi_state_list().
+*/
raw_spin_unlock_wait(>pi_lock);
 
if (unlikely(in_atomic())) {
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -108,7 +108,6 @@ void task_work_run(void)
 * fail, but it can play with *work and other entries.
 */
raw_spin_unlock_wait(>pi_lock);
-   smp_mb();
 
do {
next = work->next;




Re: [PATCH RESEND 7/8] pipe: account to kmemcg

2016-05-26 Thread Eric Dumazet
On Thu, 2016-05-26 at 16:59 +0300, Vladimir Davydov wrote:
> On Thu, May 26, 2016 at 04:04:55PM +0900, Minchan Kim wrote:
> > On Wed, May 25, 2016 at 01:30:11PM +0300, Vladimir Davydov wrote:
> > > On Tue, May 24, 2016 at 01:04:33PM -0700, Eric Dumazet wrote:
> > > > On Tue, 2016-05-24 at 19:13 +0300, Vladimir Davydov wrote:
> > > > > On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote:
> > > > > ...
> > > > > > > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> > > > > > > +struct pipe_buffer *buf)
> > > > > > > +{
> > > > > > > + struct page *page = buf->page;
> > > > > > > +
> > > > > > > + if (page_count(page) == 1) {
> > > > > > 
> > > > > > This looks racy : some cpu could have temporarily elevated page 
> > > > > > count.
> > > > > 
> > > > > All pipe operations (pipe_buf_operations->get, ->release, ->steal) are
> > > > > supposed to be called under pipe_lock. So, if we see a 
> > > > > pipe_buffer->page
> > > > > with refcount of 1 in ->steal, that means that we are the only its 
> > > > > user
> > > > > and it can't be spliced to another pipe.
> > > > > 
> > > > > In fact, I just copied the code from generic_pipe_buf_steal, adding
> > > > > kmemcg related checks along the way, so it should be fine.
> > > > 
> > > > So you guarantee that no other cpu might have done
> > > > get_page_unless_zero() right before this test ?
> > > 
> > > Each pipe_buffer holds a reference to its page. If we find page's
> > > refcount to be 1 here, then it can be referenced only by our
> > > pipe_buffer. And the refcount cannot be increased by a parallel thread,
> > > because we hold pipe_lock, which rules out splice, and otherwise it's
> > > impossible to reach the page as it is not on lru. That said, I think I
> > > guarantee that this should be safe.
> > 
> > I don't know kmemcg internal and pipe stuff so my comment might be
> > totally crap.
> > 
> > No one cannot guarantee any CPU cannot held a reference of a page.
> > Look at get_page_unless_zero usecases.
> > 
> > 1. balloon_page_isolate
> > 
> > It can hold a reference in random page and then verify the page
> > is balloon page. Otherwise, just put.
> > 
> > 2. page_idle_get_page
> > 
> > It has PageLRU check but it's racy so it can hold a reference
> > of randome page and then verify within zone->lru_lock. If it's
> > not LRU page, just put.
> 
> Well, I see your concern now - even if a page is not on lru and we
> locked all structs pointing to it, it can always get accessed by pfn in
> a completely unrelated thread, like in examples you gave above. That's a
> fair point.
> 
> However, I still think that it's OK in case of pipe buffers. What can
> happen if somebody takes a transient reference to a pipe buffer page? At
> worst, we'll see page_count > 1 due to temporary ref and abort stealing,
> falling back on copying instead. That's OK, because stealing is not
> guaranteed. Can a function that takes a transient ref to page by pfn
> mistakenly assume that this is a page it's interested in? I don't think
> so, because this page has no marks on it except special _mapcount value,
> which should only be set on kmemcg pages.

Well, all this information deserve to be in the changelog.

Maybe in 6 months, this will be incredibly useful for bug hunting.

pipes can be used to exchange data (or pages) between processes in
different domains.

If kmemcg is not precise, this could be used by some attackers to force
some processes to consume all their budget and eventually not be able to
allocate new pages.





Re: [PATCH RESEND 7/8] pipe: account to kmemcg

2016-05-26 Thread Eric Dumazet
On Thu, 2016-05-26 at 16:59 +0300, Vladimir Davydov wrote:
> On Thu, May 26, 2016 at 04:04:55PM +0900, Minchan Kim wrote:
> > On Wed, May 25, 2016 at 01:30:11PM +0300, Vladimir Davydov wrote:
> > > On Tue, May 24, 2016 at 01:04:33PM -0700, Eric Dumazet wrote:
> > > > On Tue, 2016-05-24 at 19:13 +0300, Vladimir Davydov wrote:
> > > > > On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote:
> > > > > ...
> > > > > > > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> > > > > > > +struct pipe_buffer *buf)
> > > > > > > +{
> > > > > > > + struct page *page = buf->page;
> > > > > > > +
> > > > > > > + if (page_count(page) == 1) {
> > > > > > 
> > > > > > This looks racy : some cpu could have temporarily elevated page 
> > > > > > count.
> > > > > 
> > > > > All pipe operations (pipe_buf_operations->get, ->release, ->steal) are
> > > > > supposed to be called under pipe_lock. So, if we see a 
> > > > > pipe_buffer->page
> > > > > with refcount of 1 in ->steal, that means that we are the only its 
> > > > > user
> > > > > and it can't be spliced to another pipe.
> > > > > 
> > > > > In fact, I just copied the code from generic_pipe_buf_steal, adding
> > > > > kmemcg related checks along the way, so it should be fine.
> > > > 
> > > > So you guarantee that no other cpu might have done
> > > > get_page_unless_zero() right before this test ?
> > > 
> > > Each pipe_buffer holds a reference to its page. If we find page's
> > > refcount to be 1 here, then it can be referenced only by our
> > > pipe_buffer. And the refcount cannot be increased by a parallel thread,
> > > because we hold pipe_lock, which rules out splice, and otherwise it's
> > > impossible to reach the page as it is not on lru. That said, I think I
> > > guarantee that this should be safe.
> > 
> > I don't know kmemcg internal and pipe stuff so my comment might be
> > totally crap.
> > 
> > No one cannot guarantee any CPU cannot held a reference of a page.
> > Look at get_page_unless_zero usecases.
> > 
> > 1. balloon_page_isolate
> > 
> > It can hold a reference in random page and then verify the page
> > is balloon page. Otherwise, just put.
> > 
> > 2. page_idle_get_page
> > 
> > It has PageLRU check but it's racy so it can hold a reference
> > of randome page and then verify within zone->lru_lock. If it's
> > not LRU page, just put.
> 
> Well, I see your concern now - even if a page is not on lru and we
> locked all structs pointing to it, it can always get accessed by pfn in
> a completely unrelated thread, like in examples you gave above. That's a
> fair point.
> 
> However, I still think that it's OK in case of pipe buffers. What can
> happen if somebody takes a transient reference to a pipe buffer page? At
> worst, we'll see page_count > 1 due to temporary ref and abort stealing,
> falling back on copying instead. That's OK, because stealing is not
> guaranteed. Can a function that takes a transient ref to page by pfn
> mistakenly assume that this is a page it's interested in? I don't think
> so, because this page has no marks on it except special _mapcount value,
> which should only be set on kmemcg pages.

Well, all this information deserve to be in the changelog.

Maybe in 6 months, this will be incredibly useful for bug hunting.

pipes can be used to exchange data (or pages) between processes in
different domains.

If kmemcg is not precise, this could be used by some attackers to force
some processes to consume all their budget and eventually not be able to
allocate new pages.





Re: [PATCH 01/23] all: syscall wrappers: add documentation

2016-05-26 Thread Catalin Marinas
On Wed, May 25, 2016 at 02:28:21PM -0700, David Miller wrote:
> From: Arnd Bergmann 
> Date: Wed, 25 May 2016 23:01:06 +0200
> 
> > On Wednesday, May 25, 2016 1:50:39 PM CEST David Miller wrote:
> >> From: Arnd Bergmann 
> >> Date: Wed, 25 May 2016 22:47:33 +0200
> >> 
> >> > If we use the normal calling conventions, we could remove these overrides
> >> > along with the respective special-case handling in glibc. None of them
> >> > look particularly performance-sensitive, but I could be wrong there.
> >> 
> >> You could set the lowest bit in the system call entry pointer to indicate
> >> the upper-half clears should be elided.
> > 
> > Right, but that would introduce an extra conditional branch in the syscall
> > hotpath, and likely eliminate the gains from passing the loff_t arguments
> > in a single register instead of a pair.
> 
> Ok, then, how much are you really gaining from avoiding a 'shift' and
> an 'or' to build the full 64-bit value?  3 cycles?  Maybe 4?

It's possible a few more cycles overall. Whether this is noticeable, I
can't really tell without some benchmarks (e.g. a getpid wrapper zeroing
top 32-bit of all possible 6 arguments, called in a loop).

On arm64 with ILP32 we have three types of syscalls w.r.t. parameter
width (I guess that's true for all other compat implementations):

1. User syscall definition with 32-bit arguments, kernel handling 32-bit
   arguments

2. User 32-bit arguments, kernel 64-bit arguments

3. User 64-bit arguments, kernel 64-bit arguments

For (1), the AArch64 ABI (AAPCS) allows us to ignore the garbage in the
top 32-bit of a 64-bit register as long as the callee has 32-bit
arguments (IOW, the generated code will use 32-git Wn instead of 64-bit
Xn registers). In this case, zeroing the top 32-bit of all 6 arguments
is unnecessary.

In the 2nd case, we need sign or zero extension of 32-bit arguments. For
sign extension we would still need a wrapper as the generic one can only
zero-extend without knowing the underlying type. How many cases do we
have where sign extension is required (off_t is a signed type but does
it actually make sense as a negative value)? The __SC_WRAP and
COMPAT_SYSCALL_WRAP macros introduced by patches 3-5 in this series
handle such conversion for both sign and unsigned arguments.

We don't have such problem with AArch32 tasks since the architecture
guarantees zeroing or preserving the top half of all registers.

For (3), with the current ILP32 approach we wouldn't need any wrapper.
If we are to pass the argument as two 32-bit values, we would need both
the user (glibc) to split the argument and the kernel to re-construct
it. This would be in addition to any default top 32-bit zeroing on
kernel entry.

The overhead may be lost in the noise (we need some data) but IIRC our
decision was mostly based on a cleaner user implementation for point (3)
above. Since an AArch64/ILP32 process can freely use 64-bit registers,
we found it nicer to be able to pass such value directly to the kernel.
Reusing the s390 macros should reduce the amount of new code added to
the kernel.


While writing the above, I realised the current ILP32 patches still miss
on converting pointers passed from user space (unless I got myself
confused in macros). The new __SC_WRAP() and COMPAT_SYSCALL_WRAPx()
macros take care of zero or sign extension via __SC_COMPAT_CAST().
However, we have two more existing cases which I don't see covered:

a) Native syscalls taking a pointer argument and invoked directly from
   ILP32. For example, sys_read() takes a pointer but I don't see any
   __SC_WRAP added by patch 5

b) Current compat syscalls taking a pointer argument. For example,
   compat_sys_vmsplice() gets the iov32 pointer and the compiler assumes
   it is a 64-bit variable. I don't see where the upper half is zeroed

We can solve (a) by adding more __SC_WRAP annotations in the generic
unistd.h. For (b), we would need an __SC_DELOUSE with a bit of penalty
on AArch32/compat support where it isn't needed. So maybe davem has a
point on the overall impact of always zeroing the upper half of the
arguments ;) (both from a performance and maintainability perspective).
I guess this part of the ABI is still up for discussion.

-- 
Catalin


Re: [PATCH 01/23] all: syscall wrappers: add documentation

2016-05-26 Thread Catalin Marinas
On Wed, May 25, 2016 at 02:28:21PM -0700, David Miller wrote:
> From: Arnd Bergmann 
> Date: Wed, 25 May 2016 23:01:06 +0200
> 
> > On Wednesday, May 25, 2016 1:50:39 PM CEST David Miller wrote:
> >> From: Arnd Bergmann 
> >> Date: Wed, 25 May 2016 22:47:33 +0200
> >> 
> >> > If we use the normal calling conventions, we could remove these overrides
> >> > along with the respective special-case handling in glibc. None of them
> >> > look particularly performance-sensitive, but I could be wrong there.
> >> 
> >> You could set the lowest bit in the system call entry pointer to indicate
> >> the upper-half clears should be elided.
> > 
> > Right, but that would introduce an extra conditional branch in the syscall
> > hotpath, and likely eliminate the gains from passing the loff_t arguments
> > in a single register instead of a pair.
> 
> Ok, then, how much are you really gaining from avoiding a 'shift' and
> an 'or' to build the full 64-bit value?  3 cycles?  Maybe 4?

It's possible a few more cycles overall. Whether this is noticeable, I
can't really tell without some benchmarks (e.g. a getpid wrapper zeroing
top 32-bit of all possible 6 arguments, called in a loop).

On arm64 with ILP32 we have three types of syscalls w.r.t. parameter
width (I guess that's true for all other compat implementations):

1. User syscall definition with 32-bit arguments, kernel handling 32-bit
   arguments

2. User 32-bit arguments, kernel 64-bit arguments

3. User 64-bit arguments, kernel 64-bit arguments

For (1), the AArch64 ABI (AAPCS) allows us to ignore the garbage in the
top 32-bit of a 64-bit register as long as the callee has 32-bit
arguments (IOW, the generated code will use 32-git Wn instead of 64-bit
Xn registers). In this case, zeroing the top 32-bit of all 6 arguments
is unnecessary.

In the 2nd case, we need sign or zero extension of 32-bit arguments. For
sign extension we would still need a wrapper as the generic one can only
zero-extend without knowing the underlying type. How many cases do we
have where sign extension is required (off_t is a signed type but does
it actually make sense as a negative value)? The __SC_WRAP and
COMPAT_SYSCALL_WRAP macros introduced by patches 3-5 in this series
handle such conversion for both sign and unsigned arguments.

We don't have such problem with AArch32 tasks since the architecture
guarantees zeroing or preserving the top half of all registers.

For (3), with the current ILP32 approach we wouldn't need any wrapper.
If we are to pass the argument as two 32-bit values, we would need both
the user (glibc) to split the argument and the kernel to re-construct
it. This would be in addition to any default top 32-bit zeroing on
kernel entry.

The overhead may be lost in the noise (we need some data) but IIRC our
decision was mostly based on a cleaner user implementation for point (3)
above. Since an AArch64/ILP32 process can freely use 64-bit registers,
we found it nicer to be able to pass such value directly to the kernel.
Reusing the s390 macros should reduce the amount of new code added to
the kernel.


While writing the above, I realised the current ILP32 patches still miss
on converting pointers passed from user space (unless I got myself
confused in macros). The new __SC_WRAP() and COMPAT_SYSCALL_WRAPx()
macros take care of zero or sign extension via __SC_COMPAT_CAST().
However, we have two more existing cases which I don't see covered:

a) Native syscalls taking a pointer argument and invoked directly from
   ILP32. For example, sys_read() takes a pointer but I don't see any
   __SC_WRAP added by patch 5

b) Current compat syscalls taking a pointer argument. For example,
   compat_sys_vmsplice() gets the iov32 pointer and the compiler assumes
   it is a 64-bit variable. I don't see where the upper half is zeroed

We can solve (a) by adding more __SC_WRAP annotations in the generic
unistd.h. For (b), we would need an __SC_DELOUSE with a bit of penalty
on AArch32/compat support where it isn't needed. So maybe davem has a
point on the overall impact of always zeroing the upper half of the
arguments ;) (both from a performance and maintainability perspective).
I guess this part of the ABI is still up for discussion.

-- 
Catalin


Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem

2016-05-26 Thread Tetsuo Handa
Michal Hocko wrote:
> +/*
> + * Checks whether the given task is dying or exiting and likely to
> + * release its address space. This means that all threads and processes
> + * sharing the same mm have to be killed or exiting.
> + */
> +static inline bool task_will_free_mem(struct task_struct *task)
> +{
> + struct mm_struct *mm = NULL;
> + struct task_struct *p;
> +
> + /*
> +  * If the process has passed exit_mm we have to skip it because
> +  * we have lost a link to other tasks sharing this mm, we do not
> +  * have anything to reap and the task might then get stuck waiting
> +  * for parent as zombie and we do not want it to hold TIF_MEMDIE
> +  */
> + p = find_lock_task_mm(task);
> + if (!p)
>   return false;
>  
> + if (!__task_will_free_mem(p)) {
> + task_unlock(p);
> + return false;
> + }
> +
> + /*
> +  * Check whether there are other processes sharing the mm - they all 
> have
> +  * to be killed or exiting.
> +  */
> + if (atomic_read(>mm->mm_users) > get_nr_threads(p)) {
> + mm = p->mm;
> + /* pin the mm to not get freed and reused */
> + atomic_inc(>mm_count);
> + }
> + task_unlock(p);
> +
> + if (mm) {
> + rcu_read_lock();
> + for_each_process(p) {
> + bool vfork;
> +
> + /*
> +  * skip over vforked tasks because they are mostly
> +  * independent and will drop the mm soon
> +  */
> + task_lock(p);
> + vfork = p->vfork_done;
> + task_unlock(p);
> + if (vfork)
> + continue;
> +
> + if (!__task_will_free_mem(p))
> + break;
> + }
> + rcu_read_unlock();
> + mmdrop(mm);

Did you forget "if (something) return false;" here?

> + }
> +

If task_will_free_mem() == true is always followed by mark_oom_victim()
and wake_oom_reaper(), can we call them from here?

>   return true;
>  }


Re: [PATCH 6/6] mm, oom: fortify task_will_free_mem

2016-05-26 Thread Tetsuo Handa
Michal Hocko wrote:
> +/*
> + * Checks whether the given task is dying or exiting and likely to
> + * release its address space. This means that all threads and processes
> + * sharing the same mm have to be killed or exiting.
> + */
> +static inline bool task_will_free_mem(struct task_struct *task)
> +{
> + struct mm_struct *mm = NULL;
> + struct task_struct *p;
> +
> + /*
> +  * If the process has passed exit_mm we have to skip it because
> +  * we have lost a link to other tasks sharing this mm, we do not
> +  * have anything to reap and the task might then get stuck waiting
> +  * for parent as zombie and we do not want it to hold TIF_MEMDIE
> +  */
> + p = find_lock_task_mm(task);
> + if (!p)
>   return false;
>  
> + if (!__task_will_free_mem(p)) {
> + task_unlock(p);
> + return false;
> + }
> +
> + /*
> +  * Check whether there are other processes sharing the mm - they all 
> have
> +  * to be killed or exiting.
> +  */
> + if (atomic_read(>mm->mm_users) > get_nr_threads(p)) {
> + mm = p->mm;
> + /* pin the mm to not get freed and reused */
> + atomic_inc(>mm_count);
> + }
> + task_unlock(p);
> +
> + if (mm) {
> + rcu_read_lock();
> + for_each_process(p) {
> + bool vfork;
> +
> + /*
> +  * skip over vforked tasks because they are mostly
> +  * independent and will drop the mm soon
> +  */
> + task_lock(p);
> + vfork = p->vfork_done;
> + task_unlock(p);
> + if (vfork)
> + continue;
> +
> + if (!__task_will_free_mem(p))
> + break;
> + }
> + rcu_read_unlock();
> + mmdrop(mm);

Did you forget "if (something) return false;" here?

> + }
> +

If task_will_free_mem() == true is always followed by mark_oom_victim()
and wake_oom_reaper(), can we call them from here?

>   return true;
>  }


How should I use kernel-defined i2c structs in this driver

2016-05-26 Thread Andrey Utkin
Could anybody please give a hint - which kernel-defined i2c objects, and how
many of them, I need to define and use to substitute these driver-defined
functions i2c_read(), i2c_write() ?
https://github.com/bluecherrydvr/linux/blob/release/tw5864/1.16/drivers/media/pci/tw5864/tw5864-config.c
In a word, there's 4 chips with different addresses, to which this code
communicates via main chip's dedicated registers.
Do i need a single i2c_adapter or several?
Do i need i2c_client entities?
where should I put what is named "devid" here?

Thanks in advance.


How should I use kernel-defined i2c structs in this driver

2016-05-26 Thread Andrey Utkin
Could anybody please give a hint - which kernel-defined i2c objects, and how
many of them, I need to define and use to substitute these driver-defined
functions i2c_read(), i2c_write() ?
https://github.com/bluecherrydvr/linux/blob/release/tw5864/1.16/drivers/media/pci/tw5864/tw5864-config.c
In a word, there's 4 chips with different addresses, to which this code
communicates via main chip's dedicated registers.
Do i need a single i2c_adapter or several?
Do i need i2c_client entities?
where should I put what is named "devid" here?

Thanks in advance.


Re: [PATCH] fs: ubifs: Replace kmem_cache_alloc/memset with kmem_cache_zalloc

2016-05-26 Thread Richard Weinberger
Am 26.05.2016 um 00:04 schrieb Salah Triki:
> Use kmem_cache_zalloc instead of kmem_cache_alloc/memset.
> 
> Signed-off-by: Salah Triki 
> ---
>  fs/ubifs/super.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
> index 7034995..f509200 100644
> --- a/fs/ubifs/super.c
> +++ b/fs/ubifs/super.c
> @@ -262,12 +262,10 @@ static struct inode *ubifs_alloc_inode(struct 
> super_block *sb)
>  {
>   struct ubifs_inode *ui;
>  
> - ui = kmem_cache_alloc(ubifs_inode_slab, GFP_NOFS);
> + ui = kmem_cache_zalloc(ubifs_inode_slab, GFP_NOFS);
>   if (!ui)
>   return NULL;
>  
> - memset((void *)ui + sizeof(struct inode), 0,
> -sizeof(struct ubifs_inode) - sizeof(struct inode));

Your patch fails to explain why it is needed.
Also note that the path slightly changes the semantics.

Thanks,
//richard


Re: [PATCH] fs: ubifs: Replace kmem_cache_alloc/memset with kmem_cache_zalloc

2016-05-26 Thread Richard Weinberger
Am 26.05.2016 um 00:04 schrieb Salah Triki:
> Use kmem_cache_zalloc instead of kmem_cache_alloc/memset.
> 
> Signed-off-by: Salah Triki 
> ---
>  fs/ubifs/super.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
> index 7034995..f509200 100644
> --- a/fs/ubifs/super.c
> +++ b/fs/ubifs/super.c
> @@ -262,12 +262,10 @@ static struct inode *ubifs_alloc_inode(struct 
> super_block *sb)
>  {
>   struct ubifs_inode *ui;
>  
> - ui = kmem_cache_alloc(ubifs_inode_slab, GFP_NOFS);
> + ui = kmem_cache_zalloc(ubifs_inode_slab, GFP_NOFS);
>   if (!ui)
>   return NULL;
>  
> - memset((void *)ui + sizeof(struct inode), 0,
> -sizeof(struct ubifs_inode) - sizeof(struct inode));

Your patch fails to explain why it is needed.
Also note that the path slightly changes the semantics.

Thanks,
//richard


Re: [RFC 3/3] md: dm-crypt: Introduce the bulk mode method when sending request

2016-05-26 Thread Mike Snitzer
Comments inlined.

In general the most concerning bit is the need for memory allocation in
the IO path (see comment/question below near call to sg_alloc_table).
In DM targets we make heavy use of .ctr preallocated memory and/or
per-bio-data to avoid memory allocations in the IO path.

On Wed, May 25 2016 at  2:12am -0400,
Baolin Wang  wrote:

> In now dm-crypt code, it is ineffective to map one segment (always one
> sector) of one bio with just only one scatterlist at one time for hardware
> crypto engine. Especially for some encryption mode (like ecb or xts mode)
> cooperating with the crypto engine, they just need one initial IV or null
> IV instead of different IV for each sector. In this situation We can consider
> to use multiple scatterlists to map the whole bio and send all scatterlists
> of one bio to crypto engine to encrypt or decrypt, which can improve the
> hardware engine's efficiency.
> 
> With this optimization, On my test setup (beaglebone black board) using 64KB
> I/Os on an eMMC storage device I saw about 60% improvement in throughput for
> encrypted writes, and about 100% improvement for encrypted reads. But this
> is not fit for other modes which need different IV for each sector.
> 
> Signed-off-by: Baolin Wang 
> ---
>  drivers/md/dm-crypt.c |  188 
> +
>  1 file changed, 176 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 4f3cb35..1c86ea7 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -33,6 +33,7 @@
>  #include 
>  
>  #define DM_MSG_PREFIX "crypt"
> +#define DM_MAX_SG_LIST   1024
>  
>  /*
>   * context holding the current state of a multi-part conversion
> @@ -46,6 +47,8 @@ struct convert_context {
>   sector_t cc_sector;
>   atomic_t cc_pending;
>   struct skcipher_request *req;
> + struct sg_table sgt_in;
> + struct sg_table sgt_out;
>  };
>  
>  /*
> @@ -803,6 +806,108 @@ static struct crypt_iv_operations crypt_iv_tcw_ops = {
>   .post  = crypt_iv_tcw_post
>  };
>  
> +/*
> + * Check how many sg entry numbers are needed when map one bio
> + * with scatterlists in advance.
> + */
> +static unsigned int crypt_sg_entry(struct bio *bio_t)
> +{
> + struct request_queue *q = bdev_get_queue(bio_t->bi_bdev);
> + int cluster = blk_queue_cluster(q);
> + struct bio_vec bvec, bvprv = { NULL };
> + struct bvec_iter biter;
> + unsigned long nbytes = 0, sg_length = 0;
> + unsigned int sg_cnt = 0, first_bvec = 0;
> +
> + if (bio_t->bi_rw & REQ_DISCARD) {
> + if (bio_t->bi_vcnt)
> + return 1;
> + return 0;
> + }
> +
> + if (bio_t->bi_rw & REQ_WRITE_SAME)
> + return 1;
> +
> + bio_for_each_segment(bvec, bio_t, biter) {
> + nbytes = bvec.bv_len;
> +
> + if (!cluster) {
> + sg_cnt++;
> + continue;
> + }
> +
> + if (!first_bvec) {
> + first_bvec = 1;
> + goto new_segment;
> + }
> +
> + if (sg_length + nbytes > queue_max_segment_size(q))
> + goto new_segment;
> +
> + if (!BIOVEC_PHYS_MERGEABLE(, ))
> + goto new_segment;
> +
> + if (!BIOVEC_SEG_BOUNDARY(q, , ))
> + goto new_segment;
> +
> + sg_length += nbytes;
> + continue;
> +
> +new_segment:
> + memcpy(, , sizeof(struct bio_vec));
> + sg_length = nbytes;
> + sg_cnt++;
> + }
> +
> + return sg_cnt;
> +}
> +
> +static int crypt_convert_alloc_table(struct crypt_config *cc,
> +  struct convert_context *ctx)
> +{
> + struct bio *bio_in = ctx->bio_in;
> + struct bio *bio_out = ctx->bio_out;
> + unsigned int mode = skcipher_is_bulk_mode(any_tfm(cc));

please use: bool bulk_mode = ...

> + unsigned int sg_in_max, sg_out_max;
> + int ret = 0;
> +
> + if (!mode)
> + goto out2;

Please use more descriptive label names than out[1-3]

> +
> + /*
> +  * Need to calculate how many sg entry need to be used
> +  * for this bio.
> +  */
> + sg_in_max = crypt_sg_entry(bio_in) + 1;

The return from crypt_sg_entry() is pretty awkward, given you just go on
to add 1; as is the bounds checking.. the magic value of 2 needs to be
be made clearer.

> + if (sg_in_max > DM_MAX_SG_LIST || sg_in_max <= 2)
> + goto out2;
> +
> + ret = sg_alloc_table(>sgt_in, sg_in_max, GFP_KERNEL);

Is it safe to be using GFP_KERNEL here?  AFAIK this is in the IO mapping
path and we try to avoid memory allocations at all costs -- due to the
risk of deadlock when issuing IO to stacked block devices (dm-crypt
could be part of a much more elaborate IO stack).

> + if (ret)
> + goto 

Re: [RFC 3/3] md: dm-crypt: Introduce the bulk mode method when sending request

2016-05-26 Thread Mike Snitzer
Comments inlined.

In general the most concerning bit is the need for memory allocation in
the IO path (see comment/question below near call to sg_alloc_table).
In DM targets we make heavy use of .ctr preallocated memory and/or
per-bio-data to avoid memory allocations in the IO path.

On Wed, May 25 2016 at  2:12am -0400,
Baolin Wang  wrote:

> In now dm-crypt code, it is ineffective to map one segment (always one
> sector) of one bio with just only one scatterlist at one time for hardware
> crypto engine. Especially for some encryption mode (like ecb or xts mode)
> cooperating with the crypto engine, they just need one initial IV or null
> IV instead of different IV for each sector. In this situation We can consider
> to use multiple scatterlists to map the whole bio and send all scatterlists
> of one bio to crypto engine to encrypt or decrypt, which can improve the
> hardware engine's efficiency.
> 
> With this optimization, On my test setup (beaglebone black board) using 64KB
> I/Os on an eMMC storage device I saw about 60% improvement in throughput for
> encrypted writes, and about 100% improvement for encrypted reads. But this
> is not fit for other modes which need different IV for each sector.
> 
> Signed-off-by: Baolin Wang 
> ---
>  drivers/md/dm-crypt.c |  188 
> +
>  1 file changed, 176 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 4f3cb35..1c86ea7 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -33,6 +33,7 @@
>  #include 
>  
>  #define DM_MSG_PREFIX "crypt"
> +#define DM_MAX_SG_LIST   1024
>  
>  /*
>   * context holding the current state of a multi-part conversion
> @@ -46,6 +47,8 @@ struct convert_context {
>   sector_t cc_sector;
>   atomic_t cc_pending;
>   struct skcipher_request *req;
> + struct sg_table sgt_in;
> + struct sg_table sgt_out;
>  };
>  
>  /*
> @@ -803,6 +806,108 @@ static struct crypt_iv_operations crypt_iv_tcw_ops = {
>   .post  = crypt_iv_tcw_post
>  };
>  
> +/*
> + * Check how many sg entry numbers are needed when map one bio
> + * with scatterlists in advance.
> + */
> +static unsigned int crypt_sg_entry(struct bio *bio_t)
> +{
> + struct request_queue *q = bdev_get_queue(bio_t->bi_bdev);
> + int cluster = blk_queue_cluster(q);
> + struct bio_vec bvec, bvprv = { NULL };
> + struct bvec_iter biter;
> + unsigned long nbytes = 0, sg_length = 0;
> + unsigned int sg_cnt = 0, first_bvec = 0;
> +
> + if (bio_t->bi_rw & REQ_DISCARD) {
> + if (bio_t->bi_vcnt)
> + return 1;
> + return 0;
> + }
> +
> + if (bio_t->bi_rw & REQ_WRITE_SAME)
> + return 1;
> +
> + bio_for_each_segment(bvec, bio_t, biter) {
> + nbytes = bvec.bv_len;
> +
> + if (!cluster) {
> + sg_cnt++;
> + continue;
> + }
> +
> + if (!first_bvec) {
> + first_bvec = 1;
> + goto new_segment;
> + }
> +
> + if (sg_length + nbytes > queue_max_segment_size(q))
> + goto new_segment;
> +
> + if (!BIOVEC_PHYS_MERGEABLE(, ))
> + goto new_segment;
> +
> + if (!BIOVEC_SEG_BOUNDARY(q, , ))
> + goto new_segment;
> +
> + sg_length += nbytes;
> + continue;
> +
> +new_segment:
> + memcpy(, , sizeof(struct bio_vec));
> + sg_length = nbytes;
> + sg_cnt++;
> + }
> +
> + return sg_cnt;
> +}
> +
> +static int crypt_convert_alloc_table(struct crypt_config *cc,
> +  struct convert_context *ctx)
> +{
> + struct bio *bio_in = ctx->bio_in;
> + struct bio *bio_out = ctx->bio_out;
> + unsigned int mode = skcipher_is_bulk_mode(any_tfm(cc));

please use: bool bulk_mode = ...

> + unsigned int sg_in_max, sg_out_max;
> + int ret = 0;
> +
> + if (!mode)
> + goto out2;

Please use more descriptive label names than out[1-3]

> +
> + /*
> +  * Need to calculate how many sg entry need to be used
> +  * for this bio.
> +  */
> + sg_in_max = crypt_sg_entry(bio_in) + 1;

The return from crypt_sg_entry() is pretty awkward, given you just go on
to add 1; as is the bounds checking.. the magic value of 2 needs to be
be made clearer.

> + if (sg_in_max > DM_MAX_SG_LIST || sg_in_max <= 2)
> + goto out2;
> +
> + ret = sg_alloc_table(>sgt_in, sg_in_max, GFP_KERNEL);

Is it safe to be using GFP_KERNEL here?  AFAIK this is in the IO mapping
path and we try to avoid memory allocations at all costs -- due to the
risk of deadlock when issuing IO to stacked block devices (dm-crypt
could be part of a much more elaborate IO stack).

> + if (ret)
> + goto out2;
> +
> + if (bio_data_dir(bio_in) == 

[PATCH] oom_reaper: close race with exiting task

2016-05-26 Thread Michal Hocko
From: Michal Hocko 

Tetsuo has reported:
[   74.453958] Out of memory: Kill process 443 (oleg's-test) score 855 or 
sacrifice child
[   74.456234] Killed process 443 (oleg's-test) total-vm:493248kB, 
anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB
[   74.459219] sh invoked oom-killer: 
gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
[   74.462813] sh cpuset=/ mems_allowed=0
[   74.465221] CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51
[   74.467037] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/31/2013
[   74.470207]  0286 a17a86b0 88001e673a18 
812c7cbd
[   74.473422]   88001e673bd0 88001e673ab8 
811b9e94
[   74.475704]  88001e66cbe0 88001e673ab8 0246 

[   74.477990] Call Trace:
[   74.479170]  [] dump_stack+0x85/0xc8
[   74.480872]  [] dump_header+0x5b/0x394
[   74.481837] oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, 
file-rss:0kB, shmem-rss:0kB

In other words:
__oom_reap_task exit_mm
  atomic_inc_not_zero
  tsk->mm = NULL
  mmput
atomic_dec_and_test # > 0
  exit_oom_victim # New victim will be
  # selected

  # no TIF_MEMDIE task so we can select a new 
one
  unmap_page_range # to release the memory

The race exists even without the oom_reaper because anybody who pins the
address space and gets preempted might race with exit_mm but oom_reaper
made this race more probable.

We can address the oom_reaper part by using oom_lock for __oom_reap_task
because this would guarantee that a new oom victim will not be selected
if the oom reaper might race with the exit path. This doesn't solve the
original issue, though, because somebody else still might be pinning
mm_users and so __mmput won't be called to release the memory but that
is not really realiably solvable because the task will get away from the
oom sight as soon as it is unhashed from the task_list and so we cannot
guarantee a new victim won't be selected.

Fixes: aac453635549 ("mm, oom: introduce oom reaper")
Reported-by: Tetsuo Handa 
Signed-off-by: Michal Hocko 
---

Hi,
I haven't marked this for stable because the race is quite unlikely I
believe. I have noted the original commit, though, for those who might
want to backport it and consider this follow up fix as well.

I guess this would be good to go in the current merge window, unless I
have missed something subtle. It would be great if Tetsuo could try to
reproduce and confirm this really solves his issue.

Thanks!

 mm/oom_kill.c | 25 +
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5bb2f7698ad7..d0f42cc88f6a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -450,6 +450,22 @@ static bool __oom_reap_task(struct task_struct *tsk)
bool ret = true;
 
/*
+* We have to make sure to not race with the victim exit path
+* and cause premature new oom victim selection:
+* __oom_reap_task  exit_mm
+*   atomic_inc_not_zero
+*mmput
+*  atomic_dec_and_test
+*exit_oom_victim
+*  [...]
+*  out_of_memory
+*select_bad_process
+*  # no TIF_MEMDIE task select new 
victim
+*  unmap_page_range # frees some memory
+*/
+   mutex_lock(_lock);
+
+   /*
 * Make sure we find the associated mm_struct even when the particular
 * thread has already terminated and cleared its mm.
 * We might have race with exit path so consider our work done if there
@@ -457,19 +473,19 @@ static bool __oom_reap_task(struct task_struct *tsk)
 */
p = find_lock_task_mm(tsk);
if (!p)
-   return true;
+   goto unlock_oom;
 
mm = p->mm;
if (!atomic_inc_not_zero(>mm_users)) {
task_unlock(p);
-   return true;
+   goto unlock_oom;
}
 
task_unlock(p);
 
if (!down_read_trylock(>mmap_sem)) {
ret = false;
-   goto out;
+   goto unlock_oom;
}
 
tlb_gather_mmu(, mm, 0, -1);
@@ -511,7 +527,8 @@ static bool __oom_reap_task(struct task_struct *tsk)
 * to release its memory.
 */
set_bit(MMF_OOM_REAPED, >flags);
-out:
+unlock_oom:
+   mutex_unlock(_lock);
/*
 * 

[PATCH] oom_reaper: close race with exiting task

2016-05-26 Thread Michal Hocko
From: Michal Hocko 

Tetsuo has reported:
[   74.453958] Out of memory: Kill process 443 (oleg's-test) score 855 or 
sacrifice child
[   74.456234] Killed process 443 (oleg's-test) total-vm:493248kB, 
anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB
[   74.459219] sh invoked oom-killer: 
gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
[   74.462813] sh cpuset=/ mems_allowed=0
[   74.465221] CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51
[   74.467037] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 07/31/2013
[   74.470207]  0286 a17a86b0 88001e673a18 
812c7cbd
[   74.473422]   88001e673bd0 88001e673ab8 
811b9e94
[   74.475704]  88001e66cbe0 88001e673ab8 0246 

[   74.477990] Call Trace:
[   74.479170]  [] dump_stack+0x85/0xc8
[   74.480872]  [] dump_header+0x5b/0x394
[   74.481837] oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, 
file-rss:0kB, shmem-rss:0kB

In other words:
__oom_reap_task exit_mm
  atomic_inc_not_zero
  tsk->mm = NULL
  mmput
atomic_dec_and_test # > 0
  exit_oom_victim # New victim will be
  # selected

  # no TIF_MEMDIE task so we can select a new 
one
  unmap_page_range # to release the memory

The race exists even without the oom_reaper because anybody who pins the
address space and gets preempted might race with exit_mm but oom_reaper
made this race more probable.

We can address the oom_reaper part by using oom_lock for __oom_reap_task
because this would guarantee that a new oom victim will not be selected
if the oom reaper might race with the exit path. This doesn't solve the
original issue, though, because somebody else still might be pinning
mm_users and so __mmput won't be called to release the memory but that
is not really realiably solvable because the task will get away from the
oom sight as soon as it is unhashed from the task_list and so we cannot
guarantee a new victim won't be selected.

Fixes: aac453635549 ("mm, oom: introduce oom reaper")
Reported-by: Tetsuo Handa 
Signed-off-by: Michal Hocko 
---

Hi,
I haven't marked this for stable because the race is quite unlikely I
believe. I have noted the original commit, though, for those who might
want to backport it and consider this follow up fix as well.

I guess this would be good to go in the current merge window, unless I
have missed something subtle. It would be great if Tetsuo could try to
reproduce and confirm this really solves his issue.

Thanks!

 mm/oom_kill.c | 25 +
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5bb2f7698ad7..d0f42cc88f6a 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -450,6 +450,22 @@ static bool __oom_reap_task(struct task_struct *tsk)
bool ret = true;
 
/*
+* We have to make sure to not race with the victim exit path
+* and cause premature new oom victim selection:
+* __oom_reap_task  exit_mm
+*   atomic_inc_not_zero
+*mmput
+*  atomic_dec_and_test
+*exit_oom_victim
+*  [...]
+*  out_of_memory
+*select_bad_process
+*  # no TIF_MEMDIE task select new 
victim
+*  unmap_page_range # frees some memory
+*/
+   mutex_lock(_lock);
+
+   /*
 * Make sure we find the associated mm_struct even when the particular
 * thread has already terminated and cleared its mm.
 * We might have race with exit path so consider our work done if there
@@ -457,19 +473,19 @@ static bool __oom_reap_task(struct task_struct *tsk)
 */
p = find_lock_task_mm(tsk);
if (!p)
-   return true;
+   goto unlock_oom;
 
mm = p->mm;
if (!atomic_inc_not_zero(>mm_users)) {
task_unlock(p);
-   return true;
+   goto unlock_oom;
}
 
task_unlock(p);
 
if (!down_read_trylock(>mmap_sem)) {
ret = false;
-   goto out;
+   goto unlock_oom;
}
 
tlb_gather_mmu(, mm, 0, -1);
@@ -511,7 +527,8 @@ static bool __oom_reap_task(struct task_struct *tsk)
 * to release its memory.
 */
set_bit(MMF_OOM_REAPED, >flags);
-out:
+unlock_oom:
+   mutex_unlock(_lock);
/*
 * Drop our reference but make sure the mmput slow path is called from a
   

Re: [PATCH RESEND 7/8] pipe: account to kmemcg

2016-05-26 Thread Vladimir Davydov
On Thu, May 26, 2016 at 04:04:55PM +0900, Minchan Kim wrote:
> On Wed, May 25, 2016 at 01:30:11PM +0300, Vladimir Davydov wrote:
> > On Tue, May 24, 2016 at 01:04:33PM -0700, Eric Dumazet wrote:
> > > On Tue, 2016-05-24 at 19:13 +0300, Vladimir Davydov wrote:
> > > > On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote:
> > > > ...
> > > > > > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> > > > > > +  struct pipe_buffer *buf)
> > > > > > +{
> > > > > > +   struct page *page = buf->page;
> > > > > > +
> > > > > > +   if (page_count(page) == 1) {
> > > > > 
> > > > > This looks racy : some cpu could have temporarily elevated page count.
> > > > 
> > > > All pipe operations (pipe_buf_operations->get, ->release, ->steal) are
> > > > supposed to be called under pipe_lock. So, if we see a pipe_buffer->page
> > > > with refcount of 1 in ->steal, that means that we are the only its user
> > > > and it can't be spliced to another pipe.
> > > > 
> > > > In fact, I just copied the code from generic_pipe_buf_steal, adding
> > > > kmemcg related checks along the way, so it should be fine.
> > > 
> > > So you guarantee that no other cpu might have done
> > > get_page_unless_zero() right before this test ?
> > 
> > Each pipe_buffer holds a reference to its page. If we find page's
> > refcount to be 1 here, then it can be referenced only by our
> > pipe_buffer. And the refcount cannot be increased by a parallel thread,
> > because we hold pipe_lock, which rules out splice, and otherwise it's
> > impossible to reach the page as it is not on lru. That said, I think I
> > guarantee that this should be safe.
> 
> I don't know kmemcg internal and pipe stuff so my comment might be
> totally crap.
> 
> No one cannot guarantee any CPU cannot held a reference of a page.
> Look at get_page_unless_zero usecases.
> 
> 1. balloon_page_isolate
> 
> It can hold a reference in random page and then verify the page
> is balloon page. Otherwise, just put.
> 
> 2. page_idle_get_page
> 
> It has PageLRU check but it's racy so it can hold a reference
> of randome page and then verify within zone->lru_lock. If it's
> not LRU page, just put.

Well, I see your concern now - even if a page is not on lru and we
locked all structs pointing to it, it can always get accessed by pfn in
a completely unrelated thread, like in examples you gave above. That's a
fair point.

However, I still think that it's OK in case of pipe buffers. What can
happen if somebody takes a transient reference to a pipe buffer page? At
worst, we'll see page_count > 1 due to temporary ref and abort stealing,
falling back on copying instead. That's OK, because stealing is not
guaranteed. Can a function that takes a transient ref to page by pfn
mistakenly assume that this is a page it's interested in? I don't think
so, because this page has no marks on it except special _mapcount value,
which should only be set on kmemcg pages.

Thanks,
Vladimir


Re: [PATCH RESEND 7/8] pipe: account to kmemcg

2016-05-26 Thread Vladimir Davydov
On Thu, May 26, 2016 at 04:04:55PM +0900, Minchan Kim wrote:
> On Wed, May 25, 2016 at 01:30:11PM +0300, Vladimir Davydov wrote:
> > On Tue, May 24, 2016 at 01:04:33PM -0700, Eric Dumazet wrote:
> > > On Tue, 2016-05-24 at 19:13 +0300, Vladimir Davydov wrote:
> > > > On Tue, May 24, 2016 at 05:59:02AM -0700, Eric Dumazet wrote:
> > > > ...
> > > > > > +static int anon_pipe_buf_steal(struct pipe_inode_info *pipe,
> > > > > > +  struct pipe_buffer *buf)
> > > > > > +{
> > > > > > +   struct page *page = buf->page;
> > > > > > +
> > > > > > +   if (page_count(page) == 1) {
> > > > > 
> > > > > This looks racy : some cpu could have temporarily elevated page count.
> > > > 
> > > > All pipe operations (pipe_buf_operations->get, ->release, ->steal) are
> > > > supposed to be called under pipe_lock. So, if we see a pipe_buffer->page
> > > > with refcount of 1 in ->steal, that means that we are the only its user
> > > > and it can't be spliced to another pipe.
> > > > 
> > > > In fact, I just copied the code from generic_pipe_buf_steal, adding
> > > > kmemcg related checks along the way, so it should be fine.
> > > 
> > > So you guarantee that no other cpu might have done
> > > get_page_unless_zero() right before this test ?
> > 
> > Each pipe_buffer holds a reference to its page. If we find page's
> > refcount to be 1 here, then it can be referenced only by our
> > pipe_buffer. And the refcount cannot be increased by a parallel thread,
> > because we hold pipe_lock, which rules out splice, and otherwise it's
> > impossible to reach the page as it is not on lru. That said, I think I
> > guarantee that this should be safe.
> 
> I don't know kmemcg internal and pipe stuff so my comment might be
> totally crap.
> 
> No one cannot guarantee any CPU cannot held a reference of a page.
> Look at get_page_unless_zero usecases.
> 
> 1. balloon_page_isolate
> 
> It can hold a reference in random page and then verify the page
> is balloon page. Otherwise, just put.
> 
> 2. page_idle_get_page
> 
> It has PageLRU check but it's racy so it can hold a reference
> of randome page and then verify within zone->lru_lock. If it's
> not LRU page, just put.

Well, I see your concern now - even if a page is not on lru and we
locked all structs pointing to it, it can always get accessed by pfn in
a completely unrelated thread, like in examples you gave above. That's a
fair point.

However, I still think that it's OK in case of pipe buffers. What can
happen if somebody takes a transient reference to a pipe buffer page? At
worst, we'll see page_count > 1 due to temporary ref and abort stealing,
falling back on copying instead. That's OK, because stealing is not
guaranteed. Can a function that takes a transient ref to page by pfn
mistakenly assume that this is a page it's interested in? I don't think
so, because this page has no marks on it except special _mapcount value,
which should only be set on kmemcg pages.

Thanks,
Vladimir


Re: [RFC PATCH 1/3] watchdog: Add Meson GXBB Watchdog Driver

2016-05-26 Thread Guenter Roeck

On 05/26/2016 12:51 AM, Neil Armstrong wrote:

Add watchdog specific driver for Amlogic Meson GXBB SoC.



Wondering - why RFC ?


Signed-off-by: Neil Armstrong 
---
  drivers/watchdog/Makefile |   1 +
  drivers/watchdog/meson_gxbb_wdt.c | 287 ++
  2 files changed, 288 insertions(+)
  create mode 100644 drivers/watchdog/meson_gxbb_wdt.c

diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile
index 9bde095..7679d93 100644
--- a/drivers/watchdog/Makefile
+++ b/drivers/watchdog/Makefile
@@ -68,6 +68,7 @@ obj-$(CONFIG_QCOM_WDT) += qcom-wdt.o
  obj-$(CONFIG_BCM_KONA_WDT) += bcm_kona_wdt.o
  obj-$(CONFIG_TEGRA_WATCHDOG) += tegra_wdt.o
  obj-$(CONFIG_MESON_WATCHDOG) += meson_wdt.o
+obj-$(CONFIG_MESON_WATCHDOG) += meson_gxbb_wdt.o
  obj-$(CONFIG_MEDIATEK_WATCHDOG) += mtk_wdt.o
  obj-$(CONFIG_DIGICOLOR_WATCHDOG) += digicolor_wdt.o
  obj-$(CONFIG_LPC18XX_WATCHDOG) += lpc18xx_wdt.o
diff --git a/drivers/watchdog/meson_gxbb_wdt.c 
b/drivers/watchdog/meson_gxbb_wdt.c
new file mode 100644
index 000..0c7c0d9
--- /dev/null
+++ b/drivers/watchdog/meson_gxbb_wdt.c
@@ -0,0 +1,287 @@
+/*
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright (c) 2016 BayLibre, SAS.
+ * Author: Neil Armstrong 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ * The full GNU General Public License is included in this distribution
+ * in the file called COPYING.
+ *
+ * BSD LICENSE
+ *
+ * Copyright (c) 2016 BayLibre, SAS.
+ * Author: Neil Armstrong 
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *   * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *   * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ *   * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DEFAULT_TIMEOUT10  /* seconds */
+
+#define GXBB_WDT_CTRL_REG  0x0
+#define GXBB_WDT_CTRL1_REG 0x4
+#define GXBB_WDT_TCNT_REG  0x8
+#define GXBB_WDT_RSET_REG  0xc
+
+#define GXBB_WDT_CTRL_EE_RESET_NOW BIT(26)
+
+#define GXBB_WDT_CTRL_CLKDIV_ENBIT(25)
+#define GXBB_WDT_CTRL_CLK_EN   BIT(24)
+#define GXBB_WDT_CTRL_IRQ_EN   BIT(23)
+#define GXBB_WDT_CTRL_EE_RESET BIT(21)
+#define GXBB_WDT_CTRL_XTAL_SEL (0)
+#define GXBB_WDT_CTRL_CLK81_SELBIT(19)
+#define GXBB_WDT_CTRL_EN   BIT(18)
+#define GXBB_WDT_CTRL_DIV_MASK (BIT(18)-1)
+
+#define GXBB_WDT_CTRL1_GPIO_PULSE  BIT(17)
+#define GXBB_WDT_CTRL1_GPIO_POL_RESET_0BIT(16)
+#define GXBB_WDT_CTRL1_GPIO_POL_RESET_1(0)
+#define GXBB_WDT_CTRL1_GPIO_PULSE_CNT  (BIT(16)-1)
+


Spaces around operators, please.


+#define GXBB_WDT_TCNT_SETUP_MASK   (BIT(16)-1)
+#define GXBB_WDT_TCNT_CNT_SHIFT16
+
+struct meson_gxbb_wdt {
+   void __iomem *reg_base;
+   struct watchdog_device wdt_dev;
+   struct clk *clk;
+};
+
+int 

Re: [RFC PATCH 1/3] watchdog: Add Meson GXBB Watchdog Driver

2016-05-26 Thread Guenter Roeck

On 05/26/2016 12:51 AM, Neil Armstrong wrote:

Add watchdog specific driver for Amlogic Meson GXBB SoC.



Wondering - why RFC ?


Signed-off-by: Neil Armstrong 
---
  drivers/watchdog/Makefile |   1 +
  drivers/watchdog/meson_gxbb_wdt.c | 287 ++
  2 files changed, 288 insertions(+)
  create mode 100644 drivers/watchdog/meson_gxbb_wdt.c

diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile
index 9bde095..7679d93 100644
--- a/drivers/watchdog/Makefile
+++ b/drivers/watchdog/Makefile
@@ -68,6 +68,7 @@ obj-$(CONFIG_QCOM_WDT) += qcom-wdt.o
  obj-$(CONFIG_BCM_KONA_WDT) += bcm_kona_wdt.o
  obj-$(CONFIG_TEGRA_WATCHDOG) += tegra_wdt.o
  obj-$(CONFIG_MESON_WATCHDOG) += meson_wdt.o
+obj-$(CONFIG_MESON_WATCHDOG) += meson_gxbb_wdt.o
  obj-$(CONFIG_MEDIATEK_WATCHDOG) += mtk_wdt.o
  obj-$(CONFIG_DIGICOLOR_WATCHDOG) += digicolor_wdt.o
  obj-$(CONFIG_LPC18XX_WATCHDOG) += lpc18xx_wdt.o
diff --git a/drivers/watchdog/meson_gxbb_wdt.c 
b/drivers/watchdog/meson_gxbb_wdt.c
new file mode 100644
index 000..0c7c0d9
--- /dev/null
+++ b/drivers/watchdog/meson_gxbb_wdt.c
@@ -0,0 +1,287 @@
+/*
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright (c) 2016 BayLibre, SAS.
+ * Author: Neil Armstrong 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ * The full GNU General Public License is included in this distribution
+ * in the file called COPYING.
+ *
+ * BSD LICENSE
+ *
+ * Copyright (c) 2016 BayLibre, SAS.
+ * Author: Neil Armstrong 
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *   * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *   * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ *   * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define DEFAULT_TIMEOUT10  /* seconds */
+
+#define GXBB_WDT_CTRL_REG  0x0
+#define GXBB_WDT_CTRL1_REG 0x4
+#define GXBB_WDT_TCNT_REG  0x8
+#define GXBB_WDT_RSET_REG  0xc
+
+#define GXBB_WDT_CTRL_EE_RESET_NOW BIT(26)
+
+#define GXBB_WDT_CTRL_CLKDIV_ENBIT(25)
+#define GXBB_WDT_CTRL_CLK_EN   BIT(24)
+#define GXBB_WDT_CTRL_IRQ_EN   BIT(23)
+#define GXBB_WDT_CTRL_EE_RESET BIT(21)
+#define GXBB_WDT_CTRL_XTAL_SEL (0)
+#define GXBB_WDT_CTRL_CLK81_SELBIT(19)
+#define GXBB_WDT_CTRL_EN   BIT(18)
+#define GXBB_WDT_CTRL_DIV_MASK (BIT(18)-1)
+
+#define GXBB_WDT_CTRL1_GPIO_PULSE  BIT(17)
+#define GXBB_WDT_CTRL1_GPIO_POL_RESET_0BIT(16)
+#define GXBB_WDT_CTRL1_GPIO_POL_RESET_1(0)
+#define GXBB_WDT_CTRL1_GPIO_PULSE_CNT  (BIT(16)-1)
+


Spaces around operators, please.


+#define GXBB_WDT_TCNT_SETUP_MASK   (BIT(16)-1)
+#define GXBB_WDT_TCNT_CNT_SHIFT16
+
+struct meson_gxbb_wdt {
+   void __iomem *reg_base;
+   struct watchdog_device wdt_dev;
+   struct clk *clk;
+};
+
+int meson_gxbb_wdt_start(struct watchdog_device *wdt_dev)


Functions should all be static.



Re: [RFC][PATCH 3/3] locking,netfilter: Fix nf_conntrack_lock()

2016-05-26 Thread Peter Zijlstra
On Tue, May 24, 2016 at 10:41:43PM +0200, Manfred Spraul wrote:
> >--- a/net/netfilter/nf_conntrack_core.c
> >+++ b/net/netfilter/nf_conntrack_core.c
> >@@ -74,7 +74,18 @@ void nf_conntrack_lock(spinlock_t *lock)
> > spin_lock(lock);
> > while (unlikely(nf_conntrack_locks_all)) {
> > spin_unlock(lock);
> >+/*
> >+ * Order the nf_contrack_locks_all load vs the 
> >spin_unlock_wait()
> >+ * loads below, to ensure locks_all is indeed held.
> >+ */
> >+smp_rmb(); /* spin_lock(locks_all) */
> > spin_unlock_wait(_conntrack_locks_all_lock);

> I don't understand the comment, and I don't understand why the smp_rmb() is
> required: What do you want to protect against?

I order against the spin_unlock_wait() load of
nf_conntrack_locks_all_lock being done _before_ the
nf_conntrack_locks_all load.

This is possible because the spin_unlock() in between will stop the
nf_conntrack_locks_all load from falling down, but it doesn't stop the
_conntrack_locks_all_lock lock from being done early.

Inverting these two loads could result in not waiting when we should.


nf_conntrack_all_lock() nf_conntrack_lock()

  [L] all_locks == unlocked
  [S] spin_lock(_lock);
  [S] all = true;
  [L] all == true

And we'll not wait for the current all_lock section to finish.
Resulting in an all_lock and lock section at the same time.


Re: [RFC][PATCH 3/3] locking,netfilter: Fix nf_conntrack_lock()

2016-05-26 Thread Peter Zijlstra
On Tue, May 24, 2016 at 10:41:43PM +0200, Manfred Spraul wrote:
> >--- a/net/netfilter/nf_conntrack_core.c
> >+++ b/net/netfilter/nf_conntrack_core.c
> >@@ -74,7 +74,18 @@ void nf_conntrack_lock(spinlock_t *lock)
> > spin_lock(lock);
> > while (unlikely(nf_conntrack_locks_all)) {
> > spin_unlock(lock);
> >+/*
> >+ * Order the nf_contrack_locks_all load vs the 
> >spin_unlock_wait()
> >+ * loads below, to ensure locks_all is indeed held.
> >+ */
> >+smp_rmb(); /* spin_lock(locks_all) */
> > spin_unlock_wait(_conntrack_locks_all_lock);

> I don't understand the comment, and I don't understand why the smp_rmb() is
> required: What do you want to protect against?

I order against the spin_unlock_wait() load of
nf_conntrack_locks_all_lock being done _before_ the
nf_conntrack_locks_all load.

This is possible because the spin_unlock() in between will stop the
nf_conntrack_locks_all load from falling down, but it doesn't stop the
_conntrack_locks_all_lock lock from being done early.

Inverting these two loads could result in not waiting when we should.


nf_conntrack_all_lock() nf_conntrack_lock()

  [L] all_locks == unlocked
  [S] spin_lock(_lock);
  [S] all = true;
  [L] all == true

And we'll not wait for the current all_lock section to finish.
Resulting in an all_lock and lock section at the same time.


Re: [PATCH 16/23] arm64: ilp32: introduce binfmt_ilp32.c

2016-05-26 Thread Zhangjian (Bamvor)

Hi, yury

The coredump is usable in our platform. It miss the following definition:
+#define compat_elf_greg_t  elf_greg_t
+#define compat_elf_gregset_t   elf_gregset_t

And it leads to the wrong register save in core dump. After apply this patch,
gdb could debug core dump files.

Here is the full patch:
From 102624840aa5dacdd1bbfe3b390290f52f530ea2 Mon Sep 17 00:00:00 2001
From: Bamvor Jian Zhang 
Date: Thu, 26 May 2016 21:00:16 +0800
Subject: [PATCH hulk-4.1-next] arm64: ilp32: fix coredump issue

ILP32 use aarch64 register and 32bit signal struct which means it
could not make use of the existing compat_elf_prstatus/elf_prstatus
and compat_elf_prpsinfo/elf_prpsinfo.

This patch fix this issue by introducing the different
compat_elf_greg_t, compat_elf_gregset_t for aarch64 ilp32 and aarch32
el0.

Tested pass on huawei's hardware in bigendian.

Signed-off-by: Bamvor Jian Zhang 
---
 arch/arm64/include/asm/elf.h | 14 +++---
 arch/arm64/kernel/binfmt_elf32.c |  3 +++
 arch/arm64/kernel/binfmt_ilp32.c |  8 +++-
 arch/arm64/kernel/ptrace.c   | 20 ++--
 4 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h
index 0106d18..9019441 100644
--- a/arch/arm64/include/asm/elf.h
+++ b/arch/arm64/include/asm/elf.h
@@ -154,18 +154,18 @@ extern int arch_setup_additional_pages(struct 
linux_binprm *bprm,
   int uses_interp);

 /* 1GB of VA */
-#define STACK_RND_MASK (is_compat_task() ? \
-   0x7ff >> (PAGE_SHIFT - 12) : \
-   0x3 >> (PAGE_SHIFT - 12))
+#define STACK_RND_MASK (is_compat_task() ? \
+   0x7ff >> (PAGE_SHIFT - 12) : \
+   0x3 >> (PAGE_SHIFT - 12))

 #ifdef CONFIG_COMPAT

-#define COMPAT_ELF_ET_DYN_BASE (2 * TASK_SIZE_32 / 3)
+#define COMPAT_ELF_ET_DYN_BASE (2 * TASK_SIZE_32 / 3)

 /* AArch32 registers. */
-#define COMPAT_ELF_NGREG   18
-typedef unsigned int   compat_elf_greg_t;
-typedef compat_elf_greg_t  compat_elf_gregset_t[COMPAT_ELF_NGREG];
+#define COMPAT_ELF_NGREG   18
+typedef unsigned int   compat_a32_elf_greg_t;
+typedef compat_a32_elf_greg_t  compat_a32_elf_gregset_t[COMPAT_ELF_NGREG];

 #endif /* CONFIG_COMPAT */

diff --git a/arch/arm64/kernel/binfmt_elf32.c b/arch/arm64/kernel/binfmt_elf32.c
index 7b9b445..f75253c 100644
--- a/arch/arm64/kernel/binfmt_elf32.c
+++ b/arch/arm64/kernel/binfmt_elf32.c
@@ -31,4 +31,7 @@ struct linux_binprm;
 extern int aarch32_setup_vectors_page(struct linux_binprm *bprm,
  int uses_interp);

+#define compat_elf_greg_t  compat_a32_elf_greg_t
+#define compat_elf_gregset_t   compat_a32_elf_gregset_t
+
 #include "../../../fs/compat_binfmt_elf.c"
diff --git a/arch/arm64/kernel/binfmt_ilp32.c b/arch/arm64/kernel/binfmt_ilp32.c
index b827a9a..01baf83 100644
--- a/arch/arm64/kernel/binfmt_ilp32.c
+++ b/arch/arm64/kernel/binfmt_ilp32.c
@@ -2,7 +2,9 @@
  * Support for ILP32 Linux/aarch64 ELF binaries.
  */

-#include 
+#include 
+#include 
+#include 
 #include 

 #undef ELF_CLASS
@@ -30,9 +32,13 @@
  * The machine-dependent core note format types are defined in 
elfcore-compat.h,
  * which requires asm/elf.h to define compat_elf_gregset_t et al.
  */
+#define compat_elf_greg_t  elf_greg_t
+#define compat_elf_gregset_t   elf_gregset_t
 #define elf_prstatus   compat_elf_prstatus
 #define elf_prpsinfo   compat_elf_prpsinfo

+#include 
+
 /*
  * Compat version of cputime_to_compat_timeval, perhaps this
  * should be an inline in .
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 5c86135..9784c77 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -913,8 +913,8 @@ static const struct user_regset aarch32_regsets[] = {
[REGSET_COMPAT_GPR] = {
.core_note_type = NT_PRSTATUS,
.n = COMPAT_ELF_NGREG,
-   .size = sizeof(compat_elf_greg_t),
-   .align = sizeof(compat_elf_greg_t),
+   .size = sizeof(compat_a32_elf_greg_t),
+   .align = sizeof(compat_a32_elf_greg_t),
.get = compat_gpr_get,
.set = compat_gpr_set
},
@@ -947,7 +947,7 @@ static int compat_ptrace_read_user(struct task_struct *tsk, 
compat_ulong_t off,
tmp = tsk->mm->start_data;
else if (off == COMPAT_PT_TEXT_END_ADDR)
tmp = tsk->mm->end_code;
-   else if (off < sizeof(compat_elf_gregset_t))
+   else if (off < sizeof(compat_a32_elf_gregset_t))
return copy_regset_to_user(tsk, _aarch32_view,
   REGSET_COMPAT_GPR, off,
   

Re: [PATCH 16/23] arm64: ilp32: introduce binfmt_ilp32.c

2016-05-26 Thread Zhangjian (Bamvor)

Hi, yury

The coredump is usable in our platform. It miss the following definition:
+#define compat_elf_greg_t  elf_greg_t
+#define compat_elf_gregset_t   elf_gregset_t

And it leads to the wrong register save in core dump. After apply this patch,
gdb could debug core dump files.

Here is the full patch:
From 102624840aa5dacdd1bbfe3b390290f52f530ea2 Mon Sep 17 00:00:00 2001
From: Bamvor Jian Zhang 
Date: Thu, 26 May 2016 21:00:16 +0800
Subject: [PATCH hulk-4.1-next] arm64: ilp32: fix coredump issue

ILP32 use aarch64 register and 32bit signal struct which means it
could not make use of the existing compat_elf_prstatus/elf_prstatus
and compat_elf_prpsinfo/elf_prpsinfo.

This patch fix this issue by introducing the different
compat_elf_greg_t, compat_elf_gregset_t for aarch64 ilp32 and aarch32
el0.

Tested pass on huawei's hardware in bigendian.

Signed-off-by: Bamvor Jian Zhang 
---
 arch/arm64/include/asm/elf.h | 14 +++---
 arch/arm64/kernel/binfmt_elf32.c |  3 +++
 arch/arm64/kernel/binfmt_ilp32.c |  8 +++-
 arch/arm64/kernel/ptrace.c   | 20 ++--
 4 files changed, 27 insertions(+), 18 deletions(-)

diff --git a/arch/arm64/include/asm/elf.h b/arch/arm64/include/asm/elf.h
index 0106d18..9019441 100644
--- a/arch/arm64/include/asm/elf.h
+++ b/arch/arm64/include/asm/elf.h
@@ -154,18 +154,18 @@ extern int arch_setup_additional_pages(struct 
linux_binprm *bprm,
   int uses_interp);

 /* 1GB of VA */
-#define STACK_RND_MASK (is_compat_task() ? \
-   0x7ff >> (PAGE_SHIFT - 12) : \
-   0x3 >> (PAGE_SHIFT - 12))
+#define STACK_RND_MASK (is_compat_task() ? \
+   0x7ff >> (PAGE_SHIFT - 12) : \
+   0x3 >> (PAGE_SHIFT - 12))

 #ifdef CONFIG_COMPAT

-#define COMPAT_ELF_ET_DYN_BASE (2 * TASK_SIZE_32 / 3)
+#define COMPAT_ELF_ET_DYN_BASE (2 * TASK_SIZE_32 / 3)

 /* AArch32 registers. */
-#define COMPAT_ELF_NGREG   18
-typedef unsigned int   compat_elf_greg_t;
-typedef compat_elf_greg_t  compat_elf_gregset_t[COMPAT_ELF_NGREG];
+#define COMPAT_ELF_NGREG   18
+typedef unsigned int   compat_a32_elf_greg_t;
+typedef compat_a32_elf_greg_t  compat_a32_elf_gregset_t[COMPAT_ELF_NGREG];

 #endif /* CONFIG_COMPAT */

diff --git a/arch/arm64/kernel/binfmt_elf32.c b/arch/arm64/kernel/binfmt_elf32.c
index 7b9b445..f75253c 100644
--- a/arch/arm64/kernel/binfmt_elf32.c
+++ b/arch/arm64/kernel/binfmt_elf32.c
@@ -31,4 +31,7 @@ struct linux_binprm;
 extern int aarch32_setup_vectors_page(struct linux_binprm *bprm,
  int uses_interp);

+#define compat_elf_greg_t  compat_a32_elf_greg_t
+#define compat_elf_gregset_t   compat_a32_elf_gregset_t
+
 #include "../../../fs/compat_binfmt_elf.c"
diff --git a/arch/arm64/kernel/binfmt_ilp32.c b/arch/arm64/kernel/binfmt_ilp32.c
index b827a9a..01baf83 100644
--- a/arch/arm64/kernel/binfmt_ilp32.c
+++ b/arch/arm64/kernel/binfmt_ilp32.c
@@ -2,7 +2,9 @@
  * Support for ILP32 Linux/aarch64 ELF binaries.
  */

-#include 
+#include 
+#include 
+#include 
 #include 

 #undef ELF_CLASS
@@ -30,9 +32,13 @@
  * The machine-dependent core note format types are defined in 
elfcore-compat.h,
  * which requires asm/elf.h to define compat_elf_gregset_t et al.
  */
+#define compat_elf_greg_t  elf_greg_t
+#define compat_elf_gregset_t   elf_gregset_t
 #define elf_prstatus   compat_elf_prstatus
 #define elf_prpsinfo   compat_elf_prpsinfo

+#include 
+
 /*
  * Compat version of cputime_to_compat_timeval, perhaps this
  * should be an inline in .
diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 5c86135..9784c77 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -913,8 +913,8 @@ static const struct user_regset aarch32_regsets[] = {
[REGSET_COMPAT_GPR] = {
.core_note_type = NT_PRSTATUS,
.n = COMPAT_ELF_NGREG,
-   .size = sizeof(compat_elf_greg_t),
-   .align = sizeof(compat_elf_greg_t),
+   .size = sizeof(compat_a32_elf_greg_t),
+   .align = sizeof(compat_a32_elf_greg_t),
.get = compat_gpr_get,
.set = compat_gpr_set
},
@@ -947,7 +947,7 @@ static int compat_ptrace_read_user(struct task_struct *tsk, 
compat_ulong_t off,
tmp = tsk->mm->start_data;
else if (off == COMPAT_PT_TEXT_END_ADDR)
tmp = tsk->mm->end_code;
-   else if (off < sizeof(compat_elf_gregset_t))
+   else if (off < sizeof(compat_a32_elf_gregset_t))
return copy_regset_to_user(tsk, _aarch32_view,
   REGSET_COMPAT_GPR, off,
   sizeof(compat_ulong_t), ret);
@@ -968,7 +968,7 @@ static int 

[PATCH] clk: rockchip: fix cclk error handing

2016-05-26 Thread Xing Zheng
It maybe due to a copy-paste error the error handing should be
cclk not clk.

Reported-by: Lin Huang 
Signed-off-by: Xing Zheng 
---

 drivers/clk/rockchip/clk-cpu.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/clk/rockchip/clk-cpu.c b/drivers/clk/rockchip/clk-cpu.c
index 4bb130c..05b3d73 100644
--- a/drivers/clk/rockchip/clk-cpu.c
+++ b/drivers/clk/rockchip/clk-cpu.c
@@ -321,9 +321,9 @@ struct clk *rockchip_clk_register_cpuclk(const char *name,
}
 
cclk = clk_register(NULL, >hw);
-   if (IS_ERR(clk)) {
+   if (IS_ERR(cclk)) {
pr_err("%s: could not register cpuclk %s\n", __func__,  name);
-   ret = PTR_ERR(clk);
+   ret = PTR_ERR(cclk);
goto free_rate_table;
}
 
-- 
1.7.9.5




[PATCH] clk: rockchip: fix cclk error handing

2016-05-26 Thread Xing Zheng
It maybe due to a copy-paste error the error handing should be
cclk not clk.

Reported-by: Lin Huang 
Signed-off-by: Xing Zheng 
---

 drivers/clk/rockchip/clk-cpu.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/clk/rockchip/clk-cpu.c b/drivers/clk/rockchip/clk-cpu.c
index 4bb130c..05b3d73 100644
--- a/drivers/clk/rockchip/clk-cpu.c
+++ b/drivers/clk/rockchip/clk-cpu.c
@@ -321,9 +321,9 @@ struct clk *rockchip_clk_register_cpuclk(const char *name,
}
 
cclk = clk_register(NULL, >hw);
-   if (IS_ERR(clk)) {
+   if (IS_ERR(cclk)) {
pr_err("%s: could not register cpuclk %s\n", __func__,  name);
-   ret = PTR_ERR(clk);
+   ret = PTR_ERR(cclk);
goto free_rate_table;
}
 
-- 
1.7.9.5




Re: [RFC PATCH v1 10/18] x86/efi: Access EFI related tables in the clear

2016-05-26 Thread Tom Lendacky
On 05/25/2016 02:30 PM, Matt Fleming wrote:
> On Tue, 24 May, at 09:54:31AM, Tom Lendacky wrote:
>>
>> I looked into this and this would be a large change also to parse tables
>> and build lists.  It occurred to me that this could all be taken care of
>> if the early_memremap calls were changed to early_ioremap calls. Looking
>> in the git log I see that they were originally early_ioremap calls but
>> were changed to early_memremap calls with this commit:
>>
>> commit abc93f8eb6e4 ("efi: Use early_mem*() instead of early_io*()")
>>
>> Looking at the early_memremap code and the early_ioremap code they both
>> call __early_ioremap so I don't see how this change makes any
>> difference (especially since FIXMAP_PAGE_NORMAL and FIXMAP_PAGE_IO are
>> identical in this case).
>>
>> Is it safe to change these back to early_ioremap calls (at least on
>> x86)?
> 
> I really don't want to begin mixing early_ioremap() calls and
> early_memremap() calls for any of the EFI code if it can be avoided.

I definitely wouldn't mix them, it would be all or nothing.

> 
> There is slow but steady progress to move more and more of the
> architecture specific EFI code out into generic code. Swapping
> early_memremap() for early_ioremap() would be a step backwards,
> because FIXMAP_PAGE_NORMAL and FIXMAP_PAGE_IO are not identical on
> ARM/arm64.

Maybe adding something similar to __acpi_map_table would be more
appropriate?

> 
> Could you point me at the patch that in this series that fixes up
> early_ioremap() to work with mem encrypt/decrypt? I took another
> (quick) look through but couldn't find it.

The patch in question is patch 6/18 where PAGE_KERNEL is changed to
include the _PAGE_ENC attribute (the encryption mask). This now
makes FIXMAP_PAGE_NORMAL contain the encryption mask while
FIXMAP_PAGE_IO does not. In this way, anything mapped using the
early_ioremap call won't be mapped encrypted.

Thanks,
Tom

> 


Re: [RFC PATCH v1 10/18] x86/efi: Access EFI related tables in the clear

2016-05-26 Thread Tom Lendacky
On 05/25/2016 02:30 PM, Matt Fleming wrote:
> On Tue, 24 May, at 09:54:31AM, Tom Lendacky wrote:
>>
>> I looked into this and this would be a large change also to parse tables
>> and build lists.  It occurred to me that this could all be taken care of
>> if the early_memremap calls were changed to early_ioremap calls. Looking
>> in the git log I see that they were originally early_ioremap calls but
>> were changed to early_memremap calls with this commit:
>>
>> commit abc93f8eb6e4 ("efi: Use early_mem*() instead of early_io*()")
>>
>> Looking at the early_memremap code and the early_ioremap code they both
>> call __early_ioremap so I don't see how this change makes any
>> difference (especially since FIXMAP_PAGE_NORMAL and FIXMAP_PAGE_IO are
>> identical in this case).
>>
>> Is it safe to change these back to early_ioremap calls (at least on
>> x86)?
> 
> I really don't want to begin mixing early_ioremap() calls and
> early_memremap() calls for any of the EFI code if it can be avoided.

I definitely wouldn't mix them, it would be all or nothing.

> 
> There is slow but steady progress to move more and more of the
> architecture specific EFI code out into generic code. Swapping
> early_memremap() for early_ioremap() would be a step backwards,
> because FIXMAP_PAGE_NORMAL and FIXMAP_PAGE_IO are not identical on
> ARM/arm64.

Maybe adding something similar to __acpi_map_table would be more
appropriate?

> 
> Could you point me at the patch that in this series that fixes up
> early_ioremap() to work with mem encrypt/decrypt? I took another
> (quick) look through but couldn't find it.

The patch in question is patch 6/18 where PAGE_KERNEL is changed to
include the _PAGE_ENC attribute (the encryption mask). This now
makes FIXMAP_PAGE_NORMAL contain the encryption mask while
FIXMAP_PAGE_IO does not. In this way, anything mapped using the
early_ioremap call won't be mapped encrypted.

Thanks,
Tom

> 


Re: [RFC PATCH 1/2] ACPI / button: Send "open" state after boot/resume

2016-05-26 Thread Benjamin Tissoires
[Jumping in the discussion at Bastien's request]

On Thu, May 19, 2016 at 3:21 PM, Rafael J. Wysocki  wrote:
> On Thu, May 19, 2016 at 3:50 AM, Zheng, Lv  wrote:
>> Hi,
>>
>>> From: rjwyso...@gmail.com [mailto:rjwyso...@gmail.com] On Behalf Of
>>> Rafael J. Wysocki
>>> Subject: Re: [RFC PATCH 1/2] ACPI / button: Send "open" state after
>>> boot/resume
>
> [cut]
>
>>> > That's because of systemd implementation.
>>> > It contains code logic that:
>>> > When the lid state is closed, a re-checking mechanism is installed.
>>> > So if we do not send any notification after boot/resume and the old lid 
>>> > state
>>> is "closed".
>>> > systemd determines to suspend in the re-checking mechanism.
>>>
>>> If that really is the case, it is plain silly and I don't think we can
>>> do anything in the kernel to help here.
>>
>> [Lv Zheng]
>> The problem is:
>> If we just removed the 2 lines sending wrong lid state after boot/resume.
>> Problem couldn't be solved.
>> It could only be solved by changing both the systemd and the kernel 
>> (deleting the 2 lines).
>
> There are two things here, there's a kernel issue (sending the fake
> input events) and there's a user-visible problem.  Yes, it may not be
> possible to fix the user-visible problem by fixing the kernel issue
> alone, but pretty much by definition we can only fix the kernel issue
> in the kernel.
>
> However, it looks like it may not be possible to fix the user-visible
> problem without fixing the kernel issue in the first place, so maybe
> we should do that and attach the additional user space patch to the
> bug entries in question?
>
> [cut]
>
>>> > I intentionally kept the _LID evaluation right after boot/resume.
>>> > Because I validated Windows behavior.
>>> > It seems Windows evaluates _LID right after boot.
>>> > So I kept _LID evaluated right after boot to prevent compliance issues.
>>>
>>> I don't quite see what compliance issues could result from skipping
>>> the _LID evaluation after boot.
>>
>> [Lv Zheng]
>> I'm not sure if there is a platform putting named object initialization code 
>> in _LID.
>> If you don't like it, we can stop evaluating _LID in the next version.
>
> Well, unless there is a well-documented reason for doing this, I'd at
> least try to see what happens if we don't.
>
> Doing things for unspecified reasons is not a very good idea overall IMO.

I found an issue on the surface 3 which explains why the initial state
of the _LID switch is wrong.
In gpiolib-acpi, we initialize an operation region for the LID switch
to be controlled by a GPIO. This GPIO triggers an _E4C method when
changed (see https://bugzilla.kernel.org/attachment.cgi?id=187171 in
GPO0). This GPIO event actually sets the correct initial state of LIDB,
which is forwarded by _LID.

Now, on the surface 3, there is an other gpio event (_E10) which, when
triggered at boot seems to put the sensor hub (over i2c-hid) in a
better shape:
I used to receive a5 a5 a5 a5 a5.. (garbage) after enabling S0 on the
sensor and when requesting data from it. Now I get a nice
[  +0.000137] i2c_hid i2c-MSHW0102:00: report (len=17): 11 00 01 02 05 00 00 00 
00 00 00 00 00 00 18 fc 00
which seems more sensible from a HID point of view.

The patch is the following:

---
>From 2c76d14a5ad089d0321a029edde3f91f3bc93ae3 Mon Sep 17 00:00:00 2001
From: Benjamin Tissoires 
Date: Thu, 26 May 2016 15:29:10 +0200
Subject: [PATCH] gpiolib-acpi: make sure we trigger the events at least once
 on boot

The Surface 3 has its _LID state controlled by an ACPI operation region
triggered by a GPIO event:

 OperationRegion (GPOR, GeneralPurposeIo, Zero, One)
 Field (GPOR, ByteAcc, NoLock, Preserve)
 {
 Connection (
 GpioIo (Shared, PullNone, 0x, 0x, IoRestrictionNone,
 "\\_SB.GPO0", 0x00, ResourceConsumer, ,
 )
 {   // Pin list
 0x004C
 }
 ),
 HELD,   1
 }

 Method (_E4C, 0, Serialized)  // _Exx: Edge-Triggered GPE
 {
 If ((HELD == One))
 {
 ^^LID.LIDB = One
 }
 Else
 {
 ^^LID.LIDB = Zero
 Notify (LID, 0x80) // Status Change
 }

 Notify (^^PCI0.SPI1.NTRG, One) // Device Check
 }

Currently, the state of LIDB is wrong until the user actually closes or
open the cover. We need to trigger the GPIO event once to update the
internal ACPI state.

Coincidentally, this also enables the integrated HID sensor hub which also
requires an ACPI gpio operation region to start initialization.

Signed-off-by: Benjamin Tissoires 
---
 drivers/gpio/gpiolib-acpi.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/gpio/gpiolib-acpi.c b/drivers/gpio/gpiolib-acpi.c
index 2dc5258..71775a0 100644
--- a/drivers/gpio/gpiolib-acpi.c
+++ b/drivers/gpio/gpiolib-acpi.c
@@ -175,7 +175,7 @@ static acpi_status acpi_gpiochip_request_interrupt(struct 
acpi_resource 

Re: [RFC PATCH 1/2] ACPI / button: Send "open" state after boot/resume

2016-05-26 Thread Benjamin Tissoires
[Jumping in the discussion at Bastien's request]

On Thu, May 19, 2016 at 3:21 PM, Rafael J. Wysocki  wrote:
> On Thu, May 19, 2016 at 3:50 AM, Zheng, Lv  wrote:
>> Hi,
>>
>>> From: rjwyso...@gmail.com [mailto:rjwyso...@gmail.com] On Behalf Of
>>> Rafael J. Wysocki
>>> Subject: Re: [RFC PATCH 1/2] ACPI / button: Send "open" state after
>>> boot/resume
>
> [cut]
>
>>> > That's because of systemd implementation.
>>> > It contains code logic that:
>>> > When the lid state is closed, a re-checking mechanism is installed.
>>> > So if we do not send any notification after boot/resume and the old lid 
>>> > state
>>> is "closed".
>>> > systemd determines to suspend in the re-checking mechanism.
>>>
>>> If that really is the case, it is plain silly and I don't think we can
>>> do anything in the kernel to help here.
>>
>> [Lv Zheng]
>> The problem is:
>> If we just removed the 2 lines sending wrong lid state after boot/resume.
>> Problem couldn't be solved.
>> It could only be solved by changing both the systemd and the kernel 
>> (deleting the 2 lines).
>
> There are two things here, there's a kernel issue (sending the fake
> input events) and there's a user-visible problem.  Yes, it may not be
> possible to fix the user-visible problem by fixing the kernel issue
> alone, but pretty much by definition we can only fix the kernel issue
> in the kernel.
>
> However, it looks like it may not be possible to fix the user-visible
> problem without fixing the kernel issue in the first place, so maybe
> we should do that and attach the additional user space patch to the
> bug entries in question?
>
> [cut]
>
>>> > I intentionally kept the _LID evaluation right after boot/resume.
>>> > Because I validated Windows behavior.
>>> > It seems Windows evaluates _LID right after boot.
>>> > So I kept _LID evaluated right after boot to prevent compliance issues.
>>>
>>> I don't quite see what compliance issues could result from skipping
>>> the _LID evaluation after boot.
>>
>> [Lv Zheng]
>> I'm not sure if there is a platform putting named object initialization code 
>> in _LID.
>> If you don't like it, we can stop evaluating _LID in the next version.
>
> Well, unless there is a well-documented reason for doing this, I'd at
> least try to see what happens if we don't.
>
> Doing things for unspecified reasons is not a very good idea overall IMO.

I found an issue on the surface 3 which explains why the initial state
of the _LID switch is wrong.
In gpiolib-acpi, we initialize an operation region for the LID switch
to be controlled by a GPIO. This GPIO triggers an _E4C method when
changed (see https://bugzilla.kernel.org/attachment.cgi?id=187171 in
GPO0). This GPIO event actually sets the correct initial state of LIDB,
which is forwarded by _LID.

Now, on the surface 3, there is an other gpio event (_E10) which, when
triggered at boot seems to put the sensor hub (over i2c-hid) in a
better shape:
I used to receive a5 a5 a5 a5 a5.. (garbage) after enabling S0 on the
sensor and when requesting data from it. Now I get a nice
[  +0.000137] i2c_hid i2c-MSHW0102:00: report (len=17): 11 00 01 02 05 00 00 00 
00 00 00 00 00 00 18 fc 00
which seems more sensible from a HID point of view.

The patch is the following:

---
>From 2c76d14a5ad089d0321a029edde3f91f3bc93ae3 Mon Sep 17 00:00:00 2001
From: Benjamin Tissoires 
Date: Thu, 26 May 2016 15:29:10 +0200
Subject: [PATCH] gpiolib-acpi: make sure we trigger the events at least once
 on boot

The Surface 3 has its _LID state controlled by an ACPI operation region
triggered by a GPIO event:

 OperationRegion (GPOR, GeneralPurposeIo, Zero, One)
 Field (GPOR, ByteAcc, NoLock, Preserve)
 {
 Connection (
 GpioIo (Shared, PullNone, 0x, 0x, IoRestrictionNone,
 "\\_SB.GPO0", 0x00, ResourceConsumer, ,
 )
 {   // Pin list
 0x004C
 }
 ),
 HELD,   1
 }

 Method (_E4C, 0, Serialized)  // _Exx: Edge-Triggered GPE
 {
 If ((HELD == One))
 {
 ^^LID.LIDB = One
 }
 Else
 {
 ^^LID.LIDB = Zero
 Notify (LID, 0x80) // Status Change
 }

 Notify (^^PCI0.SPI1.NTRG, One) // Device Check
 }

Currently, the state of LIDB is wrong until the user actually closes or
open the cover. We need to trigger the GPIO event once to update the
internal ACPI state.

Coincidentally, this also enables the integrated HID sensor hub which also
requires an ACPI gpio operation region to start initialization.

Signed-off-by: Benjamin Tissoires 
---
 drivers/gpio/gpiolib-acpi.c | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/gpio/gpiolib-acpi.c b/drivers/gpio/gpiolib-acpi.c
index 2dc5258..71775a0 100644
--- a/drivers/gpio/gpiolib-acpi.c
+++ b/drivers/gpio/gpiolib-acpi.c
@@ -175,7 +175,7 @@ static acpi_status acpi_gpiochip_request_interrupt(struct 
acpi_resource *ares,
irq_handler_t handler = NULL;
struct gpio_desc *desc;
unsigned long 

Re: [PATCH] watchdog: Add a device managed API for watchdog_register_device()

2016-05-26 Thread Guenter Roeck

On 05/26/2016 01:31 AM, Neil Armstrong wrote:

This helps in reducing code in .remove callbacks and sometimes
dropping .remove callbacks entirely.

Signed-off-by: Neil Armstrong 
---
  Documentation/driver-model/devres.txt |  3 +++
  drivers/watchdog/watchdog_core.c  | 36 +++
  include/linux/watchdog.h  |  3 +++
  3 files changed, 42 insertions(+)

diff --git a/Documentation/driver-model/devres.txt 
b/Documentation/driver-model/devres.txt
index c63eea0..589296b 100644
--- a/Documentation/driver-model/devres.txt
+++ b/Documentation/driver-model/devres.txt
@@ -357,3 +357,6 @@ SLAVE DMA ENGINE

  SPI
devm_spi_register_master()
+
+WATCHDOG
+  devm_watchdog_register_device()
diff --git a/drivers/watchdog/watchdog_core.c b/drivers/watchdog/watchdog_core.c
index 981a668..a664d80 100644
--- a/drivers/watchdog/watchdog_core.c
+++ b/drivers/watchdog/watchdog_core.c
@@ -329,6 +329,42 @@ void watchdog_unregister_device(struct watchdog_device 
*wdd)

  EXPORT_SYMBOL_GPL(watchdog_unregister_device);

+static void devm_watchdog_unregister_device(struct device *dev, void *res)
+{
+   watchdog_unregister_device(*(struct watchdog_device **)res);
+}
+
+/**
+ * devm_watchdog_register_device() - resource managed 
watchdog_register_device()
+ * @dev: device that is registering this watchdog device
+ * @wdd: watchdog device
+ *
+ * Managed watchdog_register_device(). For watchdog device registered by this
+ * function,  watchdog_unregister_device() is automatically called on driver
+ * detach. See watchdog_register_device() for more information.
+ */
+int devm_watchdog_register_device(struct device *dev,
+   struct watchdog_device *wdd)
+{
+   struct watchdog_device **rcwdd;
+   int ret;
+
+   rcwdd = devres_alloc(devm_watchdog_unregister_device, sizeof(*wdd),
+GFP_KERNEL);
+   if (!rcwdd)
+   return -ENOMEM;
+
+   ret = watchdog_register_device(wdd);
+   if (!ret) {
+   *rcwdd = wdd;
+   devres_add(dev, rcwdd);
+   } else
+   devres_free(rcwdd);
+


Please use { } in the else path as well.

Thanks,
Guenter


+   return ret;
+}
+EXPORT_SYMBOL_GPL(devm_watchdog_register_device);
+
  static int __init watchdog_deferred_registration(void)
  {
mutex_lock(_deferred_reg_mutex);
diff --git a/include/linux/watchdog.h b/include/linux/watchdog.h
index 51732d6c..6b75e38 100644
--- a/include/linux/watchdog.h
+++ b/include/linux/watchdog.h
@@ -180,4 +180,7 @@ extern int watchdog_init_timeout(struct watchdog_device 
*wdd,
  extern int watchdog_register_device(struct watchdog_device *);
  extern void watchdog_unregister_device(struct watchdog_device *);

+/* devres register variant */
+int devm_watchdog_register_device(struct device *dev, struct watchdog_device 
*);
+
  #endif  /* ifndef _LINUX_WATCHDOG_H */





Re: [PATCH] watchdog: Add a device managed API for watchdog_register_device()

2016-05-26 Thread Guenter Roeck

On 05/26/2016 01:31 AM, Neil Armstrong wrote:

This helps in reducing code in .remove callbacks and sometimes
dropping .remove callbacks entirely.

Signed-off-by: Neil Armstrong 
---
  Documentation/driver-model/devres.txt |  3 +++
  drivers/watchdog/watchdog_core.c  | 36 +++
  include/linux/watchdog.h  |  3 +++
  3 files changed, 42 insertions(+)

diff --git a/Documentation/driver-model/devres.txt 
b/Documentation/driver-model/devres.txt
index c63eea0..589296b 100644
--- a/Documentation/driver-model/devres.txt
+++ b/Documentation/driver-model/devres.txt
@@ -357,3 +357,6 @@ SLAVE DMA ENGINE

  SPI
devm_spi_register_master()
+
+WATCHDOG
+  devm_watchdog_register_device()
diff --git a/drivers/watchdog/watchdog_core.c b/drivers/watchdog/watchdog_core.c
index 981a668..a664d80 100644
--- a/drivers/watchdog/watchdog_core.c
+++ b/drivers/watchdog/watchdog_core.c
@@ -329,6 +329,42 @@ void watchdog_unregister_device(struct watchdog_device 
*wdd)

  EXPORT_SYMBOL_GPL(watchdog_unregister_device);

+static void devm_watchdog_unregister_device(struct device *dev, void *res)
+{
+   watchdog_unregister_device(*(struct watchdog_device **)res);
+}
+
+/**
+ * devm_watchdog_register_device() - resource managed 
watchdog_register_device()
+ * @dev: device that is registering this watchdog device
+ * @wdd: watchdog device
+ *
+ * Managed watchdog_register_device(). For watchdog device registered by this
+ * function,  watchdog_unregister_device() is automatically called on driver
+ * detach. See watchdog_register_device() for more information.
+ */
+int devm_watchdog_register_device(struct device *dev,
+   struct watchdog_device *wdd)
+{
+   struct watchdog_device **rcwdd;
+   int ret;
+
+   rcwdd = devres_alloc(devm_watchdog_unregister_device, sizeof(*wdd),
+GFP_KERNEL);
+   if (!rcwdd)
+   return -ENOMEM;
+
+   ret = watchdog_register_device(wdd);
+   if (!ret) {
+   *rcwdd = wdd;
+   devres_add(dev, rcwdd);
+   } else
+   devres_free(rcwdd);
+


Please use { } in the else path as well.

Thanks,
Guenter


+   return ret;
+}
+EXPORT_SYMBOL_GPL(devm_watchdog_register_device);
+
  static int __init watchdog_deferred_registration(void)
  {
mutex_lock(_deferred_reg_mutex);
diff --git a/include/linux/watchdog.h b/include/linux/watchdog.h
index 51732d6c..6b75e38 100644
--- a/include/linux/watchdog.h
+++ b/include/linux/watchdog.h
@@ -180,4 +180,7 @@ extern int watchdog_init_timeout(struct watchdog_device 
*wdd,
  extern int watchdog_register_device(struct watchdog_device *);
  extern void watchdog_unregister_device(struct watchdog_device *);

+/* devres register variant */
+int devm_watchdog_register_device(struct device *dev, struct watchdog_device 
*);
+
  #endif  /* ifndef _LINUX_WATCHDOG_H */





Re: [PATCH v4] extcon: gpio: Add the support for Device tree bindings

2016-05-26 Thread Chanwoo Choi
Hi Venkat,

There is some miscommunication. On previous my reply, I don't mean
that the author of the patch[1] is changed from me to you.

I'd like you to remain the original author(me) for the patch[1]
without changing the author.
[1] https://lkml.org/lkml/2015/10/21/8
- [PATCH v3] extcon: gpio: Add the support for Device tree bindings

You can use the patch[1] as based patch and you can add new feature on
base patch[1]. Also, If you ok, you can modify the extccon-gpio.c
driver as Rob comment[2].

But, Rob Herring gave me the some comment[2].
[2] https://lkml.org/lkml/2015/10/21/906


Thanks,
Chanwoo Choi


On Thu, May 26, 2016 at 8:47 PM, Venkat Reddy Talla
 wrote:
> Add the support for Device tree bindings of extcon-gpio driver.
> The extcon-gpio device tree node must include the both 'extcon-id' and
> 'gpios' property.
>
> For example:
> usb_cable: extcon-gpio-0 {
> compatible = "extcon-gpio";
> extcon-id = ;
> gpios = < 1 GPIO_ACTIVE_HIGH>;
> }
> ta_cable: extcon-gpio-1 {
> compatible = "extcon-gpio";
> extcon-id = ;
> gpios = < 2 GPIO_ACTIVE_LOW>;
> debounce-ms = <50>; /* 50 millisecond */
> wakeup-source;
> }
> _usb {
> extcon = <_cable>;
> };
>  {
> extcon = <_cable>;
> };
>
> Signed-off-by: Venkat Reddy Talla 
> Signed-off-by: Chanwoo Choi 
> Signed-off-by: MyungJoo Ham 
>
> ---
> changes from v3:
> - add description for wakeup-source in documentation
> - change dts property extcon-gpio name to gpios
> - use of_get_named_gpio_flags to get gpio number and flags
> Changes from v2:
> - Add the more description for 'extcon-id' property in documentation
> Changes from v1:
> - Create the include/dt-bindings/extcon/extcon.h including the identification
> of external connector. These definitions are used in dts file.
> - Fix error if CONFIG_OF is disabled.
> - Add signed-off tag by Myungjoo Ham
> ---
> ---
>  .../devicetree/bindings/extcon/extcon-gpio.txt |  48 +
>  drivers/extcon/extcon-gpio.c   | 109 
> +
>  include/dt-bindings/extcon/extcon.h|  47 +
>  include/linux/extcon/extcon-gpio.h |   8 +-
>  4 files changed, 189 insertions(+), 23 deletions(-)
>  create mode 100644 Documentation/devicetree/bindings/extcon/extcon-gpio.txt
>  create mode 100644 include/dt-bindings/extcon/extcon.h
>
> diff --git a/Documentation/devicetree/bindings/extcon/extcon-gpio.txt 
> b/Documentation/devicetree/bindings/extcon/extcon-gpio.txt
> new file mode 100644
> index 000..81f7932
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/extcon/extcon-gpio.txt
> @@ -0,0 +1,48 @@
> +GPIO Extcon device
> +
> +This is a virtual device used to generate the specific cable states from the
> +GPIO pin.
> +
> +Required properties:
> +- compatible: Must be "extcon-gpio".
> +- extcon-id: The extcon support the various type of external connector to 
> check
> +  whether connector is attached or detached. The each external connector has
> +  the unique number to identify it. So this property includes the unique 
> number
> +  which indicates the specific external connector. When external connector is
> +  attached or detached, GPIO pin detect the changed state. See include/
> +  dt-bindings/extcon/extcon.h which defines the unique number for supported
> +  external connector from extcon framework.
> +- gpios: GPIO pin to detect the external connector. See gpio binding.
> +
> +Optional properties:
> +- debounce-ms: the debounce dealy for GPIO pin in millisecond.
> +- wakeup-source: Boolean, extcon can wake-up the system from suspend.
> +  if gpio provided in extcon-gpio DT node is registered as interrupt,
> +  then extcon can wake-up the system from suspend if wakeup-source property
> +  is available in DT node, if gpio registered as interrupt but wakeup-source
> +  is not available in DT node, then system wake-up due to extcon events
> +  not supported.
> +
> +Example: Examples of extcon-gpio node as listed below:
> +
> +   usb_cable: extcon-gpio-0 {
> +   compatible = "extcon-gpio";
> +   extcon-id = ;
> +   extcon-gpio = < 1 GPIO_ACTIVE_HIGH>;
> +   }
> +
> +   ta_cable: extcon-gpio-1 {
> +   compatible = "extcon-gpio";
> +   extcon-id = ;
> +   extcon-gpio = < 2 GPIO_ACTIVE_LOW>;
> +   debounce-ms = <50>; /* 50 millisecond */
> +   wakeup-source;
> +   }
> +
> +   _usb {
> +   extcon = <_cable>;
> +   };
> +
> +{
> +   extcon = <_cable>;
> +   };
> diff --git a/drivers/extcon/extcon-gpio.c b/drivers/extcon/extcon-gpio.c
> index d023789..592f395 100644
> --- 

Re: [PATCH v4] extcon: gpio: Add the support for Device tree bindings

2016-05-26 Thread Chanwoo Choi
Hi Venkat,

There is some miscommunication. On previous my reply, I don't mean
that the author of the patch[1] is changed from me to you.

I'd like you to remain the original author(me) for the patch[1]
without changing the author.
[1] https://lkml.org/lkml/2015/10/21/8
- [PATCH v3] extcon: gpio: Add the support for Device tree bindings

You can use the patch[1] as based patch and you can add new feature on
base patch[1]. Also, If you ok, you can modify the extccon-gpio.c
driver as Rob comment[2].

But, Rob Herring gave me the some comment[2].
[2] https://lkml.org/lkml/2015/10/21/906


Thanks,
Chanwoo Choi


On Thu, May 26, 2016 at 8:47 PM, Venkat Reddy Talla
 wrote:
> Add the support for Device tree bindings of extcon-gpio driver.
> The extcon-gpio device tree node must include the both 'extcon-id' and
> 'gpios' property.
>
> For example:
> usb_cable: extcon-gpio-0 {
> compatible = "extcon-gpio";
> extcon-id = ;
> gpios = < 1 GPIO_ACTIVE_HIGH>;
> }
> ta_cable: extcon-gpio-1 {
> compatible = "extcon-gpio";
> extcon-id = ;
> gpios = < 2 GPIO_ACTIVE_LOW>;
> debounce-ms = <50>; /* 50 millisecond */
> wakeup-source;
> }
> _usb {
> extcon = <_cable>;
> };
>  {
> extcon = <_cable>;
> };
>
> Signed-off-by: Venkat Reddy Talla 
> Signed-off-by: Chanwoo Choi 
> Signed-off-by: MyungJoo Ham 
>
> ---
> changes from v3:
> - add description for wakeup-source in documentation
> - change dts property extcon-gpio name to gpios
> - use of_get_named_gpio_flags to get gpio number and flags
> Changes from v2:
> - Add the more description for 'extcon-id' property in documentation
> Changes from v1:
> - Create the include/dt-bindings/extcon/extcon.h including the identification
> of external connector. These definitions are used in dts file.
> - Fix error if CONFIG_OF is disabled.
> - Add signed-off tag by Myungjoo Ham
> ---
> ---
>  .../devicetree/bindings/extcon/extcon-gpio.txt |  48 +
>  drivers/extcon/extcon-gpio.c   | 109 
> +
>  include/dt-bindings/extcon/extcon.h|  47 +
>  include/linux/extcon/extcon-gpio.h |   8 +-
>  4 files changed, 189 insertions(+), 23 deletions(-)
>  create mode 100644 Documentation/devicetree/bindings/extcon/extcon-gpio.txt
>  create mode 100644 include/dt-bindings/extcon/extcon.h
>
> diff --git a/Documentation/devicetree/bindings/extcon/extcon-gpio.txt 
> b/Documentation/devicetree/bindings/extcon/extcon-gpio.txt
> new file mode 100644
> index 000..81f7932
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/extcon/extcon-gpio.txt
> @@ -0,0 +1,48 @@
> +GPIO Extcon device
> +
> +This is a virtual device used to generate the specific cable states from the
> +GPIO pin.
> +
> +Required properties:
> +- compatible: Must be "extcon-gpio".
> +- extcon-id: The extcon support the various type of external connector to 
> check
> +  whether connector is attached or detached. The each external connector has
> +  the unique number to identify it. So this property includes the unique 
> number
> +  which indicates the specific external connector. When external connector is
> +  attached or detached, GPIO pin detect the changed state. See include/
> +  dt-bindings/extcon/extcon.h which defines the unique number for supported
> +  external connector from extcon framework.
> +- gpios: GPIO pin to detect the external connector. See gpio binding.
> +
> +Optional properties:
> +- debounce-ms: the debounce dealy for GPIO pin in millisecond.
> +- wakeup-source: Boolean, extcon can wake-up the system from suspend.
> +  if gpio provided in extcon-gpio DT node is registered as interrupt,
> +  then extcon can wake-up the system from suspend if wakeup-source property
> +  is available in DT node, if gpio registered as interrupt but wakeup-source
> +  is not available in DT node, then system wake-up due to extcon events
> +  not supported.
> +
> +Example: Examples of extcon-gpio node as listed below:
> +
> +   usb_cable: extcon-gpio-0 {
> +   compatible = "extcon-gpio";
> +   extcon-id = ;
> +   extcon-gpio = < 1 GPIO_ACTIVE_HIGH>;
> +   }
> +
> +   ta_cable: extcon-gpio-1 {
> +   compatible = "extcon-gpio";
> +   extcon-id = ;
> +   extcon-gpio = < 2 GPIO_ACTIVE_LOW>;
> +   debounce-ms = <50>; /* 50 millisecond */
> +   wakeup-source;
> +   }
> +
> +   _usb {
> +   extcon = <_cable>;
> +   };
> +
> +{
> +   extcon = <_cable>;
> +   };
> diff --git a/drivers/extcon/extcon-gpio.c b/drivers/extcon/extcon-gpio.c
> index d023789..592f395 100644
> --- a/drivers/extcon/extcon-gpio.c
> +++ b/drivers/extcon/extcon-gpio.c
> @@ -1,11 +1,9 @@
>  /*
>   * 

Re: [STLinux Kernel] [PATCH 02/11] drivers: clk: st: Add clock propagation for audio clocks

2016-05-26 Thread Rob Herring
On Thu, May 26, 2016 at 8:05 AM, loic pallardy  wrote:
>
>
> On 05/26/2016 02:46 PM, Rob Herring wrote:
>>
>> On Thu, May 26, 2016 at 4:49 AM, Gabriel Fernandez
>>  wrote:
>>>
>>> On 25 May 2016 at 19:24, Rob Herring  wrote:


 On Wed, May 18, 2016 at 10:41:23AM +0200, Gabriel Fernandez wrote:
>
> This patch allows fine tuning of the quads FS for audio clocks
> accuracy.
>
> Signed-off-by: Olivier Bideau 
> Signed-off-by: Gabriel Fernandez 
> ---
>   .../devicetree/bindings/clock/st/st,flexgen.txt|  1 +
>   drivers/clk/st/clk-flexgen.c   | 24
> ++
>   2 files changed, 25 insertions(+)
>
> diff --git a/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
> b/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
> index b7ee5c7..15b33c7 100644
> --- a/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
> +++ b/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
> @@ -60,6 +60,7 @@ This binding uses the common clock binding[2].
>   Required properties:
>   - compatible : shall be:
> "st,flexgen"
> +  "st,stih407-clkgend0", "st,flexgen" (enable clock propagation on
> parent)


 What do "d0" and "d2" refer to?

 This seems to indicate you have too much clock detail in the DT (with
 individual clocks described) or not enough with genericish compatible
 strings. What happens for the mext clock you need to adjust the flags
 on? You should be able to make these adjustments without DT updates.
 Perhaps you need a wider fixing of clock compatible strings.

 Rob
>>>
>>>
>>> Sorry i sent my response in html...
>>>
>>> Hi Rob,
>>>
>>> Thanks for reviewing.
>>>
>>> Can i remove
>>> "
>>> st,stih407-clkgend0" & "
>>> st,stih407-clkgend2" compatible strings and add proprieties instead ?
>>> I only need to activate 2 features and then we can keep generic
>>> compatible strings.
>>
>>
> Hi Rob,
>>
>> That is no different and suffers the same point I raised. It requires
>> updating the DT for any clock configuration change or enhancement.
>>
> Agree with you, DT update is needed as soon as a clock configuration should
> be changed. This is due to STiH clock driver design based on DT description
> of SoC clock tree.
>
> This clock driver was accepted 2 years ago. At the time being there was
> discussion about clock tree description location: driver or DT.
> Bad choice was done for this driver...
>
> If we decide to redesign STiH clock driver using in-driver clock tree
> description, this will modify STiH clock DT nodes description and so break
> DT backward compatibility.
>
> What's from your pov the best option?

You can break it once or every time you need a clock change. I'd go
with the former. Maybe more specific compatible strings throughout
alone would be enough rather than a flag day changing the binding.

Rob


Re: [STLinux Kernel] [PATCH 02/11] drivers: clk: st: Add clock propagation for audio clocks

2016-05-26 Thread Rob Herring
On Thu, May 26, 2016 at 8:05 AM, loic pallardy  wrote:
>
>
> On 05/26/2016 02:46 PM, Rob Herring wrote:
>>
>> On Thu, May 26, 2016 at 4:49 AM, Gabriel Fernandez
>>  wrote:
>>>
>>> On 25 May 2016 at 19:24, Rob Herring  wrote:


 On Wed, May 18, 2016 at 10:41:23AM +0200, Gabriel Fernandez wrote:
>
> This patch allows fine tuning of the quads FS for audio clocks
> accuracy.
>
> Signed-off-by: Olivier Bideau 
> Signed-off-by: Gabriel Fernandez 
> ---
>   .../devicetree/bindings/clock/st/st,flexgen.txt|  1 +
>   drivers/clk/st/clk-flexgen.c   | 24
> ++
>   2 files changed, 25 insertions(+)
>
> diff --git a/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
> b/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
> index b7ee5c7..15b33c7 100644
> --- a/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
> +++ b/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
> @@ -60,6 +60,7 @@ This binding uses the common clock binding[2].
>   Required properties:
>   - compatible : shall be:
> "st,flexgen"
> +  "st,stih407-clkgend0", "st,flexgen" (enable clock propagation on
> parent)


 What do "d0" and "d2" refer to?

 This seems to indicate you have too much clock detail in the DT (with
 individual clocks described) or not enough with genericish compatible
 strings. What happens for the mext clock you need to adjust the flags
 on? You should be able to make these adjustments without DT updates.
 Perhaps you need a wider fixing of clock compatible strings.

 Rob
>>>
>>>
>>> Sorry i sent my response in html...
>>>
>>> Hi Rob,
>>>
>>> Thanks for reviewing.
>>>
>>> Can i remove
>>> "
>>> st,stih407-clkgend0" & "
>>> st,stih407-clkgend2" compatible strings and add proprieties instead ?
>>> I only need to activate 2 features and then we can keep generic
>>> compatible strings.
>>
>>
> Hi Rob,
>>
>> That is no different and suffers the same point I raised. It requires
>> updating the DT for any clock configuration change or enhancement.
>>
> Agree with you, DT update is needed as soon as a clock configuration should
> be changed. This is due to STiH clock driver design based on DT description
> of SoC clock tree.
>
> This clock driver was accepted 2 years ago. At the time being there was
> discussion about clock tree description location: driver or DT.
> Bad choice was done for this driver...
>
> If we decide to redesign STiH clock driver using in-driver clock tree
> description, this will modify STiH clock DT nodes description and so break
> DT backward compatibility.
>
> What's from your pov the best option?

You can break it once or every time you need a clock change. I'd go
with the former. Maybe more specific compatible strings throughout
alone would be enough rather than a flag day changing the binding.

Rob


Re: [PATCH 2/3] of/numa: fix a memory@ dt node can only contains one memory block

2016-05-26 Thread Rob Herring
On Thu, May 26, 2016 at 10:43:58AM +0800, Zhen Lei wrote:
> For a normal memory@ devicetree node, its reg property can contains more
> memory blocks.
> 
> Because we don't known how many memory blocks maybe contained, so we try
> from index=0, increase 1 until error returned(the end).
> 
> Signed-off-by: Zhen Lei 
> ---
>  drivers/of/of_numa.c | 30 --
>  1 file changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/of/of_numa.c b/drivers/of/of_numa.c
> index 21d831f..2c5f249 100644
> --- a/drivers/of/of_numa.c
> +++ b/drivers/of/of_numa.c
> @@ -63,7 +63,7 @@ static int __init of_numa_parse_memory_nodes(void)
>   struct device_node *np = NULL;
>   struct resource rsrc;
>   u32 nid;
> - int r = 0;
> + int i, r = 0;
> 
>   for (;;) {
>   np = of_find_node_by_type(np, "memory");
> @@ -82,17 +82,27 @@ static int __init of_numa_parse_memory_nodes(void)
>   /* some other error */
>   break;
> 
> - r = of_address_to_resource(np, 0, );
> - if (r) {
> - pr_err("NUMA: bad reg property in memory node\n");
> - break;
> + for (i = 0; ; i++) {
> + r = of_address_to_resource(np, i, );
> + if (r) {
> + /* reached the end of of_address */
> + if (i > 0) {
> + r = 0;
> + break;
> + }
> +
> + pr_err("NUMA: bad reg property in memory 
> node\n");
> + goto finished;
> + }
> +
> + r = numa_add_memblk(nid, rsrc.start,
> + rsrc.end - rsrc.start + 1);
> + if (r)
> + goto finished;
>   }
> -
> - r = numa_add_memblk(nid, rsrc.start,
> - rsrc.end - rsrc.start + 1);
> - if (r)
> - break;
>   }
> +
> +finished:
>   of_node_put(np);

This function can be simplified down to:

for_each_node_by_type(np, "memory") {
r = of_property_read_u32(np, "numa-node-id", );
if (r == -EINVAL)
/*
 * property doesn't exist if -EINVAL, continue
 * looking for more memory nodes with
 * "numa-node-id" property
 */
continue;
else if (r)
/* some other error */
break;

r = of_address_to_resource(np, 0, );
for (i = 0; !r; i++, r = of_address_to_resource(np, i, 
)) {
r = numa_add_memblk(nid, rsrc.start,
rsrc.end - rsrc.start + 1);
}
}
of_node_put(np);

return r;


Perhaps with a "if (!i && r) pr_err()" for an error message at the end.

Rob


Re: [PATCH 2/3] of/numa: fix a memory@ dt node can only contains one memory block

2016-05-26 Thread Rob Herring
On Thu, May 26, 2016 at 10:43:58AM +0800, Zhen Lei wrote:
> For a normal memory@ devicetree node, its reg property can contains more
> memory blocks.
> 
> Because we don't known how many memory blocks maybe contained, so we try
> from index=0, increase 1 until error returned(the end).
> 
> Signed-off-by: Zhen Lei 
> ---
>  drivers/of/of_numa.c | 30 --
>  1 file changed, 20 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/of/of_numa.c b/drivers/of/of_numa.c
> index 21d831f..2c5f249 100644
> --- a/drivers/of/of_numa.c
> +++ b/drivers/of/of_numa.c
> @@ -63,7 +63,7 @@ static int __init of_numa_parse_memory_nodes(void)
>   struct device_node *np = NULL;
>   struct resource rsrc;
>   u32 nid;
> - int r = 0;
> + int i, r = 0;
> 
>   for (;;) {
>   np = of_find_node_by_type(np, "memory");
> @@ -82,17 +82,27 @@ static int __init of_numa_parse_memory_nodes(void)
>   /* some other error */
>   break;
> 
> - r = of_address_to_resource(np, 0, );
> - if (r) {
> - pr_err("NUMA: bad reg property in memory node\n");
> - break;
> + for (i = 0; ; i++) {
> + r = of_address_to_resource(np, i, );
> + if (r) {
> + /* reached the end of of_address */
> + if (i > 0) {
> + r = 0;
> + break;
> + }
> +
> + pr_err("NUMA: bad reg property in memory 
> node\n");
> + goto finished;
> + }
> +
> + r = numa_add_memblk(nid, rsrc.start,
> + rsrc.end - rsrc.start + 1);
> + if (r)
> + goto finished;
>   }
> -
> - r = numa_add_memblk(nid, rsrc.start,
> - rsrc.end - rsrc.start + 1);
> - if (r)
> - break;
>   }
> +
> +finished:
>   of_node_put(np);

This function can be simplified down to:

for_each_node_by_type(np, "memory") {
r = of_property_read_u32(np, "numa-node-id", );
if (r == -EINVAL)
/*
 * property doesn't exist if -EINVAL, continue
 * looking for more memory nodes with
 * "numa-node-id" property
 */
continue;
else if (r)
/* some other error */
break;

r = of_address_to_resource(np, 0, );
for (i = 0; !r; i++, r = of_address_to_resource(np, i, 
)) {
r = numa_add_memblk(nid, rsrc.start,
rsrc.end - rsrc.start + 1);
}
}
of_node_put(np);

return r;


Perhaps with a "if (!i && r) pr_err()" for an error message at the end.

Rob


Re: [PATCH v2] net: alx: use custom skb allocator

2016-05-26 Thread Eric Dumazet
On Thu, 2016-05-26 at 16:41 +0800, Feng Tang wrote:
> On Wed, May 25, 2016 at 07:53:41PM -0400, David Miller wrote:

> > 
> > But now that we have at least two instances of this code we really
> > need to put a common version somewhere. :-/
> 
> I agree, and furthermore I noticed there are some similar routines
> in the 4 individual Atheros drivers atlx/alx/atl1c/atl1e, which may
> be unified by a simple framework for them all. Maybe the driver
> maintainer from Atheros could take a look, as they can reach all
> the real HWs :)

Note that you could also use the napi_get_frags() API that other drivers
use, and you attach page frags to the skb.

(Ie use small skb->head allocations where the stack will pull the
headers, but use your own page based allocations for the frames)

This might allow you to use the page reuse trick that some Intel drivers
use.

Look for igb_can_reuse_rx_page() for an example.

Thanks.



Re: [PATCH v2] net: alx: use custom skb allocator

2016-05-26 Thread Eric Dumazet
On Thu, 2016-05-26 at 16:41 +0800, Feng Tang wrote:
> On Wed, May 25, 2016 at 07:53:41PM -0400, David Miller wrote:

> > 
> > But now that we have at least two instances of this code we really
> > need to put a common version somewhere. :-/
> 
> I agree, and furthermore I noticed there are some similar routines
> in the 4 individual Atheros drivers atlx/alx/atl1c/atl1e, which may
> be unified by a simple framework for them all. Maybe the driver
> maintainer from Atheros could take a look, as they can reach all
> the real HWs :)

Note that you could also use the napi_get_frags() API that other drivers
use, and you attach page frags to the skb.

(Ie use small skb->head allocations where the stack will pull the
headers, but use your own page based allocations for the frames)

This might allow you to use the page reuse trick that some Intel drivers
use.

Look for igb_can_reuse_rx_page() for an example.

Thanks.



Re: [STLinux Kernel] [PATCH 02/11] drivers: clk: st: Add clock propagation for audio clocks

2016-05-26 Thread loic pallardy



On 05/26/2016 02:46 PM, Rob Herring wrote:

On Thu, May 26, 2016 at 4:49 AM, Gabriel Fernandez
 wrote:

On 25 May 2016 at 19:24, Rob Herring  wrote:


On Wed, May 18, 2016 at 10:41:23AM +0200, Gabriel Fernandez wrote:

This patch allows fine tuning of the quads FS for audio clocks
accuracy.

Signed-off-by: Olivier Bideau 
Signed-off-by: Gabriel Fernandez 
---
  .../devicetree/bindings/clock/st/st,flexgen.txt|  1 +
  drivers/clk/st/clk-flexgen.c   | 24 ++
  2 files changed, 25 insertions(+)

diff --git a/Documentation/devicetree/bindings/clock/st/st,flexgen.txt 
b/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
index b7ee5c7..15b33c7 100644
--- a/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
+++ b/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
@@ -60,6 +60,7 @@ This binding uses the common clock binding[2].
  Required properties:
  - compatible : shall be:
"st,flexgen"
+  "st,stih407-clkgend0", "st,flexgen" (enable clock propagation on parent)


What do "d0" and "d2" refer to?

This seems to indicate you have too much clock detail in the DT (with
individual clocks described) or not enough with genericish compatible
strings. What happens for the mext clock you need to adjust the flags
on? You should be able to make these adjustments without DT updates.
Perhaps you need a wider fixing of clock compatible strings.

Rob


Sorry i sent my response in html...

Hi Rob,

Thanks for reviewing.

Can i remove
"
st,stih407-clkgend0" & "
st,stih407-clkgend2" compatible strings and add proprieties instead ?
I only need to activate 2 features and then we can keep generic
compatible strings.



Hi Rob,

That is no different and suffers the same point I raised. It requires
updating the DT for any clock configuration change or enhancement.

Agree with you, DT update is needed as soon as a clock configuration 
should be changed. This is due to STiH clock driver design based on DT 
description of SoC clock tree.


This clock driver was accepted 2 years ago. At the time being there was 
discussion about clock tree description location: driver or DT.

Bad choice was done for this driver...

If we decide to redesign STiH clock driver using in-driver clock tree 
description, this will modify STiH clock DT nodes description and so 
break DT backward compatibility.


What's from your pov the best option?

Regards,
Loic


Rob

___
Kernel mailing list
ker...@stlinux.com
http://www.stlinux.com/mailman/listinfo/kernel



Re: [STLinux Kernel] [PATCH 02/11] drivers: clk: st: Add clock propagation for audio clocks

2016-05-26 Thread loic pallardy



On 05/26/2016 02:46 PM, Rob Herring wrote:

On Thu, May 26, 2016 at 4:49 AM, Gabriel Fernandez
 wrote:

On 25 May 2016 at 19:24, Rob Herring  wrote:


On Wed, May 18, 2016 at 10:41:23AM +0200, Gabriel Fernandez wrote:

This patch allows fine tuning of the quads FS for audio clocks
accuracy.

Signed-off-by: Olivier Bideau 
Signed-off-by: Gabriel Fernandez 
---
  .../devicetree/bindings/clock/st/st,flexgen.txt|  1 +
  drivers/clk/st/clk-flexgen.c   | 24 ++
  2 files changed, 25 insertions(+)

diff --git a/Documentation/devicetree/bindings/clock/st/st,flexgen.txt 
b/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
index b7ee5c7..15b33c7 100644
--- a/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
+++ b/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
@@ -60,6 +60,7 @@ This binding uses the common clock binding[2].
  Required properties:
  - compatible : shall be:
"st,flexgen"
+  "st,stih407-clkgend0", "st,flexgen" (enable clock propagation on parent)


What do "d0" and "d2" refer to?

This seems to indicate you have too much clock detail in the DT (with
individual clocks described) or not enough with genericish compatible
strings. What happens for the mext clock you need to adjust the flags
on? You should be able to make these adjustments without DT updates.
Perhaps you need a wider fixing of clock compatible strings.

Rob


Sorry i sent my response in html...

Hi Rob,

Thanks for reviewing.

Can i remove
"
st,stih407-clkgend0" & "
st,stih407-clkgend2" compatible strings and add proprieties instead ?
I only need to activate 2 features and then we can keep generic
compatible strings.



Hi Rob,

That is no different and suffers the same point I raised. It requires
updating the DT for any clock configuration change or enhancement.

Agree with you, DT update is needed as soon as a clock configuration 
should be changed. This is due to STiH clock driver design based on DT 
description of SoC clock tree.


This clock driver was accepted 2 years ago. At the time being there was 
discussion about clock tree description location: driver or DT.

Bad choice was done for this driver...

If we decide to redesign STiH clock driver using in-driver clock tree 
description, this will modify STiH clock DT nodes description and so 
break DT backward compatibility.


What's from your pov the best option?

Regards,
Loic


Rob

___
Kernel mailing list
ker...@stlinux.com
http://www.stlinux.com/mailman/listinfo/kernel



[PATCH v3] ASoC: rockchip: Add machine driver for MAX98357A/RT5514/DA7219

2016-05-26 Thread Xing Zheng
There are multi codec devices on the RK3399 platform, we can use
this patch support and control these codecs.

Signed-off-by: Xing Zheng 
---

Changes in v3:
- rename DOC to rockchip,rk3399-max98357a-rt5514-da7219.txt
- rename compatible to rockchip,rk3399-max98357a-rt5514-da7219
- rename source code to rk3399_max98357a_rt5514_da7219.c

Changes in v2:
- use the FS 256 to set mclks of the max98357a and rt5514 danamically
- add more sample rate for da7219

 .../rockchip,rk3399-max98357a-rt5514-da7219.txt|   15 +
 sound/soc/rockchip/Kconfig |   11 +
 sound/soc/rockchip/Makefile|2 +
 .../soc/rockchip/rk3399_max98357a_rt5514_da7219.c  |  333 
 4 files changed, 361 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/sound/rockchip,rk3399-max98357a-rt5514-da7219.txt
 create mode 100644 sound/soc/rockchip/rk3399_max98357a_rt5514_da7219.c

diff --git 
a/Documentation/devicetree/bindings/sound/rockchip,rk3399-max98357a-rt5514-da7219.txt
 
b/Documentation/devicetree/bindings/sound/rockchip,rk3399-max98357a-rt5514-da7219.txt
new file mode 100644
index 000..3ae603e
--- /dev/null
+++ 
b/Documentation/devicetree/bindings/sound/rockchip,rk3399-max98357a-rt5514-da7219.txt
@@ -0,0 +1,15 @@
+ROCKCHIP with MAX98357A/RT5514/DA7219 codecs
+
+Required properties:
+- compatible: "rockchip,rk3399-max98357a-rt5514-da7219"
+- rockchip,cpu: The phandle of the Rockchip I2S controller that's
+  connected to the codecs
+- rockchip,codec: The phandle of the MAX98357A/RT5514/DA7219 codecs
+
+Example:
+
+sound {
+   compatible = "rockchip,rk3399-max98357a-rt5514-da7219";
+   rockchip,cpu = <  >;
+   rockchip,codec = <  >;
+};
diff --git a/sound/soc/rockchip/Kconfig b/sound/soc/rockchip/Kconfig
index f1e0c70..e5bd1f9 100644
--- a/sound/soc/rockchip/Kconfig
+++ b/sound/soc/rockchip/Kconfig
@@ -41,3 +41,14 @@ config SND_SOC_ROCKCHIP_RT5645
help
  Say Y or M here if you want to add support for SoC audio on Rockchip
  boards using the RT5645/RT5650 codec, such as Veyron.
+
+config SND_SOC_RK3399_MAX98357A_RT5514_DA7219
+   tristate "ASoC support for Rockchip RK3399 boards using the 
MAX98357A/RT5514/DA7219"
+   depends on SND_SOC_ROCKCHIP && GPIOLIB
+   select SND_SOC_ROCKCHIP_I2S
+   select SND_SOC_MAX98357A
+   select SND_SOC_RT5514
+   select SND_SOC_DA7219
+   help
+ Say Y or M here if you want to add support for SoC audio on Rockchip 
RK3399
+ boards using the MAX98357A/RT5514/DA7219.
diff --git a/sound/soc/rockchip/Makefile b/sound/soc/rockchip/Makefile
index c0bf560..0d9ca0a 100644
--- a/sound/soc/rockchip/Makefile
+++ b/sound/soc/rockchip/Makefile
@@ -7,6 +7,8 @@ obj-$(CONFIG_SND_SOC_ROCKCHIP_SPDIF) += snd-soc-rockchip-spdif.o
 
 snd-soc-rockchip-max98090-objs := rockchip_max98090.o
 snd-soc-rockchip-rt5645-objs := rockchip_rt5645.o
+snd-soc-rk3399-max98357a-rt5514-da7219-objs := rk3399_max98357a_rt5514_da7219.o
 
 obj-$(CONFIG_SND_SOC_ROCKCHIP_MAX98090) += snd-soc-rockchip-max98090.o
 obj-$(CONFIG_SND_SOC_ROCKCHIP_RT5645) += snd-soc-rockchip-rt5645.o
+obj-$(CONFIG_SND_SOC_RK3399_MAX98357A_RT5514_DA7219) += 
snd-soc-rk3399-max98357a-rt5514-da7219.o
diff --git a/sound/soc/rockchip/rk3399_max98357a_rt5514_da7219.c 
b/sound/soc/rockchip/rk3399_max98357a_rt5514_da7219.c
new file mode 100644
index 000..ec88600
--- /dev/null
+++ b/sound/soc/rockchip/rk3399_max98357a_rt5514_da7219.c
@@ -0,0 +1,333 @@
+/*
+ * Rockchip machine ASoC driver for boards using MAX98357A/RT5514/DA7219
+ *
+ * Copyright (c) 2016, ROCKCHIP CORPORATION.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see .
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "rockchip_i2s.h"
+#include "../codecs/da7219.h"
+#include "../codecs/da7219-aad.h"
+#include "../codecs/rt5514.h"
+
+#define DRV_NAME "rk3399-max98357a-rt5514-da7219"
+
+#define SOUND_FS   256
+
+static struct snd_soc_jack rockchip_sound_jack;
+
+static const struct snd_soc_dapm_widget rockchip_dapm_widgets[] = {
+   SND_SOC_DAPM_HP("Headphones", NULL),
+   SND_SOC_DAPM_SPK("Speakers", NULL),
+   SND_SOC_DAPM_MIC("Headset Mic", NULL),
+   SND_SOC_DAPM_MIC("Int Mic", NULL),
+};
+
+static const struct snd_soc_dapm_route 

[PATCH v3] ASoC: rockchip: Add machine driver for MAX98357A/RT5514/DA7219

2016-05-26 Thread Xing Zheng
There are multi codec devices on the RK3399 platform, we can use
this patch support and control these codecs.

Signed-off-by: Xing Zheng 
---

Changes in v3:
- rename DOC to rockchip,rk3399-max98357a-rt5514-da7219.txt
- rename compatible to rockchip,rk3399-max98357a-rt5514-da7219
- rename source code to rk3399_max98357a_rt5514_da7219.c

Changes in v2:
- use the FS 256 to set mclks of the max98357a and rt5514 danamically
- add more sample rate for da7219

 .../rockchip,rk3399-max98357a-rt5514-da7219.txt|   15 +
 sound/soc/rockchip/Kconfig |   11 +
 sound/soc/rockchip/Makefile|2 +
 .../soc/rockchip/rk3399_max98357a_rt5514_da7219.c  |  333 
 4 files changed, 361 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/sound/rockchip,rk3399-max98357a-rt5514-da7219.txt
 create mode 100644 sound/soc/rockchip/rk3399_max98357a_rt5514_da7219.c

diff --git 
a/Documentation/devicetree/bindings/sound/rockchip,rk3399-max98357a-rt5514-da7219.txt
 
b/Documentation/devicetree/bindings/sound/rockchip,rk3399-max98357a-rt5514-da7219.txt
new file mode 100644
index 000..3ae603e
--- /dev/null
+++ 
b/Documentation/devicetree/bindings/sound/rockchip,rk3399-max98357a-rt5514-da7219.txt
@@ -0,0 +1,15 @@
+ROCKCHIP with MAX98357A/RT5514/DA7219 codecs
+
+Required properties:
+- compatible: "rockchip,rk3399-max98357a-rt5514-da7219"
+- rockchip,cpu: The phandle of the Rockchip I2S controller that's
+  connected to the codecs
+- rockchip,codec: The phandle of the MAX98357A/RT5514/DA7219 codecs
+
+Example:
+
+sound {
+   compatible = "rockchip,rk3399-max98357a-rt5514-da7219";
+   rockchip,cpu = <  >;
+   rockchip,codec = <  >;
+};
diff --git a/sound/soc/rockchip/Kconfig b/sound/soc/rockchip/Kconfig
index f1e0c70..e5bd1f9 100644
--- a/sound/soc/rockchip/Kconfig
+++ b/sound/soc/rockchip/Kconfig
@@ -41,3 +41,14 @@ config SND_SOC_ROCKCHIP_RT5645
help
  Say Y or M here if you want to add support for SoC audio on Rockchip
  boards using the RT5645/RT5650 codec, such as Veyron.
+
+config SND_SOC_RK3399_MAX98357A_RT5514_DA7219
+   tristate "ASoC support for Rockchip RK3399 boards using the 
MAX98357A/RT5514/DA7219"
+   depends on SND_SOC_ROCKCHIP && GPIOLIB
+   select SND_SOC_ROCKCHIP_I2S
+   select SND_SOC_MAX98357A
+   select SND_SOC_RT5514
+   select SND_SOC_DA7219
+   help
+ Say Y or M here if you want to add support for SoC audio on Rockchip 
RK3399
+ boards using the MAX98357A/RT5514/DA7219.
diff --git a/sound/soc/rockchip/Makefile b/sound/soc/rockchip/Makefile
index c0bf560..0d9ca0a 100644
--- a/sound/soc/rockchip/Makefile
+++ b/sound/soc/rockchip/Makefile
@@ -7,6 +7,8 @@ obj-$(CONFIG_SND_SOC_ROCKCHIP_SPDIF) += snd-soc-rockchip-spdif.o
 
 snd-soc-rockchip-max98090-objs := rockchip_max98090.o
 snd-soc-rockchip-rt5645-objs := rockchip_rt5645.o
+snd-soc-rk3399-max98357a-rt5514-da7219-objs := rk3399_max98357a_rt5514_da7219.o
 
 obj-$(CONFIG_SND_SOC_ROCKCHIP_MAX98090) += snd-soc-rockchip-max98090.o
 obj-$(CONFIG_SND_SOC_ROCKCHIP_RT5645) += snd-soc-rockchip-rt5645.o
+obj-$(CONFIG_SND_SOC_RK3399_MAX98357A_RT5514_DA7219) += 
snd-soc-rk3399-max98357a-rt5514-da7219.o
diff --git a/sound/soc/rockchip/rk3399_max98357a_rt5514_da7219.c 
b/sound/soc/rockchip/rk3399_max98357a_rt5514_da7219.c
new file mode 100644
index 000..ec88600
--- /dev/null
+++ b/sound/soc/rockchip/rk3399_max98357a_rt5514_da7219.c
@@ -0,0 +1,333 @@
+/*
+ * Rockchip machine ASoC driver for boards using MAX98357A/RT5514/DA7219
+ *
+ * Copyright (c) 2016, ROCKCHIP CORPORATION.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see .
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "rockchip_i2s.h"
+#include "../codecs/da7219.h"
+#include "../codecs/da7219-aad.h"
+#include "../codecs/rt5514.h"
+
+#define DRV_NAME "rk3399-max98357a-rt5514-da7219"
+
+#define SOUND_FS   256
+
+static struct snd_soc_jack rockchip_sound_jack;
+
+static const struct snd_soc_dapm_widget rockchip_dapm_widgets[] = {
+   SND_SOC_DAPM_HP("Headphones", NULL),
+   SND_SOC_DAPM_SPK("Speakers", NULL),
+   SND_SOC_DAPM_MIC("Headset Mic", NULL),
+   SND_SOC_DAPM_MIC("Int Mic", NULL),
+};
+
+static const struct snd_soc_dapm_route rockchip_dapm_routes[] = {
+   /* 

[PATCH] ASoC: mediatek: Change the order of MCLK clock configuration

2016-05-26 Thread PC Liao
Because MCLK opens later and closes earlier than codec, this patch
changes the order of MCLK clock configuration.

Signed-off-by: PC Liao 
---
 sound/soc/mediatek/mtk-afe-pcm.c |   16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/sound/soc/mediatek/mtk-afe-pcm.c b/sound/soc/mediatek/mtk-afe-pcm.c
index f1c58a2..440ae06 100644
--- a/sound/soc/mediatek/mtk-afe-pcm.c
+++ b/sound/soc/mediatek/mtk-afe-pcm.c
@@ -360,8 +360,6 @@ static int mtk_afe_i2s_startup(struct snd_pcm_substream 
*substream,
if (dai->active)
return 0;
 
-   mtk_afe_dais_enable_clks(afe, afe->clocks[MTK_CLK_I2S1_M], NULL);
-   mtk_afe_dais_enable_clks(afe, afe->clocks[MTK_CLK_I2S2_M], NULL);
regmap_update_bits(afe->regmap, AUDIO_TOP_CON0,
   AUD_TCON0_PDN_22M | AUD_TCON0_PDN_24M, 0);
return 0;
@@ -380,8 +378,6 @@ static void mtk_afe_i2s_shutdown(struct snd_pcm_substream 
*substream,
regmap_update_bits(afe->regmap, AUDIO_TOP_CON0,
   AUD_TCON0_PDN_22M | AUD_TCON0_PDN_24M,
   AUD_TCON0_PDN_22M | AUD_TCON0_PDN_24M);
-   mtk_afe_dais_disable_clks(afe, afe->clocks[MTK_CLK_I2S1_M], NULL);
-   mtk_afe_dais_disable_clks(afe, afe->clocks[MTK_CLK_I2S2_M], NULL);
 }
 
 static int mtk_afe_i2s_prepare(struct snd_pcm_substream *substream,
@@ -1132,6 +1128,8 @@ static int mtk_afe_runtime_suspend(struct device *dev)
regmap_update_bits(afe->regmap, AUDIO_TOP_CON0,
   AUD_TCON0_PDN_AFE, AUD_TCON0_PDN_AFE);
 
+   clk_disable_unprepare(afe->clocks[MTK_CLK_I2S1_M]);
+   clk_disable_unprepare(afe->clocks[MTK_CLK_I2S2_M]);
clk_disable_unprepare(afe->clocks[MTK_CLK_BCK0]);
clk_disable_unprepare(afe->clocks[MTK_CLK_BCK1]);
clk_disable_unprepare(afe->clocks[MTK_CLK_TOP_PDN_AUD]);
@@ -1164,6 +1162,12 @@ static int mtk_afe_runtime_resume(struct device *dev)
ret = clk_prepare_enable(afe->clocks[MTK_CLK_BCK1]);
if (ret)
goto err_bck0;
+   ret = clk_prepare_enable(afe->clocks[MTK_CLK_I2S1_M]);
+   if (ret)
+   goto err_i2s1_m;
+   ret = clk_prepare_enable(afe->clocks[MTK_CLK_I2S2_M]);
+   if (ret)
+   goto err_i2s2_m;
 
/* enable AFE clk */
regmap_update_bits(afe->regmap, AUDIO_TOP_CON0, AUD_TCON0_PDN_AFE, 0);
@@ -1179,6 +1183,10 @@ static int mtk_afe_runtime_resume(struct device *dev)
regmap_update_bits(afe->regmap, AFE_DAC_CON0, 0x1, 0x1);
return 0;
 
+err_i2s1_m:
+   clk_disable_unprepare(afe->clocks[MTK_CLK_I2S1_M]);
+err_i2s2_m:
+   clk_disable_unprepare(afe->clocks[MTK_CLK_I2S2_M]);
 err_bck0:
clk_disable_unprepare(afe->clocks[MTK_CLK_BCK0]);
 err_top_aud:
-- 
1.7.9.5



[PATCH] ASoC: mediatek: Change the order of MCLK clock configuration

2016-05-26 Thread PC Liao
Because MCLK opens later and closes earlier than codec, this patch
changes the order of MCLK clock configuration.

Signed-off-by: PC Liao 
---
 sound/soc/mediatek/mtk-afe-pcm.c |   16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/sound/soc/mediatek/mtk-afe-pcm.c b/sound/soc/mediatek/mtk-afe-pcm.c
index f1c58a2..440ae06 100644
--- a/sound/soc/mediatek/mtk-afe-pcm.c
+++ b/sound/soc/mediatek/mtk-afe-pcm.c
@@ -360,8 +360,6 @@ static int mtk_afe_i2s_startup(struct snd_pcm_substream 
*substream,
if (dai->active)
return 0;
 
-   mtk_afe_dais_enable_clks(afe, afe->clocks[MTK_CLK_I2S1_M], NULL);
-   mtk_afe_dais_enable_clks(afe, afe->clocks[MTK_CLK_I2S2_M], NULL);
regmap_update_bits(afe->regmap, AUDIO_TOP_CON0,
   AUD_TCON0_PDN_22M | AUD_TCON0_PDN_24M, 0);
return 0;
@@ -380,8 +378,6 @@ static void mtk_afe_i2s_shutdown(struct snd_pcm_substream 
*substream,
regmap_update_bits(afe->regmap, AUDIO_TOP_CON0,
   AUD_TCON0_PDN_22M | AUD_TCON0_PDN_24M,
   AUD_TCON0_PDN_22M | AUD_TCON0_PDN_24M);
-   mtk_afe_dais_disable_clks(afe, afe->clocks[MTK_CLK_I2S1_M], NULL);
-   mtk_afe_dais_disable_clks(afe, afe->clocks[MTK_CLK_I2S2_M], NULL);
 }
 
 static int mtk_afe_i2s_prepare(struct snd_pcm_substream *substream,
@@ -1132,6 +1128,8 @@ static int mtk_afe_runtime_suspend(struct device *dev)
regmap_update_bits(afe->regmap, AUDIO_TOP_CON0,
   AUD_TCON0_PDN_AFE, AUD_TCON0_PDN_AFE);
 
+   clk_disable_unprepare(afe->clocks[MTK_CLK_I2S1_M]);
+   clk_disable_unprepare(afe->clocks[MTK_CLK_I2S2_M]);
clk_disable_unprepare(afe->clocks[MTK_CLK_BCK0]);
clk_disable_unprepare(afe->clocks[MTK_CLK_BCK1]);
clk_disable_unprepare(afe->clocks[MTK_CLK_TOP_PDN_AUD]);
@@ -1164,6 +1162,12 @@ static int mtk_afe_runtime_resume(struct device *dev)
ret = clk_prepare_enable(afe->clocks[MTK_CLK_BCK1]);
if (ret)
goto err_bck0;
+   ret = clk_prepare_enable(afe->clocks[MTK_CLK_I2S1_M]);
+   if (ret)
+   goto err_i2s1_m;
+   ret = clk_prepare_enable(afe->clocks[MTK_CLK_I2S2_M]);
+   if (ret)
+   goto err_i2s2_m;
 
/* enable AFE clk */
regmap_update_bits(afe->regmap, AUDIO_TOP_CON0, AUD_TCON0_PDN_AFE, 0);
@@ -1179,6 +1183,10 @@ static int mtk_afe_runtime_resume(struct device *dev)
regmap_update_bits(afe->regmap, AFE_DAC_CON0, 0x1, 0x1);
return 0;
 
+err_i2s1_m:
+   clk_disable_unprepare(afe->clocks[MTK_CLK_I2S1_M]);
+err_i2s2_m:
+   clk_disable_unprepare(afe->clocks[MTK_CLK_I2S2_M]);
 err_bck0:
clk_disable_unprepare(afe->clocks[MTK_CLK_BCK0]);
 err_top_aud:
-- 
1.7.9.5



Re: [PATCH v2 0/10] Add RK3399 eDP support and fix some bugs to analogix_dp driver.

2016-05-26 Thread Javier Martinez Canillas
Hello Yakir,

On 05/26/2016 05:34 AM, Yakir Yang wrote:
> Hi Javier,
> 
> On 05/24/2016 01:01 PM, Yakir Yang wrote:
>> Hi all,
>>
>> This series have been posted about one month, still no comments, help here :(
> This series works rightly on Rockchip platform, and most of them haven't 
> touch the
> common analogix_dp driver (except for the hotplug fixed). So i guess Exynos 
> platform
> should also happy with this changes.
> 
> But not sure about that. So, is it possible that you could help to check this 
> on Exynos
> Chromebook, if so i would be very grateful about that.
>

Of course, I' ll test. Could you please provide me a branch that I can
pull directly to avoid cherry-picking all the patches from the list? 
 
> Thanks,
> - Yakir
>>

Best regards,
-- 
Javier Martinez Canillas
Open Source Group
Samsung Research America


Re: [PATCH v2 0/10] Add RK3399 eDP support and fix some bugs to analogix_dp driver.

2016-05-26 Thread Javier Martinez Canillas
Hello Yakir,

On 05/26/2016 05:34 AM, Yakir Yang wrote:
> Hi Javier,
> 
> On 05/24/2016 01:01 PM, Yakir Yang wrote:
>> Hi all,
>>
>> This series have been posted about one month, still no comments, help here :(
> This series works rightly on Rockchip platform, and most of them haven't 
> touch the
> common analogix_dp driver (except for the hotplug fixed). So i guess Exynos 
> platform
> should also happy with this changes.
> 
> But not sure about that. So, is it possible that you could help to check this 
> on Exynos
> Chromebook, if so i would be very grateful about that.
>

Of course, I' ll test. Could you please provide me a branch that I can
pull directly to avoid cherry-picking all the patches from the list? 
 
> Thanks,
> - Yakir
>>

Best regards,
-- 
Javier Martinez Canillas
Open Source Group
Samsung Research America


Re: [PATCH 02/11] drivers: clk: st: Add clock propagation for audio clocks

2016-05-26 Thread Rob Herring
On Thu, May 26, 2016 at 4:49 AM, Gabriel Fernandez
 wrote:
> On 25 May 2016 at 19:24, Rob Herring  wrote:
>>
>> On Wed, May 18, 2016 at 10:41:23AM +0200, Gabriel Fernandez wrote:
>> > This patch allows fine tuning of the quads FS for audio clocks
>> > accuracy.
>> >
>> > Signed-off-by: Olivier Bideau 
>> > Signed-off-by: Gabriel Fernandez 
>> > ---
>> >  .../devicetree/bindings/clock/st/st,flexgen.txt|  1 +
>> >  drivers/clk/st/clk-flexgen.c   | 24 
>> > ++
>> >  2 files changed, 25 insertions(+)
>> >
>> > diff --git a/Documentation/devicetree/bindings/clock/st/st,flexgen.txt 
>> > b/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
>> > index b7ee5c7..15b33c7 100644
>> > --- a/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
>> > +++ b/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
>> > @@ -60,6 +60,7 @@ This binding uses the common clock binding[2].
>> >  Required properties:
>> >  - compatible : shall be:
>> >"st,flexgen"
>> > +  "st,stih407-clkgend0", "st,flexgen" (enable clock propagation on parent)
>>
>> What do "d0" and "d2" refer to?
>>
>> This seems to indicate you have too much clock detail in the DT (with
>> individual clocks described) or not enough with genericish compatible
>> strings. What happens for the mext clock you need to adjust the flags
>> on? You should be able to make these adjustments without DT updates.
>> Perhaps you need a wider fixing of clock compatible strings.
>>
>> Rob
>
> Sorry i sent my response in html...
>
> Hi Rob,
>
> Thanks for reviewing.
>
> Can i remove
> "
> st,stih407-clkgend0" & "
> st,stih407-clkgend2" compatible strings and add proprieties instead ?
> I only need to activate 2 features and then we can keep generic
> compatible strings.

That is no different and suffers the same point I raised. It requires
updating the DT for any clock configuration change or enhancement.

Rob


Re: [PATCH 02/11] drivers: clk: st: Add clock propagation for audio clocks

2016-05-26 Thread Rob Herring
On Thu, May 26, 2016 at 4:49 AM, Gabriel Fernandez
 wrote:
> On 25 May 2016 at 19:24, Rob Herring  wrote:
>>
>> On Wed, May 18, 2016 at 10:41:23AM +0200, Gabriel Fernandez wrote:
>> > This patch allows fine tuning of the quads FS for audio clocks
>> > accuracy.
>> >
>> > Signed-off-by: Olivier Bideau 
>> > Signed-off-by: Gabriel Fernandez 
>> > ---
>> >  .../devicetree/bindings/clock/st/st,flexgen.txt|  1 +
>> >  drivers/clk/st/clk-flexgen.c   | 24 
>> > ++
>> >  2 files changed, 25 insertions(+)
>> >
>> > diff --git a/Documentation/devicetree/bindings/clock/st/st,flexgen.txt 
>> > b/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
>> > index b7ee5c7..15b33c7 100644
>> > --- a/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
>> > +++ b/Documentation/devicetree/bindings/clock/st/st,flexgen.txt
>> > @@ -60,6 +60,7 @@ This binding uses the common clock binding[2].
>> >  Required properties:
>> >  - compatible : shall be:
>> >"st,flexgen"
>> > +  "st,stih407-clkgend0", "st,flexgen" (enable clock propagation on parent)
>>
>> What do "d0" and "d2" refer to?
>>
>> This seems to indicate you have too much clock detail in the DT (with
>> individual clocks described) or not enough with genericish compatible
>> strings. What happens for the mext clock you need to adjust the flags
>> on? You should be able to make these adjustments without DT updates.
>> Perhaps you need a wider fixing of clock compatible strings.
>>
>> Rob
>
> Sorry i sent my response in html...
>
> Hi Rob,
>
> Thanks for reviewing.
>
> Can i remove
> "
> st,stih407-clkgend0" & "
> st,stih407-clkgend2" compatible strings and add proprieties instead ?
> I only need to activate 2 features and then we can keep generic
> compatible strings.

That is no different and suffers the same point I raised. It requires
updating the DT for any clock configuration change or enhancement.

Rob


Re: Builtin microcode does nothing..

2016-05-26 Thread Borislav Petkov
On Thu, May 26, 2016 at 01:52:22PM +0200, Gabriel C wrote:
>  With this patch ontop your your tip brach all is fine.
>  Tested on both server and both are up and running.

Thanks.

In the meantime, I've simplified the code even more and testing here
with your .config looks good. I'll prepare another patchset next week
and post it here in case you guys feel bored and want to test it too.

:-)

Thanks again!

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 


Re: Builtin microcode does nothing..

2016-05-26 Thread Borislav Petkov
On Thu, May 26, 2016 at 01:52:22PM +0200, Gabriel C wrote:
>  With this patch ontop your your tip brach all is fine.
>  Tested on both server and both are up and running.

Thanks.

In the meantime, I've simplified the code even more and testing here
with your .config looks good. I'll prepare another patchset next week
and post it here in case you guys feel bored and want to test it too.

:-)

Thanks again!

-- 
Regards/Gruss,
Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 
(AG Nürnberg)
-- 


[PATCH 4/6] mm, oom: skip over vforked tasks

2016-05-26 Thread Michal Hocko
From: Michal Hocko 

vforked tasks are not really sitting on memory so it doesn't matter much
to kill them. Parents are waiting for vforked task killable so it is
better to chose parent which is the real mm owner. Teach oom_badness
to ignore all tasks which haven't passed mm_release. oom_kill_process
should ignore them as well because they will drop the mm soon and they
will not block oom_reaper because they cannot touch any memory.

Signed-off-by: Michal Hocko 
---
 mm/oom_kill.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index eeccb4d7e7f5..d1cbaaa1a666 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct 
mem_cgroup *memcg,
 
/*
 * Do not even consider tasks which are explicitly marked oom
-* unkillable or have been already oom reaped.
+* unkillable or have been already oom reaped or they are in
+* the middle of vfork
 */
adj = (long)p->signal->oom_score_adj;
if (adj == OOM_SCORE_ADJ_MIN ||
-   test_bit(MMF_OOM_REAPED, >mm->flags)) {
+   test_bit(MMF_OOM_REAPED, >mm->flags) ||
+   p->vfork_done) {
task_unlock(p);
return 0;
}
@@ -839,6 +841,13 @@ void oom_kill_process(struct oom_control *oc, struct 
task_struct *p,
for_each_process(p) {
if (!process_shares_mm(p, mm))
continue;
+   /*
+* vforked tasks are ignored because they will drop the mm soon
+* hopefully and even if not they will not mind being oom
+* reaped because they cannot touch any memory.
+*/
+   if (p->vfork_done)
+   continue;
if (same_thread_group(p, victim))
continue;
if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-- 
2.8.1



[PATCH 4/6] mm, oom: skip over vforked tasks

2016-05-26 Thread Michal Hocko
From: Michal Hocko 

vforked tasks are not really sitting on memory so it doesn't matter much
to kill them. Parents are waiting for vforked task killable so it is
better to chose parent which is the real mm owner. Teach oom_badness
to ignore all tasks which haven't passed mm_release. oom_kill_process
should ignore them as well because they will drop the mm soon and they
will not block oom_reaper because they cannot touch any memory.

Signed-off-by: Michal Hocko 
---
 mm/oom_kill.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index eeccb4d7e7f5..d1cbaaa1a666 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -176,11 +176,13 @@ unsigned long oom_badness(struct task_struct *p, struct 
mem_cgroup *memcg,
 
/*
 * Do not even consider tasks which are explicitly marked oom
-* unkillable or have been already oom reaped.
+* unkillable or have been already oom reaped or they are in
+* the middle of vfork
 */
adj = (long)p->signal->oom_score_adj;
if (adj == OOM_SCORE_ADJ_MIN ||
-   test_bit(MMF_OOM_REAPED, >mm->flags)) {
+   test_bit(MMF_OOM_REAPED, >mm->flags) ||
+   p->vfork_done) {
task_unlock(p);
return 0;
}
@@ -839,6 +841,13 @@ void oom_kill_process(struct oom_control *oc, struct 
task_struct *p,
for_each_process(p) {
if (!process_shares_mm(p, mm))
continue;
+   /*
+* vforked tasks are ignored because they will drop the mm soon
+* hopefully and even if not they will not mind being oom
+* reaped because they cannot touch any memory.
+*/
+   if (p->vfork_done)
+   continue;
if (same_thread_group(p, victim))
continue;
if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-- 
2.8.1



[PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm

2016-05-26 Thread Michal Hocko
From: Michal Hocko 

oom_kill_process makes sure to kill all processes outside of the thread
group which are sharing the mm. This requires to iterate over all tasks.
This is however not a common case so we can optimize it a bit and only
do that path only if we know that there are external users of the mm
struct outside of the thread group.

Signed-off-by: Michal Hocko 
---
 mm/oom_kill.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5bb2f7698ad7..0e33e912f7e4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -820,6 +820,13 @@ void oom_kill_process(struct oom_control *oc, struct 
task_struct *p,
task_unlock(victim);
 
/*
+* skip expensive iterations over all tasks if we know that there
+* are no users outside of threads in the same thread group
+*/
+   if (atomic_read(>mm_users) <= get_nr_threads(victim))
+   goto oom_reap;
+
+   /*
 * Kill all user processes sharing victim->mm in other thread groups, if
 * any.  They don't get access to memory reserves, though, to avoid
 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
@@ -848,6 +855,7 @@ void oom_kill_process(struct oom_control *oc, struct 
task_struct *p,
}
rcu_read_unlock();
 
+oom_reap:
if (can_oom_reap)
wake_oom_reaper(victim);
 
-- 
2.8.1



[PATCH 2/6] proc, oom_adj: extract oom_score_adj setting into a helper

2016-05-26 Thread Michal Hocko
From: Michal Hocko 

Currently we have two proc interfaces to set oom_score_adj. The legacy
/proc//oom_adj and /proc//oom_score_adj which both have their
specific handlers. Big part of the logic is duplicated so extract the
common code into __set_oom_adj helper. Legacy knob still expects some
details slightly different so make sure those are handled same way - e.g.
the legacy mode ignores oom_score_adj_min and it warns about the usage.

This patch shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko 
---
 fs/proc/base.c | 133 +
 1 file changed, 59 insertions(+), 74 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index be73f4d0cb01..23679673bf5a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1041,6 +1041,63 @@ static ssize_t oom_adj_read(struct file *file, char 
__user *buf, size_t count,
return simple_read_from_buffer(buf, count, ppos, buffer, len);
 }
 
+static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
+{
+   struct task_struct *task;
+   unsigned long flags;
+   int err = 0;
+
+   task = get_proc_task(file_inode(file));
+   if (!task) {
+   err = -ESRCH;
+   goto out;
+   }
+
+   task_lock(task);
+   if (!task->mm) {
+   err = -EINVAL;
+   goto err_task_lock;
+   }
+
+   if (!lock_task_sighand(task, )) {
+   err = -ESRCH;
+   goto err_task_lock;
+   }
+
+   if (legacy) {
+   if (oom_adj < task->signal->oom_score_adj &&
+   !capable(CAP_SYS_RESOURCE)) {
+   err = -EACCES;
+   goto err_sighand;
+   }
+   /*
+* /proc/pid/oom_adj is provided for legacy purposes, ask users 
to use
+* /proc/pid/oom_score_adj instead.
+*/
+   pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please 
use /proc/%d/oom_score_adj instead.\n",
+ current->comm, task_pid_nr(current), 
task_pid_nr(task),
+ task_pid_nr(task));
+   } else {
+   if ((short)oom_adj < task->signal->oom_score_adj_min &&
+   !capable(CAP_SYS_RESOURCE)) {
+   err = -EACCES;
+   goto err_sighand;
+   }
+   }
+
+   task->signal->oom_score_adj = oom_adj;
+   if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+   task->signal->oom_score_adj_min = (short)oom_adj;
+   trace_oom_score_adj_update(task);
+err_sighand:
+   unlock_task_sighand(task, );
+err_task_lock:
+   task_unlock(task);
+   put_task_struct(task);
+out:
+   return err;
+}
+
 /*
  * /proc/pid/oom_adj exists solely for backwards compatibility with previous
  * kernels.  The effective policy is defined by oom_score_adj, which has a
@@ -1054,10 +,8 @@ static ssize_t oom_adj_read(struct file *file, char 
__user *buf, size_t count,
 static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 size_t count, loff_t *ppos)
 {
-   struct task_struct *task;
char buffer[PROC_NUMBUF];
int oom_adj;
-   unsigned long flags;
int err;
 
memset(buffer, 0, sizeof(buffer));
@@ -1077,23 +1132,6 @@ static ssize_t oom_adj_write(struct file *file, const 
char __user *buf,
goto out;
}
 
-   task = get_proc_task(file_inode(file));
-   if (!task) {
-   err = -ESRCH;
-   goto out;
-   }
-
-   task_lock(task);
-   if (!task->mm) {
-   err = -EINVAL;
-   goto err_task_lock;
-   }
-
-   if (!lock_task_sighand(task, )) {
-   err = -ESRCH;
-   goto err_task_lock;
-   }
-
/*
 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
 * value is always attainable.
@@ -1103,27 +1141,8 @@ static ssize_t oom_adj_write(struct file *file, const 
char __user *buf,
else
oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE;
 
-   if (oom_adj < task->signal->oom_score_adj &&
-   !capable(CAP_SYS_RESOURCE)) {
-   err = -EACCES;
-   goto err_sighand;
-   }
-
-   /*
-* /proc/pid/oom_adj is provided for legacy purposes, ask users to use
-* /proc/pid/oom_score_adj instead.
-*/
-   pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use 
/proc/%d/oom_score_adj instead.\n",
- current->comm, task_pid_nr(current), task_pid_nr(task),
- task_pid_nr(task));
+   err = __set_oom_adj(file, oom_adj, true);
 
-   task->signal->oom_score_adj = oom_adj;
-   trace_oom_score_adj_update(task);
-err_sighand:
-   

[PATCH 1/6] mm, oom: do not loop over all tasks if there are no external tasks sharing mm

2016-05-26 Thread Michal Hocko
From: Michal Hocko 

oom_kill_process makes sure to kill all processes outside of the thread
group which are sharing the mm. This requires to iterate over all tasks.
This is however not a common case so we can optimize it a bit and only
do that path only if we know that there are external users of the mm
struct outside of the thread group.

Signed-off-by: Michal Hocko 
---
 mm/oom_kill.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5bb2f7698ad7..0e33e912f7e4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -820,6 +820,13 @@ void oom_kill_process(struct oom_control *oc, struct 
task_struct *p,
task_unlock(victim);
 
/*
+* skip expensive iterations over all tasks if we know that there
+* are no users outside of threads in the same thread group
+*/
+   if (atomic_read(>mm_users) <= get_nr_threads(victim))
+   goto oom_reap;
+
+   /*
 * Kill all user processes sharing victim->mm in other thread groups, if
 * any.  They don't get access to memory reserves, though, to avoid
 * depletion of all memory.  This prevents mm->mmap_sem livelock when an
@@ -848,6 +855,7 @@ void oom_kill_process(struct oom_control *oc, struct 
task_struct *p,
}
rcu_read_unlock();
 
+oom_reap:
if (can_oom_reap)
wake_oom_reaper(victim);
 
-- 
2.8.1



[PATCH 2/6] proc, oom_adj: extract oom_score_adj setting into a helper

2016-05-26 Thread Michal Hocko
From: Michal Hocko 

Currently we have two proc interfaces to set oom_score_adj. The legacy
/proc//oom_adj and /proc//oom_score_adj which both have their
specific handlers. Big part of the logic is duplicated so extract the
common code into __set_oom_adj helper. Legacy knob still expects some
details slightly different so make sure those are handled same way - e.g.
the legacy mode ignores oom_score_adj_min and it warns about the usage.

This patch shouldn't introduce any functional changes.

Signed-off-by: Michal Hocko 
---
 fs/proc/base.c | 133 +
 1 file changed, 59 insertions(+), 74 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index be73f4d0cb01..23679673bf5a 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1041,6 +1041,63 @@ static ssize_t oom_adj_read(struct file *file, char 
__user *buf, size_t count,
return simple_read_from_buffer(buf, count, ppos, buffer, len);
 }
 
+static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
+{
+   struct task_struct *task;
+   unsigned long flags;
+   int err = 0;
+
+   task = get_proc_task(file_inode(file));
+   if (!task) {
+   err = -ESRCH;
+   goto out;
+   }
+
+   task_lock(task);
+   if (!task->mm) {
+   err = -EINVAL;
+   goto err_task_lock;
+   }
+
+   if (!lock_task_sighand(task, )) {
+   err = -ESRCH;
+   goto err_task_lock;
+   }
+
+   if (legacy) {
+   if (oom_adj < task->signal->oom_score_adj &&
+   !capable(CAP_SYS_RESOURCE)) {
+   err = -EACCES;
+   goto err_sighand;
+   }
+   /*
+* /proc/pid/oom_adj is provided for legacy purposes, ask users 
to use
+* /proc/pid/oom_score_adj instead.
+*/
+   pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please 
use /proc/%d/oom_score_adj instead.\n",
+ current->comm, task_pid_nr(current), 
task_pid_nr(task),
+ task_pid_nr(task));
+   } else {
+   if ((short)oom_adj < task->signal->oom_score_adj_min &&
+   !capable(CAP_SYS_RESOURCE)) {
+   err = -EACCES;
+   goto err_sighand;
+   }
+   }
+
+   task->signal->oom_score_adj = oom_adj;
+   if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
+   task->signal->oom_score_adj_min = (short)oom_adj;
+   trace_oom_score_adj_update(task);
+err_sighand:
+   unlock_task_sighand(task, );
+err_task_lock:
+   task_unlock(task);
+   put_task_struct(task);
+out:
+   return err;
+}
+
 /*
  * /proc/pid/oom_adj exists solely for backwards compatibility with previous
  * kernels.  The effective policy is defined by oom_score_adj, which has a
@@ -1054,10 +,8 @@ static ssize_t oom_adj_read(struct file *file, char 
__user *buf, size_t count,
 static ssize_t oom_adj_write(struct file *file, const char __user *buf,
 size_t count, loff_t *ppos)
 {
-   struct task_struct *task;
char buffer[PROC_NUMBUF];
int oom_adj;
-   unsigned long flags;
int err;
 
memset(buffer, 0, sizeof(buffer));
@@ -1077,23 +1132,6 @@ static ssize_t oom_adj_write(struct file *file, const 
char __user *buf,
goto out;
}
 
-   task = get_proc_task(file_inode(file));
-   if (!task) {
-   err = -ESRCH;
-   goto out;
-   }
-
-   task_lock(task);
-   if (!task->mm) {
-   err = -EINVAL;
-   goto err_task_lock;
-   }
-
-   if (!lock_task_sighand(task, )) {
-   err = -ESRCH;
-   goto err_task_lock;
-   }
-
/*
 * Scale /proc/pid/oom_score_adj appropriately ensuring that a maximum
 * value is always attainable.
@@ -1103,27 +1141,8 @@ static ssize_t oom_adj_write(struct file *file, const 
char __user *buf,
else
oom_adj = (oom_adj * OOM_SCORE_ADJ_MAX) / -OOM_DISABLE;
 
-   if (oom_adj < task->signal->oom_score_adj &&
-   !capable(CAP_SYS_RESOURCE)) {
-   err = -EACCES;
-   goto err_sighand;
-   }
-
-   /*
-* /proc/pid/oom_adj is provided for legacy purposes, ask users to use
-* /proc/pid/oom_score_adj instead.
-*/
-   pr_warn_once("%s (%d): /proc/%d/oom_adj is deprecated, please use 
/proc/%d/oom_score_adj instead.\n",
- current->comm, task_pid_nr(current), task_pid_nr(task),
- task_pid_nr(task));
+   err = __set_oom_adj(file, oom_adj, true);
 
-   task->signal->oom_score_adj = oom_adj;
-   trace_oom_score_adj_update(task);
-err_sighand:
-   unlock_task_sighand(task, );
-err_task_lock:
-   

[PATCH 6/6] mm, oom: fortify task_will_free_mem

2016-05-26 Thread Michal Hocko
From: Michal Hocko 

task_will_free_mem is rather weak. It doesn't really tell whether
the task has chance to drop its mm. 98748bd72200 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step
into making it more robust for multi-threaded applications so now we
know that the whole process is going down. This builds on top for more
complex scenarios where mm is shared between different processes
(CLONE_VM without CLONE_THREAD resp CLONE_SIGHAND).

Make sure that all processes sharing the mm are killed or exiting. This
will allow us to replace try_oom_reaper by wake_oom_reaper. Therefore
all paths which bypass the oom killer are now reapable and so they
shouldn't lock up the oom killer.

Drop the mm checks for the bypass because those are not really
guaranteeing anything as the condition might change at any time
after task_lock is dropped.

Signed-off-by: Michal Hocko 
---
 include/linux/oom.h | 72 +
 mm/memcontrol.c |  4 +--
 mm/oom_kill.c   | 65 ---
 3 files changed, 74 insertions(+), 67 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 83469522690a..412c4ecb42b1 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -70,9 +70,9 @@ static inline bool oom_task_origin(const struct task_struct 
*p)
 extern void mark_oom_victim(struct task_struct *tsk);
 
 #ifdef CONFIG_MMU
-extern void try_oom_reaper(struct task_struct *tsk);
+extern void wake_oom_reaper(struct task_struct *tsk);
 #else
-static inline void try_oom_reaper(struct task_struct *tsk)
+static inline void wake_oom_reaper(struct task_struct *tsk)
 {
 }
 #endif
@@ -105,7 +105,7 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
-static inline bool task_will_free_mem(struct task_struct *task)
+static inline bool __task_will_free_mem(struct task_struct *task)
 {
struct signal_struct *sig = task->signal;
 
@@ -117,13 +117,75 @@ static inline bool task_will_free_mem(struct task_struct 
*task)
if (sig->flags & SIGNAL_GROUP_COREDUMP)
return false;
 
-   if (!(task->flags & PF_EXITING))
+   if (!(task->flags & PF_EXITING || fatal_signal_pending(task)))
return false;
 
/* Make sure that the whole thread group is going down */
-   if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
+   if (!thread_group_empty(task) &&
+   !(sig->flags & SIGNAL_GROUP_EXIT || fatal_signal_pending(task)))
+   return false;
+
+   return true;
+}
+
+/*
+ * Checks whether the given task is dying or exiting and likely to
+ * release its address space. This means that all threads and processes
+ * sharing the same mm have to be killed or exiting.
+ */
+static inline bool task_will_free_mem(struct task_struct *task)
+{
+   struct mm_struct *mm = NULL;
+   struct task_struct *p;
+
+   /*
+* If the process has passed exit_mm we have to skip it because
+* we have lost a link to other tasks sharing this mm, we do not
+* have anything to reap and the task might then get stuck waiting
+* for parent as zombie and we do not want it to hold TIF_MEMDIE
+*/
+   p = find_lock_task_mm(task);
+   if (!p)
return false;
 
+   if (!__task_will_free_mem(p)) {
+   task_unlock(p);
+   return false;
+   }
+
+   /*
+* Check whether there are other processes sharing the mm - they all 
have
+* to be killed or exiting.
+*/
+   if (atomic_read(>mm->mm_users) > get_nr_threads(p)) {
+   mm = p->mm;
+   /* pin the mm to not get freed and reused */
+   atomic_inc(>mm_count);
+   }
+   task_unlock(p);
+
+   if (mm) {
+   rcu_read_lock();
+   for_each_process(p) {
+   bool vfork;
+
+   /*
+* skip over vforked tasks because they are mostly
+* independent and will drop the mm soon
+*/
+   task_lock(p);
+   vfork = p->vfork_done;
+   task_unlock(p);
+   if (vfork)
+   continue;
+
+   if (!__task_will_free_mem(p))
+   break;
+   }
+   rcu_read_unlock();
+   mmdrop(mm);
+   }
+
return true;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f6477a9dbe7a..878a4308164c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1273,9 +1273,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
 * select it.  The goal is to allow it to allocate so that it may
 * quickly exit and free its 

[PATCH 5/6] mm, oom: kill all tasks sharing the mm

2016-05-26 Thread Michal Hocko
From: Michal Hocko 

Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
process sharing the same mm is unkillable via OOM_ADJUST_MIN. After "mm,
oom_adj: make sure processes sharing mm have same view of oom_score_adj"
all such processes are sharing the same value so we shouldn't see such a
task at all (oom_badness would rule them out).
Moreover after "mm, oom: skip over vforked tasks" we even cannot
encounter vfork task so we can allow both SIG_KILL and oom reaper. A
potential race is highly unlikely but possible. It would happen if
__set_oom_adj raced with select_bad_process and then it is OK to
consider the old value or with fork when it should be acceptable as
well.
Let's add a little note to the log so that people would tell us that
this really happens in the real life and it matters.

Signed-off-by: Michal Hocko 
---
 mm/oom_kill.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d1cbaaa1a666..008c5b4732de 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -850,8 +850,7 @@ void oom_kill_process(struct oom_control *oc, struct 
task_struct *p,
continue;
if (same_thread_group(p, victim))
continue;
-   if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+   if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
/*
 * We cannot use oom_reaper for the mm shared by this
 * process because it wouldn't get killed and so the
@@ -860,6 +859,11 @@ void oom_kill_process(struct oom_control *oc, struct 
task_struct *p,
can_oom_reap = false;
continue;
}
+   if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
+   pr_warn("%s pid=%d shares mm with oom disabled %s 
pid=%d. Seems like misconfiguration, killing anyway!"
+   " Report at linux...@kvack.org\n",
+   victim->comm, task_pid_nr(victim),
+   p->comm, task_pid_nr(p));
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
}
rcu_read_unlock();
-- 
2.8.1



[PATCH 6/6] mm, oom: fortify task_will_free_mem

2016-05-26 Thread Michal Hocko
From: Michal Hocko 

task_will_free_mem is rather weak. It doesn't really tell whether
the task has chance to drop its mm. 98748bd72200 ("oom: consider
multi-threaded tasks in task_will_free_mem") made a first step
into making it more robust for multi-threaded applications so now we
know that the whole process is going down. This builds on top for more
complex scenarios where mm is shared between different processes
(CLONE_VM without CLONE_THREAD resp CLONE_SIGHAND).

Make sure that all processes sharing the mm are killed or exiting. This
will allow us to replace try_oom_reaper by wake_oom_reaper. Therefore
all paths which bypass the oom killer are now reapable and so they
shouldn't lock up the oom killer.

Drop the mm checks for the bypass because those are not really
guaranteeing anything as the condition might change at any time
after task_lock is dropped.

Signed-off-by: Michal Hocko 
---
 include/linux/oom.h | 72 +
 mm/memcontrol.c |  4 +--
 mm/oom_kill.c   | 65 ---
 3 files changed, 74 insertions(+), 67 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 83469522690a..412c4ecb42b1 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -70,9 +70,9 @@ static inline bool oom_task_origin(const struct task_struct 
*p)
 extern void mark_oom_victim(struct task_struct *tsk);
 
 #ifdef CONFIG_MMU
-extern void try_oom_reaper(struct task_struct *tsk);
+extern void wake_oom_reaper(struct task_struct *tsk);
 #else
-static inline void try_oom_reaper(struct task_struct *tsk)
+static inline void wake_oom_reaper(struct task_struct *tsk)
 {
 }
 #endif
@@ -105,7 +105,7 @@ extern void oom_killer_enable(void);
 
 extern struct task_struct *find_lock_task_mm(struct task_struct *p);
 
-static inline bool task_will_free_mem(struct task_struct *task)
+static inline bool __task_will_free_mem(struct task_struct *task)
 {
struct signal_struct *sig = task->signal;
 
@@ -117,13 +117,75 @@ static inline bool task_will_free_mem(struct task_struct 
*task)
if (sig->flags & SIGNAL_GROUP_COREDUMP)
return false;
 
-   if (!(task->flags & PF_EXITING))
+   if (!(task->flags & PF_EXITING || fatal_signal_pending(task)))
return false;
 
/* Make sure that the whole thread group is going down */
-   if (!thread_group_empty(task) && !(sig->flags & SIGNAL_GROUP_EXIT))
+   if (!thread_group_empty(task) &&
+   !(sig->flags & SIGNAL_GROUP_EXIT || fatal_signal_pending(task)))
+   return false;
+
+   return true;
+}
+
+/*
+ * Checks whether the given task is dying or exiting and likely to
+ * release its address space. This means that all threads and processes
+ * sharing the same mm have to be killed or exiting.
+ */
+static inline bool task_will_free_mem(struct task_struct *task)
+{
+   struct mm_struct *mm = NULL;
+   struct task_struct *p;
+
+   /*
+* If the process has passed exit_mm we have to skip it because
+* we have lost a link to other tasks sharing this mm, we do not
+* have anything to reap and the task might then get stuck waiting
+* for parent as zombie and we do not want it to hold TIF_MEMDIE
+*/
+   p = find_lock_task_mm(task);
+   if (!p)
return false;
 
+   if (!__task_will_free_mem(p)) {
+   task_unlock(p);
+   return false;
+   }
+
+   /*
+* Check whether there are other processes sharing the mm - they all 
have
+* to be killed or exiting.
+*/
+   if (atomic_read(>mm->mm_users) > get_nr_threads(p)) {
+   mm = p->mm;
+   /* pin the mm to not get freed and reused */
+   atomic_inc(>mm_count);
+   }
+   task_unlock(p);
+
+   if (mm) {
+   rcu_read_lock();
+   for_each_process(p) {
+   bool vfork;
+
+   /*
+* skip over vforked tasks because they are mostly
+* independent and will drop the mm soon
+*/
+   task_lock(p);
+   vfork = p->vfork_done;
+   task_unlock(p);
+   if (vfork)
+   continue;
+
+   if (!__task_will_free_mem(p))
+   break;
+   }
+   rcu_read_unlock();
+   mmdrop(mm);
+   }
+
return true;
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f6477a9dbe7a..878a4308164c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1273,9 +1273,9 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup 
*memcg, gfp_t gfp_mask,
 * select it.  The goal is to allow it to allocate so that it may
 * quickly exit and free its memory.
 */
-   if 

[PATCH 5/6] mm, oom: kill all tasks sharing the mm

2016-05-26 Thread Michal Hocko
From: Michal Hocko 

Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
process sharing the same mm is unkillable via OOM_ADJUST_MIN. After "mm,
oom_adj: make sure processes sharing mm have same view of oom_score_adj"
all such processes are sharing the same value so we shouldn't see such a
task at all (oom_badness would rule them out).
Moreover after "mm, oom: skip over vforked tasks" we even cannot
encounter vfork task so we can allow both SIG_KILL and oom reaper. A
potential race is highly unlikely but possible. It would happen if
__set_oom_adj raced with select_bad_process and then it is OK to
consider the old value or with fork when it should be acceptable as
well.
Let's add a little note to the log so that people would tell us that
this really happens in the real life and it matters.

Signed-off-by: Michal Hocko 
---
 mm/oom_kill.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d1cbaaa1a666..008c5b4732de 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -850,8 +850,7 @@ void oom_kill_process(struct oom_control *oc, struct 
task_struct *p,
continue;
if (same_thread_group(p, victim))
continue;
-   if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
-   p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+   if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p)) {
/*
 * We cannot use oom_reaper for the mm shared by this
 * process because it wouldn't get killed and so the
@@ -860,6 +859,11 @@ void oom_kill_process(struct oom_control *oc, struct 
task_struct *p,
can_oom_reap = false;
continue;
}
+   if (p->signal->oom_score_adj == OOM_ADJUST_MIN)
+   pr_warn("%s pid=%d shares mm with oom disabled %s 
pid=%d. Seems like misconfiguration, killing anyway!"
+   " Report at linux...@kvack.org\n",
+   victim->comm, task_pid_nr(victim),
+   p->comm, task_pid_nr(p));
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
}
rcu_read_unlock();
-- 
2.8.1



[PATCH 0/5] Handle oom bypass more gracefully

2016-05-26 Thread Michal Hocko
Hi,
the following 6 patches should put some order to very rare cases of
mm shared between processes and make the paths which bypass the oom
killer oom reapable and so much more reliable finally.  Even though mm
shared outside of threadgroup is rare (either use_mm by kernel threads
or exotic clone(CLONE_VM) without CLONE_THREAD resp. CLONE_SIGHAND) it
makes the current oom killer logic quite hard to follow and evaluate. It
is possible to select an oom victim which shares the mm with unkillable
process or bypass the oom killer even when other processes sharing the
mm are still alive and other weird cases.

Patch 1 optimizes oom_kill_task to skip the costly process
iteration when the current oom victim is not sharing mm with other
processes. Patch 2 is a clean up of oom_score_adj handling and a
preparatory work. Patch 3 enforces oom_adj_score to be consistent
between processes sharing the mm to behave consistently with the regular
thread groups. Patch 4 tries to handle vforked tasks better in the oom
path, patch 5 ensures that all tasks sharing the mm are killed and
finally patch 6 should guarantee that task_will_free_mem will always
imply reapable bypass of the oom killer.

The patchset is based on the current mmotm tree (mmotm-2016-05-23-16-51).
I would really appreciate a deep review as this area is full of land
mines but I hope I've made the code much cleaner with less kludges.

I am CCing Oleg (sorry I know you hate this code) but I would feel much
better if you double checked my assumptions about locking and vfork
behavior.

Michal Hocko (6):
  mm, oom: do not loop over all tasks if there are no external tasks 
sharing mm
  proc, oom_adj: extract oom_score_adj setting into a helper
  mm, oom_adj: make sure processes sharing mm have same view of 
oom_score_adj
  mm, oom: skip over vforked tasks
  mm, oom: kill all tasks sharing the mm
  mm, oom: fortify task_will_free_mem

 fs/proc/base.c  | 168 +---
 include/linux/mm.h  |   2 +
 include/linux/oom.h |  72 --
 mm/memcontrol.c |   4 +-
 mm/oom_kill.c   |  96 ++
 5 files changed, 196 insertions(+), 146 deletions(-)


[PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj

2016-05-26 Thread Michal Hocko
From: Michal Hocko 

oom_score_adj is shared for the thread groups (via struct signal) but
this is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_THREAD resp. CLONE_SIGHAND) and so we can easily end up in a
situation when some processes update their oom_score_adj and confuse
the oom killer. In the worst case some of those processes might hide
from oom killer altogether via OOM_SCORE_ADJ_MIN while others are
eligible. OOM killer would then pick up those eligible but won't be
allowed to kill others sharing the same mm so the mm wouldn't release
the mm and so the memory.

It would be ideal to have the oom_score_adj per mm_struct becuase that
is the natural entity OOM killer considers. But this will not work
because some programs are doing
vfork()
set_oom_adj()
exec()

We can achieve the same though. oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not
in the middle of vfork. As a result all the processes will share the
same oom_score_adj.

Note that we have to serialize all the oom_score_adj writers now to
guarantee they do not interleave and generate inconsistent results.

Signed-off-by: Michal Hocko 
---
 fs/proc/base.c | 35 +++
 include/linux/mm.h |  2 ++
 mm/oom_kill.c  |  2 +-
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 23679673bf5a..e3ee4fb1930c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1043,10 +1043,13 @@ static ssize_t oom_adj_read(struct file *file, char 
__user *buf, size_t count,
 
 static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 {
+   static DEFINE_MUTEX(oom_adj_mutex);
+   struct mm_struct *mm = NULL;
struct task_struct *task;
unsigned long flags;
int err = 0;
 
+   mutex_lock(_adj_mutex);
task = get_proc_task(file_inode(file));
if (!task) {
err = -ESRCH;
@@ -1085,6 +1088,20 @@ static int __set_oom_adj(struct file *file, int oom_adj, 
bool legacy)
}
}
 
+   /*
+* If we are not in the vfork and share mm with other processes we
+* have to propagate the score otherwise we would have a schizophrenic
+* requirements for the same mm. We can use racy check because we
+* only risk the slow path.
+*/
+   if (!task->vfork_done &&
+   atomic_read(>mm->mm_users) > 
get_nr_threads(task)) {
+   mm = task->mm;
+
+   /* pin the mm so it doesn't go away and get reused */
+   atomic_inc(>mm_count);
+   }
+
task->signal->oom_score_adj = oom_adj;
if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
task->signal->oom_score_adj_min = (short)oom_adj;
@@ -1094,7 +,25 @@ static int __set_oom_adj(struct file *file, int oom_adj, 
bool legacy)
 err_task_lock:
task_unlock(task);
put_task_struct(task);
+
+   if (mm) {
+   struct task_struct *p;
+
+   rcu_read_lock();
+   for_each_process(p) {
+   task_lock(p);
+   if (!p->vfork_done && process_shares_mm(p, mm)) {
+   p->signal->oom_score_adj = oom_adj;
+   if (!legacy && has_capability_noaudit(current, 
CAP_SYS_RESOURCE))
+   p->signal->oom_score_adj_min = 
(short)oom_adj;
+   }
+   task_unlock(p);
+   }
+   rcu_read_unlock();
+   mmdrop(mm);
+   }
 out:
+   mutex_unlock(_adj_mutex);
return err;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 05102822912c..b44d3d792a00 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2248,6 +2248,8 @@ static inline int in_gate_area(struct mm_struct *mm, 
unsigned long addr)
 }
 #endif /* __HAVE_ARCH_GATE_AREA */
 
+extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
+
 #ifdef CONFIG_SYSCTL
 extern int sysctl_drop_caches;
 int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e33e912f7e4..eeccb4d7e7f5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -416,7 +416,7 @@ bool oom_killer_disabled __read_mostly;
  * task's threads: if one of those is using this mm then this task was also
  * using it.
  */
-static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
+bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 {
struct task_struct *t;
 
-- 
2.8.1



[PATCH 0/5] Handle oom bypass more gracefully

2016-05-26 Thread Michal Hocko
Hi,
the following 6 patches should put some order to very rare cases of
mm shared between processes and make the paths which bypass the oom
killer oom reapable and so much more reliable finally.  Even though mm
shared outside of threadgroup is rare (either use_mm by kernel threads
or exotic clone(CLONE_VM) without CLONE_THREAD resp. CLONE_SIGHAND) it
makes the current oom killer logic quite hard to follow and evaluate. It
is possible to select an oom victim which shares the mm with unkillable
process or bypass the oom killer even when other processes sharing the
mm are still alive and other weird cases.

Patch 1 optimizes oom_kill_task to skip the costly process
iteration when the current oom victim is not sharing mm with other
processes. Patch 2 is a clean up of oom_score_adj handling and a
preparatory work. Patch 3 enforces oom_adj_score to be consistent
between processes sharing the mm to behave consistently with the regular
thread groups. Patch 4 tries to handle vforked tasks better in the oom
path, patch 5 ensures that all tasks sharing the mm are killed and
finally patch 6 should guarantee that task_will_free_mem will always
imply reapable bypass of the oom killer.

The patchset is based on the current mmotm tree (mmotm-2016-05-23-16-51).
I would really appreciate a deep review as this area is full of land
mines but I hope I've made the code much cleaner with less kludges.

I am CCing Oleg (sorry I know you hate this code) but I would feel much
better if you double checked my assumptions about locking and vfork
behavior.

Michal Hocko (6):
  mm, oom: do not loop over all tasks if there are no external tasks 
sharing mm
  proc, oom_adj: extract oom_score_adj setting into a helper
  mm, oom_adj: make sure processes sharing mm have same view of 
oom_score_adj
  mm, oom: skip over vforked tasks
  mm, oom: kill all tasks sharing the mm
  mm, oom: fortify task_will_free_mem

 fs/proc/base.c  | 168 +---
 include/linux/mm.h  |   2 +
 include/linux/oom.h |  72 --
 mm/memcontrol.c |   4 +-
 mm/oom_kill.c   |  96 ++
 5 files changed, 196 insertions(+), 146 deletions(-)


[PATCH 3/6] mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj

2016-05-26 Thread Michal Hocko
From: Michal Hocko 

oom_score_adj is shared for the thread groups (via struct signal) but
this is not sufficient to cover processes sharing mm (CLONE_VM without
CLONE_THREAD resp. CLONE_SIGHAND) and so we can easily end up in a
situation when some processes update their oom_score_adj and confuse
the oom killer. In the worst case some of those processes might hide
from oom killer altogether via OOM_SCORE_ADJ_MIN while others are
eligible. OOM killer would then pick up those eligible but won't be
allowed to kill others sharing the same mm so the mm wouldn't release
the mm and so the memory.

It would be ideal to have the oom_score_adj per mm_struct becuase that
is the natural entity OOM killer considers. But this will not work
because some programs are doing
vfork()
set_oom_adj()
exec()

We can achieve the same though. oom_score_adj write handler can set the
oom_score_adj for all processes sharing the same mm if the task is not
in the middle of vfork. As a result all the processes will share the
same oom_score_adj.

Note that we have to serialize all the oom_score_adj writers now to
guarantee they do not interleave and generate inconsistent results.

Signed-off-by: Michal Hocko 
---
 fs/proc/base.c | 35 +++
 include/linux/mm.h |  2 ++
 mm/oom_kill.c  |  2 +-
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 23679673bf5a..e3ee4fb1930c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1043,10 +1043,13 @@ static ssize_t oom_adj_read(struct file *file, char 
__user *buf, size_t count,
 
 static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 {
+   static DEFINE_MUTEX(oom_adj_mutex);
+   struct mm_struct *mm = NULL;
struct task_struct *task;
unsigned long flags;
int err = 0;
 
+   mutex_lock(_adj_mutex);
task = get_proc_task(file_inode(file));
if (!task) {
err = -ESRCH;
@@ -1085,6 +1088,20 @@ static int __set_oom_adj(struct file *file, int oom_adj, 
bool legacy)
}
}
 
+   /*
+* If we are not in the vfork and share mm with other processes we
+* have to propagate the score otherwise we would have a schizophrenic
+* requirements for the same mm. We can use racy check because we
+* only risk the slow path.
+*/
+   if (!task->vfork_done &&
+   atomic_read(>mm->mm_users) > 
get_nr_threads(task)) {
+   mm = task->mm;
+
+   /* pin the mm so it doesn't go away and get reused */
+   atomic_inc(>mm_count);
+   }
+
task->signal->oom_score_adj = oom_adj;
if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
task->signal->oom_score_adj_min = (short)oom_adj;
@@ -1094,7 +,25 @@ static int __set_oom_adj(struct file *file, int oom_adj, 
bool legacy)
 err_task_lock:
task_unlock(task);
put_task_struct(task);
+
+   if (mm) {
+   struct task_struct *p;
+
+   rcu_read_lock();
+   for_each_process(p) {
+   task_lock(p);
+   if (!p->vfork_done && process_shares_mm(p, mm)) {
+   p->signal->oom_score_adj = oom_adj;
+   if (!legacy && has_capability_noaudit(current, 
CAP_SYS_RESOURCE))
+   p->signal->oom_score_adj_min = 
(short)oom_adj;
+   }
+   task_unlock(p);
+   }
+   rcu_read_unlock();
+   mmdrop(mm);
+   }
 out:
+   mutex_unlock(_adj_mutex);
return err;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 05102822912c..b44d3d792a00 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2248,6 +2248,8 @@ static inline int in_gate_area(struct mm_struct *mm, 
unsigned long addr)
 }
 #endif /* __HAVE_ARCH_GATE_AREA */
 
+extern bool process_shares_mm(struct task_struct *p, struct mm_struct *mm);
+
 #ifdef CONFIG_SYSCTL
 extern int sysctl_drop_caches;
 int drop_caches_sysctl_handler(struct ctl_table *, int,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 0e33e912f7e4..eeccb4d7e7f5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -416,7 +416,7 @@ bool oom_killer_disabled __read_mostly;
  * task's threads: if one of those is using this mm then this task was also
  * using it.
  */
-static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
+bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
 {
struct task_struct *t;
 
-- 
2.8.1



Re: [PATCH] Staging: android: ion: fixed a kzalloc coding style issue.

2016-05-26 Thread Luis de Bethencourt
On 26/05/16 09:01, Shubham Bansal wrote:
> Fixed a coding style issue. Issue reported by checkpatch.pl.
> 
> Signed-off-by: Shubham Bansal 
> ---
>  drivers/staging/android/ion/ion.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/staging/android/ion/ion.c 
> b/drivers/staging/android/ion/ion.c
> index 8536567..2217ccb 100644
> --- a/drivers/staging/android/ion/ion.c
> +++ b/drivers/staging/android/ion/ion.c
> @@ -184,7 +184,7 @@ static struct ion_buffer *ion_buffer_create(struct 
> ion_heap *heap,
>   struct scatterlist *sg;
>   int i, ret;
>  
> - buffer = kzalloc(sizeof(struct ion_buffer), GFP_KERNEL);
> + buffer = kzalloc(sizeof(*buffer), GFP_KERNEL);
>   if (!buffer)
>   return ERR_PTR(-ENOMEM);
>  
> 


This patch looks good.

It is not a big issue, but have it in mind for the future. When fixing issues
reported by checkpatch, it is nice to paste the output warning of checkpatch
in the commit message.

Look at this example:
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=e7a142aaa09fa5db015fd176d6943f888665829f

Thanks,
Luis


Re: [PATCH] Staging: android: ion: fixed a kzalloc coding style issue.

2016-05-26 Thread Luis de Bethencourt
On 26/05/16 09:01, Shubham Bansal wrote:
> Fixed a coding style issue. Issue reported by checkpatch.pl.
> 
> Signed-off-by: Shubham Bansal 
> ---
>  drivers/staging/android/ion/ion.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/staging/android/ion/ion.c 
> b/drivers/staging/android/ion/ion.c
> index 8536567..2217ccb 100644
> --- a/drivers/staging/android/ion/ion.c
> +++ b/drivers/staging/android/ion/ion.c
> @@ -184,7 +184,7 @@ static struct ion_buffer *ion_buffer_create(struct 
> ion_heap *heap,
>   struct scatterlist *sg;
>   int i, ret;
>  
> - buffer = kzalloc(sizeof(struct ion_buffer), GFP_KERNEL);
> + buffer = kzalloc(sizeof(*buffer), GFP_KERNEL);
>   if (!buffer)
>   return ERR_PTR(-ENOMEM);
>  
> 


This patch looks good.

It is not a big issue, but have it in mind for the future. When fixing issues
reported by checkpatch, it is nice to paste the output warning of checkpatch
in the commit message.

Look at this example:
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=e7a142aaa09fa5db015fd176d6943f888665829f

Thanks,
Luis


Re: [PATCH] PCI/MSI: pci-xgene-msi: Enable MSI support in ACPI boot for X-Gene v1

2016-05-26 Thread Lorenzo Pieralisi
Hi Duc,

On Wed, May 25, 2016 at 04:13:35PM -0700, Duc Dang wrote:
> On Thu, Feb 25, 2016 at 9:38 AM, Lorenzo Pieralisi
>  wrote:
> > On Wed, Feb 24, 2016 at 02:28:10PM -0800, Duc Dang wrote:
> >> On Wed, Feb 24, 2016 at 8:16 AM, Marc Zyngier  wrote:
> >> > On 24/02/16 16:09, Mark Salter wrote:
> >> >> On Tue, 2016-02-09 at 17:56 -0800, Duc Dang wrote:
> >> >>> This patch makes pci-xgene-msi driver ACPI-aware and provides
> >> >>> MSI capability for X-Gene v1 PCIe controllers in ACPI boot mode.
> >> >>>
> >> >>> Signed-off-by: Duc Dang 
> >> >>> ---
> >> >>>  drivers/pci/host/pci-xgene-msi.c | 35 
> >> >>> ---
> >> >>>  1 file changed, 32 insertions(+), 3 deletions(-)
> >> >>>
> >> >>> diff --git a/drivers/pci/host/pci-xgene-msi.c 
> >> >>> b/drivers/pci/host/pci-xgene-msi.c
> >> >>> index a6456b5..466aa93 100644
> >> >>> --- a/drivers/pci/host/pci-xgene-msi.c
> >> >>> +++ b/drivers/pci/host/pci-xgene-msi.c
> >> >>> @@ -24,6 +24,7 @@
> >> >>>  #include 
> >> >>>  #include 
> >> >>>  #include 
> >> >>> +#include 
> >> >>>
> >> >>>  #define MSI_IR0 0x00
> >> >>>  #define MSI_INT00x80
> >> >>> @@ -39,7 +40,7 @@ struct xgene_msi_group {
> >> >>>  };
> >> >>>
> >> >>>  struct xgene_msi {
> >> >>> -struct device_node  *node;
> >> >>> +struct fwnode_handle*fwnode;
> >> >>>  struct irq_domain   *inner_domain;
> >> >>>  struct irq_domain   *msi_domain;
> >> >>>  u64 msi_addr;
> >> >>> @@ -249,6 +250,13 @@ static const struct irq_domain_ops msi_domain_ops 
> >> >>> = {
> >> >>>  .free   = xgene_irq_domain_free,
> >> >>>  };
> >> >>>
> >> >>> +#ifdef CONFIG_ACPI
> >> >>> +static struct fwnode_handle *xgene_msi_get_fwnode(struct device *dev)
> >> >>> +{
> >> >>> +return xgene_msi_ctrl.fwnode;
> >> >>> +}
> >> >>> +#endif
> >> >>> +
> >> >>>  static int xgene_allocate_domains(struct xgene_msi *msi)
> >> >>>  {
> >> >>>  msi->inner_domain = irq_domain_add_linear(NULL, NR_MSI_VEC,
> >> >>> @@ -256,7 +264,7 @@ static int xgene_allocate_domains(struct xgene_msi 
> >> >>> *msi)
> >> >>>  if (!msi->inner_domain)
> >> >>>  return -ENOMEM;
> >> >>>
> >> >>> -msi->msi_domain = 
> >> >>> pci_msi_create_irq_domain(of_node_to_fwnode(msi->node),
> >> >>> +msi->msi_domain = pci_msi_create_irq_domain(msi->fwnode,
> >> >>>  
> >> >>> _msi_domain_info,
> >> >>>  msi->inner_domain);
> >> >>
> >> >> This doesn't work for me (ACPI probing on Mustang) unless I change this
> >> >> to be pci_msi_create_default_irq_domain(). The problem seems to be that
> >> >> the MSI probe happens after the PCIe RC is probed so there is no MSI 
> >> >> domain
> >> >> at the time the PCIe root is initialized by ACPI.
> >> >
> >> > pci_msi_create_default_irq_domain is the wrong thing do use, specially
> >> > if you have multiple MSI controllers in the system. I certainly wouldn't
> >> > want to see it being used on arm64.
> >> >
> >> > This is the usual dependency hell. You try moving the probing earlier,
> >> > but that may break something else in the process.
> >>
> >> Hi Mark and Marc,
> >>
> >> I have modified Tianocore firmware to have MSI node declared before
> >> PCIe node in Dsdt table. With this modification, the MSI driver will
> >> be loaded before PCIe driver and MSI domain is available at the time
> >> PCIe root is initialized.
> >
> > I am totally against this. We should not hack ACPI tables to make
> > the kernel work (on top of that with unwritten ordering rules that may
> > well require changes as kernel evolves), we should have a standard set
> > of bindings that people use to describe HW in the DSDT and the kernel(s)
> > has to cope with that. If there is a dependency problem in the description
> > we may solve it at bindings level, but I absolutely do not want to rely
> > on DSDT nodes ordering for things to work, that's fragile, no shortcuts
> > please.
> 
> Hi Lorenzo, Bjorn,
> 
> (Refresh this thread on this merge cycle)
> 
> I tried to use _DEP method to describe the dependency of PCIe node to
> MSI node but it turns out that PCIe driver does not check for the
> dependency (only EC (ec.c) and I2C core (i2c-core.c) use
> acpi_walk_dep_device_list to check for dependency before loading
> dependent drivers)
> 
> Do you have other suggestion/example about how to ensure strict
> ordering on driver loading in ACPI boot mode without changing the
> order of ACPI nodes in DSDT?

Before doing anything else, I would like to ask you to report the code
paths leading to failures (what fails actually ?) please, and I want
to understand what "this does not work" means.

In particular:

- Why using pci_msi_create_default_irq_domain() works
- Why, if you change the DSDT nodes ordering, things work (please add
  some debugging to MSI 

Re: [PATCH] PCI/MSI: pci-xgene-msi: Enable MSI support in ACPI boot for X-Gene v1

2016-05-26 Thread Lorenzo Pieralisi
Hi Duc,

On Wed, May 25, 2016 at 04:13:35PM -0700, Duc Dang wrote:
> On Thu, Feb 25, 2016 at 9:38 AM, Lorenzo Pieralisi
>  wrote:
> > On Wed, Feb 24, 2016 at 02:28:10PM -0800, Duc Dang wrote:
> >> On Wed, Feb 24, 2016 at 8:16 AM, Marc Zyngier  wrote:
> >> > On 24/02/16 16:09, Mark Salter wrote:
> >> >> On Tue, 2016-02-09 at 17:56 -0800, Duc Dang wrote:
> >> >>> This patch makes pci-xgene-msi driver ACPI-aware and provides
> >> >>> MSI capability for X-Gene v1 PCIe controllers in ACPI boot mode.
> >> >>>
> >> >>> Signed-off-by: Duc Dang 
> >> >>> ---
> >> >>>  drivers/pci/host/pci-xgene-msi.c | 35 
> >> >>> ---
> >> >>>  1 file changed, 32 insertions(+), 3 deletions(-)
> >> >>>
> >> >>> diff --git a/drivers/pci/host/pci-xgene-msi.c 
> >> >>> b/drivers/pci/host/pci-xgene-msi.c
> >> >>> index a6456b5..466aa93 100644
> >> >>> --- a/drivers/pci/host/pci-xgene-msi.c
> >> >>> +++ b/drivers/pci/host/pci-xgene-msi.c
> >> >>> @@ -24,6 +24,7 @@
> >> >>>  #include 
> >> >>>  #include 
> >> >>>  #include 
> >> >>> +#include 
> >> >>>
> >> >>>  #define MSI_IR0 0x00
> >> >>>  #define MSI_INT00x80
> >> >>> @@ -39,7 +40,7 @@ struct xgene_msi_group {
> >> >>>  };
> >> >>>
> >> >>>  struct xgene_msi {
> >> >>> -struct device_node  *node;
> >> >>> +struct fwnode_handle*fwnode;
> >> >>>  struct irq_domain   *inner_domain;
> >> >>>  struct irq_domain   *msi_domain;
> >> >>>  u64 msi_addr;
> >> >>> @@ -249,6 +250,13 @@ static const struct irq_domain_ops msi_domain_ops 
> >> >>> = {
> >> >>>  .free   = xgene_irq_domain_free,
> >> >>>  };
> >> >>>
> >> >>> +#ifdef CONFIG_ACPI
> >> >>> +static struct fwnode_handle *xgene_msi_get_fwnode(struct device *dev)
> >> >>> +{
> >> >>> +return xgene_msi_ctrl.fwnode;
> >> >>> +}
> >> >>> +#endif
> >> >>> +
> >> >>>  static int xgene_allocate_domains(struct xgene_msi *msi)
> >> >>>  {
> >> >>>  msi->inner_domain = irq_domain_add_linear(NULL, NR_MSI_VEC,
> >> >>> @@ -256,7 +264,7 @@ static int xgene_allocate_domains(struct xgene_msi 
> >> >>> *msi)
> >> >>>  if (!msi->inner_domain)
> >> >>>  return -ENOMEM;
> >> >>>
> >> >>> -msi->msi_domain = 
> >> >>> pci_msi_create_irq_domain(of_node_to_fwnode(msi->node),
> >> >>> +msi->msi_domain = pci_msi_create_irq_domain(msi->fwnode,
> >> >>>  
> >> >>> _msi_domain_info,
> >> >>>  msi->inner_domain);
> >> >>
> >> >> This doesn't work for me (ACPI probing on Mustang) unless I change this
> >> >> to be pci_msi_create_default_irq_domain(). The problem seems to be that
> >> >> the MSI probe happens after the PCIe RC is probed so there is no MSI 
> >> >> domain
> >> >> at the time the PCIe root is initialized by ACPI.
> >> >
> >> > pci_msi_create_default_irq_domain is the wrong thing do use, specially
> >> > if you have multiple MSI controllers in the system. I certainly wouldn't
> >> > want to see it being used on arm64.
> >> >
> >> > This is the usual dependency hell. You try moving the probing earlier,
> >> > but that may break something else in the process.
> >>
> >> Hi Mark and Marc,
> >>
> >> I have modified Tianocore firmware to have MSI node declared before
> >> PCIe node in Dsdt table. With this modification, the MSI driver will
> >> be loaded before PCIe driver and MSI domain is available at the time
> >> PCIe root is initialized.
> >
> > I am totally against this. We should not hack ACPI tables to make
> > the kernel work (on top of that with unwritten ordering rules that may
> > well require changes as kernel evolves), we should have a standard set
> > of bindings that people use to describe HW in the DSDT and the kernel(s)
> > has to cope with that. If there is a dependency problem in the description
> > we may solve it at bindings level, but I absolutely do not want to rely
> > on DSDT nodes ordering for things to work, that's fragile, no shortcuts
> > please.
> 
> Hi Lorenzo, Bjorn,
> 
> (Refresh this thread on this merge cycle)
> 
> I tried to use _DEP method to describe the dependency of PCIe node to
> MSI node but it turns out that PCIe driver does not check for the
> dependency (only EC (ec.c) and I2C core (i2c-core.c) use
> acpi_walk_dep_device_list to check for dependency before loading
> dependent drivers)
> 
> Do you have other suggestion/example about how to ensure strict
> ordering on driver loading in ACPI boot mode without changing the
> order of ACPI nodes in DSDT?

Before doing anything else, I would like to ask you to report the code
paths leading to failures (what fails actually ?) please, and I want
to understand what "this does not work" means.

In particular:

- Why using pci_msi_create_default_irq_domain() works
- Why, if you change the DSDT nodes ordering, things work (please add
  some debugging to MSI allocation functions to detect the code paths
  leading to 

Re: [PATCH v2] mmc: sdhci: fix wakeup configuration

2016-05-26 Thread Ludovic Desroches
On Fri, May 20, 2016 at 09:28:56PM +0300, Adrian Hunter wrote:
> On 20/05/2016 4:39 p.m., Ulf Hansson wrote:
> > On 20 May 2016 at 13:46, Adrian Hunter  wrote:
> > > On 13/05/16 16:16, Ludovic Desroches wrote:
> > > > Activating wakeup event is not enough to get a wakeup signal. The
> > > > corresponding events have to be enabled in the Interrupt Status Enable
> > > > Register too. It follows the specification and is needed at least by
> > > > sdhci-of-at91.
> > > > 
> > > > Signed-off-by: Ludovic Desroches 
> > > 
> > > Acked-by: Adrian Hunter 
> > 
> > Is this material for stable and as a fix for 4.6?
> 
> Not as far as I know.
> 

System PM code for the Atmel SDHCI has not been submitted yet so no need
to take it as a fix.

Regards

Ludovic

> > 
> > Kind regards
> > Uffe
> > 
> > > 
> > > 
> > > > ---
> > > >   drivers/mmc/host/sdhci.c | 15 ++-
> > > >   1 file changed, 14 insertions(+), 1 deletion(-)
> > > > 
> > > > Changes:
> > > > - v2:
> > > >- update commit message and comments
> > > >- do not rename val and mask variables
> > > > 
> > > > diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
> > > > index e010ea4..e351859 100644
> > > > --- a/drivers/mmc/host/sdhci.c
> > > > +++ b/drivers/mmc/host/sdhci.c
> > > > @@ -2605,18 +2605,31 @@ static irqreturn_t sdhci_thread_irq(int irq, 
> > > > void *dev_id)
> > > >   
> > > > \*/
> > > > 
> > > >   #ifdef CONFIG_PM
> > > > +/*
> > > > + * To enable wakeup events, the corresponding events have to be 
> > > > enabled in
> > > > + * the Interrupt Status Enable register too. See 'Table 1-6: Wakeup 
> > > > Signal
> > > > + * Table' in the SD Host Controller Standard Specification.
> > > > + * It is useless to restore SDHCI_INT_ENABLE state in
> > > > + * sdhci_disable_irq_wakeups() since it will be set by
> > > > + * sdhci_enable_card_detection() or sdhci_init().
> > > > + */
> > > >   void sdhci_enable_irq_wakeups(struct sdhci_host *host)
> > > >   {
> > > >u8 val;
> > > >u8 mask = SDHCI_WAKE_ON_INSERT | SDHCI_WAKE_ON_REMOVE
> > > >| SDHCI_WAKE_ON_INT;
> > > > + u32 irq_val = SDHCI_INT_CARD_INSERT | SDHCI_INT_CARD_REMOVE |
> > > > +   SDHCI_INT_CARD_INT;
> > > > 
> > > >val = sdhci_readb(host, SDHCI_WAKE_UP_CONTROL);
> > > >val |= mask ;
> > > >/* Avoid fake wake up */
> > > > - if (host->quirks & SDHCI_QUIRK_BROKEN_CARD_DETECTION)
> > > > + if (host->quirks & SDHCI_QUIRK_BROKEN_CARD_DETECTION) {
> > > >val &= ~(SDHCI_WAKE_ON_INSERT | SDHCI_WAKE_ON_REMOVE);
> > > > + irq_val &= ~(SDHCI_INT_CARD_INSERT | 
> > > > SDHCI_INT_CARD_REMOVE);
> > > > + }
> > > >sdhci_writeb(host, val, SDHCI_WAKE_UP_CONTROL);
> > > > + sdhci_writel(host, irq_val, SDHCI_INT_ENABLE);
> > > >   }
> > > >   EXPORT_SYMBOL_GPL(sdhci_enable_irq_wakeups);
> > > > 
> > > > 
> > > 


Re: [PATCH v2] mmc: sdhci: fix wakeup configuration

2016-05-26 Thread Ludovic Desroches
On Fri, May 20, 2016 at 09:28:56PM +0300, Adrian Hunter wrote:
> On 20/05/2016 4:39 p.m., Ulf Hansson wrote:
> > On 20 May 2016 at 13:46, Adrian Hunter  wrote:
> > > On 13/05/16 16:16, Ludovic Desroches wrote:
> > > > Activating wakeup event is not enough to get a wakeup signal. The
> > > > corresponding events have to be enabled in the Interrupt Status Enable
> > > > Register too. It follows the specification and is needed at least by
> > > > sdhci-of-at91.
> > > > 
> > > > Signed-off-by: Ludovic Desroches 
> > > 
> > > Acked-by: Adrian Hunter 
> > 
> > Is this material for stable and as a fix for 4.6?
> 
> Not as far as I know.
> 

System PM code for the Atmel SDHCI has not been submitted yet so no need
to take it as a fix.

Regards

Ludovic

> > 
> > Kind regards
> > Uffe
> > 
> > > 
> > > 
> > > > ---
> > > >   drivers/mmc/host/sdhci.c | 15 ++-
> > > >   1 file changed, 14 insertions(+), 1 deletion(-)
> > > > 
> > > > Changes:
> > > > - v2:
> > > >- update commit message and comments
> > > >- do not rename val and mask variables
> > > > 
> > > > diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
> > > > index e010ea4..e351859 100644
> > > > --- a/drivers/mmc/host/sdhci.c
> > > > +++ b/drivers/mmc/host/sdhci.c
> > > > @@ -2605,18 +2605,31 @@ static irqreturn_t sdhci_thread_irq(int irq, 
> > > > void *dev_id)
> > > >   
> > > > \*/
> > > > 
> > > >   #ifdef CONFIG_PM
> > > > +/*
> > > > + * To enable wakeup events, the corresponding events have to be 
> > > > enabled in
> > > > + * the Interrupt Status Enable register too. See 'Table 1-6: Wakeup 
> > > > Signal
> > > > + * Table' in the SD Host Controller Standard Specification.
> > > > + * It is useless to restore SDHCI_INT_ENABLE state in
> > > > + * sdhci_disable_irq_wakeups() since it will be set by
> > > > + * sdhci_enable_card_detection() or sdhci_init().
> > > > + */
> > > >   void sdhci_enable_irq_wakeups(struct sdhci_host *host)
> > > >   {
> > > >u8 val;
> > > >u8 mask = SDHCI_WAKE_ON_INSERT | SDHCI_WAKE_ON_REMOVE
> > > >| SDHCI_WAKE_ON_INT;
> > > > + u32 irq_val = SDHCI_INT_CARD_INSERT | SDHCI_INT_CARD_REMOVE |
> > > > +   SDHCI_INT_CARD_INT;
> > > > 
> > > >val = sdhci_readb(host, SDHCI_WAKE_UP_CONTROL);
> > > >val |= mask ;
> > > >/* Avoid fake wake up */
> > > > - if (host->quirks & SDHCI_QUIRK_BROKEN_CARD_DETECTION)
> > > > + if (host->quirks & SDHCI_QUIRK_BROKEN_CARD_DETECTION) {
> > > >val &= ~(SDHCI_WAKE_ON_INSERT | SDHCI_WAKE_ON_REMOVE);
> > > > + irq_val &= ~(SDHCI_INT_CARD_INSERT | 
> > > > SDHCI_INT_CARD_REMOVE);
> > > > + }
> > > >sdhci_writeb(host, val, SDHCI_WAKE_UP_CONTROL);
> > > > + sdhci_writel(host, irq_val, SDHCI_INT_ENABLE);
> > > >   }
> > > >   EXPORT_SYMBOL_GPL(sdhci_enable_irq_wakeups);
> > > > 
> > > > 
> > > 


Re: [PATCH 1/2] staging: dgnc: remove redundant NULL check for brd

2016-05-26 Thread Luis de Bethencourt
On 26/05/16 05:56, DaeSeok Youn wrote:
> 2016-05-26 6:48 GMT+09:00 Luis de Bethencourt :
>> On 20/05/16 10:51, Daeseok Youn wrote:
>>> the "brd" value cannot be NULL in dgnc_finalize_board_init().
>>> Because "brd" as a parameter of this function was already
>>> checked for NULL.
>>>
>>> Signed-off-by: Daeseok Youn 
>>> ---
>>>  drivers/staging/dgnc/dgnc_driver.c | 3 ---
>>>  1 file changed, 3 deletions(-)
>>>
>>> diff --git a/drivers/staging/dgnc/dgnc_driver.c 
>>> b/drivers/staging/dgnc/dgnc_driver.c
>>> index af2e835..22257d2 100644
>>> --- a/drivers/staging/dgnc/dgnc_driver.c
>>> +++ b/drivers/staging/dgnc/dgnc_driver.c
>>> @@ -579,9 +579,6 @@ static int dgnc_finalize_board_init(struct dgnc_board 
>>> *brd)
>>>  {
>>>   int rc = 0;
>>>
>>> - if (!brd || brd->magic != DGNC_BOARD_MAGIC)
>>> - return -ENODEV;
>>> -
>>>   if (brd->irq) {
>>>   rc = request_irq(brd->irq, brd->bd_ops->intr,
>>>IRQF_SHARED, "DGNC", brd);
>>>
>>
>> This is partially correct, the check for brd being NULL is in line 371.
> Hi Luis,
> 
> Yes, right. but also brd was assigned the value DGNC_BOARD_MAGIC in line 384.
> brd->magic = DGNC_BOARD_MAGIC;
> and also dgnc_finalize_board_init() as a static function is only
> called in dgnc_found_board(), right?
> 
>>
>> But there is a second check for brd->magic != DGNC_BOARD_MAGIC. Do you want
>> to keep that one?
> So.. I think it doesn't need to check about DGNC_BOARD_MAGIC.

This is good. I was asking just to make sure it was your intention.

Please add the reason to drop that second check in the commit message as well. 
So people
reading the git log can understand both parts. For both patches.

Thanks for the fixes :)

Luis

> 
>>
>> Also, how did you find this patch. It is useful to mention this in the commit
>> message if it was through some static analysis tool. For people using these 
>> tools
>> in the future.
> There are some static analysis tool for checking linux kernel code.
> But I didn't use
> those tools for this patch. sometimes, I usually run "smatch" tool for
> checking linux kernel
> code.
> 
> thanks.
> regards,
> Daeseok.
> 
>>
>> Thanks for the patch :)
>> Luis



Re: [PATCH 1/2] staging: dgnc: remove redundant NULL check for brd

2016-05-26 Thread Luis de Bethencourt
On 26/05/16 05:56, DaeSeok Youn wrote:
> 2016-05-26 6:48 GMT+09:00 Luis de Bethencourt :
>> On 20/05/16 10:51, Daeseok Youn wrote:
>>> the "brd" value cannot be NULL in dgnc_finalize_board_init().
>>> Because "brd" as a parameter of this function was already
>>> checked for NULL.
>>>
>>> Signed-off-by: Daeseok Youn 
>>> ---
>>>  drivers/staging/dgnc/dgnc_driver.c | 3 ---
>>>  1 file changed, 3 deletions(-)
>>>
>>> diff --git a/drivers/staging/dgnc/dgnc_driver.c 
>>> b/drivers/staging/dgnc/dgnc_driver.c
>>> index af2e835..22257d2 100644
>>> --- a/drivers/staging/dgnc/dgnc_driver.c
>>> +++ b/drivers/staging/dgnc/dgnc_driver.c
>>> @@ -579,9 +579,6 @@ static int dgnc_finalize_board_init(struct dgnc_board 
>>> *brd)
>>>  {
>>>   int rc = 0;
>>>
>>> - if (!brd || brd->magic != DGNC_BOARD_MAGIC)
>>> - return -ENODEV;
>>> -
>>>   if (brd->irq) {
>>>   rc = request_irq(brd->irq, brd->bd_ops->intr,
>>>IRQF_SHARED, "DGNC", brd);
>>>
>>
>> This is partially correct, the check for brd being NULL is in line 371.
> Hi Luis,
> 
> Yes, right. but also brd was assigned the value DGNC_BOARD_MAGIC in line 384.
> brd->magic = DGNC_BOARD_MAGIC;
> and also dgnc_finalize_board_init() as a static function is only
> called in dgnc_found_board(), right?
> 
>>
>> But there is a second check for brd->magic != DGNC_BOARD_MAGIC. Do you want
>> to keep that one?
> So.. I think it doesn't need to check about DGNC_BOARD_MAGIC.

This is good. I was asking just to make sure it was your intention.

Please add the reason to drop that second check in the commit message as well. 
So people
reading the git log can understand both parts. For both patches.

Thanks for the fixes :)

Luis

> 
>>
>> Also, how did you find this patch. It is useful to mention this in the commit
>> message if it was through some static analysis tool. For people using these 
>> tools
>> in the future.
> There are some static analysis tool for checking linux kernel code.
> But I didn't use
> those tools for this patch. sometimes, I usually run "smatch" tool for
> checking linux kernel
> code.
> 
> thanks.
> regards,
> Daeseok.
> 
>>
>> Thanks for the patch :)
>> Luis



Re: [PATCH v2] tpm: Factor out common startup code

2016-05-26 Thread Jarkko Sakkinen
On Mon, May 16, 2016 at 12:25:47PM -0600, Jason Gunthorpe wrote:
> Provide some flags in tpm_class_ops to allow drivers to opt-in to the
> common startup sequence. This is the sequence used by tpm_tis and
> tpm_crb.
> 
> All drivers should set this flag.
> 
> Signed-off-by: Jason Gunthorpe 
> Tested-by: Andrew Zamansky 

I assume there is coming a patch set of which part this patch will be?
In that patch set (whoever makes it) could you put the updated
description in place?

/Jarkko


Re: [PATCH v2] tpm: Factor out common startup code

2016-05-26 Thread Jarkko Sakkinen
On Mon, May 16, 2016 at 12:25:47PM -0600, Jason Gunthorpe wrote:
> Provide some flags in tpm_class_ops to allow drivers to opt-in to the
> common startup sequence. This is the sequence used by tpm_tis and
> tpm_crb.
> 
> All drivers should set this flag.
> 
> Signed-off-by: Jason Gunthorpe 
> Tested-by: Andrew Zamansky 

I assume there is coming a patch set of which part this patch will be?
In that patch set (whoever makes it) could you put the updated
description in place?

/Jarkko


Re: ARM: dts: exynos: Add MFC memory banks for Peach boards

2016-05-26 Thread Javier Martinez Canillas
Hello Pankaj,

On 05/26/2016 05:10 AM, pankaj.dubey wrote:

[snip]

>>> [0.00] Kernel command line: cros_legacy console=ttySAC3,115200
>>> debug earlyprintk cros_debug no_console_suspend root=/dev/ram0 rw
>>> ramdisk=32768 initrd=0x4200,3
>>> 2M
>>
>> I see that you are loading an initrd at 0x4200 with size of 32 MiB.
>>
>> [...]
>>
>>> [1.121421] Trying to unpack rootfs image as initramfs...
>>> [1.126940] rootfs image is not initramfs (junk in compressed
>>> archive); looks like an initrd
>>> [1.160139] Unable to handle kernel paging request at virtual address
>>> e300
>>
>> So I wonder if the problem is that the memblock_remove() is called very
>> early and so the kernel is not able to copy the initrd from 0x4200
>> to 0x4400 since overlaps with the mfc-r mem (0x4300-0x4380).
>>
>> Specially since the NULL pointer dereference below happens in the
>> populate_rootfs() function when calling xwrite() to do the copy.
>>
>> Could you please try to change the load address for your initrd, or
>> change the mfc-r start address to see if that prevents the issue?
>>
> 
> Yes, you are correct. This was the case.
> I changed initrd location from 0x4200 to 0x4400, and it is able
> to boot without any crash.
>

Great, good to confirm that this was causing your boot failure.
 
> Thanks,
> Pankaj Dubey
> 

Best regards,
-- 
Javier Martinez Canillas
Open Source Group
Samsung Research America


Re: ARM: dts: exynos: Add MFC memory banks for Peach boards

2016-05-26 Thread Javier Martinez Canillas
Hello Pankaj,

On 05/26/2016 05:10 AM, pankaj.dubey wrote:

[snip]

>>> [0.00] Kernel command line: cros_legacy console=ttySAC3,115200
>>> debug earlyprintk cros_debug no_console_suspend root=/dev/ram0 rw
>>> ramdisk=32768 initrd=0x4200,3
>>> 2M
>>
>> I see that you are loading an initrd at 0x4200 with size of 32 MiB.
>>
>> [...]
>>
>>> [1.121421] Trying to unpack rootfs image as initramfs...
>>> [1.126940] rootfs image is not initramfs (junk in compressed
>>> archive); looks like an initrd
>>> [1.160139] Unable to handle kernel paging request at virtual address
>>> e300
>>
>> So I wonder if the problem is that the memblock_remove() is called very
>> early and so the kernel is not able to copy the initrd from 0x4200
>> to 0x4400 since overlaps with the mfc-r mem (0x4300-0x4380).
>>
>> Specially since the NULL pointer dereference below happens in the
>> populate_rootfs() function when calling xwrite() to do the copy.
>>
>> Could you please try to change the load address for your initrd, or
>> change the mfc-r start address to see if that prevents the issue?
>>
> 
> Yes, you are correct. This was the case.
> I changed initrd location from 0x4200 to 0x4400, and it is able
> to boot without any crash.
>

Great, good to confirm that this was causing your boot failure.
 
> Thanks,
> Pankaj Dubey
> 

Best regards,
-- 
Javier Martinez Canillas
Open Source Group
Samsung Research America


Re: [RFC][PATCH 1/3] staging: ion: Move away from the DMA APIs for cache flushing

2016-05-26 Thread Christoph Hellwig
On Thu, May 26, 2016 at 11:59:22AM +0100, Russell King - ARM Linux wrote:
> I _really_ hate seeing architecture internal functions being abused in
> drivers - architecture internal functions are there to implement the
> official kernel APIs and are not for drivers to poke about with.

Exactly - if drivers like drm and ion want to do something out side the
current DMA API we need to add a proper API, not hack around the lack of
one.


Re: [RFC][PATCH 1/3] staging: ion: Move away from the DMA APIs for cache flushing

2016-05-26 Thread Christoph Hellwig
On Thu, May 26, 2016 at 11:59:22AM +0100, Russell King - ARM Linux wrote:
> I _really_ hate seeing architecture internal functions being abused in
> drivers - architecture internal functions are there to implement the
> official kernel APIs and are not for drivers to poke about with.

Exactly - if drivers like drm and ion want to do something out side the
current DMA API we need to add a proper API, not hack around the lack of
one.


Re: [PATCH 1/1] arm64: fix flush_cache_range

2016-05-26 Thread Russell King - ARM Linux
On Thu, May 26, 2016 at 07:46:11PM +0800, Leizhen (ThunderTown) wrote:
> Hi,
> As my tracing, it is returned by "if (!page_mapping(page))", because "mmap" 
> are anonymous pages. I commented below code lines, it works well.
>   
>   /* no flushing needed for anonymous pages */
>   if (!page_mapping(page))
>   return;
> 
> 
> I printed the page information three times, as below:
> page->mapping=8017baf36961, page->flags=0x10040048
> page->mapping=8017b265bf51, page->flags=0x10040048
> page->mapping=8017b94fc5a1, page->flags=0x10040048
> 
> PG_slab=7, PG_arch_1=9, PG_swapcache=15

Well, I think we first need to establish what the semantics of changing
any region from PROT_READ|PROT_WRITE to PROT_READ|PROT_EXEC are as far
as cache coherence goes.  As I've already said, POSIX says nothing about
this aspect, so the specifications don't define what the expectations
are here.  So, I don't think we can say that anything is wrong.

> > I ran our internal LTP version yesterday and it was fine but didn't
> > realise that we actually patched mprotect04.c to include:
> > 
> > __clear_cache((char *)func, (char *)func + page_sz);
> > 
> > just after memcpy().

The question here is what is the justification for this LTP test -
surely, all tests in LTP should refer to some specification justifying
the test, which allows diagnosis of whether the LTP test is wrong or
whether the environment being tested is wrong.

However, the PPC modifications to it appear to justify our kernel
implementation: as has been pointed out, PPC has some additional code
which deals with the cache coherency.  ARM has similar mechanisms with
the cacheflush syscall, whose whole point of existing is to support
self-modifying code (as userspace is not allowed to touch the cache
maintanence instructions.)

So, in the presence of the test justification vaccuum, and the fact
that PPC has added cache maintanence to the test, I think we should
take the exact same approach.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.


Re: [PATCH 1/1] arm64: fix flush_cache_range

2016-05-26 Thread Russell King - ARM Linux
On Thu, May 26, 2016 at 07:46:11PM +0800, Leizhen (ThunderTown) wrote:
> Hi,
> As my tracing, it is returned by "if (!page_mapping(page))", because "mmap" 
> are anonymous pages. I commented below code lines, it works well.
>   
>   /* no flushing needed for anonymous pages */
>   if (!page_mapping(page))
>   return;
> 
> 
> I printed the page information three times, as below:
> page->mapping=8017baf36961, page->flags=0x10040048
> page->mapping=8017b265bf51, page->flags=0x10040048
> page->mapping=8017b94fc5a1, page->flags=0x10040048
> 
> PG_slab=7, PG_arch_1=9, PG_swapcache=15

Well, I think we first need to establish what the semantics of changing
any region from PROT_READ|PROT_WRITE to PROT_READ|PROT_EXEC are as far
as cache coherence goes.  As I've already said, POSIX says nothing about
this aspect, so the specifications don't define what the expectations
are here.  So, I don't think we can say that anything is wrong.

> > I ran our internal LTP version yesterday and it was fine but didn't
> > realise that we actually patched mprotect04.c to include:
> > 
> > __clear_cache((char *)func, (char *)func + page_sz);
> > 
> > just after memcpy().

The question here is what is the justification for this LTP test -
surely, all tests in LTP should refer to some specification justifying
the test, which allows diagnosis of whether the LTP test is wrong or
whether the environment being tested is wrong.

However, the PPC modifications to it appear to justify our kernel
implementation: as has been pointed out, PPC has some additional code
which deals with the cache coherency.  ARM has similar mechanisms with
the cacheflush syscall, whose whole point of existing is to support
self-modifying code (as userspace is not allowed to touch the cache
maintanence instructions.)

So, in the presence of the test justification vaccuum, and the fact
that PPC has added cache maintanence to the test, I think we should
take the exact same approach.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.


<    1   2   3   4   5   6   7   8   9   >