[PATCH 2/9] percpu_ref: minor code and comment updates

2014-09-22 Thread Tejun Heo
* Some comments became stale.  Updated.
* percpu_ref_tryget() unnecessarily initializes @ret.  Removed.
* A blank line removed from percpu_ref_kill_rcu().
* Explicit function name in a WARN format string replaced with __func__.
* WARN_ON() in percpu_ref_reinit() converted to WARN_ON_ONCE().

Signed-off-by: Tejun Heo 
Cc: Kent Overstreet 
---
 include/linux/percpu-refcount.h | 25 -
 lib/percpu-refcount.c   | 14 ++
 2 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index f015f13..d44b027 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -115,8 +115,10 @@ static inline bool __pcpu_ref_alive(struct percpu_ref *ref,
  * percpu_ref_get - increment a percpu refcount
  * @ref: percpu_ref to get
  *
- * Analagous to atomic_inc().
-  */
+ * Analagous to atomic_long_inc().
+ *
+ * This function is safe to call as long as @ref is between init and exit.
+ */
 static inline void percpu_ref_get(struct percpu_ref *ref)
 {
unsigned long __percpu *pcpu_count;
@@ -138,12 +140,12 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
  * Increment a percpu refcount unless its count already reached zero.
  * Returns %true on success; %false on failure.
  *
- * The caller is responsible for ensuring that @ref stays accessible.
+ * This function is safe to call as long as @ref is between init and exit.
  */
 static inline bool percpu_ref_tryget(struct percpu_ref *ref)
 {
unsigned long __percpu *pcpu_count;
-   int ret = false;
+   int ret;
 
rcu_read_lock_sched();
 
@@ -166,12 +168,13 @@ static inline bool percpu_ref_tryget(struct percpu_ref 
*ref)
  * Increment a percpu refcount unless it has already been killed.  Returns
  * %true on success; %false on failure.
  *
- * Completion of percpu_ref_kill() in itself doesn't guarantee that tryget
- * will fail.  For such guarantee, percpu_ref_kill_and_confirm() should be
- * used.  After the confirm_kill callback is invoked, it's guaranteed that
- * no new reference will be given out by percpu_ref_tryget().
+ * Completion of percpu_ref_kill() in itself doesn't guarantee that this
+ * function will fail.  For such guarantee, percpu_ref_kill_and_confirm()
+ * should be used.  After the confirm_kill callback is invoked, it's
+ * guaranteed that no new reference will be given out by
+ * percpu_ref_tryget_live().
  *
- * The caller is responsible for ensuring that @ref stays accessible.
+ * This function is safe to call as long as @ref is between init and exit.
  */
 static inline bool percpu_ref_tryget_live(struct percpu_ref *ref)
 {
@@ -196,6 +199,8 @@ static inline bool percpu_ref_tryget_live(struct percpu_ref 
*ref)
  *
  * Decrement the refcount, and if 0, call the release function (which was 
passed
  * to percpu_ref_init())
+ *
+ * This function is safe to call as long as @ref is between init and exit.
  */
 static inline void percpu_ref_put(struct percpu_ref *ref)
 {
@@ -216,6 +221,8 @@ static inline void percpu_ref_put(struct percpu_ref *ref)
  * @ref: percpu_ref to test
  *
  * Returns %true if @ref reached zero.
+ *
+ * This function is safe to call as long as @ref is between init and exit.
  */
 static inline bool percpu_ref_is_zero(struct percpu_ref *ref)
 {
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 070dab5..8ef3f5c 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -108,7 +108,6 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu)
 * reaching 0 before we add the percpu counts. But doing it at the same
 * time is equivalent and saves us atomic operations:
 */
-
atomic_long_add((long)count - PCPU_COUNT_BIAS, >count);
 
WARN_ONCE(atomic_long_read(>count) <= 0,
@@ -120,8 +119,8 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu)
ref->confirm_kill(ref);
 
/*
-* Now we're in single atomic_t mode with a consistent refcount, so it's
-* safe to drop our initial ref:
+* Now we're in single atomic_long_t mode with a consistent
+* refcount, so it's safe to drop our initial ref:
 */
percpu_ref_put(ref);
 }
@@ -134,8 +133,8 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu)
  * Equivalent to percpu_ref_kill() but also schedules kill confirmation if
  * @confirm_kill is not NULL.  @confirm_kill, which may not block, will be
  * called after @ref is seen as dead from all CPUs - all further
- * invocations of percpu_ref_tryget() will fail.  See percpu_ref_tryget()
- * for more details.
+ * invocations of percpu_ref_tryget_live() will fail.  See
+ * percpu_ref_tryget_live() for more details.
  *
  * Due to the way percpu_ref is implemented, @confirm_kill will be called
  * after at least one full RCU grace period has passed but this is an
@@ -145,8 +144,7 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 

[PATCHSET percpu/for-3.18] percpu_ref: implement switch_to_atomic/percpu()

2014-09-22 Thread Tejun Heo
Hello,

Over the past several months, percpu_ref grew use cases where it's
used as a persistent on/off switch which may be cycled multiple times
using percpu_ref_reinit().  One of such use cases is blk-mq's
mq_usage_counter which tracks the number of in-flight commands and is
used to drain them.  Unfortunately, SCSI device probing involves
synchronously creating and destroying request_queues for non-existent
devices and the sched RCU grace period involved in percpu_ref killing
adds upto a significant amount of latency.

Block layer already experienced the same issue in other areas and
works around it by starting the queue in a degraded mode which is
faster to shut down and making it fully functional only after it's
known that the queue isn't a temporary one for probing.

This patchset implements percpu_ref mechanisms to manually switch
between atomic and percpu operation modes so that blk-mq can implement
a similar degraded operation mode.  This will also allow implementing
debug mode for percpu_ref so that underflow can be detected sooner.

This patchset contains the following nine patches.

 0001-percpu_ref-relocate-percpu_ref_reinit.patch
 0002-percpu_ref-minor-code-and-comment-updates.patch
 0003-percpu_ref-replace-pcpu_-prefix-with-percpu_.patch
 0004-percpu_ref-rename-things-to-prepare-for-decoupling-p.patch
 0005-percpu_ref-add-PCPU_REF_DEAD.patch
 0006-percpu_ref-decouple-switching-to-atomic-mode-and-kil.patch
 0007-percpu_ref-decouple-switching-to-percpu-mode-and-rei.patch
 0008-percpu_ref-add-PERCPU_REF_INIT_-flags.patch
 0009-percpu_ref-make-INIT_ATOMIC-and-switch_to_atomic-sti.patch

0001-0005 are prep patches.

0006-0007 implement percpu_ref_switch_to_atomic/percpu().

0008 extends percpu_ref_init() so that a percpu_ref can be initialized
in different states including atomic mode.

0009 makes atomic mode sticky so that it survives through reinits.

This patchset is on top of percpu/for-3.18 and available in the
following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git 
review-percpu_ref-switch

diffstat follows.

 block/blk-mq.c  |2 
 fs/aio.c|4 
 include/linux/percpu-refcount.h |  108 +-
 kernel/cgroup.c |7 
 lib/percpu-refcount.c   |  291 +---
 5 files changed, 295 insertions(+), 117 deletions(-)

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/9] percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch

2014-09-22 Thread Tejun Heo
percpu_ref will be restructured so that percpu/atomic mode switching
and reference killing are dedoupled.  In preparation, do the following
renames.

* percpu_ref->confirm_kill  -> percpu_ref->confirm_switch
* __PERCPU_REF_DEAD -> __PERCPU_REF_ATOMIC
* __percpu_ref_alive()  -> __ref_is_percpu()

This patch is pure rename and doesn't introduce any functional
changes.

Signed-off-by: Tejun Heo 
Cc: Kent Overstreet 
---
 include/linux/percpu-refcount.h | 25 ++---
 lib/percpu-refcount.c   | 22 +++---
 2 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 3d463a3..910e5f7 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -54,6 +54,11 @@
 struct percpu_ref;
 typedef void (percpu_ref_func_t)(struct percpu_ref *);
 
+/* flags set in the lower bits of percpu_ref->percpu_count_ptr */
+enum {
+   __PERCPU_REF_ATOMIC = 1LU << 0, /* operating in atomic mode */
+};
+
 struct percpu_ref {
atomic_long_t   count;
/*
@@ -62,7 +67,7 @@ struct percpu_ref {
 */
unsigned long   percpu_count_ptr;
percpu_ref_func_t   *release;
-   percpu_ref_func_t   *confirm_kill;
+   percpu_ref_func_t   *confirm_switch;
struct rcu_head rcu;
 };
 
@@ -88,23 +93,21 @@ static inline void percpu_ref_kill(struct percpu_ref *ref)
return percpu_ref_kill_and_confirm(ref, NULL);
 }
 
-#define __PERCPU_REF_DEAD  1
-
 /*
  * Internal helper.  Don't use outside percpu-refcount proper.  The
  * function doesn't return the pointer and let the caller test it for NULL
  * because doing so forces the compiler to generate two conditional
  * branches as it can't assume that @ref->percpu_count is not NULL.
  */
-static inline bool __percpu_ref_alive(struct percpu_ref *ref,
- unsigned long __percpu **percpu_countp)
+static inline bool __ref_is_percpu(struct percpu_ref *ref,
+ unsigned long __percpu 
**percpu_countp)
 {
unsigned long percpu_ptr = ACCESS_ONCE(ref->percpu_count_ptr);
 
/* paired with smp_store_release() in percpu_ref_reinit() */
smp_read_barrier_depends();
 
-   if (unlikely(percpu_ptr & __PERCPU_REF_DEAD))
+   if (unlikely(percpu_ptr & __PERCPU_REF_ATOMIC))
return false;
 
*percpu_countp = (unsigned long __percpu *)percpu_ptr;
@@ -125,7 +128,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
 
rcu_read_lock_sched();
 
-   if (__percpu_ref_alive(ref, _count))
+   if (__ref_is_percpu(ref, _count))
this_cpu_inc(*percpu_count);
else
atomic_long_inc(>count);
@@ -149,7 +152,7 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref)
 
rcu_read_lock_sched();
 
-   if (__percpu_ref_alive(ref, _count)) {
+   if (__ref_is_percpu(ref, _count)) {
this_cpu_inc(*percpu_count);
ret = true;
} else {
@@ -183,7 +186,7 @@ static inline bool percpu_ref_tryget_live(struct percpu_ref 
*ref)
 
rcu_read_lock_sched();
 
-   if (__percpu_ref_alive(ref, _count)) {
+   if (__ref_is_percpu(ref, _count)) {
this_cpu_inc(*percpu_count);
ret = true;
}
@@ -208,7 +211,7 @@ static inline void percpu_ref_put(struct percpu_ref *ref)
 
rcu_read_lock_sched();
 
-   if (__percpu_ref_alive(ref, _count))
+   if (__ref_is_percpu(ref, _count))
this_cpu_dec(*percpu_count);
else if (unlikely(atomic_long_dec_and_test(>count)))
ref->release(ref);
@@ -228,7 +231,7 @@ static inline bool percpu_ref_is_zero(struct percpu_ref 
*ref)
 {
unsigned long __percpu *percpu_count;
 
-   if (__percpu_ref_alive(ref, _count))
+   if (__ref_is_percpu(ref, _count))
return false;
return !atomic_long_read(>count);
 }
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 5aea6b7..7aef590 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -34,7 +34,7 @@
 static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref)
 {
return (unsigned long __percpu *)
-   (ref->percpu_count_ptr & ~__PERCPU_REF_DEAD);
+   (ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC);
 }
 
 /**
@@ -80,7 +80,7 @@ void percpu_ref_exit(struct percpu_ref *ref)
 
if (percpu_count) {
free_percpu(percpu_count);
-   ref->percpu_count_ptr = __PERCPU_REF_DEAD;
+   ref->percpu_count_ptr = __PERCPU_REF_ATOMIC;
}
 }
 EXPORT_SYMBOL_GPL(percpu_ref_exit);
@@ -117,8 +117,8 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu)
  ref->release, atomic_long_read(>count));
 
/* @ref is viewed as dead on all CPUs, send out 

[PATCH 1/9] percpu_ref: relocate percpu_ref_reinit()

2014-09-22 Thread Tejun Heo
percpu_ref is gonna go through restructuring.  Move
percpu_ref_reinit() after percpu_ref_kill_and_confirm().  This will
make later changes easier to follow and result in cleaner
organization.

Signed-off-by: Tejun Heo 
Cc: Kent Overstreet 
---
 include/linux/percpu-refcount.h |  2 +-
 lib/percpu-refcount.c   | 70 -
 2 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 5df6784..f015f13 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -68,10 +68,10 @@ struct percpu_ref {
 
 int __must_check percpu_ref_init(struct percpu_ref *ref,
 percpu_ref_func_t *release, gfp_t gfp);
-void percpu_ref_reinit(struct percpu_ref *ref);
 void percpu_ref_exit(struct percpu_ref *ref);
 void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 percpu_ref_func_t *confirm_kill);
+void percpu_ref_reinit(struct percpu_ref *ref);
 
 /**
  * percpu_ref_kill - drop the initial ref
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 559ee0b..070dab5 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -63,41 +63,6 @@ int percpu_ref_init(struct percpu_ref *ref, 
percpu_ref_func_t *release,
 EXPORT_SYMBOL_GPL(percpu_ref_init);
 
 /**
- * percpu_ref_reinit - re-initialize a percpu refcount
- * @ref: perpcu_ref to re-initialize
- *
- * Re-initialize @ref so that it's in the same state as when it finished
- * percpu_ref_init().  @ref must have been initialized successfully, killed
- * and reached 0 but not exited.
- *
- * Note that percpu_ref_tryget[_live]() are safe to perform on @ref while
- * this function is in progress.
- */
-void percpu_ref_reinit(struct percpu_ref *ref)
-{
-   unsigned long __percpu *pcpu_count = pcpu_count_ptr(ref);
-   int cpu;
-
-   BUG_ON(!pcpu_count);
-   WARN_ON(!percpu_ref_is_zero(ref));
-
-   atomic_long_set(>count, 1 + PCPU_COUNT_BIAS);
-
-   /*
-* Restore per-cpu operation.  smp_store_release() is paired with
-* smp_read_barrier_depends() in __pcpu_ref_alive() and guarantees
-* that the zeroing is visible to all percpu accesses which can see
-* the following PCPU_REF_DEAD clearing.
-*/
-   for_each_possible_cpu(cpu)
-   *per_cpu_ptr(pcpu_count, cpu) = 0;
-
-   smp_store_release(>pcpu_count_ptr,
- ref->pcpu_count_ptr & ~PCPU_REF_DEAD);
-}
-EXPORT_SYMBOL_GPL(percpu_ref_reinit);
-
-/**
  * percpu_ref_exit - undo percpu_ref_init()
  * @ref: percpu_ref to exit
  *
@@ -189,3 +154,38 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
call_rcu_sched(>rcu, percpu_ref_kill_rcu);
 }
 EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm);
+
+/**
+ * percpu_ref_reinit - re-initialize a percpu refcount
+ * @ref: perpcu_ref to re-initialize
+ *
+ * Re-initialize @ref so that it's in the same state as when it finished
+ * percpu_ref_init().  @ref must have been initialized successfully, killed
+ * and reached 0 but not exited.
+ *
+ * Note that percpu_ref_tryget[_live]() are safe to perform on @ref while
+ * this function is in progress.
+ */
+void percpu_ref_reinit(struct percpu_ref *ref)
+{
+   unsigned long __percpu *pcpu_count = pcpu_count_ptr(ref);
+   int cpu;
+
+   BUG_ON(!pcpu_count);
+   WARN_ON(!percpu_ref_is_zero(ref));
+
+   atomic_long_set(>count, 1 + PCPU_COUNT_BIAS);
+
+   /*
+* Restore per-cpu operation.  smp_store_release() is paired with
+* smp_read_barrier_depends() in __pcpu_ref_alive() and guarantees
+* that the zeroing is visible to all percpu accesses which can see
+* the following PCPU_REF_DEAD clearing.
+*/
+   for_each_possible_cpu(cpu)
+   *per_cpu_ptr(pcpu_count, cpu) = 0;
+
+   smp_store_release(>pcpu_count_ptr,
+ ref->pcpu_count_ptr & ~PCPU_REF_DEAD);
+}
+EXPORT_SYMBOL_GPL(percpu_ref_reinit);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] x86 fixes

2014-09-22 Thread Ingo Molnar

* Ingo Molnar  wrote:

> * Ingo Molnar  wrote:
> 
> > 
> > * Linus Torvalds  wrote:
> > 
> > > On Fri, Sep 19, 2014 at 3:40 AM, Ingo Molnar  wrote:
> > > >
> > > > Please pull the latest x86-urgent-for-linus git tree from:
> > > 
> > > I only just noticed, but this pull request causes my Sony Vaio 
> > > laptop to immediately reboot at startup.
> > > 
> > > I'm assuming it's one of the efi changes, but I'm bisecting now 
> > > to say exactly where it happens. It will get reverted.
> > 
> > I've Cc:-ed Matt.
> > 
> > My guess would be one of these two EFI commits:
> > 
> >   * Fix early boot regression affecting x86 EFI boot stub when loading
> > initrds above 4GB - Yinghai Lu
> > 
> > 47226ad4f4cf x86/efi: Only load initrd above 4g on second try
> > 
> >   * Relocate GOT entries in the x86 EFI boot stub now that we have
> > symbols with global visibility - Matt Fleming
> > 
> > 9cb0e394234d x86/efi: Fixup GOT in all boot code paths
> > 
> > If it's 9cb0e394234d - then it's perhaps a build quirk, or a bug 
> > in the assembly code. If so then we'd have to revert this, and 
> > reintroduce another regression, caused by EFI commit 
> > f23cf8bd5c1f49 in this merge window. The most recent commit is 
> > easy to revert, the older one not.
> > 
> > If it's 47226ad4f4cf then we'd reintroduce the regression caused 
> > by 4bf7111f501 in the previous merge window. They both revert 
> > cleanly after each other - but it might be safer to just revert 
> > the most recent one.
> >
> > My guess is that your regression is caused by 47226ad4f4cf.
> 
> Wrong sha1: my guess is on 9cb0e394234d, the GOT fixup.

So if it's the GOT fixup then I feel the safest option is to 
revert 9cb0e394234d straight away, and then to do a functional 
revert of f23cf8bd5c1f49 as a separate step, perhaps via 
something really crude like:

   #include "..//drivers/firmware/efi/libstub/efi-stub-helper.c"

or so. (Maybe someone else can think of something 
cleaner/simpler, because this method is really ugly, as we'd have 
to #include the whole libstub library into eboot.c AFAICS...)

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: boot stall regression due to blk-mq: use percpu_ref for mq usage count

2014-09-22 Thread Tejun Heo
On Tue, Sep 23, 2014 at 01:56:48AM -0400, Tejun Heo wrote:
> On Tue, Sep 23, 2014 at 07:55:54AM +0200, Christoph Hellwig wrote:
> > Jens,
> > 
> > can we simply get these commits reverted from now if there's no better
> > fix?  I'd hate to have this boot stall in the first kernel with blk-mq
> > support for scsi.
> 
> Patches going out right now.

And the original implementation was broken, so...

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/9] percpu_ref: add PCPU_REF_DEAD

2014-09-22 Thread Tejun Heo
percpu_ref will be restructured so that percpu/atomic mode switching
and reference killing are dedoupled.  In preparation, add
PCPU_REF_DEAD and PCPU_REF_ATOMIC_DEAD which is OR of ATOMIC and DEAD.
For now, ATOMIC and DEAD are changed together and all PCPU_REF_ATOMIC
uses are converted to PCPU_REF_ATOMIC_DEAD without causing any
behavior changes.

BUILD_BUG_ON() is added to percpu_ref_init() so that later flag
additions don't accidentally clobber lower bits of the pointer in
percpu_ref->pcpu_count_ptr.

Signed-off-by: Tejun Heo 
Cc: Kent Overstreet 
---
 include/linux/percpu-refcount.h |  4 +++-
 lib/percpu-refcount.c   | 15 +--
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 910e5f7..24cf157 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -57,6 +57,8 @@ typedef void (percpu_ref_func_t)(struct percpu_ref *);
 /* flags set in the lower bits of percpu_ref->percpu_count_ptr */
 enum {
__PERCPU_REF_ATOMIC = 1LU << 0, /* operating in atomic mode */
+   __PERCPU_REF_DEAD   = 1LU << 1, /* (being) killed */
+   __PERCPU_REF_ATOMIC_DEAD = __PERCPU_REF_ATOMIC | __PERCPU_REF_DEAD,
 };
 
 struct percpu_ref {
@@ -107,7 +109,7 @@ static inline bool __ref_is_percpu(struct percpu_ref *ref,
/* paired with smp_store_release() in percpu_ref_reinit() */
smp_read_barrier_depends();
 
-   if (unlikely(percpu_ptr & __PERCPU_REF_ATOMIC))
+   if (unlikely(percpu_ptr & __PERCPU_REF_ATOMIC_DEAD))
return false;
 
*percpu_countp = (unsigned long __percpu *)percpu_ptr;
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 7aef590..b0b8c09 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -34,7 +34,7 @@
 static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref)
 {
return (unsigned long __percpu *)
-   (ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC);
+   (ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD);
 }
 
 /**
@@ -52,6 +52,9 @@ static unsigned long __percpu *percpu_count_ptr(struct 
percpu_ref *ref)
 int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release,
gfp_t gfp)
 {
+   BUILD_BUG_ON(__PERCPU_REF_ATOMIC_DEAD &
+~(__alignof__(unsigned long) - 1));
+
atomic_long_set(>count, 1 + PERCPU_COUNT_BIAS);
 
ref->percpu_count_ptr =
@@ -80,7 +83,7 @@ void percpu_ref_exit(struct percpu_ref *ref)
 
if (percpu_count) {
free_percpu(percpu_count);
-   ref->percpu_count_ptr = __PERCPU_REF_ATOMIC;
+   ref->percpu_count_ptr = __PERCPU_REF_ATOMIC_DEAD;
}
 }
 EXPORT_SYMBOL_GPL(percpu_ref_exit);
@@ -145,10 +148,10 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu)
 void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 percpu_ref_func_t *confirm_kill)
 {
-   WARN_ONCE(ref->percpu_count_ptr & __PERCPU_REF_ATOMIC,
+   WARN_ONCE(ref->percpu_count_ptr & __PERCPU_REF_ATOMIC_DEAD,
  "%s called more than once on %pf!", __func__, ref->release);
 
-   ref->percpu_count_ptr |= __PERCPU_REF_ATOMIC;
+   ref->percpu_count_ptr |= __PERCPU_REF_ATOMIC_DEAD;
ref->confirm_switch = confirm_kill;
 
call_rcu_sched(>rcu, percpu_ref_kill_rcu);
@@ -180,12 +183,12 @@ void percpu_ref_reinit(struct percpu_ref *ref)
 * Restore per-cpu operation.  smp_store_release() is paired with
 * smp_read_barrier_depends() in __ref_is_percpu() and guarantees
 * that the zeroing is visible to all percpu accesses which can see
-* the following __PERCPU_REF_ATOMIC clearing.
+* the following __PERCPU_REF_ATOMIC_DEAD clearing.
 */
for_each_possible_cpu(cpu)
*per_cpu_ptr(percpu_count, cpu) = 0;
 
smp_store_release(>percpu_count_ptr,
- ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC);
+ ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD);
 }
 EXPORT_SYMBOL_GPL(percpu_ref_reinit);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: two more fixes for block/for-linus

2014-09-22 Thread Christoph Hellwig
On Mon, Sep 22, 2014 at 02:40:15PM -0400, Douglas Gilbert wrote:
> With these patches applied (actually a resync an hour
> ago with the for-linus tree which includes them), the
> freeze-during-boot-up problem that I have been seeing
> with an old SATA boot disk (perhaps 1.5 Gbps) for
> the last two weeks, has gone away.
>
> That SATA disk is connected to the motherboard (Gigabyte
> Z97M-D3H/Z97M-D3H, BIOS F5 05/30/2014) and has a standard
> AHCI interface as far as I can tell. dmesg confirms that.

Should have thought of the weird ATA error handling earlier.  Sorry Doug!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: boot stall regression due to blk-mq: use percpu_ref for mq usage count

2014-09-22 Thread Tejun Heo
On Tue, Sep 23, 2014 at 07:55:54AM +0200, Christoph Hellwig wrote:
> Jens,
> 
> can we simply get these commits reverted from now if there's no better
> fix?  I'd hate to have this boot stall in the first kernel with blk-mq
> support for scsi.

Patches going out right now.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 9/9] percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky

2014-09-22 Thread Tejun Heo
Currently, a percpu_ref which is initialized with
PERPCU_REF_INIT_ATOMIC or switched to atomic mode via
switch_to_atomic() automatically reverts to percpu mode on the first
percpu_ref_reinit().  This makes the atomic mode difficult to use for
cases where a percpu_ref is used as a persistent on/off switch which
may be cycled multiple times.

This patch makes such atomic state sticky so that it survives through
kill/reinit cycles.  After this patch, atomic state is cleared only by
an explicit percpu_ref_switch_to_percpu() call.

Signed-off-by: Tejun Heo 
Cc: Kent Overstreet 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Johannes Weiner 
---
 include/linux/percpu-refcount.h |  5 -
 lib/percpu-refcount.c   | 20 +++-
 2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 5f84bf0..8459d3a 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -65,7 +65,9 @@ enum {
 enum {
/*
 * Start w/ ref == 1 in atomic mode.  Can be switched to percpu
-* operation using percpu_ref_switch_to_percpu().
+* operation using percpu_ref_switch_to_percpu().  If initialized
+* with this flag, the ref will stay in atomic mode until
+* percpu_ref_switch_to_percpu() is invoked on it.
 */
PERCPU_REF_INIT_ATOMIC  = 1 << 0,
 
@@ -85,6 +87,7 @@ struct percpu_ref {
unsigned long   percpu_count_ptr;
percpu_ref_func_t   *release;
percpu_ref_func_t   *confirm_switch;
+   boolforce_atomic:1;
struct rcu_head rcu;
 };
 
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 74ec33e..c47e496 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -68,6 +68,8 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t 
*release,
if (!ref->percpu_count_ptr)
return -ENOMEM;
 
+   ref->force_atomic = flags & PERCPU_REF_INIT_ATOMIC;
+
if (flags & (PERCPU_REF_INIT_ATOMIC | PERCPU_REF_INIT_DEAD))
ref->percpu_count_ptr |= __PERCPU_REF_ATOMIC;
else
@@ -203,7 +205,8 @@ static void __percpu_ref_switch_to_atomic(struct percpu_ref 
*ref,
  * are guaraneed to be in atomic mode, @confirm_switch, which may not
  * block, is invoked.  This function may be invoked concurrently with all
  * the get/put operations and can safely be mixed with kill and reinit
- * operations.
+ * operations.  Note that @ref will stay in atomic mode across kill/reinit
+ * cycles until percpu_ref_switch_to_percpu() is called.
  *
  * This function normally doesn't block and can be called from any context
  * but it may block if @confirm_kill is specified and @ref is already in
@@ -217,6 +220,7 @@ static void __percpu_ref_switch_to_atomic(struct percpu_ref 
*ref,
 void percpu_ref_switch_to_atomic(struct percpu_ref *ref,
 percpu_ref_func_t *confirm_switch)
 {
+   ref->force_atomic = true;
__percpu_ref_switch_to_atomic(ref, confirm_switch);
 }
 
@@ -256,7 +260,10 @@ void __percpu_ref_switch_to_percpu(struct percpu_ref *ref)
  *
  * Switch @ref to percpu mode.  This function may be invoked concurrently
  * with all the get/put operations and can safely be mixed with kill and
- * reinit operations.
+ * reinit operations.  This function reverses the sticky atomic state set
+ * by PERCPU_REF_INIT_ATOMIC or percpu_ref_switch_to_atomic().  If @ref is
+ * dying or dead, the actual switching takes place on the following
+ * percpu_ref_reinit().
  *
  * This function normally doesn't block and can be called from any context
  * but it may block if @ref is in the process of switching to atomic mode
@@ -264,6 +271,8 @@ void __percpu_ref_switch_to_percpu(struct percpu_ref *ref)
  */
 void percpu_ref_switch_to_percpu(struct percpu_ref *ref)
 {
+   ref->force_atomic = false;
+
/* a dying or dead ref can't be switched to percpu mode w/o reinit */
if (!(ref->percpu_count_ptr & __PERCPU_REF_DEAD))
__percpu_ref_switch_to_percpu(ref);
@@ -305,8 +314,8 @@ EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm);
  * @ref: perpcu_ref to re-initialize
  *
  * Re-initialize @ref so that it's in the same state as when it finished
- * percpu_ref_init().  @ref must have been initialized successfully and
- * reached 0 but not exited.
+ * percpu_ref_init() ignoring %PERCPU_REF_INIT_DEAD.  @ref must have been
+ * initialized successfully and reached 0 but not exited.
  *
  * Note that percpu_ref_tryget[_live]() are safe to perform on @ref while
  * this function is in progress.
@@ -317,6 +326,7 @@ void percpu_ref_reinit(struct percpu_ref *ref)
 
ref->percpu_count_ptr &= ~__PERCPU_REF_DEAD;
percpu_ref_get(ref);
-   __percpu_ref_switch_to_percpu(ref);
+   if (!ref->force_atomic)
+   __percpu_ref_switch_to_percpu(ref);
 }
 

[PATCH 6/9] percpu_ref: decouple switching to atomic mode and killing

2014-09-22 Thread Tejun Heo
percpu_ref has treated the dropping of the base reference and
switching to atomic mode as an integral operation; however, there's
nothing inherent tying the two together.

The use cases for percpu_ref have been expanding continuously.  While
the current init/kill/reinit/exit model can cover a lot, the coupling
of kill/reinit with atomic/percpu mode switching is turning out to be
too restrictive for use cases where many percpu_refs are created and
destroyed back-to-back with only some of them reaching extended
operation.  The coupling also makes implementing always-atomic debug
mode difficult.

This patch separates out atomic mode switching into
percpu_ref_switch_to_atomic() and reimplements
percpu_ref_kill_and_confirm() on top of it.

* The handling of __PERCPU_REF_ATOMIC and __PERCPU_REF_DEAD is now
  differentiated.  Among get/put operations, percpu_ref_tryget_live()
  is the only one which cares about DEAD.

* percpu_ref_switch_to_atomic() can be called multiple times on the
  same ref.  This means that multiple @confirm_switch may get queued
  up which we can't do reliably without extra memory area.  This is
  handled by making the later invocation synchronously wait for the
  completion of the previous one.  This isn't particularly desirable
  but such synchronous waits shouldn't happen in most cases.

Signed-off-by: Tejun Heo 
Cc: Kent Overstreet 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Johannes Weiner 
---
 include/linux/percpu-refcount.h |   8 ++-
 lib/percpu-refcount.c   | 141 +++-
 2 files changed, 116 insertions(+), 33 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 24cf157..03a02e9 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -76,9 +76,11 @@ struct percpu_ref {
 int __must_check percpu_ref_init(struct percpu_ref *ref,
 percpu_ref_func_t *release, gfp_t gfp);
 void percpu_ref_exit(struct percpu_ref *ref);
+void percpu_ref_switch_to_atomic(struct percpu_ref *ref,
+percpu_ref_func_t *confirm_switch);
+void percpu_ref_reinit(struct percpu_ref *ref);
 void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 percpu_ref_func_t *confirm_kill);
-void percpu_ref_reinit(struct percpu_ref *ref);
 
 /**
  * percpu_ref_kill - drop the initial ref
@@ -109,7 +111,7 @@ static inline bool __ref_is_percpu(struct percpu_ref *ref,
/* paired with smp_store_release() in percpu_ref_reinit() */
smp_read_barrier_depends();
 
-   if (unlikely(percpu_ptr & __PERCPU_REF_ATOMIC_DEAD))
+   if (unlikely(percpu_ptr & __PERCPU_REF_ATOMIC))
return false;
 
*percpu_countp = (unsigned long __percpu *)percpu_ptr;
@@ -191,6 +193,8 @@ static inline bool percpu_ref_tryget_live(struct percpu_ref 
*ref)
if (__ref_is_percpu(ref, _count)) {
this_cpu_inc(*percpu_count);
ret = true;
+   } else if (!(ACCESS_ONCE(ref->percpu_count_ptr) & __PERCPU_REF_DEAD)) {
+   ret = atomic_long_inc_not_zero(>count);
}
 
rcu_read_unlock_sched();
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index b0b8c09..56a7c0d 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -1,6 +1,8 @@
 #define pr_fmt(fmt) "%s: " fmt "\n", __func__
 
 #include 
+#include 
+#include 
 #include 
 
 /*
@@ -31,6 +33,8 @@
 
 #define PERCPU_COUNT_BIAS  (1LU << (BITS_PER_LONG - 1))
 
+static DECLARE_WAIT_QUEUE_HEAD(percpu_ref_switch_waitq);
+
 static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref)
 {
return (unsigned long __percpu *)
@@ -88,7 +92,19 @@ void percpu_ref_exit(struct percpu_ref *ref)
 }
 EXPORT_SYMBOL_GPL(percpu_ref_exit);
 
-static void percpu_ref_kill_rcu(struct rcu_head *rcu)
+static void percpu_ref_call_confirm_rcu(struct rcu_head *rcu)
+{
+   struct percpu_ref *ref = container_of(rcu, struct percpu_ref, rcu);
+
+   ref->confirm_switch(ref);
+   ref->confirm_switch = NULL;
+   wake_up_all(_ref_switch_waitq);
+
+   /* drop ref from percpu_ref_switch_to_atomic() */
+   percpu_ref_put(ref);
+}
+
+static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu)
 {
struct percpu_ref *ref = container_of(rcu, struct percpu_ref, rcu);
unsigned long __percpu *percpu_count = percpu_count_ptr(ref);
@@ -116,47 +132,79 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu)
atomic_long_add((long)count - PERCPU_COUNT_BIAS, >count);
 
WARN_ONCE(atomic_long_read(>count) <= 0,
- "percpu ref (%pf) <= 0 (%ld) after killed",
+ "percpu ref (%pf) <= 0 (%ld) after switching to atomic",
  ref->release, atomic_long_read(>count));
 
-   /* @ref is viewed as dead on all CPUs, send out kill confirmation */
-   if (ref->confirm_switch)
-   ref->confirm_switch(ref);
+   

Re: boot stall regression due to blk-mq: use percpu_ref for mq usage count

2014-09-22 Thread Christoph Hellwig
Jens,

can we simply get these commits reverted from now if there's no better
fix?  I'd hate to have this boot stall in the first kernel with blk-mq
support for scsi.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 8/9] percpu_ref: add PERCPU_REF_INIT_* flags

2014-09-22 Thread Tejun Heo
With the recent addition of percpu_ref_reinit(), percpu_ref now can be
used as a persistent switch which can be turned on and off repeatedly
where turning off maps to killing the ref and waiting for it to drain;
however, there currently isn't a way to initialize a percpu_ref in its
off (killed and drained) state, which can be inconvenient for certain
persistent switch use cases.

Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
selection of operation mode; however, currently a newly initialized
percpu_ref is always in percpu mode making it impossible to avoid the
latency overhead of switching to atomic mode.

This patch adds @flags to percpu_ref_init() and implements the
following flags.

* PERCPU_REF_INIT_ATOMIC: start ref in atomic mode
* PERCPU_REF_INIT_DEAD  : start ref killed and drained

These flags should be able to serve the above two use cases.

Signed-off-by: Tejun Heo 
Cc: Kent Overstreet 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Johannes Weiner 
---
 block/blk-mq.c  |  2 +-
 fs/aio.c|  4 ++--
 include/linux/percpu-refcount.h | 18 +-
 kernel/cgroup.c |  7 ---
 lib/percpu-refcount.c   | 24 +++-
 5 files changed, 43 insertions(+), 12 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 702df07..3f6e6f5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1777,7 +1777,7 @@ struct request_queue *blk_mq_init_queue(struct 
blk_mq_tag_set *set)
goto err_hctxs;
 
if (percpu_ref_init(>mq_usage_counter, blk_mq_usage_counter_release,
-   GFP_KERNEL))
+   0, GFP_KERNEL))
goto err_map;
 
setup_timer(>timeout, blk_mq_rq_timer, (unsigned long) q);
diff --git a/fs/aio.c b/fs/aio.c
index 93fbcc0f..9b6d5d6 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -666,10 +666,10 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
 
INIT_LIST_HEAD(>active_reqs);
 
-   if (percpu_ref_init(>users, free_ioctx_users, GFP_KERNEL))
+   if (percpu_ref_init(>users, free_ioctx_users, 0, GFP_KERNEL))
goto err;
 
-   if (percpu_ref_init(>reqs, free_ioctx_reqs, GFP_KERNEL))
+   if (percpu_ref_init(>reqs, free_ioctx_reqs, 0, GFP_KERNEL))
goto err;
 
ctx->cpu = alloc_percpu(struct kioctx_cpu);
diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index e41ca20..5f84bf0 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -61,6 +61,21 @@ enum {
__PERCPU_REF_ATOMIC_DEAD = __PERCPU_REF_ATOMIC | __PERCPU_REF_DEAD,
 };
 
+/* @flags for percpu_ref_init() */
+enum {
+   /*
+* Start w/ ref == 1 in atomic mode.  Can be switched to percpu
+* operation using percpu_ref_switch_to_percpu().
+*/
+   PERCPU_REF_INIT_ATOMIC  = 1 << 0,
+
+   /*
+* Start dead w/ ref == 0 in atomic mode.  Must be revived with
+* percpu_ref_reinit() before used.  Implies INIT_ATOMIC.
+*/
+   PERCPU_REF_INIT_DEAD= 1 << 1,
+};
+
 struct percpu_ref {
atomic_long_t   count;
/*
@@ -74,7 +89,8 @@ struct percpu_ref {
 };
 
 int __must_check percpu_ref_init(struct percpu_ref *ref,
-percpu_ref_func_t *release, gfp_t gfp);
+percpu_ref_func_t *release, unsigned int flags,
+gfp_t gfp);
 void percpu_ref_exit(struct percpu_ref *ref);
 void percpu_ref_switch_to_atomic(struct percpu_ref *ref,
 percpu_ref_func_t *confirm_switch);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 589b4d8..e2fbcc1 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1628,7 +1628,8 @@ static int cgroup_setup_root(struct cgroup_root *root, 
unsigned int ss_mask)
goto out;
root_cgrp->id = ret;
 
-   ret = percpu_ref_init(_cgrp->self.refcnt, css_release, GFP_KERNEL);
+   ret = percpu_ref_init(_cgrp->self.refcnt, css_release, 0,
+ GFP_KERNEL);
if (ret)
goto out;
 
@@ -4487,7 +4488,7 @@ static int create_css(struct cgroup *cgrp, struct 
cgroup_subsys *ss,
 
init_and_link_css(css, ss, cgrp);
 
-   err = percpu_ref_init(>refcnt, css_release, GFP_KERNEL);
+   err = percpu_ref_init(>refcnt, css_release, 0, GFP_KERNEL);
if (err)
goto err_free_css;
 
@@ -4555,7 +4556,7 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, 
const char *name,
goto out_unlock;
}
 
-   ret = percpu_ref_init(>self.refcnt, css_release, GFP_KERNEL);
+   ret = percpu_ref_init(>self.refcnt, css_release, 0, GFP_KERNEL);
if (ret)
goto out_free_cgrp;
 
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 548b19e..74ec33e 100644
--- a/lib/percpu-refcount.c
+++ 

[PATCH 3/9] percpu_ref: replace pcpu_ prefix with percpu_

2014-09-22 Thread Tejun Heo
percpu_ref uses pcpu_ prefix for internal stuff and percpu_ for
externally visible ones.  This is the same convention used in the
percpu allocator implementation.  It works fine there but percpu_ref
doesn't have too much internal-only stuff and scattered usages of
pcpu_ prefix are confusing than helpful.

This patch replaces all pcpu_ prefixes with percpu_.  This is pure
rename and there's no functional change.  Note that PCPU_REF_DEAD is
renamed to __PERCPU_REF_DEAD to signify that the flag is internal.

Signed-off-by: Tejun Heo 
Cc: Kent Overstreet 
---
 include/linux/percpu-refcount.h | 46 -
 lib/percpu-refcount.c   | 56 +
 2 files changed, 52 insertions(+), 50 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index d44b027..3d463a3 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -13,7 +13,7 @@
  *
  * The refcount will have a range of 0 to ((1U << 31) - 1), i.e. one bit less
  * than an atomic_t - this is because of the way shutdown works, see
- * percpu_ref_kill()/PCPU_COUNT_BIAS.
+ * percpu_ref_kill()/PERCPU_COUNT_BIAS.
  *
  * Before you call percpu_ref_kill(), percpu_ref_put() does not check for the
  * refcount hitting 0 - it can't, if it was in percpu mode. percpu_ref_kill()
@@ -60,7 +60,7 @@ struct percpu_ref {
 * The low bit of the pointer indicates whether the ref is in percpu
 * mode; if set, then get/put will manipulate the atomic_t.
 */
-   unsigned long   pcpu_count_ptr;
+   unsigned long   percpu_count_ptr;
percpu_ref_func_t   *release;
percpu_ref_func_t   *confirm_kill;
struct rcu_head rcu;
@@ -88,26 +88,26 @@ static inline void percpu_ref_kill(struct percpu_ref *ref)
return percpu_ref_kill_and_confirm(ref, NULL);
 }
 
-#define PCPU_REF_DEAD  1
+#define __PERCPU_REF_DEAD  1
 
 /*
  * Internal helper.  Don't use outside percpu-refcount proper.  The
  * function doesn't return the pointer and let the caller test it for NULL
  * because doing so forces the compiler to generate two conditional
- * branches as it can't assume that @ref->pcpu_count is not NULL.
+ * branches as it can't assume that @ref->percpu_count is not NULL.
  */
-static inline bool __pcpu_ref_alive(struct percpu_ref *ref,
-   unsigned long __percpu **pcpu_countp)
+static inline bool __percpu_ref_alive(struct percpu_ref *ref,
+ unsigned long __percpu **percpu_countp)
 {
-   unsigned long pcpu_ptr = ACCESS_ONCE(ref->pcpu_count_ptr);
+   unsigned long percpu_ptr = ACCESS_ONCE(ref->percpu_count_ptr);
 
/* paired with smp_store_release() in percpu_ref_reinit() */
smp_read_barrier_depends();
 
-   if (unlikely(pcpu_ptr & PCPU_REF_DEAD))
+   if (unlikely(percpu_ptr & __PERCPU_REF_DEAD))
return false;
 
-   *pcpu_countp = (unsigned long __percpu *)pcpu_ptr;
+   *percpu_countp = (unsigned long __percpu *)percpu_ptr;
return true;
 }
 
@@ -121,12 +121,12 @@ static inline bool __pcpu_ref_alive(struct percpu_ref 
*ref,
  */
 static inline void percpu_ref_get(struct percpu_ref *ref)
 {
-   unsigned long __percpu *pcpu_count;
+   unsigned long __percpu *percpu_count;
 
rcu_read_lock_sched();
 
-   if (__pcpu_ref_alive(ref, _count))
-   this_cpu_inc(*pcpu_count);
+   if (__percpu_ref_alive(ref, _count))
+   this_cpu_inc(*percpu_count);
else
atomic_long_inc(>count);
 
@@ -144,13 +144,13 @@ static inline void percpu_ref_get(struct percpu_ref *ref)
  */
 static inline bool percpu_ref_tryget(struct percpu_ref *ref)
 {
-   unsigned long __percpu *pcpu_count;
+   unsigned long __percpu *percpu_count;
int ret;
 
rcu_read_lock_sched();
 
-   if (__pcpu_ref_alive(ref, _count)) {
-   this_cpu_inc(*pcpu_count);
+   if (__percpu_ref_alive(ref, _count)) {
+   this_cpu_inc(*percpu_count);
ret = true;
} else {
ret = atomic_long_inc_not_zero(>count);
@@ -178,13 +178,13 @@ static inline bool percpu_ref_tryget(struct percpu_ref 
*ref)
  */
 static inline bool percpu_ref_tryget_live(struct percpu_ref *ref)
 {
-   unsigned long __percpu *pcpu_count;
+   unsigned long __percpu *percpu_count;
int ret = false;
 
rcu_read_lock_sched();
 
-   if (__pcpu_ref_alive(ref, _count)) {
-   this_cpu_inc(*pcpu_count);
+   if (__percpu_ref_alive(ref, _count)) {
+   this_cpu_inc(*percpu_count);
ret = true;
}
 
@@ -204,12 +204,12 @@ static inline bool percpu_ref_tryget_live(struct 
percpu_ref *ref)
  */
 static inline void percpu_ref_put(struct percpu_ref *ref)
 {
-   unsigned long __percpu *pcpu_count;
+   unsigned long 

[PATCH 7/9] percpu_ref: decouple switching to percpu mode and reinit

2014-09-22 Thread Tejun Heo
percpu_ref has treated the dropping of the base reference and
switching to atomic mode as an integral operation; however, there's
nothing inherent tying the two together.

The use cases for percpu_ref have been expanding continuously.  While
the current init/kill/reinit/exit model can cover a lot, the coupling
of kill/reinit with atomic/percpu mode switching is turning out to be
too restrictive for use cases where many percpu_refs are created and
destroyed back-to-back with only some of them reaching extended
operation.  The coupling also makes implementing always-atomic debug
mode difficult.

This patch separates out percpu mode switching into
percpu_ref_switch_to_percpu() and reimplements percpu_ref_reinit() on
top of it.

* DEAD still requires ATOMIC.  A dead ref can't be switched to percpu
  mode w/o going through reinit.

Signed-off-by: Tejun Heo 
Cc: Kent Overstreet 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Johannes Weiner 
---
 include/linux/percpu-refcount.h |  3 +-
 lib/percpu-refcount.c   | 73 ++---
 2 files changed, 56 insertions(+), 20 deletions(-)

diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 03a02e9..e41ca20 100644
--- a/include/linux/percpu-refcount.h
+++ b/include/linux/percpu-refcount.h
@@ -78,9 +78,10 @@ int __must_check percpu_ref_init(struct percpu_ref *ref,
 void percpu_ref_exit(struct percpu_ref *ref);
 void percpu_ref_switch_to_atomic(struct percpu_ref *ref,
 percpu_ref_func_t *confirm_switch);
-void percpu_ref_reinit(struct percpu_ref *ref);
+void percpu_ref_switch_to_percpu(struct percpu_ref *ref);
 void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
 percpu_ref_func_t *confirm_kill);
+void percpu_ref_reinit(struct percpu_ref *ref);
 
 /**
  * percpu_ref_kill - drop the initial ref
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index 56a7c0d..548b19e 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -206,40 +206,54 @@ void percpu_ref_switch_to_atomic(struct percpu_ref *ref,
__percpu_ref_switch_to_atomic(ref, confirm_switch);
 }
 
-/**
- * percpu_ref_reinit - re-initialize a percpu refcount
- * @ref: perpcu_ref to re-initialize
- *
- * Re-initialize @ref so that it's in the same state as when it finished
- * percpu_ref_init().  @ref must have been initialized successfully, killed
- * and reached 0 but not exited.
- *
- * Note that percpu_ref_tryget[_live]() are safe to perform on @ref while
- * this function is in progress.
- */
-void percpu_ref_reinit(struct percpu_ref *ref)
+void __percpu_ref_switch_to_percpu(struct percpu_ref *ref)
 {
unsigned long __percpu *percpu_count = percpu_count_ptr(ref);
int cpu;
 
BUG_ON(!percpu_count);
-   WARN_ON_ONCE(!percpu_ref_is_zero(ref));
 
-   atomic_long_set(>count, 1 + PERCPU_COUNT_BIAS);
+   if (!(ref->percpu_count_ptr & __PERCPU_REF_ATOMIC))
+   return;
+
+   wait_event(percpu_ref_switch_waitq, !ref->confirm_switch);
+
+   atomic_long_add(PERCPU_COUNT_BIAS, >count);
 
/*
 * Restore per-cpu operation.  smp_store_release() is paired with
 * smp_read_barrier_depends() in __ref_is_percpu() and guarantees
 * that the zeroing is visible to all percpu accesses which can see
-* the following __PERCPU_REF_ATOMIC_DEAD clearing.
+* the following __PERCPU_REF_ATOMIC clearing.
 */
for_each_possible_cpu(cpu)
*per_cpu_ptr(percpu_count, cpu) = 0;
 
smp_store_release(>percpu_count_ptr,
- ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD);
+ ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC);
+}
+
+/**
+ * percpu_ref_switch_to_percpu - switch a percpu_ref to percpu mode
+ * @ref: percpu_ref to switch to percpu mode
+ *
+ * There's no reason to use this function for the usual reference counting.
+ * To re-use an expired ref, use percpu_ref_reinit().
+ *
+ * Switch @ref to percpu mode.  This function may be invoked concurrently
+ * with all the get/put operations and can safely be mixed with kill and
+ * reinit operations.
+ *
+ * This function normally doesn't block and can be called from any context
+ * but it may block if @ref is in the process of switching to atomic mode
+ * by percpu_ref_switch_atomic().
+ */
+void percpu_ref_switch_to_percpu(struct percpu_ref *ref)
+{
+   /* a dying or dead ref can't be switched to percpu mode w/o reinit */
+   if (!(ref->percpu_count_ptr & __PERCPU_REF_DEAD))
+   __percpu_ref_switch_to_percpu(ref);
 }
-EXPORT_SYMBOL_GPL(percpu_ref_reinit);
 
 /**
  * percpu_ref_kill_and_confirm - drop the initial ref and schedule confirmation
@@ -253,8 +267,8 @@ EXPORT_SYMBOL_GPL(percpu_ref_reinit);
  * percpu_ref_tryget_live() for details.
  *
  * This function normally doesn't block and can be called from any context
- * but it may block if 

[PATCH] ata: Disabling the async PM for JMicron chips

2014-09-22 Thread Chuansheng Liu
Be similar with commit (ata: Disabling the async PM for JMicron chip 363/361),
Barto found the similar issue for JMicron chip 368, that 363/368 has no
parent-children relationship, but they have the power dependency.

So here we can exclude the JMicron chips out of pm_async method directly,
to avoid further similar issues.

Details in:
https://bugzilla.kernel.org/show_bug.cgi?id=84861

Reported-and-tested-by: Barto 
Signed-off-by: Chuansheng Liu 
---
 drivers/ata/ahci.c |6 +++---
 drivers/ata/pata_jmicron.c |6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index a0cc0ed..c096d49 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1345,10 +1345,10 @@ static int ahci_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
 * follow the sequence one by one, otherwise one of them can not be
 * powered on successfully, so here we disable the async suspend
 * method for these chips.
+* Jmicron chip 368 has been found has the similar issue, here we can
+* exclude the Jmicron family directly to avoid other similar issues.
 */
-   if (pdev->vendor == PCI_VENDOR_ID_JMICRON &&
-   (pdev->device == PCI_DEVICE_ID_JMICRON_JMB363 ||
-   pdev->device == PCI_DEVICE_ID_JMICRON_JMB361))
+   if (pdev->vendor == PCI_VENDOR_ID_JMICRON)
device_disable_async_suspend(>dev);
 
/* acquire resources */
diff --git a/drivers/ata/pata_jmicron.c b/drivers/ata/pata_jmicron.c
index 47e418b..48c993b 100644
--- a/drivers/ata/pata_jmicron.c
+++ b/drivers/ata/pata_jmicron.c
@@ -149,10 +149,10 @@ static int jmicron_init_one (struct pci_dev *pdev, const 
struct pci_device_id *i
 * follow the sequence one by one, otherwise one of them can not be
 * powered on successfully, so here we disable the async suspend
 * method for these chips.
+* Jmicron chip 368 has been found has the similar issue, here we can
+* exclude the Jmicron family directly to avoid other similar issues.
 */
-   if (pdev->vendor == PCI_VENDOR_ID_JMICRON &&
-   (pdev->device == PCI_DEVICE_ID_JMICRON_JMB363 ||
-   pdev->device == PCI_DEVICE_ID_JMICRON_JMB361))
+   if (pdev->vendor == PCI_VENDOR_ID_JMICRON)
device_disable_async_suspend(>dev);
 
return ata_pci_bmdma_init_one(pdev, ppi, _sht, NULL, 0);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kernfs: use stack-buf for small writes.

2014-09-22 Thread Tejun Heo
On Tue, Sep 23, 2014 at 03:40:58PM +1000, NeilBrown wrote:
> > Oh, I meant the buffer seqfile read op writes to, so it depends on the
> > fact that the allocation is only on the first read?  That seems
> > extremely brittle to me, especially for an issue which tends to be
> > difficult to reproduce.
> 
> It is easy for user-space to ensure they read once before any critical time..

Sure, but it's a hard and subtle dependency on an extremely obscure
implementation detail.

> > I'd much rather keep things direct and make it explicitly allocate r/w
> > buffer(s) on open and disallow seq_file operations on such files.
> 
> As far as I can tell, seq_read is used on all sysfs files that are
> readable except for 'binary' files.  Are you suggesting all files that might
> need to be accessed without a kmalloc have to be binary files?

kernfs ->direct_read() callback doesn't go through seq_file.  sysfs
can be extended to support that for regular files, I guess.  Or just
make those special files binary?

> Having to identify those files which are important in advance seems the more
> "brittle" approach to me.  I would much rather it "just worked"

I disagree.  The files which shouldn't involve memory allocations must
be identified no matter what.  They're *very* special.  And the rules
that userland has to follow seem completely broken to me.  "Small"
writes are okay, whatever that means, and "small" reads are okay too
as long as it isn't the first read.  Ooh, BTW, if the second read ends
up expanding the initial buffer, it isn't okay - the initial boundary
is PAGE_SIZE and the buffer is expanded twice on each overflow.  How
are these rules okay?  This is borderline crazy.  In addition, the
read path involves a lot more code this way.  It ends up locking down
buffer policies of the whole seqfile implementation.

> Would you prefer a new per-attribute flag which directed sysfs to
> pre-allocate a full page, or a 'max_size' attribute which caused a buffer of
> that size to be allocated on open?
> The same size would be used to pre-allocate the seqfile buf (like
> single_open_size does) if reads were supported.

Yes but I really think we should avoid seqfile dependency.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH net-next] mellanox: Change en_print to return void

2014-09-22 Thread Amir Vadai
On 9/22/2014 8:40 PM, Joe Perches wrote:
> No caller or macro uses the return value so make it void.
> 
> Signed-off-by: Joe Perches 
> ---
> This change is associated to a desire to eventually
> change printk to return void.
> 
>  drivers/net/ethernet/mellanox/mlx4/en_main.c | 17 +++--
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  4 ++--
>  2 files changed, 9 insertions(+), 12 deletions(-)

Thanks Joe.

Acked-By: Amir Vadai 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 1/5] x86, mm, pat: Set WT to PA7 slot of PAT MSR

2014-09-22 Thread Juergen Gross

On 09/17/2014 09:48 PM, Toshi Kani wrote:

This patch sets WT to the PA7 slot in the PAT MSR when the processor
is not affected by the PAT errata.  The PA7 slot is chosen to further
minimize the risk of using the PAT bit as the PA3 slot is UC and is
not currently used.

The following Intel processors are affected by the PAT errata.

errata   cpuid

Pentium 2, A52   family 0x6, model 0x5
Pentium 3, E27   family 0x6, model 0x7, 0x8
Pentium 3 Xenon, G26 family 0x6, model 0x7, 0x8, 0xa
Pentium M, Y26   family 0x6, model 0x9
Pentium M 90nm, X9   family 0x6, model 0xd
Pentium 4, N46   family 0xf, model 0x0

Instead of making sharp boundary checks, this patch makes conservative
checks to exclude all Pentium 2, 3, M and 4 family processors.  For
such processors, _PAGE_CACHE_MODE_WT is redirected to UC- per the
default setup in __cachemode2pte_tbl[].

Signed-off-by: Toshi Kani 


Reviewed-by: Juergen Gross 


---
  arch/x86/mm/pat.c |   64 +
  1 file changed, 49 insertions(+), 15 deletions(-)

diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c
index ff31851..db687c3 100644
--- a/arch/x86/mm/pat.c
+++ b/arch/x86/mm/pat.c
@@ -133,6 +133,7 @@ void pat_init(void)
  {
u64 pat;
bool boot_cpu = !boot_pat_state;
+   struct cpuinfo_x86 *c = _cpu_data;

if (!pat_enabled)
return;
@@ -153,21 +154,54 @@ void pat_init(void)
}
}

-   /* Set PWT to Write-Combining. All other bits stay the same */
-   /*
-* PTE encoding used in Linux:
-*  PAT
-*  |PCD
-*  ||PWT
-*  |||
-*  000 WB  _PAGE_CACHE_WB
-*  001 WC  _PAGE_CACHE_WC
-*  010 UC- _PAGE_CACHE_UC_MINUS
-*  011 UC  _PAGE_CACHE_UC
-* PAT bit unused
-*/
-   pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
- PAT(4, WB) | PAT(5, WC) | PAT(6, UC_MINUS) | PAT(7, UC);
+   if ((c->x86_vendor == X86_VENDOR_INTEL) &&
+   (((c->x86 == 0x6) && (c->x86_model <= 0xd)) ||
+((c->x86 == 0xf) && (c->x86_model <= 0x6 {
+   /*
+* PAT support with the lower four entries. Intel Pentium 2,
+* 3, M, and 4 are affected by PAT errata, which makes the
+* upper four entries unusable.  We do not use the upper four
+* entries for all the affected processor families for safe.
+*
+*  PTE encoding used in Linux:
+*  PAT
+*  |PCD
+*  ||PWT  PAT
+*  |||slot
+*  0000WB : _PAGE_CACHE_MODE_WB
+*  0011WC : _PAGE_CACHE_MODE_WC
+*  0102UC-: _PAGE_CACHE_MODE_UC_MINUS
+*  0113UC : _PAGE_CACHE_MODE_UC
+* PAT bit unused
+*
+* NOTE: When WT or WP is used, it is redirected to UC- per
+* the default setup in __cachemode2pte_tbl[].
+*/
+   pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
+ PAT(4, WB) | PAT(5, WC) | PAT(6, UC_MINUS) | PAT(7, UC);
+   } else {
+   /*
+* PAT full support. WT is set to slot 7, which minimizes
+* the risk of using the PAT bit as slot 3 is UC and is
+* currently unused. Slot 4 should remain as reserved.
+*
+*  PTE encoding used in Linux:
+*  PAT
+*  |PCD
+*  ||PWT  PAT
+*  |||slot
+*  0000WB : _PAGE_CACHE_MODE_WB
+*  0011WC : _PAGE_CACHE_MODE_WC
+*  0102UC-: _PAGE_CACHE_MODE_UC_MINUS
+*  0113UC : _PAGE_CACHE_MODE_UC
+*  1004
+*  1015
+*  1106
+*  1117WT : _PAGE_CACHE_MODE_WT
+*/
+   pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) |
+ PAT(4, WB) | PAT(5, WC) | PAT(6, UC_MINUS) | PAT(7, WT);
+   }

/* Boot CPU check */
if (!boot_pat_state)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] x86 fixes

2014-09-22 Thread H. Peter Anvin
That would be my guess, too.

On September 22, 2014 10:37:11 PM PDT, Ingo Molnar  wrote:
>
>* Ingo Molnar  wrote:
>
>> 
>> * Linus Torvalds  wrote:
>> 
>> > On Fri, Sep 19, 2014 at 3:40 AM, Ingo Molnar 
>wrote:
>> > >
>> > > Please pull the latest x86-urgent-for-linus git tree from:
>> > 
>> > I only just noticed, but this pull request causes my Sony Vaio 
>> > laptop to immediately reboot at startup.
>> > 
>> > I'm assuming it's one of the efi changes, but I'm bisecting now 
>> > to say exactly where it happens. It will get reverted.
>> 
>> I've Cc:-ed Matt.
>> 
>> My guess would be one of these two EFI commits:
>> 
>>   * Fix early boot regression affecting x86 EFI boot stub when
>loading
>> initrds above 4GB - Yinghai Lu
>> 
>> 47226ad4f4cf x86/efi: Only load initrd above 4g on second try
>> 
>>   * Relocate GOT entries in the x86 EFI boot stub now that we
>have
>> symbols with global visibility - Matt Fleming
>> 
>> 9cb0e394234d x86/efi: Fixup GOT in all boot code paths
>> 
>> If it's 9cb0e394234d - then it's perhaps a build quirk, or a bug 
>> in the assembly code. If so then we'd have to revert this, and 
>> reintroduce another regression, caused by EFI commit 
>> f23cf8bd5c1f49 in this merge window. The most recent commit is 
>> easy to revert, the older one not.
>> 
>> If it's 47226ad4f4cf then we'd reintroduce the regression caused 
>> by 4bf7111f501 in the previous merge window. They both revert 
>> cleanly after each other - but it might be safer to just revert 
>> the most recent one.
>>
>> My guess is that your regression is caused by 47226ad4f4cf.
>
>Wrong sha1: my guess is on 9cb0e394234d, the GOT fixup.
>
>Thanks,
>
>   Ingo

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next: manual merge of the tiny tree with the tip tree

2014-09-22 Thread Ingo Molnar

* Stephen Rothwell  wrote:

> Hi Josh,
> 
> Today's linux-next merge of the tiny tree got conflicts in
> arch/x86/kernel/process_32.c and arch/x86/kernel/process_64.c between
> commits dc56c0f9b870 ("x86, fpu: Shift "fpu_counter = 0" from
> copy_thread() to arch_dup_task_struct()") and 6f46b3aef003 ("x86:
> copy_thread: Don't nullify ->ptrace_bps twice") from the tip tree and
> commits a1cf09f93e66 ("x86: process: Unify 32-bit and 64-bit
> copy_thread I/O bitmap handling") and e4a191d1e05b ("x86: Support
> compiling out userspace I/O (iopl and ioperm)") from the tiny tree.

Why are such changes in the 'tiny' tree? These are sensitive 
arch/x86 files, and any unification and compilation-out support 
patches need to go through the proper review channels and be 
merged upstream via the x86 tree if accepted...

In particular the graticious sprinking of #ifdef 
CONFIG_X86_IOPORTs around x86 code looks ugly.

Josh, don't do that, this route is really unacceptable. Please 
resubmit the latest patches and remove these from linux-next.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kernfs: use stack-buf for small writes.

2014-09-22 Thread NeilBrown
On Tue, 23 Sep 2014 00:55:49 -0400 Tejun Heo  wrote:

> Hello, Neil.
> 
> On Tue, Sep 23, 2014 at 02:46:50PM +1000, NeilBrown wrote:
> > seqfile is only safe for reads.  sysfs via kernfs uses seq_read(), so there
> > is only a single allocation on the first read.
> > 
> > It doesn't really related to fixing writes, except to point out that only
> > writes need to be "fixed".  Reads already work.
> 
> Oh, I meant the buffer seqfile read op writes to, so it depends on the
> fact that the allocation is only on the first read?  That seems
> extremely brittle to me, especially for an issue which tends to be
> difficult to reproduce.

It is easy for user-space to ensure they read once before any critical time..

> 
> > Separately:
> > 
> > > Ugh... :( If this can't be avoided at all, I'd much prefer it to be
> > > something explicit - a flag marking the file as needing a persistent
> > > write buffer which is allocated on open.  "Small" writes on stack
> > > feels way to implicit to me.
> > 
> > How about if we add seq_getbuf() and seq_putbuf() to seqfile
> > which takes a 'struct seq_file' and a size and returns the ->buf
> > after making sure it is big enough.
> > It also claims and releases the seqfile ->lock.
> > 
> > Then we would be using the same buffer for reads and write.
> > 
> > Does that sound suitable?  It uses existing infrastructure and avoids having
> > to identify in advance which attributes it is important for.
> 
> I'd much rather keep things direct and make it explicitly allocate r/w
> buffer(s) on open and disallow seq_file operations on such files.

As far as I can tell, seq_read is used on all sysfs files that are
readable except for 'binary' files.  Are you suggesting all files that might
need to be accessed without a kmalloc have to be binary files?

Having to identify those files which are important in advance seems the more
"brittle" approach to me.  I would much rather it "just worked"

Would you prefer a new per-attribute flag which directed sysfs to
pre-allocate a full page, or a 'max_size' attribute which caused a buffer of
that size to be allocated on open?
The same size would be used to pre-allocate the seqfile buf (like
single_open_size does) if reads were supported.

Thanks,
NeilBrown



signature.asc
Description: PGP signature


Re: [PATCH] i2c: move acpi code back into the core

2014-09-22 Thread Wolfram Sang

> Sorry for later response due to sickness. I can't write this patch in
> time. Sorry again. I will test it soon.

Oh, get well soon! Please say so next time, so I know.



signature.asc
Description: Digital signature


Re: [GIT PULL] x86 fixes

2014-09-22 Thread Ingo Molnar

* Ingo Molnar  wrote:

> 
> * Linus Torvalds  wrote:
> 
> > On Fri, Sep 19, 2014 at 3:40 AM, Ingo Molnar  wrote:
> > >
> > > Please pull the latest x86-urgent-for-linus git tree from:
> > 
> > I only just noticed, but this pull request causes my Sony Vaio 
> > laptop to immediately reboot at startup.
> > 
> > I'm assuming it's one of the efi changes, but I'm bisecting now 
> > to say exactly where it happens. It will get reverted.
> 
> I've Cc:-ed Matt.
> 
> My guess would be one of these two EFI commits:
> 
>   * Fix early boot regression affecting x86 EFI boot stub when loading
> initrds above 4GB - Yinghai Lu
> 
> 47226ad4f4cf x86/efi: Only load initrd above 4g on second try
> 
>   * Relocate GOT entries in the x86 EFI boot stub now that we have
> symbols with global visibility - Matt Fleming
> 
> 9cb0e394234d x86/efi: Fixup GOT in all boot code paths
> 
> If it's 9cb0e394234d - then it's perhaps a build quirk, or a bug 
> in the assembly code. If so then we'd have to revert this, and 
> reintroduce another regression, caused by EFI commit 
> f23cf8bd5c1f49 in this merge window. The most recent commit is 
> easy to revert, the older one not.
> 
> If it's 47226ad4f4cf then we'd reintroduce the regression caused 
> by 4bf7111f501 in the previous merge window. They both revert 
> cleanly after each other - but it might be safer to just revert 
> the most recent one.
>
> My guess is that your regression is caused by 47226ad4f4cf.

Wrong sha1: my guess is on 9cb0e394234d, the GOT fixup.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] x86 fixes

2014-09-22 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Fri, Sep 19, 2014 at 3:40 AM, Ingo Molnar  wrote:
> >
> > Please pull the latest x86-urgent-for-linus git tree from:
> 
> I only just noticed, but this pull request causes my Sony Vaio 
> laptop to immediately reboot at startup.
> 
> I'm assuming it's one of the efi changes, but I'm bisecting now 
> to say exactly where it happens. It will get reverted.

I've Cc:-ed Matt.

My guess would be one of these two EFI commits:

  * Fix early boot regression affecting x86 EFI boot stub when loading
initrds above 4GB - Yinghai Lu

47226ad4f4cf x86/efi: Only load initrd above 4g on second try

  * Relocate GOT entries in the x86 EFI boot stub now that we have
symbols with global visibility - Matt Fleming

9cb0e394234d x86/efi: Fixup GOT in all boot code paths

If it's 9cb0e394234d - then it's perhaps a build quirk, or a bug 
in the assembly code. If so then we'd have to revert this, and 
reintroduce another regression, caused by EFI commit 
f23cf8bd5c1f49 in this merge window. The most recent commit is 
easy to revert, the older one not.

If it's 47226ad4f4cf then we'd reintroduce the regression caused 
by 4bf7111f501 in the previous merge window. They both revert 
cleanly after each other - but it might be safer to just revert 
the most recent one.

My guess is that your regression is caused by 47226ad4f4cf.

Sorry about this, the timing is unfortunate.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3 0/3] x86: Full support of PAT

2014-09-22 Thread Juergen Gross

Hi,

any chance to have this in 3.18?

Juergen

On 09/12/2014 12:35 PM, Juergen Gross wrote:

The x86 architecture offers via the PAT (Page Attribute Table) a way to
specify different caching modes in page table entries. The PAT MSR contains
8 entries each specifying one of 6 possible cache modes. A pte references one
of those entries via 3 bits: _PAGE_PAT, _PAGE_PWT and _PAGE_PCD.

The Linux kernel currently supports only 4 different cache modes. The PAT MSR
is set up in a way that the setting of _PAGE_PAT in a pte doesn't matter: the
top 4 entries in the PAT MSR are the same as the 4 lower entries.

This results in the kernel not supporting e.g. write-through mode. Especially
this cache mode would speed up drivers of video cards which now have to use
uncached accesses.

OTOH some old processors (Pentium) don't support PAT correctly and the Xen
hypervisor has been using a different PAT MSR configuration for some time now
and can't change that as this setting is part of the ABI.

This patch set abstracts the cache mode from the pte and introduces tables to
translate between cache mode and pte bits (the default cache mode "write back"
is hard-wired to PAT entry 0). The tables are statically initialized with
values being compatible to old processors and current usage. As soon as the
PAT MSR is changed (or - in case of Xen - is read at boot time) the tables are
changed accordingly. Requests of mappings with special cache modes are always
possible now, in case they are not supported there will be a fallback to a
compatible but slower mode.

Summing it up, this patch set adds the following features:
- capability to support WT and WP cache modes on processors with full PAT
   support
- processors with no or uncorrect PAT support are still working as today, even
   if WT or WP cache mode are selected by drivers for some pages
- reduction of Xen special handling regarding cache mode

Changes in V3:
- corrected two minor nits (UC_MINUS, again) detected by Toshi Kani

Changes in V2:
- simplified handling of PAT MSR write under Xen as suggested by David Vrabel
- removed resetting of pat_enabled under Xen
- two small corrections requested by Toshi Kani (UC_MINUS cache mode in
   vermilion driver, fix 32 bit kernel build failure)
- correct build error on non-x86 arch by moving definition of
   update_cache_mode_entry() to x86 specific header

Changes since RFC:
- renamed functions and variables as suggested by Toshi Kani
- corrected cache mode bits for WT and WP
- modified handling of PAT MSR write under Xen as suggested by Jan Beulich


Juergen Gross (3):
   x86: Make page cache mode a real type
   x86: Enable PAT to use cache mode translation tables
   Support Xen pv-domains using PAT

  arch/x86/include/asm/cacheflush.h |  38 ---
  arch/x86/include/asm/fb.h |   6 +-
  arch/x86/include/asm/io.h |   2 +-
  arch/x86/include/asm/pat.h|   7 +-
  arch/x86/include/asm/pgtable.h|  19 ++--
  arch/x86/include/asm/pgtable_types.h  |  96 
  arch/x86/mm/dump_pagetables.c |  24 ++--
  arch/x86/mm/init.c|  37 ++
  arch/x86/mm/init_64.c |   9 +-
  arch/x86/mm/iomap_32.c|  15 ++-
  arch/x86/mm/ioremap.c |  63 ++-
  arch/x86/mm/mm_internal.h |   2 +
  arch/x86/mm/pageattr.c|  84 --
  arch/x86/mm/pat.c | 181 +++---
  arch/x86/mm/pat_internal.h|  22 ++--
  arch/x86/mm/pat_rbtree.c  |   8 +-
  arch/x86/pci/i386.c   |   4 +-
  arch/x86/xen/enlighten.c  |  25 ++---
  arch/x86/xen/mmu.c|  48 +---
  arch/x86/xen/xen-ops.h|   1 -
  drivers/video/fbdev/gbefb.c   |   3 +-
  drivers/video/fbdev/vermilion/vermilion.c |   6 +-
  22 files changed, 421 insertions(+), 279 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [f2fs-dev] [PATCH 08/10] f2fs: remove redundant operation during roll-forward recovery

2014-09-22 Thread Jaegeuk Kim
Hi Chao,

I fixed that. :)

Thanks,

On Mon, Sep 22, 2014 at 05:22:27PM +0800, Chao Yu wrote:
> > -Original Message-
> > From: Jaegeuk Kim [mailto:jaeg...@kernel.org]
> > Sent: Monday, September 15, 2014 6:14 AM
> > To: linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org;
> > linux-f2fs-de...@lists.sourceforge.net
> > Cc: Jaegeuk Kim
> > Subject: [f2fs-dev] [PATCH 08/10] f2fs: remove redundant operation during 
> > roll-forward recovery
> > 
> > If same data is updated multiple times, we don't need to redo whole the
> > operations.
> > Let's just update the lastest one.
> 
> Reviewed-by: Chao Yu 
> 
> And one comment as following.
> 
> > 
> > Signed-off-by: Jaegeuk Kim 
> > ---
> >  fs/f2fs/f2fs.h |  4 +++-
> >  fs/f2fs/recovery.c | 41 +
> >  2 files changed, 20 insertions(+), 25 deletions(-)
> > 
> > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> > index 48d7d46..74dde99 100644
> > --- a/fs/f2fs/f2fs.h
> > +++ b/fs/f2fs/f2fs.h
> > @@ -137,7 +137,9 @@ struct discard_entry {
> >  struct fsync_inode_entry {
> > struct list_head list;  /* list head */
> > struct inode *inode;/* vfs inode pointer */
> > -   block_t blkaddr;/* block address locating the last inode */
> > +   block_t blkaddr;/* block address locating the last fsync */
> > +   block_t last_dentry;/* block address locating the last dentry */
> > +   block_t last_inode; /* block address locating the last inode */
> >  };
> > 
> >  #define nats_in_cursum(sum)(le16_to_cpu(sum->n_nats))
> > diff --git a/fs/f2fs/recovery.c b/fs/f2fs/recovery.c
> > index 6f7fbfa..95d9dc9 100644
> > --- a/fs/f2fs/recovery.c
> > +++ b/fs/f2fs/recovery.c
> > @@ -66,7 +66,7 @@ static struct fsync_inode_entry *get_fsync_inode(struct 
> > list_head *head,
> > return NULL;
> >  }
> > 
> > -static int recover_dentry(struct page *ipage, struct inode *inode)
> > +static int recover_dentry(struct inode *inode, struct page *ipage)
> >  {
> > struct f2fs_inode *raw_inode = F2FS_INODE(ipage);
> > nid_t pino = le32_to_cpu(raw_inode->i_pino);
> > @@ -140,7 +140,7 @@ out:
> > return err;
> >  }
> > 
> > -static void __recover_inode(struct inode *inode, struct page *page)
> > +static void recover_inode(struct inode *inode, struct page *page)
> >  {
> > struct f2fs_inode *raw = F2FS_INODE(page);
> > 
> > @@ -152,21 +152,9 @@ static void __recover_inode(struct inode *inode, 
> > struct page *page)
> > inode->i_atime.tv_nsec = le32_to_cpu(raw->i_mtime_nsec);
> > inode->i_ctime.tv_nsec = le32_to_cpu(raw->i_ctime_nsec);
> > inode->i_mtime.tv_nsec = le32_to_cpu(raw->i_mtime_nsec);
> > -}
> > -
> > -static int recover_inode(struct inode *inode, struct page *node_page)
> > -{
> > -   if (!IS_INODE(node_page))
> > -   return 0;
> > -
> > -   __recover_inode(inode, node_page);
> > -
> > -   if (is_dent_dnode(node_page))
> > -   return recover_dentry(node_page, inode);
> > 
> > f2fs_msg(inode->i_sb, KERN_NOTICE, "recover_inode: ino = %x, name = %s",
> > -   ino_of_node(node_page), F2FS_INODE(node_page)->i_name);
> > -   return 0;
> > +   ino_of_node(page), F2FS_INODE(page)->i_name);
> >  }
> > 
> >  static int find_fsync_dnodes(struct f2fs_sb_info *sbi, struct list_head 
> > *head)
> > @@ -214,12 +202,11 @@ static int find_fsync_dnodes(struct f2fs_sb_info 
> > *sbi, struct list_head
> > *head)
> > }
> > 
> > /* add this fsync inode to the list */
> > -   entry = kmem_cache_alloc(fsync_entry_slab, GFP_NOFS);
> > +   entry = kmem_cache_alloc(fsync_entry_slab, 
> > GFP_F2FS_ZERO);
> > if (!entry) {
> > err = -ENOMEM;
> > break;
> > }
> > -
> > /*
> >  * CP | dnode(F) | inode(DF)
> >  * For this case, we should not give up now.
> > @@ -236,9 +223,11 @@ static int find_fsync_dnodes(struct f2fs_sb_info *sbi, 
> > struct list_head
> > *head)
> > }
> > entry->blkaddr = blkaddr;
> > 
> > -   err = recover_inode(entry->inode, page);
> > -   if (err && err != -ENOENT)
> > -   break;
> > +   if (IS_INODE(page)) {
> > +   entry->last_inode = blkaddr;
> > +   if (is_dent_dnode(page))
> > +   entry->last_dentry = blkaddr;
> > +   }
> >  next:
> > /* check next segment */
> > blkaddr = next_blkaddr_of_node(page);
> > @@ -455,11 +444,15 @@ static int recover_data(struct f2fs_sb_info *sbi,
> > /*
> >  * inode(x) | CP | inode(x) | dnode(F)
> >  * In this case, we can lose the latest inode(x).
> > -* So, call __recover_inode for the inode update.
> > +* So, call recover_inode for the inode 

Re: [PATCH 0/4] ipc/shm.c: increase the limits for SHMMAX, SHMALL

2014-09-22 Thread Michael Kerrisk (man-pages)
On 06/03/2014 09:26 PM, Davidlohr Bueso wrote:
> On Fri, 2014-05-02 at 15:16 +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Manfred,
>>
>> On Mon, Apr 21, 2014 at 4:26 PM, Manfred Spraul
>>  wrote:
>>> Hi all,
>>>
>>> the increase of SHMMAX/SHMALL is now a 4 patch series.
>>> I don't have ideas how to improve it further.
>>
>> On the assumption that your patches are heading to mainline, could you
>> send me a man-pages patch for the changes?
> 
> It seems we're still behind here and the 3.16 merge window is already
> opened. Please consider this, and again feel free to add/modify as
> necessary. I think adding a note as below is enough and was hesitant to
> add a lot of details... Thanks.
> 
> 8<--
> From: Davidlohr Bueso 
> Subject: [PATCH] shmget.2: document new limits for shmmax/shmall
> 
> These limits have been recently enlarged and
> modifying them is no longer really necessary.
> Update the manpage.
> 
> Signed-off-by: Davidlohr Bueso 
> ---
>  man2/shmget.2 | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/man2/shmget.2 b/man2/shmget.2
> index f781048..77764ea 100644
> --- a/man2/shmget.2
> +++ b/man2/shmget.2
> @@ -299,6 +299,11 @@ with 8kB page size, it yields 2^20 (1048576).
>  
>  On Linux, this limit can be read and modified via
>  .IR /proc/sys/kernel/shmall .
> +As of Linux 3.16, the default value for this limit is increased to
> +.B ULONG_MAX - 2^24
> +pages, which is as large as it can be without helping userspace overflow
> +the values. Modifying this limit is therefore discouraged. This is suitable
> +for both 32 and 64-bit systems.
>  .TP
>  .B SHMMAX
>  Maximum size in bytes for a shared memory segment.
> @@ -306,6 +311,12 @@ Since Linux 2.2, the default value of this limit is 
> 0x200 (32MB).
>  
>  On Linux, this limit can be read and modified via
>  .IR /proc/sys/kernel/shmmax .
> +As of Linux 3.16, the default value for this limit is increased from 32Mb
> +to
> +.B ULONG_MAX - 2^24
> +bytes, which is as large as it can be without helping userspace overflow
> +the values. Modifying this limit is therefore discouraged. This is suitable
> +for both 32 and 64-bit systems.
>  .TP
>  .B SHMMIN
>  Minimum size in bytes for a shared memory segment: implementation

David,

I applied various pieces from your patch on top of material
that I already had, so that now we have the text below describing
these limits.  Comments/suggestions/improvements from all welcome.

Cheers,

Michael

   SHMALL System-wide limit on the number of pages of shared memory.

  On  Linux,  this  limit  can  be  read  and  modified  via
  /proc/sys/kernel/shmall.  Since Linux  3.16,  the  default
  value for this limit is:

  ULONG_MAX - 2^24

  The  effect  of  this  value  (which  is suitable for both
  32-bit and 64-bit systems) is to impose no  limitation  on
  allocations.   This value, rather than ULONG_MAX, was cho‐
  sen as the default to prevent some cases where  historical
  applications  simply  raised  the  existing  limit without
  first checking its current value.  Such applications would
  cause  the  value  to  overflow  if  the  limit was set at
  ULONG_MAX.

  From Linux 2.4 up to Linux 3.15,  the  default  value  for
  this limit was:

  SHMMAX / PAGE_SIZE * (SHMMNI / 16)

  If  SHMMAX  and SHMMNI were not modified, then multiplying
  the result of this formula by the  page  size  (to  get  a
  value  in  bytes)  yielded a value of 8 GB as the limit on
  the total memory used by all shared memory segments.

   SHMMAX Maximum size in bytes for a shared memory segment.

  On  Linux,  this  limit  can  be  read  and  modified  via
  /proc/sys/kernel/shmmax.   Since  Linux  3.16, the default
  value for this limit is:

  ULONG_MAX - 2^24

  The effect of this  value  (which  is  suitable  for  both
  32-bit  and  64-bit systems) is to impose no limitation on
  allocations.  See the description of SHMALL for a  discus‐
  sion  of why this default value (rather than ULONG_MAX) is
  used.

  From Linux 2.2 up to Linux 3.15, the default value of this
  limit was 0x200 (32MB).

  Because  it  is  not possible to map just part of a shared
  memory  segment,  the  amount  of  virtual  memory  places
  another limit on the maximum size of a usable segment: for
  example, on i386 the largest segments that can  be  mapped
  have  a  size of around 2.8 GB, and on x86_64 the limit is
  around 127 TB.



-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/

Re: [GIT PULL rcu/next] RCU commits for 3.18

2014-09-22 Thread Ingo Molnar

* Paul E. McKenney  wrote:

> Hello, Ingo,
> 
> The changes in this series include:
> 
> 1.Update RCU documentation.  These were posted to LKML at
>   https://lkml.org/lkml/2014/8/28/378.
> 
> 2.Miscellaneous fixes.  These were posted to LKML at
>   https://lkml.org/lkml/2014/8/28/386.  An additional fix that
>   eliminates a documented (but now inconvenient) deadlock between
>   RCU hotplug and expedited grace periods was posted at
>   https://lkml.org/lkml/2014/8/28/573.
> 
> 3.Changes related to No-CBs CPUs and NO_HZ_FULL.  These were posted
>   to LKML at https://lkml.org/lkml/2014/8/28/412.
> 
> 4.Torture-test updates.  These were posted to LKML at
>   https://lkml.org/lkml/2014/8/28/546 and at
>   https://lkml.org/lkml/2014/9/11/1114.
> 
> 5.RCU-tasks implementation.  These were posted to LKML at
>   https://lkml.org/lkml/2014/8/28/540.
> 
> All of these have been exposed to -next testing.
> These changes are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/next
> 
> for you to fetch changes up to dd56af42bd829c6e770ed69812bd65a04eaeb1e4:
> 
>   rcu: Eliminate deadlock between CPU hotplug and expedited grace periods 
> (2014-09-18 16:22:27 -0700)
> 
> 
> Ard Biesheuvel (1):
>   rcu: Define tracepoint strings only if CONFIG_TRACING is set
> 
> Davidlohr Bueso (9):
>   locktorture: Rename locktorture_runnable parameter
>   locktorture: Add documentation
>   locktorture: Support mutexes
>   locktorture: Teach about lock debugging
>   locktorture: Make statistics generic
>   torture: Address race in module cleanup
>   locktorture: Add infrastructure for torturing read locks
>   locktorture: Support rwsems
>   locktorture: Introduce torture context
> 
> Joe Perches (1):
>   rcu: Use pr_alert/pr_cont for printing logs
> 
> Oleg Nesterov (1):
>   rcu: Uninline rcu_read_lock_held()
> 
> Paul E. McKenney (46):
>   memory-barriers: Fix control-ordering no-transitivity example
>   memory-barriers: Retain barrier() in fold-to-zero example
>   memory-barriers: Fix description of 2-legged-if-based control 
> dependencies
>   rcu: Break more call_rcu() deadlock involving scheduler and perf
>   rcu: Make TINY_RCU tinier by putting error checks under #ifdef
>   rcu: Replace flush_signals() with WARN_ON(signal_pending())
>   rcu: Add step to initrd documentation
>   rcutorture: Test partial nohz_full= configuration
>   rcutorture: Specify MAXSMP=y for TREE01
>   rcutorture: Specify CONFIG_CPUMASK_OFFSTACK=y for TREE07
>   rcutorture: Add callback-flood test
>   torture: Print PID in hung-kernel diagnostic message
>   torture: Check for nul bytes in console output
>   rcu: Add call_rcu_tasks()
>   rcu: Provide cond_resched_rcu_qs() to force quiescent states in long 
> loops
>   rcu: Add synchronous grace-period waiting for RCU-tasks
>   rcu: Make TASKS_RCU handle tasks that are almost done exiting
>   rcutorture: Add torture tests for RCU-tasks
>   rcutorture: Add RCU-tasks test cases
>   rcu: Add stall-warning checks for RCU-tasks
>   rcu: Improve RCU-tasks energy efficiency
>   documentation: Add verbiage on RCU-tasks stall warning messages
>   rcu: Defer rcu_tasks_kthread() creation till first call_rcu_tasks()
>   rcu: Make TASKS_RCU handle nohz_full= CPUs
>   rcu: Make rcu_tasks_kthread()'s GP-wait loop allow preemption
>   rcu: Remove redundant preempt_disable() from 
> rcu_note_voluntary_context_switch()
>   rcu: Additional information on RCU-tasks stall-warning messages
>   rcu: Remove local_irq_disable() in rcu_preempt_note_context_switch()
>   rcu: Per-CPU operation cleanups to rcu_*_qs() functions
>   rcutorture: Add RCU-tasks tests to default rcutorture list
>   rcu: Fix attempt to avoid unsolicited offloading of callbacks
>   rcu: Rationalize kthread spawning
>   rcu: Create rcuo kthreads only for onlined CPUs
>   rcu: Eliminate redundant rcu_sysidle_state variable
>   rcu: Don't track sysidle state if no nohz_full= CPUs
>   rcu: Avoid misordering in __call_rcu_nocb_enqueue()
>   rcu: Handle NOCB callbacks from irq-disabled idle code
>   rcu: Avoid misordering in nocb_leader_wait()
>   Merge branches 'doc.2014.09.07a', 'fixes.2014.09.10a', 
> 'nocb-nohz.2014.09.16b' and 'torture.2014.09.07a' into HEAD
>   Merge branch 'rcu-tasks.2014.09.10a' into HEAD
>   locktorture: Make torture scripting account for new _runnable name
>   locktorture: Add test scenario for mutex_lock
>   locktorture: Add test scenario for rwsem_lock
>   rcutorture: Rename rcutorture_runnable parameter
>   locktorture: Document boot/module parameters
>   rcu: Eliminate deadlock between CPU hotplug and expedited grace periods
> 

Re: [GIT PULL] x86 fixes

2014-09-22 Thread Linus Torvalds
On Fri, Sep 19, 2014 at 3:40 AM, Ingo Molnar  wrote:
>
> Please pull the latest x86-urgent-for-linus git tree from:

I only just noticed, but this pull request causes my Sony Vaio laptop
to immediately reboot at startup.

I'm assuming it's one of the efi changes, but I'm bisecting now to say
exactly where it happens. It will get reverted.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ARM: mach-bcm: offer a new maintainer and process

2014-09-22 Thread Florian Fainelli
2014-09-22 22:03 GMT-07:00 Olof Johansson :
> On Fri, Sep 19, 2014 at 11:17:11AM -0700, Florian Fainelli wrote:
>> Hi all,
>>
>> As some of you may have seen in the news, Broadcom has recently stopped
>> its mobile SoC activities. Upstream support for Broadcom's Mobile SoCs
>> was an effort initially started by Christian Daudt and his team, and then
>> continued by Alex Eleder and Matter Porter assigned to a particular landing
>> team within Linaro to help Broadcom doing so.
>>
>> As part of this effort, Christian and Matt volunteered for centralizing pull
>> requests coming from the arch/arm/mach-bcm/* directory and as of today, they
>> are still responsible for merging mach-bcm pull requests coming from brcmstb,
>> bcm5301x, bcm2835 and bcm63xx, creating an intermediate layer to the arm-soc
>> tree.
>>
>> Following the mobile group shut down, our group (in which Brian, Gregory, 
>> Marc,
>> Kevin and myself are) inherited these mobile SoC platforms, although at this
>> point we cannot comment on the future of mobile platforms, we know that our
>> Linaro activities have been stopped.
>>
>> We have not heard much from Christian and Matt in a while, and some of our 
>> pull
>> requests have been stalling as a result. We would like to offer both a new
>> maintainer for the mobile platforms as well as reworking the pull request
>> process:
>>
>> - our group has now full access to these platforms, putting us in the best
>>   position to support Mobile SoCs questions
>
> So, one question I have is whether it makes sense to keep the mobile
> platforms in the kernel if the line of business is ending?

I leave it to Scott for more details, but last we talked he mentioned
what has been upstreamed is useful for some other platforms he cares
about.

>
> While I truly do appreciate the work done by Matt and others, there's
> also little chance that it'll see substantial use by anyone. The Capri
> boards aren't common out in the wild and I'm not aware of any dev
> boards or consumer products with these SoCs that might want to run
> mainline? Critical things such as power management and graphics are
> missing from the current platform support in the kernel, so nobody is
> likely to want it on their Android phone, etc.
>
> Maybe the answer to this is "keep it for now, revisit sometime later",
> which is perfectly sane -- it has practically no cost to keep it around
> the way it's looking now.

Right, let's adopt that approach for now, and we can revisit that
later in light of Scott and his group's work.
--
Florian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] extcon: gpio: Convert the driver to use gpio desc API's

2014-09-22 Thread George Cherian


On 09/23/2014 04:44 AM, Chanwoo Choi wrote:

On 09/22/2014 06:51 PM, George Cherian wrote:

On 09/22/2014 01:37 PM, Chanwoo Choi wrote:

Hi George,

This patch removes 'gpio_active_low' field of struct gpio_extcon_data.
But, include/linux/extcon-gpio.h has the description of 'gpio_active_low' field.

Yes didn't want the platform data users to break.
Actually I couldn't find any platform users for this driver. Could you please 
point me to
one if in case I missed it. If non present then why cant we get rid of platform 
data altogether.

Right,
But, Why do you support platform data on as following your patch?
- [PATCH 3/5] extcon: gpio: Add dt support for the driver.
According to your comment, you had to remove the support for platform data.
My intention with this series was to add dt support by keeping the 
existing platform data.
Now that we know there are no platform data users I will rework on this 
and keep only dt

support.


IMO,
I think this patchset must need to reorder the sequence of patchset.
Also, this patchset is more detailed description.

I will rework and submit a v2.

Also,
This patch has not included the any description/comment of removing 
'gpio_active_low'.

Also,
How to set 'FLAG_ACTIVE_LOW' bit for gpio when using platform data?

Now that we are using gpiod_* API's  we need not check for gpio_active_low from 
this driver.

This patch just use gpiod API instead of legacy gpio API.

I think that if extcon-gpio don't need to check gpio_activ_low field,
you have to implement dt support patch before this patch.

yes will do in v2

Thanks for your review.

This patch don't call 'set_bit()' function to set FLAG_ACTIVE_LOW flag.

Thanks,
Chanwoo Choi

On 09/09/2014 01:14 PM, George Cherian wrote:

Convert the driver to use gpiod_* API's.

Signed-off-by: George Cherian 
---
   drivers/extcon/extcon-gpio.c | 18 +++---
   1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/drivers/extcon/extcon-gpio.c b/drivers/extcon/extcon-gpio.c
index 72f19a3..25269f6 100644
--- a/drivers/extcon/extcon-gpio.c
+++ b/drivers/extcon/extcon-gpio.c
@@ -33,8 +33,7 @@
 struct gpio_extcon_data {
   struct extcon_dev *edev;
-unsigned gpio;
-bool gpio_active_low;
+struct gpio_desc *gpiod;
   const char *state_on;
   const char *state_off;
   int irq;
@@ -50,9 +49,7 @@ static void gpio_extcon_work(struct work_struct *work)
   container_of(to_delayed_work(work), struct gpio_extcon_data,
work);
   -state = gpio_get_value(data->gpio);
-if (data->gpio_active_low)
-state = !state;
+state = gpiod_get_value(data->gpiod);
   extcon_set_state(data->edev, state);
   }
   @@ -106,22 +103,21 @@ static int gpio_extcon_probe(struct platform_device 
*pdev)
   }
   extcon_data->edev->name = pdata->name;
   -extcon_data->gpio = pdata->gpio;
-extcon_data->gpio_active_low = pdata->gpio_active_low;
+extcon_data->gpiod = gpio_to_desc(pdata->gpio);
   extcon_data->state_on = pdata->state_on;
   extcon_data->state_off = pdata->state_off;
   extcon_data->check_on_resume = pdata->check_on_resume;
   if (pdata->state_on && pdata->state_off)
   extcon_data->edev->print_state = extcon_gpio_print_state;
   -ret = devm_gpio_request_one(>dev, extcon_data->gpio, GPIOF_DIR_IN,
+ret = devm_gpio_request_one(>dev, pdata->gpio, GPIOF_DIR_IN,
   pdev->name);
   if (ret < 0)
   return ret;
 if (pdata->debounce) {
-ret = gpio_set_debounce(extcon_data->gpio,
-pdata->debounce * 1000);
+ret = gpiod_set_debounce(extcon_data->gpiod,
+ pdata->debounce * 1000);
   if (ret < 0)
   extcon_data->debounce_jiffies =
   msecs_to_jiffies(pdata->debounce);
@@ -133,7 +129,7 @@ static int gpio_extcon_probe(struct platform_device *pdev)
 INIT_DELAYED_WORK(_data->work, gpio_extcon_work);
   -extcon_data->irq = gpio_to_irq(extcon_data->gpio);
+extcon_data->irq = gpiod_to_irq(extcon_data->gpiod);
   if (extcon_data->irq < 0)
   return extcon_data->irq;
  




-George
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


3.17 kernel crash while loading IPoIB

2014-09-22 Thread Sharma, Karun
Hello:

I am facing an issue wherein kernel 3.17 crashes while loading IPoIB module. I 
guess the issue discussed in this thread 
(https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg20963.html) is 
similar.

We were able to reproduce the issue with RC6 also. Here are the steps I 
followed:

I compiled and installed 3.17 kernel on top of RHEL 6.5. 
Then I changed rdma.conf to not load IPoIB (If I don't do this, the kernel 
crashes while booting and starting RDMA service.)
After the server comes up, I just did "modprobe ib_ipoib" and kernel crashes.
Please see below the kernel back trace.

Seeing the announcement, it looks like RC6 will be the last RC for 3.17 kernel. 
Will the release happen with this issue? Is there any workaround available for 
this issue?
I am not sure what mechanism/process is used to report issue to kernel 
community.

Regards
Karun


Kernel Stack back-trace:

crash> bt
PID: 145TASK: 88081a580d90  CPU: 3   COMMAND: "kworker/3:1"
#0 [88081a587750] machine_kexec at 8103c5d9
#1 [88081a5877a0] crash_kexec at 810d0ff8
#2 [88081a587870] oops_end at 81007570
#3 [88081a5878a0] no_context at 81046e5e
#4 [88081a5878f0] __bad_area_nosemaphore at 8104704d
#5 [88081a587940] bad_area_nosemaphore at 81047163
#6 [88081a587950] __do_page_fault at 81047722
#7 [88081a587a70] do_page_fault at 8104798c
#8 [88081a587a80] page_fault at 815aad62
[exception RIP: __dev_queue_xmit+894]
RIP: 814e17be  RSP: 88081a587b38  RFLAGS: 00010282
RAX: 88087c1679fe  RBX: 880812cc2500  RCX: 0044
RDX: 0008  RSI:   RDI: 88081a363a9c
RBP: 88081a587b78   R8:    R9: 0040
R10:   R11: 7c1679ff  R12: 88081a363a00
R13: 880814f3e000  R14: 880809535600  R15: 
ORIG_RAX:   CS: 0010  SS: 0018
#9 [88081a587b30] __dev_queue_xmit at 814e158b
#10 [88081a587b80] dev_queue_xmit at 814e1930
#11 [88081a587b90] neigh_connected_output at 814e81e8
#12 [88081a587be0] ip6_finish_output2 at a05ff8dd [ipv6]
#13 [88081a587c40] ip6_finish_output at a05ffe5f [ipv6]
#14 [88081a587c60] ip6_output at a05fff18 [ipv6]
#15 [88081a587c90] ndisc_send_skb at a06169a9 [ipv6]
#16 [88081a587d40] ndisc_send_ns at a0616bf6 [ipv6]
#17 [88081a587db0] addrconf_dad_work at a06076cb [ipv6]
#18 [88081a587df0] process_one_work at 8106b23e
#19 [88081a587e40] worker_thread at 8106b63f
#20 [88081a587ec0] kthread at 8107041e
#21 [88081a587f50] ret_from_fork at 815a92ac
-

Regards,
Karun Sharma

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mfd: inherit coherent_dma_mask from parent device

2014-09-22 Thread Boris BREZILLON
Hi Arnd,

On Mon, 22 Sep 2014 21:45:40 +0200
Arnd Bergmann  wrote:

> On Monday 22 September 2014 21:37:55 Boris BREZILLON wrote:
> > dma_mask and dma_parms are already inherited from the parent device but
> > dma_coherent_mask was left uninitialized (set to zero thanks to kzalloc).
> > Set sub-device coherent_dma_mask to its parent value to simplify
> > sub-drivers making use of dma coherent helper functions (those drivers
> > currently have to explicitly set the dma coherent mask using
> > dma_set_coherent_mask function).
> > 
> > Signed-off-by: Boris BREZILLON 
> > ---
> > 
> > Hi,
> > 
> > This patch is follow-up of a discussion we had on a KMS driver thread [1].
> > This patch is only copying the parent device coherent_dma_mask to avoid
> > calling specific dma_set_coherent_mask in case the coherent mask is the
> > default one.
> > 
> > I'm a bit surprised this hasn't been done earlier while other dma fields
> > (mask and parms) are already inherited from the parent device, so please
> > tell me if there already was an attempt to do the same, and if so, what
> > was the reson for rejecting it :-).
> > 
> > 
> 
> Seems reasonable to me. It's not clear whether we should always inherit
> the dma_mask, but I see no point in copying just dma_mask but not
> coherent_dma_mask.

I thought about adding a dma_mask field to mfd_cell to override the
default behavior (allocate a new dma_mask and copy the value 
provided by mfd_cell if it's not zero), but I don't see any real use
case where a sub-device does not share the dma capabilities with its
parent.
IMHO, it's safer to keep it as is until someone really need to set a
different dma_mask on a sub-device.

Best Regards,

Boris



-- 
Boris Brezillon, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] ARM: mach-bcm: offer a new maintainer and process

2014-09-22 Thread Olof Johansson
On Fri, Sep 19, 2014 at 11:17:11AM -0700, Florian Fainelli wrote:
> Hi all,
> 
> As some of you may have seen in the news, Broadcom has recently stopped
> its mobile SoC activities. Upstream support for Broadcom's Mobile SoCs
> was an effort initially started by Christian Daudt and his team, and then
> continued by Alex Eleder and Matter Porter assigned to a particular landing
> team within Linaro to help Broadcom doing so.
> 
> As part of this effort, Christian and Matt volunteered for centralizing pull
> requests coming from the arch/arm/mach-bcm/* directory and as of today, they
> are still responsible for merging mach-bcm pull requests coming from brcmstb,
> bcm5301x, bcm2835 and bcm63xx, creating an intermediate layer to the arm-soc
> tree.
> 
> Following the mobile group shut down, our group (in which Brian, Gregory, 
> Marc,
> Kevin and myself are) inherited these mobile SoC platforms, although at this
> point we cannot comment on the future of mobile platforms, we know that our
> Linaro activities have been stopped.
> 
> We have not heard much from Christian and Matt in a while, and some of our 
> pull
> requests have been stalling as a result. We would like to offer both a new
> maintainer for the mobile platforms as well as reworking the pull request
> process:
> 
> - our group has now full access to these platforms, putting us in the best
>   position to support Mobile SoCs questions

So, one question I have is whether it makes sense to keep the mobile
platforms in the kernel if the line of business is ending?

While I truly do appreciate the work done by Matt and others, there's
also little chance that it'll see substantial use by anyone. The Capri
boards aren't common out in the wild and I'm not aware of any dev
boards or consumer products with these SoCs that might want to run
mainline? Critical things such as power management and graphics are
missing from the current platform support in the kernel, so nobody is
likely to want it on their Android phone, etc.

Maybe the answer to this is "keep it for now, revisit sometime later",
which is perfectly sane -- it has practically no cost to keep it around
the way it's looking now.


-Olof

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3.4 00/45] 3.4.104-rc1 review

2014-09-22 Thread Guenter Roeck

On 09/22/2014 07:27 PM, Zefan Li wrote:

From: Zefan Li 

This is the start of the stable review cycle for the 3.4.104 release.
There are 45 patches in this series, all will be posted as a response
to this one.  If anyone has any issues with these being applied, please
let me know.

Responses should be made by Thu Sep 25 02:03:31 UTC 2014.
Anything received after that time might be too late.



Build results:
total: 119 pass: 116 fail: 3
Failed builds:
score:defconfig
sparc64:allmodconfig
xtensa:allmodconfig

Qemu test results:
total: 18 pass: 17 fail: 1
Failed tests:
arm:arm_versatile_defconfig

This is an improvement over the previous release, where we had six build
failures. The failing qemu test is a recent addition which is expected
to fail for the 3.4 kernel. The failure is due to Versatile SCSI driver
and interrupt handling problems; those were fixed in later kernels but
would be difficult to back-port.

Guenter

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


linux-next: build failure after merge of the tiny tree

2014-09-22 Thread Stephen Rothwell
Hi Josh,

After merging the tiny tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:

mm/built-in.o: In function `.isolate_migratepages_range':
(.text+0x2fbd8): undefined reference to `.balloon_page_isolate'
mm/built-in.o: In function `.putback_movable_pages':
(.text+0x713c4): undefined reference to `.balloon_page_putback'
mm/built-in.o: In function `.migrate_pages':
(.text+0x72a00): undefined reference to `.balloon_page_migrate'

Caused by commit b37a3fee8450 ("mm: Disable mm/balloon_compaction.c
completely when !CONFIG_VIRTIO_BALLOON").

I have reverted that commit for today.
-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


signature.asc
Description: PGP signature


Re: [PATCH v1 5/5] zram: add fullness knob to control swap full

2014-09-22 Thread Minchan Kim
On Mon, Sep 22, 2014 at 02:17:33PM -0700, Andrew Morton wrote:
> On Mon, 22 Sep 2014 09:03:11 +0900 Minchan Kim  wrote:
> 
> > Some zram usecase could want lower fullness than default 80 to
> > avoid unnecessary swapout-and-fail-recover overhead.
> > 
> > A typical example is that mutliple swap with high piroirty
> > zram-swap and low priority HDD-swap so it could still enough
> > free swap space although one of swap devices is full(ie, zram).
> > It would be better to fail over to HDD-swap rather than failing
> > swap write to zram in this case.
> > 
> > This patch exports fullness to user so user can control it
> > via the knob.
> 
> Adding new userspace interfaces requires a pretty strong justification
> and it's unclear to me that this is being met.  In fact the whole
> patchset reads like "we have some problem, don't know how to fix it so
> let's add a userspace knob to make it someone else's problem".

I explained rationale in 4/5's reply but if it's not enough or wrong,
please tell me.

> 
> > index b13dc993291f..817738d14061 100644
> > --- a/Documentation/ABI/testing/sysfs-block-zram
> > +++ b/Documentation/ABI/testing/sysfs-block-zram
> > @@ -138,3 +138,13 @@ Description:
> > amount of memory ZRAM can use to store the compressed data.  The
> > limit could be changed in run time and "0" means disable the
> > limit.  No limit is the initial state.  Unit: bytes
> > +
> > +What:  /sys/block/zram/fullness
> > +Date:  August 2014
> > +Contact:   Minchan Kim 
> > +Description:
> > +   The fullness file is read/write and specifies how easily
> > +   zram become full state so if you set it to lower value,
> > +   zram can reach full state easily compared to higher value.
> > +   Curretnly, initial value is 80% but it could be changed.
> > +   Unit: Percentage
> 
> And I don't think that there is sufficient information here for a user
> to be able to work out what to do with this tunable.

I will put more words.

> 
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -136,6 +136,37 @@ static ssize_t max_comp_streams_show(struct device 
> > *dev,
> > return scnprintf(buf, PAGE_SIZE, "%d\n", val);
> >  }
> >  
> > +static ssize_t fullness_show(struct device *dev,
> > +   struct device_attribute *attr, char *buf)
> > +{
> > +   int val;
> > +   struct zram *zram = dev_to_zram(dev);
> > +
> > +   down_read(>init_lock);
> > +   val = zram->fullness;
> > +   up_read(>init_lock);
> 
> Did we really need to take a lock to display a value which became
> out-of-date as soon as we released that lock?
> 
> > +   return scnprintf(buf, PAGE_SIZE, "%d\n", val);
> > +}
> > +
> > +static ssize_t fullness_store(struct device *dev,
> > +   struct device_attribute *attr, const char *buf, size_t len)
> > +{
> > +   int err;
> > +   unsigned long val;
> > +   struct zram *zram = dev_to_zram(dev);
> > +
> > +   err = kstrtoul(buf, 10, );
> > +   if (err || val > 100)
> > +   return -EINVAL;
> 
> This overwrites the kstrtoul() return value.

Will fix.

Thanks for the reivew, Andrew.
-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kernfs: use stack-buf for small writes.

2014-09-22 Thread Tejun Heo
Hello, Neil.

On Tue, Sep 23, 2014 at 02:46:50PM +1000, NeilBrown wrote:
> seqfile is only safe for reads.  sysfs via kernfs uses seq_read(), so there
> is only a single allocation on the first read.
> 
> It doesn't really related to fixing writes, except to point out that only
> writes need to be "fixed".  Reads already work.

Oh, I meant the buffer seqfile read op writes to, so it depends on the
fact that the allocation is only on the first read?  That seems
extremely brittle to me, especially for an issue which tends to be
difficult to reproduce.

> Separately:
> 
> > Ugh... :( If this can't be avoided at all, I'd much prefer it to be
> > something explicit - a flag marking the file as needing a persistent
> > write buffer which is allocated on open.  "Small" writes on stack
> > feels way to implicit to me.
> 
> How about if we add seq_getbuf() and seq_putbuf() to seqfile
> which takes a 'struct seq_file' and a size and returns the ->buf
> after making sure it is big enough.
> It also claims and releases the seqfile ->lock.
> 
> Then we would be using the same buffer for reads and write.
> 
> Does that sound suitable?  It uses existing infrastructure and avoids having
> to identify in advance which attributes it is important for.

I'd much rather keep things direct and make it explicitly allocate r/w
buffer(s) on open and disallow seq_file operations on such files.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3.4 00/45] 3.4.104-rc1 review

2014-09-22 Thread Satoru Takeuchi
Hi Li,

At Tue, 23 Sep 2014 10:27:39 +0800,
Zefan Li wrote:
> 
> From: Zefan Li 
> 
> This is the start of the stable review cycle for the 3.4.104 release.
> There are 45 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Thu Sep 25 02:03:31 UTC 2014.
> Anything received after that time might be too late.

This kernel passed my test.

 - Test Cases:
   - Build this kernel.
   - Boot this kernel.
   - Build the latest mainline kernel with this kernel.

 - Test Tool:
   https://github.com/satoru-takeuchi/test-linux-stable

 - Test Result (kernel .config, ktest config and test log):
   http://satoru-takeuchi.org/test-linux-stable/results/-.tar.xz

 - Build Environment:
   - OS: Debian Jessy x86_64
   - CPU: Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz x 4
   - memory: 8GB

 - Test Target Environment:
   - Debian Jessy x86_64 (KVM guest on the Build Environment)
   - # of vCPU: 2
   - memory: 2GB

Thanks,
Satoru

> 
> A combined patch relative to 3.4.103 will be posted as an additional
> response to this.  A shortlog and diffstat can be found below.
> 
> thanks,
> 
> Zefan Li
> 
> 
> 
> Aaro Koskinen (1):
>   MIPS: OCTEON: make get_system_type() thread-safe
> 
> Alan Douglas (1):
>   xtensa: fix address checks in dma_{alloc,free}_coherent
> 
> Andi Kleen (1):
>   slab/mempolicy: always use local policy from interrupt context
> 
> Anton Blanchard (1):
>   ibmveth: Fix endian issues with rx_no_buffer statistic
> 
> Arjun Sreedharan (1):
>   pata_scc: propagate return value of scc_wait_after_reset
> 
> Benjamin Tissoires (1):
>   HID: logitech-dj: prevent false errors to be shown
> 
> Brennan Ashton (1):
>   USB: option: add VIA Telecom CDS7 chipset device id
> 
> Daniel Mack (1):
>   ASoC: pxa-ssp: drop SNDRV_PCM_FMTBIT_S24_LE
> 
> Dave Chiluk (1):
>   stable_kernel_rules: Add pointer to netdev-FAQ for network patches
> 
> Fengguang Wu (1):
>   unicore32: select generic atomic64_t support
> 
> Florian Fainelli (1):
>   MIPS: perf: Fix build error caused by unused
> counters_per_cpu_to_total()
> 
> Greg KH (1):
>   USB: serial: pl2303: add device id for ztek device
> 
> Guan Xuetao (2):
>   UniCore32-bugfix: Remove definitions in asm/bug.h to solve difference
> between native and cross compiler
>   UniCore32-bugfix: fix mismatch return value of __xchg_bad_pointer
> 
> Hans de Goede (1):
>   xhci: Treat not finding the event_seg on COMP_STOP the same as
> COMP_STOP_INVAL
> 
> Huang Rui (1):
>   usb: xhci: amd chipset also needs short TX quirk
> 
> James Forshaw (1):
>   USB: whiteheat: Added bounds checking for bulk command response
> 
> Jan Kara (2):
>   isofs: Fix unbounded recursion when processing relocated directories
>   ext2: Fix fs corruption in ext2_get_xip_mem()
> 
> Jaša Bartelj (1):
>   USB: ftdi_sio: Added PID for new ekey device
> 
> Jiri Kosina (4):
>   HID: fix a couple of off-by-ones
>   HID: logitech: perform bounds checking on device_id early enough
>   HID: magicmouse: sanity check report size in raw_event() callback
>   HID: picolcd: sanity check report size in raw_event() callback
> 
> Joerg Roedel (1):
>   iommu/amd: Fix cleanup_domain for mass device removal
> 
> Johan Hovold (3):
>   USB: ftdi_sio: add Basic Micro ATOM Nano USB2Serial PID
>   USB: serial: fix potential stack buffer overflow
>   USB: serial: fix potential heap buffer overflow
> 
> Mark Einon (1):
>   staging: et131x: Fix errors caused by phydev->addr accesses before
> initialisation
> 
> Mark Rutland (2):
>   ARM: 8128/1: abort: don't clear the exclusive monitors
>   ARM: 8129/1: errata: work around Cortex-A15 erratum 830321 using
> dummy strex
> 
> Max Filippov (3):
>   xtensa: replace IOCTL code definitions with constants
>   xtensa: fix TLBTEMP_BASE_2 region handling in fast_second_level_miss
>   xtensa: fix a6 and a7 handling in fast_syscall_xtensa
> 
> Michael Cree (2):
>   alpha: Fix fall-out from disintegrating asm/system.h
>   alpha: add io{read,write}{16,32}be functions
> 
> Michael S. Tsirkin (1):
>   kvm: iommu: fix the third parameter of kvm_iommu_put_pages
> (CVE-2014-3601)
> 
> NeilBrown (1):
>   md/raid6: avoid data corruption during recovery of double-degraded
> RAID6
> 
> Paul Gortmaker (1):
>   8250_pci: fix warnings in backport of Broadcom TruManage support
> 
> Pavel Shilovsky (1):
>   CIFS: Fix wrong directory attributes after rename
> 
> Ralf Baechle (1):
>   MIPS: Fix accessing to per-cpu data when flushing the cache
> 
> Stefan Kristiansson (1):
>   openrisc: add missing header inclusion
> 
> Stephen Hemminger (1):
>   USB: sisusb: add device id for Magic Control USB video
> 
> Takashi Iwai (1):
>   ALSA: hda/realtek - Avoid setting wrong COEF on ALC269 & co
> 
> Trond Myklebust (1):
>   NFSv4: Fix problems with close in the presence of a delegation
> 
>  Documentation/stable_kernel_rules.txt |3 ++
>  

Re: [PATCH v1 4/5] zram: add swap full hint

2014-09-22 Thread Minchan Kim
On Mon, Sep 22, 2014 at 02:11:18PM -0700, Andrew Morton wrote:
> On Mon, 22 Sep 2014 09:03:10 +0900 Minchan Kim  wrote:
> 
> > This patch implement SWAP_FULL handler in zram so that VM can
> > know whether zram is full or not and use it to stop anonymous
> > page reclaim.
> > 
> > How to judge fullness is below,
> > 
> > fullness = (100 * used space / total space)
> > 
> > It means the higher fullness is, the slower we reach zram full.
> > Now, default of fullness is 80 so that it biased more momory
> > consumption rather than early OOM kill.
> 
> It's unclear to me why this is being done.  What's wrong with "use it
> until it's full then stop", which is what I assume the current code
> does?  Why add this stuff?  What goes wrong with the current code and
> how does this fix it?
> 
> ie: better explanation and justification in the chagnelogs, please.

My bad. I should have wrote down about zram allocator's fragmentation
problem.

zsmalloc has various size class so it has a fragmentation problem.
For example, a page swap out -> comprssed 32 byte -> has a empty slot
of zsmalloc's 32 size class -> successful write.

Another swap out -> compressed 256 byte -> no empty slot in zsmalloc's
256 size class -> zsmalloc should allocate new zspage but it would be
over limit so it would be failed.

The problem is swap layer cannot know compressed size of the page
in advance so it couldn't expect whether swap-write will be successful
while it could get empty swap slot easily since zram's virtual disk
size is fairy enough.

Given that zsmalloc's fragmentation, it would be *early-OOM* if zram
says *full* as soon as it reaches page limit because it could have
empty slots in various size classes. IOW, it doesn't consider fragment
problem so this patch suggests two condition to solve it.

if (total_pages >= zram->limit_pages) {

compr_pages = atomic64_read(>stats.compr_data_size)
>> PAGE_SHIFT;
if ((100 * compr_pages / total_pages)
>= zram->fullness)
return 1;
}

First of all, zram-consumed page should reach *limit* and then we
consider fullness. If used space is over 80%, we regards it as full
in this implementation because I want to focus more memory usage to
avoid early OOM kill when I consider zram's popular usecase in
embedded.

> 
> > Above logic works only when used space of zram hit over the limit
> > but zram also pretend to be full once 32 consecutive allocation
> > fail happens. It's safe guard to prevent system hang caused by
> > fragment uncertainty.
> 
> So allocation requests are of variable size, yes?  If so, the above
> statement should read "32 consecutive allocation attempts for regions
> or size 2 or more slots".  Because a failure of a single-slot
> allocation attempt is an immediate failure.
> 
> The 32-in-a-row thing sounds like a hack.  Why can't we do this
> deterministically?  If one request for four slots fails then the next
> one will as well, so why bother retrying?

The problem is swap layer cannot expect what compressed size in the end
in advance without compressing. If the page is compressed to the size
zsmalloc has empty slot in size class, it would be successful.

> 
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -43,6 +43,20 @@ static const char *default_compressor = "lzo";
> >  /* Module params (documentation at end) */
> >  static unsigned int num_devices = 1;
> >  
> > +/*
> > + * If (100 * used_pages / total_pages) >= ZRAM_FULLNESS_PERCENT),
> > + * we regards it as zram-full. It means that the higher
> > + * ZRAM_FULLNESS_PERCENT is, the slower we reach zram full.
> > + */
> 
> I just don't understand this patch :( To me, the above implies that the
> user who sets 80% has elected to never use 20% of the zram capacity. 
> Why on earth would anyone do that?  This chagnelog doesn't tell me.

Hope above my words make you clear.

> 
> > +#define ZRAM_FULLNESS_PERCENT 80
> 
> We've had problems in the past where 1% is just too large an increment
> for large systems.

So, do you want fullness_bytes like dirty_bytes?

> 
> > @@ -597,10 +613,15 @@ static int zram_bvec_write(struct zram *zram, struct 
> > bio_vec *bvec, u32 index,
> > }
> >  
> > alloced_pages = zs_get_total_pages(meta->mem_pool);
> > -   if (zram->limit_pages && alloced_pages > zram->limit_pages) {
> > -   zs_free(meta->mem_pool, handle);
> > -   ret = -ENOMEM;
> > -   goto out;
> > +   if (zram->limit_pages) {
> > +   if (alloced_pages > zram->limit_pages) {
> 
> This is all a bit racy, isn't it?  pool->pages_allocated and
> zram->limit_pages could be changing under our feet.

limit_pages cannot be changed by init_lock but pool->pages_allocated
is yes but the result by the race is not critical.

1. swap write fail so swap layer could make the page dirty again
   so it's no problem.
Or
2. alloc_fail race so zram 

[PATCH 3/3] f2fs: refactor flush_nat_entries to remove costly reorganizing ops

2014-09-22 Thread Jaegeuk Kim
Previously, f2fs tries to reorganize the dirty nat entries into multiple sets
according to its nid ranges. This can improve the flushing nat pages, however,
if there are a lot of cached nat entries, it becomes a bottleneck.

This patch introduces a new set management flow by removing dirty nat list and
adding a series of set operations when the nat entry becomes dirty.

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/f2fs.h |  13 +--
 fs/f2fs/node.c | 299 +
 fs/f2fs/node.h |   9 +-
 3 files changed, 162 insertions(+), 159 deletions(-)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 7b1e1d2..94cfdc4 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -164,6 +164,9 @@ struct fsync_inode_entry {
 #define sit_in_journal(sum, i) (sum->sit_j.entries[i].se)
 #define segno_in_journal(sum, i)   (sum->sit_j.entries[i].segno)
 
+#define MAX_NAT_JENTRIES(sum)  (NAT_JOURNAL_ENTRIES - nats_in_cursum(sum))
+#define MAX_SIT_JENTRIES(sum)  (SIT_JOURNAL_ENTRIES - sits_in_cursum(sum))
+
 static inline int update_nats_in_cursum(struct f2fs_summary_block *rs, int i)
 {
int before = nats_in_cursum(rs);
@@ -182,9 +185,8 @@ static inline bool __has_cursum_space(struct 
f2fs_summary_block *sum, int size,
int type)
 {
if (type == NAT_JOURNAL)
-   return nats_in_cursum(sum) + size <= NAT_JOURNAL_ENTRIES;
-
-   return sits_in_cursum(sum) + size <= SIT_JOURNAL_ENTRIES;
+   return size <= MAX_NAT_JENTRIES(sum);
+   return size <= MAX_SIT_JENTRIES(sum);
 }
 
 /*
@@ -292,11 +294,10 @@ struct f2fs_nm_info {
 
/* NAT cache management */
struct radix_tree_root nat_root;/* root of the nat entry cache */
+   struct radix_tree_root nat_set_root;/* root of the nat set cache */
rwlock_t nat_tree_lock; /* protect nat_tree_lock */
-   unsigned int nat_cnt;   /* the # of cached nat entries */
struct list_head nat_entries;   /* cached nat entry list (clean) */
-   struct list_head dirty_nat_entries; /* cached nat entry list (dirty) */
-   struct list_head nat_entry_set; /* nat entry set list */
+   unsigned int nat_cnt;   /* the # of cached nat entries */
unsigned int dirty_nat_cnt; /* total num of nat entries in set */
 
/* free node ids management */
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 21ed91b..f5a21f4 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -123,6 +123,57 @@ static void __del_from_nat_cache(struct f2fs_nm_info 
*nm_i, struct nat_entry *e)
kmem_cache_free(nat_entry_slab, e);
 }
 
+static void __set_nat_cache_dirty(struct f2fs_nm_info *nm_i,
+   struct nat_entry *ne)
+{
+   nid_t set = ne->ni.nid / NAT_ENTRY_PER_BLOCK;
+   struct nat_entry_set *head;
+
+   if (get_nat_flag(ne, IS_DIRTY))
+   return;
+retry:
+   head = radix_tree_lookup(_i->nat_set_root, set);
+   if (!head) {
+   head = f2fs_kmem_cache_alloc(nat_entry_set_slab, GFP_ATOMIC);
+
+   INIT_LIST_HEAD(>entry_list);
+   INIT_LIST_HEAD(>set_list);
+   head->set = set;
+   head->entry_cnt = 0;
+
+   if (radix_tree_insert(_i->nat_set_root, set, head)) {
+   cond_resched();
+   goto retry;
+   }
+   }
+   list_move_tail(>list, >entry_list);
+   nm_i->dirty_nat_cnt++;
+   head->entry_cnt++;
+   set_nat_flag(ne, IS_DIRTY, true);
+}
+
+static void __clear_nat_cache_dirty(struct f2fs_nm_info *nm_i,
+   struct nat_entry *ne)
+{
+   nid_t set = ne->ni.nid / NAT_ENTRY_PER_BLOCK;
+   struct nat_entry_set *head;
+
+   head = radix_tree_lookup(_i->nat_set_root, set);
+   if (head) {
+   list_move_tail(>list, _i->nat_entries);
+   set_nat_flag(ne, IS_DIRTY, false);
+   head->entry_cnt--;
+   nm_i->dirty_nat_cnt--;
+   }
+}
+
+static unsigned int __gang_lookup_nat_set(struct f2fs_nm_info *nm_i,
+   nid_t start, unsigned int nr, struct nat_entry_set **ep)
+{
+   return radix_tree_gang_lookup(_i->nat_set_root, (void **)ep,
+   start, nr);
+}
+
 bool is_checkpointed_node(struct f2fs_sb_info *sbi, nid_t nid)
 {
struct f2fs_nm_info *nm_i = NM_I(sbi);
@@ -1739,79 +1790,6 @@ skip:
return err;
 }
 
-static struct nat_entry_set *grab_nat_entry_set(void)
-{
-   struct nat_entry_set *nes =
-   f2fs_kmem_cache_alloc(nat_entry_set_slab, GFP_ATOMIC);
-
-   nes->entry_cnt = 0;
-   INIT_LIST_HEAD(>set_list);
-   INIT_LIST_HEAD(>entry_list);
-   return nes;
-}
-
-static void release_nat_entry_set(struct nat_entry_set *nes,
-   

[PATCH 2/3] f2fs: introduce FITRIM in f2fs_ioctl

2014-09-22 Thread Jaegeuk Kim
This patch introduces FITRIM in f2fs_ioctl.
In this case, f2fs will issue small discards and prefree discards as many as
possible for the given area.

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/checkpoint.c|   4 +-
 fs/f2fs/f2fs.h  |   9 +++-
 fs/f2fs/file.c  |  29 
 fs/f2fs/segment.c   | 110 +++-
 fs/f2fs/super.c |   1 +
 include/trace/events/f2fs.h |   3 +-
 6 files changed, 141 insertions(+), 15 deletions(-)

diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index e401ffd..5d793ba 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -997,7 +997,7 @@ void write_checkpoint(struct f2fs_sb_info *sbi, struct 
cp_control *cpc)
 
mutex_lock(>cp_mutex);
 
-   if (!sbi->s_dirty)
+   if (!sbi->s_dirty && cpc->reason != CP_DISCARD)
goto out;
if (unlikely(f2fs_cp_error(sbi)))
goto out;
@@ -1020,7 +1020,7 @@ void write_checkpoint(struct f2fs_sb_info *sbi, struct 
cp_control *cpc)
 
/* write cached NAT/SIT entries to NAT/SIT area */
flush_nat_entries(sbi);
-   flush_sit_entries(sbi);
+   flush_sit_entries(sbi, cpc);
 
/* unlock all the fs_lock[] in do_checkpoint() */
do_checkpoint(sbi, cpc);
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 5298924..7b1e1d2 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -99,10 +99,15 @@ enum {
 enum {
CP_UMOUNT,
CP_SYNC,
+   CP_DISCARD,
 };
 
 struct cp_control {
int reason;
+   __u64 trim_start;
+   __u64 trim_end;
+   __u64 trim_minlen;
+   __u64 trimmed;
 };
 
 /*
@@ -1276,9 +1281,11 @@ void destroy_flush_cmd_control(struct f2fs_sb_info *);
 void invalidate_blocks(struct f2fs_sb_info *, block_t);
 void refresh_sit_entry(struct f2fs_sb_info *, block_t, block_t);
 void clear_prefree_segments(struct f2fs_sb_info *);
+void release_discard_addrs(struct f2fs_sb_info *);
 void discard_next_dnode(struct f2fs_sb_info *, block_t);
 int npages_for_summary_flush(struct f2fs_sb_info *);
 void allocate_new_segments(struct f2fs_sb_info *);
+int f2fs_trim_fs(struct f2fs_sb_info *, struct fstrim_range *);
 struct page *get_sum_page(struct f2fs_sb_info *, unsigned int);
 void write_meta_page(struct f2fs_sb_info *, struct page *);
 void write_node_page(struct f2fs_sb_info *, struct page *,
@@ -1295,7 +1302,7 @@ void write_data_summaries(struct f2fs_sb_info *, block_t);
 void write_node_summaries(struct f2fs_sb_info *, block_t);
 int lookup_journal_in_cursum(struct f2fs_summary_block *,
int, unsigned int, int);
-void flush_sit_entries(struct f2fs_sb_info *);
+void flush_sit_entries(struct f2fs_sb_info *, struct cp_control *);
 int build_segment_manager(struct f2fs_sb_info *);
 void destroy_segment_manager(struct f2fs_sb_info *);
 int __init create_segment_manager_caches(void);
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index ac8c680..1184207 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -860,6 +860,35 @@ out:
mnt_drop_write_file(filp);
return ret;
}
+   case FITRIM:
+   {
+   struct super_block *sb = inode->i_sb;
+   struct request_queue *q = bdev_get_queue(sb->s_bdev);
+   struct fstrim_range range;
+   int ret = 0;
+
+   if (!capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
+   if (!blk_queue_discard(q))
+   return -EOPNOTSUPP;
+
+   if (copy_from_user(, (struct fstrim_range __user *)arg,
+   sizeof(range)))
+   return -EFAULT;
+
+   range.minlen = max((unsigned int)range.minlen,
+  q->limits.discard_granularity);
+   ret = f2fs_trim_fs(F2FS_SB(sb), );
+   if (ret < 0)
+   return ret;
+
+   if (copy_to_user((struct fstrim_range __user *)arg, ,
+   sizeof(range)))
+   return -EFAULT;
+
+   return 0;
+   }
default:
return -ENOTTY;
}
diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 3125a3d..b423005 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -386,45 +386,92 @@ void discard_next_dnode(struct f2fs_sb_info *sbi, block_t 
blkaddr)
}
 }
 
-static void add_discard_addrs(struct f2fs_sb_info *sbi,
-   unsigned int segno, struct seg_entry *se)
+static void add_discard_addrs(struct f2fs_sb_info *sbi, struct cp_control *cpc)
 {
struct list_head *head = _I(sbi)->discard_list;
struct discard_entry *new;
int entries = SIT_VBLOCK_MAP_SIZE / sizeof(unsigned long);
int max_blocks = sbi->blocks_per_seg;
+   struct seg_entry *se = get_seg_entry(sbi, cpc->trim_start);
unsigned long *cur_map = 

[PATCH 1/3] f2fs: introduce cp_control structure

2014-09-22 Thread Jaegeuk Kim
This patch add a new data structure to control checkpoint parameters.
Currently, it presents the reason of checkpoint such as is_umount and normal
sync.

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/checkpoint.c| 16 
 fs/f2fs/f2fs.h  | 11 ++-
 fs/f2fs/gc.c|  7 +--
 fs/f2fs/recovery.c  |  5 -
 fs/f2fs/super.c | 13 ++---
 include/trace/events/f2fs.h | 15 ++-
 6 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
index e519aaf..e401ffd 100644
--- a/fs/f2fs/checkpoint.c
+++ b/fs/f2fs/checkpoint.c
@@ -826,7 +826,7 @@ static void wait_on_all_pages_writeback(struct f2fs_sb_info 
*sbi)
finish_wait(>cp_wait, );
 }
 
-static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount)
+static void do_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc)
 {
struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_WARM_NODE);
@@ -894,7 +894,7 @@ static void do_checkpoint(struct f2fs_sb_info *sbi, bool 
is_umount)
ckpt->cp_pack_start_sum = cpu_to_le32(1 + cp_payload_blks +
orphan_blocks);
 
-   if (is_umount) {
+   if (cpc->reason == CP_UMOUNT) {
set_ckpt_flags(ckpt, CP_UMOUNT_FLAG);
ckpt->cp_pack_total_block_count = cpu_to_le32(F2FS_CP_PACKS+
cp_payload_blks + data_sum_blocks +
@@ -948,7 +948,7 @@ static void do_checkpoint(struct f2fs_sb_info *sbi, bool 
is_umount)
 
write_data_summaries(sbi, start_blk);
start_blk += data_sum_blocks;
-   if (is_umount) {
+   if (cpc->reason == CP_UMOUNT) {
write_node_summaries(sbi, start_blk);
start_blk += NR_CURSEG_NODE_TYPE;
}
@@ -988,12 +988,12 @@ static void do_checkpoint(struct f2fs_sb_info *sbi, bool 
is_umount)
 /*
  * We guarantee that this checkpoint procedure will not fail.
  */
-void write_checkpoint(struct f2fs_sb_info *sbi, bool is_umount)
+void write_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc)
 {
struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi);
unsigned long long ckpt_ver;
 
-   trace_f2fs_write_checkpoint(sbi->sb, is_umount, "start block_ops");
+   trace_f2fs_write_checkpoint(sbi->sb, cpc->reason, "start block_ops");
 
mutex_lock(>cp_mutex);
 
@@ -1004,7 +1004,7 @@ void write_checkpoint(struct f2fs_sb_info *sbi, bool 
is_umount)
if (block_operations(sbi))
goto out;
 
-   trace_f2fs_write_checkpoint(sbi->sb, is_umount, "finish block_ops");
+   trace_f2fs_write_checkpoint(sbi->sb, cpc->reason, "finish block_ops");
 
f2fs_submit_merged_bio(sbi, DATA, WRITE);
f2fs_submit_merged_bio(sbi, NODE, WRITE);
@@ -1023,13 +1023,13 @@ void write_checkpoint(struct f2fs_sb_info *sbi, bool 
is_umount)
flush_sit_entries(sbi);
 
/* unlock all the fs_lock[] in do_checkpoint() */
-   do_checkpoint(sbi, is_umount);
+   do_checkpoint(sbi, cpc);
 
unblock_operations(sbi);
stat_inc_cp_count(sbi->stat_info);
 out:
mutex_unlock(>cp_mutex);
-   trace_f2fs_write_checkpoint(sbi->sb, is_umount, "finish checkpoint");
+   trace_f2fs_write_checkpoint(sbi->sb, cpc->reason, "finish checkpoint");
 }
 
 void init_ino_entry_info(struct f2fs_sb_info *sbi)
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 3b70b01..5298924 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -96,6 +96,15 @@ enum {
SIT_BITMAP
 };
 
+enum {
+   CP_UMOUNT,
+   CP_SYNC,
+};
+
+struct cp_control {
+   int reason;
+};
+
 /*
  * For CP/NAT/SIT/SSA readahead
  */
@@ -1314,7 +1323,7 @@ void update_dirty_page(struct inode *, struct page *);
 void add_dirty_dir_inode(struct inode *);
 void remove_dirty_dir_inode(struct inode *);
 void sync_dirty_dir_inodes(struct f2fs_sb_info *);
-void write_checkpoint(struct f2fs_sb_info *, bool);
+void write_checkpoint(struct f2fs_sb_info *, struct cp_control *);
 void init_ino_entry_info(struct f2fs_sb_info *);
 int __init create_checkpoint_caches(void);
 void destroy_checkpoint_caches(void);
diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c
index 7bf8392..e88fcf6 100644
--- a/fs/f2fs/gc.c
+++ b/fs/f2fs/gc.c
@@ -694,6 +694,9 @@ int f2fs_gc(struct f2fs_sb_info *sbi)
int gc_type = BG_GC;
int nfree = 0;
int ret = -1;
+   struct cp_control cpc = {
+   .reason = CP_SYNC,
+   };
 
INIT_LIST_HEAD();
 gc_more:
@@ -704,7 +707,7 @@ gc_more:
 
if (gc_type == BG_GC && has_not_enough_free_secs(sbi, nfree)) {
gc_type = FG_GC;
-   write_checkpoint(sbi, false);
+   write_checkpoint(sbi, );
}
 
if (!__get_victim(sbi, , gc_type, NO_CHECK_TYPE))
@@ -729,7 +732,7 @@ gc_more:
goto gc_more;
 
if (gc_type == FG_GC)
-   

Re: [f2fs-dev] [PATCH 2/3] f2fs: fix conditions to remain recovery information in f2fs_sync_file

2014-09-22 Thread Jaegeuk Kim
On Mon, Sep 22, 2014 at 05:20:19PM +0800, Chao Yu wrote:
> > -Original Message-
> > From: Huang Ying [mailto:ying.hu...@intel.com]
> > Sent: Monday, September 22, 2014 3:39 PM
> > To: Chao Yu
> > Cc: 'Jaegeuk Kim'; linux-kernel@vger.kernel.org; 
> > linux-fsde...@vger.kernel.org;
> > linux-f2fs-de...@lists.sourceforge.net
> > Subject: Re: [f2fs-dev] [PATCH 2/3] f2fs: fix conditions to remain recovery 
> > information in
> > f2fs_sync_file
> > 
> > On Mon, 2014-09-22 at 15:24 +0800, Chao Yu wrote:
> > > Hi Jaegeuk, Huang,
> > >
> > > > -Original Message-
> > > > From: Jaegeuk Kim [mailto:jaeg...@kernel.org]
> > > > Sent: Thursday, September 18, 2014 1:51 PM
> > > > To: linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org;
> > > > linux-f2fs-de...@lists.sourceforge.net
> > > > Cc: Jaegeuk Kim; Huang Ying
> > > > Subject: [f2fs-dev] [PATCH 2/3] f2fs: fix conditions to remain recovery 
> > > > information in
> > > > f2fs_sync_file
> > > >
> > > > This patch revisited whole the recovery information during the 
> > > > f2fs_sync_file.
> > > >
> > > > In this patch, there are three information to make a decision.
> > > >
> > > > a) IS_CHECKPOINTED, /* is it checkpointed before? */
> > > > b) HAS_FSYNCED_INODE,   /* is the inode fsynced before? */
> > > > c) HAS_LAST_FSYNC,  /* has the latest node fsync mark? */
> > > >
> > > > And, the scenarios for our rule are based on:
> > > >
> > > > [Term] F: fsync_mark, D: dentry_mark
> > > >
> > > > 1. inode(x) | CP | inode(x) | dnode(F)
> > > > 2. inode(x) | CP | inode(F) | dnode(F)
> > > > 3. inode(x) | CP | dnode(F) | inode(x) | inode(F)
> > > > 4. inode(x) | CP | dnode(F) | inode(F)
> > > > 5. CP | inode(x) | dnode(F) | inode(DF)
> > > > 6. CP | inode(DF) | dnode(F)
> > > > 7. CP | dnode(F) | inode(DF)
> > > > 8. CP | dnode(F) | inode(x) | inode(DF)
> > >
> > > No sure, do we missed these cases:
> > > inode(x) | CP | inode(F) | dnode(x) -> write inode(F)
> > > CP | inode(DF) | dnode(x) -> write inode(F)
> > >
> > > In these cases we will write another inode with fsync flag because our 
> > > last
> > > dnode is written out to disk by bdi-flusher (HAS_LAST_FSYNC is not 
> > > marked). But
> > > this appended inode is not useful.
> > >
> > > AFAIK, HAS_LAST_FSYNC(AKA fsync_done) is introduced in commit 479f40c44ae3
> > > ("f2fs: skip unnecessary node writes during fsync") to avoid writting 
> > > multiple
> > > unneeded inode page by multiple redundant fsync calls. But for now, its 
> > > role can
> > > be taken by HAS_FSYNCED_INODE.
> > > So, can we remove this flag to simplify our logic of fsync flow?
> > >
> > > Then in fsync flow, the rule can be:
> > > If CPed before, there must be a inode(F) written in warm node chain;
> > 
> > How about
> > 
> > inode(x) | CP | dnode(F)
> 
> Oh, I missed this one, thanks for remindering that.
> 
> There is another case:
> inode(x) | CP | dnode(F) | dnode(x) -> write inode(F)
> It seems we will append another unneeded inode(F) in this patch also due to
> no HAS_LAST_FSYNC in nat entry cache of inode.

As the current rule for roll-forward recovery, we need inode(F) to find the
latest mark. This can also be used to distinguish fsynced inode from writebacked
inodes.

> 
> > 
> > > If not CPed before, there must be a inode(DF) written in warm node chain.
> > 
> > For example below:
> > 
> > 1) checkpoint
> > 2) create "a", change "a"
> > 3) fsync "a"
> > 4) open "a", change "a"
> > 
> > Do we want recovery to stop at dnode(F) in step 3) or stop at dnode(x)
> > produced by step 4)?
> 
> To my understanding, we will recover all info related to fsynced nodes in warm
> node chain. So will produce to step 4 if changed nodes in step 4 are flushed 
> to
> device.

Current rule should stop at 3) fsync "a". It won't recover 4)'s inode, since it
was just writebacked.

If we'd like to recover whole the inode and its data, we should traverse all the
recovery paths from the sketch.

Thanks,

> 
> Thanks,
> Yu
> > 
> > Best Regards,
> > Huang, Ying
> > 
> > > >
> > > > For example, #3, the three conditions should be changed as follows.
> > > >
> > > >inode(x) | CP | dnode(F) | inode(x) | inode(F)
> > > > a)x   o  o  o  o
> > > > b)x   x  x  x  o
> > > > c)x   o  o  x  o
> > > >
> > > > If f2fs_sync_file stops   --^,
> > > >  it should write inode(F)--^
> > > >
> > > > So, the need_inode_block_update should return true, since
> > > >  c) get_nat_flag(e, HAS_LAST_FSYNC), is false.
> > > >
> > > > For example, #8,
> > > >   CP | alloc | dnode(F) | inode(x) | inode(DF)
> > > > a)o  xx  x  x
> > > > b)x   x  x  o
> > > > c)o   o  x  o
> > > >
> > > > If f2fs_sync_file stops   ---^,
> > > >  it should write inode(DF)--^
> > > >
> > > > Note that, the roll-forward policy should follow this rule, which 

Re: [PATCH] Fix the issue that lowmemkiller fell into a cycle that try to kill a task

2014-09-22 Thread 朱辉


On 09/23/14 12:18, Greg KH wrote:
> On Tue, Sep 23, 2014 at 10:57:09AM +0800, Hui Zhu wrote:
>> The cause of this issue is when free memroy size is low and a lot of task is
>> trying to shrink the memory, the task that is killed by lowmemkiller cannot 
>> get
>> CPU to exit itself.
>>
>> Fix this issue with change the scheduling policy to SCHED_FIFO if a task's 
>> flag
>> is TIF_MEMDIE in lowmemkiller.
>>
>> Signed-off-by: Hui Zhu 
>> ---
>>   drivers/staging/android/lowmemorykiller.c | 4 
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/staging/android/lowmemorykiller.c 
>> b/drivers/staging/android/lowmemorykiller.c
>> index b545d3d..ca1ffac 100644
>> --- a/drivers/staging/android/lowmemorykiller.c
>> +++ b/drivers/staging/android/lowmemorykiller.c
>> @@ -129,6 +129,10 @@ static unsigned long lowmem_scan(struct shrinker *s, 
>> struct shrink_control *sc)
>>
>>  if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
>>  time_before_eq(jiffies, lowmem_deathpending_timeout)) {
>> +struct sched_param param = { .sched_priority = 1 };
>> +
>> +if (p->policy == SCHED_NORMAL)
>> +sched_setscheduler(p, SCHED_FIFO, );
>
> This seems really specific to a specific scheduler pattern now.  Isn't
> there some other way to resolve this?

I tried to let the task that call lowmemkiller sleep some time when it 
try to kill same task.  But it doesn't work.
I think the issue is that the free memroy size is too low to make more 
and more tasks come to call lowmemkiller.

Thanks,
Hui

>
> thanks,
>
> greg k-h
>


[PATCH v4 06/12] crypto: LLVMLinux: Remove VLAIS from crypto/omap_sham.c

2014-09-22 Thread behanw
From: Behan Webster 

Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent. This patch allocates the appropriate amount of memory
using a char array using the SHASH_DESC_ON_STACK macro.

The new code can be compiled with both gcc and clang.

Signed-off-by: Behan Webster 
Reviewed-by: Mark Charlebois 
Reviewed-by: Jan-Simon Möller 
Acked-by: Herbert Xu 
---
 drivers/crypto/omap-sham.c | 28 +++-
 1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/drivers/crypto/omap-sham.c b/drivers/crypto/omap-sham.c
index 710d863..24ef489 100644
--- a/drivers/crypto/omap-sham.c
+++ b/drivers/crypto/omap-sham.c
@@ -949,17 +949,14 @@ static int omap_sham_finish_hmac(struct ahash_request 
*req)
struct omap_sham_hmac_ctx *bctx = tctx->base;
int bs = crypto_shash_blocksize(bctx->shash);
int ds = crypto_shash_digestsize(bctx->shash);
-   struct {
-   struct shash_desc shash;
-   char ctx[crypto_shash_descsize(bctx->shash)];
-   } desc;
+   SHASH_DESC_ON_STACK(shash, bctx->shash);
 
-   desc.shash.tfm = bctx->shash;
-   desc.shash.flags = 0; /* not CRYPTO_TFM_REQ_MAY_SLEEP */
+   shash->tfm = bctx->shash;
+   shash->flags = 0; /* not CRYPTO_TFM_REQ_MAY_SLEEP */
 
-   return crypto_shash_init() ?:
-  crypto_shash_update(, bctx->opad, bs) ?:
-  crypto_shash_finup(, req->result, ds, req->result);
+   return crypto_shash_init(shash) ?:
+  crypto_shash_update(shash, bctx->opad, bs) ?:
+  crypto_shash_finup(shash, req->result, ds, req->result);
 }
 
 static int omap_sham_finish(struct ahash_request *req)
@@ -1118,18 +1115,15 @@ static int omap_sham_update(struct ahash_request *req)
return omap_sham_enqueue(req, OP_UPDATE);
 }
 
-static int omap_sham_shash_digest(struct crypto_shash *shash, u32 flags,
+static int omap_sham_shash_digest(struct crypto_shash *tfm, u32 flags,
  const u8 *data, unsigned int len, u8 *out)
 {
-   struct {
-   struct shash_desc shash;
-   char ctx[crypto_shash_descsize(shash)];
-   } desc;
+   SHASH_DESC_ON_STACK(shash, tfm);
 
-   desc.shash.tfm = shash;
-   desc.shash.flags = flags & CRYPTO_TFM_REQ_MAY_SLEEP;
+   shash->tfm = tfm;
+   shash->flags = flags & CRYPTO_TFM_REQ_MAY_SLEEP;
 
-   return crypto_shash_digest(, data, len, out);
+   return crypto_shash_digest(shash, data, len, out);
 }
 
 static int omap_sham_final_shash(struct ahash_request *req)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 05/12] crypto: LLVMLinux: Remove VLAIS from crypto/n2_core.c

2014-09-22 Thread behanw
From: Behan Webster 

Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent. This patch allocates the appropriate amount of memory
using a char array using the SHASH_DESC_ON_STACK macro.

The new code can be compiled with both gcc and clang.

Signed-off-by: Behan Webster 
Reviewed-by: Mark Charlebois 
Reviewed-by: Jan-Simon Möller 
Acked-by: Herbert Xu 
---
 drivers/crypto/n2_core.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/crypto/n2_core.c b/drivers/crypto/n2_core.c
index 7263c10..f8e3207 100644
--- a/drivers/crypto/n2_core.c
+++ b/drivers/crypto/n2_core.c
@@ -445,10 +445,7 @@ static int n2_hmac_async_setkey(struct crypto_ahash *tfm, 
const u8 *key,
struct n2_hmac_ctx *ctx = crypto_ahash_ctx(tfm);
struct crypto_shash *child_shash = ctx->child_shash;
struct crypto_ahash *fallback_tfm;
-   struct {
-   struct shash_desc shash;
-   char ctx[crypto_shash_descsize(child_shash)];
-   } desc;
+   SHASH_DESC_ON_STACK(shash, child_shash);
int err, bs, ds;
 
fallback_tfm = ctx->base.fallback_tfm;
@@ -456,15 +453,15 @@ static int n2_hmac_async_setkey(struct crypto_ahash *tfm, 
const u8 *key,
if (err)
return err;
 
-   desc.shash.tfm = child_shash;
-   desc.shash.flags = crypto_ahash_get_flags(tfm) &
+   shash->tfm = child_shash;
+   shash->flags = crypto_ahash_get_flags(tfm) &
CRYPTO_TFM_REQ_MAY_SLEEP;
 
bs = crypto_shash_blocksize(child_shash);
ds = crypto_shash_digestsize(child_shash);
BUG_ON(ds > N2_HASH_KEY_MAX);
if (keylen > bs) {
-   err = crypto_shash_digest(, key, keylen,
+   err = crypto_shash_digest(shash, key, keylen,
  ctx->hash_key);
if (err)
return err;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5] x86, cpu-hotplug: fix llc shared map unreleased during cpu hotplug

2014-09-22 Thread Kamezawa Hiroyuki
(2014/09/17 16:17), Wanpeng Li wrote:
> BUG: unable to handle kernel NULL pointer dereference at 0004
> IP: [..] find_busiest_group
> PGD 5a9d5067 PUD 13067 PMD 0
> Oops:  [#3] SMP
> [...]
> Call Trace:
> load_balance
> ? _raw_spin_unlock_irqrestore
> idle_balance
> __schedule
> schedule
> schedule_timeout
> ? lock_timer_base
> schedule_timeout_uninterruptible
> msleep
> lock_device_hotplug_sysfs
> online_store
> dev_attr_store
> sysfs_write_file
> vfs_write
> SyS_write
> system_call_fastpath
> 
> This bug can be triggered by hot add and remove large number of xen
> domain0's vcpus repeatedly.
> 
> Last level cache shared map is built during cpu up and build sched domain
> routine takes advantage of it to setup sched domain cpu topology, however,
> llc shared map is unreleased during cpu disable which lead to invalid sched
> domain cpu topology. This patch fix it by release llc shared map correctly
> during cpu disable.
> 
> Reviewed-by: Toshi Kani 
> Reviewed-by: Yasuaki Ishimatsu 
> Tested-by: Linn Crosetto 
> Signed-off-by: Wanpeng Li 

Yasuaki reported this can happen on our real hardware. 
https://lkml.org/lkml/2014/7/22/1018

Our case is here.
==
Here is a example on my system.
My system has 4 sockets and each socket has 15 cores and HT is enabled.
In this case, each core of sockes is numbered as follows:

  | CPU#
Socket#0 | 0-14 , 60-74
Socket#1 | 15-29, 75-89
Socket#2 | 30-44, 90-104
Socket#3 | 45-59, 105-119
Then llc_shared_mask of CPU#30 has 0x3fff8001fffc000.
It means that last level cache of Socket#2 is shared with
CPU#30-44 and 90-104.
When hot-removing socket#2 and #3, each core of sockets is numbered
as follows:

  | CPU#
Socket#0 | 0-14 , 60-74
Socket#1 | 15-29, 75-89
But llc_shared_mask is not cleared. So llc_shared_mask of CPU#30 remains
having 0x3fff8001fffc000.
After that, when hot-adding socket#2 and #3, each core of sockets is
numbered as follows:

  | CPU#
Socket#0 | 0-14 , 60-74
Socket#1 | 15-29, 75-89
Socket#2 | 30-59
Socket#3 | 90-119
Then llc_shared_mask of CPU#30 becomes 0x3fff8000fffc000.
It means that last level cache of Socket#2 is shared with CPU#30-59
and 90-104. So the mask has wrong value.
At first, I cleared hot-removed CPU number's bit from llc_shared_map
when hot removing CPU. But Borislav suggested that the problem will
disappear if readded CPU is assigned same CPU number. And llc_shared_map
must not be changed.
==

So, please.

Thanks,
-Kame







--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [f2fs-dev] [PATCH 07/10] f2fs: use meta_inode cache to improve roll-forward speed

2014-09-22 Thread Jaegeuk Kim
Hi Chao,

On Mon, Sep 22, 2014 at 10:36:25AM +0800, Chao Yu wrote:
> Hi Jaegeuk,
> 
> > -Original Message-
> > From: Jaegeuk Kim [mailto:jaeg...@kernel.org]
> > Sent: Monday, September 15, 2014 6:14 AM
> > To: linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org;
> > linux-f2fs-de...@lists.sourceforge.net
> > Cc: Jaegeuk Kim
> > Subject: [f2fs-dev] [PATCH 07/10] f2fs: use meta_inode cache to improve 
> > roll-forward speed
> > 
> > Previously, all the dnode pages should be read during the roll-forward 
> > recovery.
> > Even worsely, whole the chain was traversed twice.
> > This patch removes that redundant and costly read operations by using page 
> > cache
> > of meta_inode and readahead function as well.
> > 
> > Signed-off-by: Jaegeuk Kim 
> > ---
> >  fs/f2fs/checkpoint.c | 11 --
> >  fs/f2fs/f2fs.h   |  5 +++--
> >  fs/f2fs/recovery.c   | 59 
> > +---
> >  fs/f2fs/segment.h|  5 +++--
> >  4 files changed, 43 insertions(+), 37 deletions(-)
> > 
> > diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c
> > index 7262d99..d1ed889 100644
> > --- a/fs/f2fs/checkpoint.c
> > +++ b/fs/f2fs/checkpoint.c
> > @@ -82,6 +82,8 @@ static inline int get_max_meta_blks(struct f2fs_sb_info 
> > *sbi, int type)
> > case META_SSA:
> > case META_CP:
> > return 0;
> > +   case META_POR:
> > +   return SM_I(sbi)->main_blkaddr + sbi->user_block_count;
> 
> Here we will skip virtual over-provision segments, so better to use 
> TOTAL_BLKS(sbi).
> 
> > default:
> > BUG();
> > }
> > @@ -90,11 +92,11 @@ static inline int get_max_meta_blks(struct f2fs_sb_info 
> > *sbi, int type)
> >  /*
> >   * Readahead CP/NAT/SIT/SSA pages
> >   */
> > -int ra_meta_pages(struct f2fs_sb_info *sbi, int start, int nrpages, int 
> > type)
> > +int ra_meta_pages(struct f2fs_sb_info *sbi, block_t start, int nrpages, 
> > int type)
> >  {
> > block_t prev_blk_addr = 0;
> > struct page *page;
> > -   int blkno = start;
> > +   block_t blkno = start;
> > int max_blks = get_max_meta_blks(sbi, type);
> > 
> > struct f2fs_io_info fio = {
> > @@ -128,6 +130,11 @@ int ra_meta_pages(struct f2fs_sb_info *sbi, int start, 
> > int nrpages, int
> > type)
> > /* get ssa/cp block addr */
> > blk_addr = blkno;
> > break;
> > +   case META_POR:
> > +   if (unlikely(blkno >= max_blks))
> > +   goto out;
> > +   blk_addr = blkno;
> > +   break;
> 
> The real modification in patch which is merged to dev of f2fs is as following:
> 
> - /* get ssa/cp block addr */
> + case META_POR:
> + if (blkno >= max_blks || blkno < min_blks)
> + goto out;
> 
> IMHO, it's better to verify boundary separately for META_{SSA,CP,POR} with 
> unlikely.
> How do you think?

Not bad.
Could you check the v2 below?

> 
> > default:
> > BUG();
> > }
> > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
> > index 4f84d2a..48d7d46 100644
> > --- a/fs/f2fs/f2fs.h
> > +++ b/fs/f2fs/f2fs.h
> > @@ -103,7 +103,8 @@ enum {
> > META_CP,
> > META_NAT,
> > META_SIT,
> > -   META_SSA
> > +   META_SSA,
> > +   META_POR,
> >  };
> > 
> >  /* for the list of ino */
> > @@ -1291,7 +1292,7 @@ void destroy_segment_manager_caches(void);
> >   */
> >  struct page *grab_meta_page(struct f2fs_sb_info *, pgoff_t);
> >  struct page *get_meta_page(struct f2fs_sb_info *, pgoff_t);
> > -int ra_meta_pages(struct f2fs_sb_info *, int, int, int);
> > +int ra_meta_pages(struct f2fs_sb_info *, block_t, int, int);
> >  long sync_meta_pages(struct f2fs_sb_info *, enum page_type, long);
> >  void add_dirty_inode(struct f2fs_sb_info *, nid_t, int type);
> >  void remove_dirty_inode(struct f2fs_sb_info *, nid_t, int type);
> > diff --git a/fs/f2fs/recovery.c b/fs/f2fs/recovery.c
> > index 3736728..6f7fbfa 100644
> > --- a/fs/f2fs/recovery.c
> > +++ b/fs/f2fs/recovery.c
> > @@ -173,7 +173,7 @@ static int find_fsync_dnodes(struct f2fs_sb_info *sbi, 
> > struct list_head
> > *head)
> >  {
> > unsigned long long cp_ver = cur_cp_version(F2FS_CKPT(sbi));
> > struct curseg_info *curseg;
> > -   struct page *page;
> > +   struct page *page = NULL;
> > block_t blkaddr;
> > int err = 0;
> > 
> > @@ -181,20 +181,19 @@ static int find_fsync_dnodes(struct f2fs_sb_info 
> > *sbi, struct list_head
> > *head)
> > curseg = CURSEG_I(sbi, CURSEG_WARM_NODE);
> > blkaddr = NEXT_FREE_BLKADDR(sbi, curseg);
> > 
> > -   /* read node page */
> > -   page = alloc_page(GFP_F2FS_ZERO);
> > -   if (!page)
> > -   return -ENOMEM;
> > -   lock_page(page);
> > -
> > while (1) {
> > struct fsync_inode_entry *entry;
> > 
> > -   err = f2fs_submit_page_bio(sbi, page, blkaddr, READ_SYNC);
> > -   if (err)
> > -   return err;
> > +   if (blkaddr < 

[PATCH v4 07/12] crypto: LLVMLinux: Remove VLAIS from crypto/.../qat_algs.c

2014-09-22 Thread behanw
From: Behan Webster 

Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent. This patch allocates the appropriate amount of memory
using a char array using the SHASH_DESC_ON_STACK macro.

The new code can be compiled with both gcc and clang.

Signed-off-by: Behan Webster 
Reviewed-by: Mark Charlebois 
Reviewed-by: Jan-Simon Möller 
Acked-by: Herbert Xu 
---
 drivers/crypto/qat/qat_common/qat_algs.c | 31 ++-
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/drivers/crypto/qat/qat_common/qat_algs.c 
b/drivers/crypto/qat/qat_common/qat_algs.c
index 59df488..9cabadd 100644
--- a/drivers/crypto/qat/qat_common/qat_algs.c
+++ b/drivers/crypto/qat/qat_common/qat_algs.c
@@ -152,10 +152,7 @@ static int qat_alg_do_precomputes(struct 
icp_qat_hw_auth_algo_blk *hash,
  const uint8_t *auth_key,
  unsigned int auth_keylen, uint8_t *auth_state)
 {
-   struct {
-   struct shash_desc shash;
-   char ctx[crypto_shash_descsize(ctx->hash_tfm)];
-   } desc;
+   SHASH_DESC_ON_STACK(shash, ctx->hash_tfm);
struct sha1_state sha1;
struct sha256_state sha256;
struct sha512_state sha512;
@@ -167,12 +164,12 @@ static int qat_alg_do_precomputes(struct 
icp_qat_hw_auth_algo_blk *hash,
__be64 *hash512_state_out;
int i, offset;
 
-   desc.shash.tfm = ctx->hash_tfm;
-   desc.shash.flags = 0x0;
+   shash->tfm = ctx->hash_tfm;
+   shash->flags = 0x0;
 
if (auth_keylen > block_size) {
char buff[SHA512_BLOCK_SIZE];
-   int ret = crypto_shash_digest(, auth_key,
+   int ret = crypto_shash_digest(shash, auth_key,
  auth_keylen, buff);
if (ret)
return ret;
@@ -195,10 +192,10 @@ static int qat_alg_do_precomputes(struct 
icp_qat_hw_auth_algo_blk *hash,
*opad_ptr ^= 0x5C;
}
 
-   if (crypto_shash_init())
+   if (crypto_shash_init(shash))
return -EFAULT;
 
-   if (crypto_shash_update(, ipad, block_size))
+   if (crypto_shash_update(shash, ipad, block_size))
return -EFAULT;
 
hash_state_out = (__be32 *)hash->sha.state1;
@@ -206,19 +203,19 @@ static int qat_alg_do_precomputes(struct 
icp_qat_hw_auth_algo_blk *hash,
 
switch (ctx->qat_hash_alg) {
case ICP_QAT_HW_AUTH_ALGO_SHA1:
-   if (crypto_shash_export(, ))
+   if (crypto_shash_export(shash, ))
return -EFAULT;
for (i = 0; i < digest_size >> 2; i++, hash_state_out++)
*hash_state_out = cpu_to_be32(*(sha1.state + i));
break;
case ICP_QAT_HW_AUTH_ALGO_SHA256:
-   if (crypto_shash_export(, ))
+   if (crypto_shash_export(shash, ))
return -EFAULT;
for (i = 0; i < digest_size >> 2; i++, hash_state_out++)
*hash_state_out = cpu_to_be32(*(sha256.state + i));
break;
case ICP_QAT_HW_AUTH_ALGO_SHA512:
-   if (crypto_shash_export(, ))
+   if (crypto_shash_export(shash, ))
return -EFAULT;
for (i = 0; i < digest_size >> 3; i++, hash512_state_out++)
*hash512_state_out = cpu_to_be64(*(sha512.state + i));
@@ -227,10 +224,10 @@ static int qat_alg_do_precomputes(struct 
icp_qat_hw_auth_algo_blk *hash,
return -EFAULT;
}
 
-   if (crypto_shash_init())
+   if (crypto_shash_init(shash))
return -EFAULT;
 
-   if (crypto_shash_update(, opad, block_size))
+   if (crypto_shash_update(shash, opad, block_size))
return -EFAULT;
 
offset = round_up(qat_get_inter_state_size(ctx->qat_hash_alg), 8);
@@ -239,19 +236,19 @@ static int qat_alg_do_precomputes(struct 
icp_qat_hw_auth_algo_blk *hash,
 
switch (ctx->qat_hash_alg) {
case ICP_QAT_HW_AUTH_ALGO_SHA1:
-   if (crypto_shash_export(, ))
+   if (crypto_shash_export(shash, ))
return -EFAULT;
for (i = 0; i < digest_size >> 2; i++, hash_state_out++)
*hash_state_out = cpu_to_be32(*(sha1.state + i));
break;
case ICP_QAT_HW_AUTH_ALGO_SHA256:
-   if (crypto_shash_export(, ))
+   if (crypto_shash_export(shash, ))
return -EFAULT;
for (i = 0; i < digest_size >> 2; i++, hash_state_out++)
*hash_state_out = cpu_to_be32(*(sha256.state + i));
break;
case ICP_QAT_HW_AUTH_ALGO_SHA512:
-   if (crypto_shash_export(, ))
+   if (crypto_shash_export(shash, ))
return -EFAULT;
 

[PATCH v4 02/12] btrfs: LLVMLinux: Remove VLAIS

2014-09-22 Thread behanw
From: Vinícius Tinti 

Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent.  This patch instead allocates the appropriate amount of
memory using a char array using the SHASH_DESC_ON_STACK macro.

The new code can be compiled with both gcc and clang.

Signed-off-by: Vinícius Tinti 
Reviewed-by: Jan-Simon Möller 
Reviewed-by: Mark Charlebois 
Signed-off-by: Behan Webster 
Acked-by: Chris Mason 
Acked-by: Herbert Xu 
Cc: "David S. Miller" 
---
 fs/btrfs/hash.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/hash.c b/fs/btrfs/hash.c
index 85889aa..4bf4d3a 100644
--- a/fs/btrfs/hash.c
+++ b/fs/btrfs/hash.c
@@ -33,18 +33,16 @@ void btrfs_hash_exit(void)
 
 u32 btrfs_crc32c(u32 crc, const void *address, unsigned int length)
 {
-   struct {
-   struct shash_desc shash;
-   char ctx[crypto_shash_descsize(tfm)];
-   } desc;
+   SHASH_DESC_ON_STACK(shash, tfm);
+   u32 *ctx = (u32 *)shash_desc_ctx(shash);
int err;
 
-   desc.shash.tfm = tfm;
-   desc.shash.flags = 0;
-   *(u32 *)desc.ctx = crc;
+   shash->tfm = tfm;
+   shash->flags = 0;
+   *ctx = crc;
 
-   err = crypto_shash_update(, address, length);
+   err = crypto_shash_update(shash, address, length);
BUG_ON(err);
 
-   return *(u32 *)desc.ctx;
+   return *ctx;
 }
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 08/12] crypto, dm: LLVMLinux: Remove VLAIS usage from dm-crypt

2014-09-22 Thread behanw
From: Jan-Simon Möller 

Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent. This patch allocates the appropriate amount of memory
using a char array using the SHASH_DESC_ON_STACK macro.

The new code can be compiled with both gcc and clang.

Signed-off-by: Jan-Simon Möller 
Signed-off-by: Behan Webster 
Reviewed-by: Mark Charlebois 
Acked-by: Herbert Xu 
Cc: pagee...@freemail.hu
Cc: gmazyl...@gmail.com
Cc: "David S. Miller" 
---
 drivers/md/dm-crypt.c | 34 ++
 1 file changed, 14 insertions(+), 20 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index cd15e08..fc93b93 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -526,29 +526,26 @@ static int crypt_iv_lmk_one(struct crypt_config *cc, u8 
*iv,
u8 *data)
 {
struct iv_lmk_private *lmk = >iv_gen_private.lmk;
-   struct {
-   struct shash_desc desc;
-   char ctx[crypto_shash_descsize(lmk->hash_tfm)];
-   } sdesc;
+   SHASH_DESC_ON_STACK(desc, lmk->hash_tfm);
struct md5_state md5state;
__le32 buf[4];
int i, r;
 
-   sdesc.desc.tfm = lmk->hash_tfm;
-   sdesc.desc.flags = CRYPTO_TFM_REQ_MAY_SLEEP;
+   desc->tfm = lmk->hash_tfm;
+   desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP;
 
-   r = crypto_shash_init();
+   r = crypto_shash_init(desc);
if (r)
return r;
 
if (lmk->seed) {
-   r = crypto_shash_update(, lmk->seed, LMK_SEED_SIZE);
+   r = crypto_shash_update(desc, lmk->seed, LMK_SEED_SIZE);
if (r)
return r;
}
 
/* Sector is always 512B, block size 16, add data of blocks 1-31 */
-   r = crypto_shash_update(, data + 16, 16 * 31);
+   r = crypto_shash_update(desc, data + 16, 16 * 31);
if (r)
return r;
 
@@ -557,12 +554,12 @@ static int crypt_iv_lmk_one(struct crypt_config *cc, u8 
*iv,
buf[1] = cpu_to_le32u64)dmreq->iv_sector >> 32) & 0x00FF) | 
0x8000);
buf[2] = cpu_to_le32(4024);
buf[3] = 0;
-   r = crypto_shash_update(, (u8 *)buf, sizeof(buf));
+   r = crypto_shash_update(desc, (u8 *)buf, sizeof(buf));
if (r)
return r;
 
/* No MD5 padding here */
-   r = crypto_shash_export(, );
+   r = crypto_shash_export(desc, );
if (r)
return r;
 
@@ -679,10 +676,7 @@ static int crypt_iv_tcw_whitening(struct crypt_config *cc,
struct iv_tcw_private *tcw = >iv_gen_private.tcw;
u64 sector = cpu_to_le64((u64)dmreq->iv_sector);
u8 buf[TCW_WHITENING_SIZE];
-   struct {
-   struct shash_desc desc;
-   char ctx[crypto_shash_descsize(tcw->crc32_tfm)];
-   } sdesc;
+   SHASH_DESC_ON_STACK(desc, tcw->crc32_tfm);
int i, r;
 
/* xor whitening with sector number */
@@ -691,16 +685,16 @@ static int crypt_iv_tcw_whitening(struct crypt_config *cc,
crypto_xor([8], (u8 *), 8);
 
/* calculate crc32 for every 32bit part and xor it */
-   sdesc.desc.tfm = tcw->crc32_tfm;
-   sdesc.desc.flags = CRYPTO_TFM_REQ_MAY_SLEEP;
+   desc->tfm = tcw->crc32_tfm;
+   desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP;
for (i = 0; i < 4; i++) {
-   r = crypto_shash_init();
+   r = crypto_shash_init(desc);
if (r)
goto out;
-   r = crypto_shash_update(, [i * 4], 4);
+   r = crypto_shash_update(desc, [i * 4], 4);
if (r)
goto out;
-   r = crypto_shash_final(, [i * 4]);
+   r = crypto_shash_final(desc, [i * 4]);
if (r)
goto out;
}
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kernfs: use stack-buf for small writes.

2014-09-22 Thread NeilBrown
On Tue, 23 Sep 2014 00:18:17 -0400 Tejun Heo  wrote:

> On Tue, Sep 23, 2014 at 02:06:33PM +1000, NeilBrown wrote:
> ...
> > Note that reads from a sysfs file are already safe due to the use for
> > seqfile.  The first read will allocate a buffer (m->buf) which will
> > be used for all subsequent reads.
> 
> Hmmm?  How is seqfile safe?  Where would the seq op write to?

seqfile is only safe for reads.  sysfs via kernfs uses seq_read(), so there
is only a single allocation on the first read.

It doesn't really related to fixing writes, except to point out that only
writes need to be "fixed".  Reads already work.

Separately:

> Ugh... :( If this can't be avoided at all, I'd much prefer it to be
> something explicit - a flag marking the file as needing a persistent
> write buffer which is allocated on open.  "Small" writes on stack
> feels way to implicit to me.

How about if we add seq_getbuf() and seq_putbuf() to seqfile
which takes a 'struct seq_file' and a size and returns the ->buf
after making sure it is big enough.
It also claims and releases the seqfile ->lock.

Then we would be using the same buffer for reads and write.

Does that sound suitable?  It uses existing infrastructure and avoids having
to identify in advance which attributes it is important for.

Thanks,
NeilBrown


signature.asc
Description: PGP signature


[PATCH v4 01/12] crypto: LLVMLinux: Add macro to remove use of VLAIS in crypto code

2014-09-22 Thread behanw
From: Behan Webster 

Add a macro which replaces the use of a Variable Length Array In Struct (VLAIS)
with a C99 compliant equivalent. This macro instead allocates the appropriate
amount of memory using an char array.

The new code can be compiled with both gcc and clang.

struct shash_desc contains a flexible array member member ctx declared with
CRYPTO_MINALIGN_ATTR, so sizeof(struct shash_desc) aligns the beginning
of the array declared after struct shash_desc with long long.

No trailing padding is required because it is not a struct type that can
be used in an array.

The CRYPTO_MINALIGN_ATTR is required so that desc is aligned with long long
as would be the case for a struct containing a member with
CRYPTO_MINALIGN_ATTR.

If you want to get to the ctx at the end of the shash_desc as before you can do
so using shash_desc_ctx(shash)

Signed-off-by: Behan Webster 
Reviewed-by: Mark Charlebois 
Acked-by: Herbert Xu 
Cc: Michał Mirosław 
---
 include/crypto/hash.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/crypto/hash.h b/include/crypto/hash.h
index a391955..74b13ec 100644
--- a/include/crypto/hash.h
+++ b/include/crypto/hash.h
@@ -58,6 +58,11 @@ struct shash_desc {
void *__ctx[] CRYPTO_MINALIGN_ATTR;
 };
 
+#define SHASH_DESC_ON_STACK(shash, ctx)  \
+   char __##shash##_desc[sizeof(struct shash_desc) + \
+   crypto_shash_descsize(ctx)] CRYPTO_MINALIGN_ATTR; \
+   struct shash_desc *shash = (struct shash_desc *)__##shash##_desc
+
 struct shash_alg {
int (*init)(struct shash_desc *desc);
int (*update)(struct shash_desc *desc, const u8 *data,
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 09/12] crypto: LLVMLinux: Remove VLAIS usage from crypto/hmac.c

2014-09-22 Thread behanw
From: Jan-Simon Möller 

Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent. This patch allocates the appropriate amount of memory
using a char array using the SHASH_DESC_ON_STACK macro.

The new code can be compiled with both gcc and clang.

Signed-off-by: Jan-Simon Möller 
Signed-off-by: Behan Webster 
Reviewed-by: Mark Charlebois 
Acked-by: Herbert Xu 
Cc: pagee...@freemail.hu
---
 crypto/hmac.c | 25 +++--
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/crypto/hmac.c b/crypto/hmac.c
index 8d9544c..e392219 100644
--- a/crypto/hmac.c
+++ b/crypto/hmac.c
@@ -52,20 +52,17 @@ static int hmac_setkey(struct crypto_shash *parent,
struct hmac_ctx *ctx = align_ptr(opad + ss,
 crypto_tfm_ctx_alignment());
struct crypto_shash *hash = ctx->hash;
-   struct {
-   struct shash_desc shash;
-   char ctx[crypto_shash_descsize(hash)];
-   } desc;
+   SHASH_DESC_ON_STACK(shash, hash);
unsigned int i;
 
-   desc.shash.tfm = hash;
-   desc.shash.flags = crypto_shash_get_flags(parent) &
-   CRYPTO_TFM_REQ_MAY_SLEEP;
+   shash->tfm = hash;
+   shash->flags = crypto_shash_get_flags(parent)
+   & CRYPTO_TFM_REQ_MAY_SLEEP;
 
if (keylen > bs) {
int err;
 
-   err = crypto_shash_digest(, inkey, keylen, ipad);
+   err = crypto_shash_digest(shash, inkey, keylen, ipad);
if (err)
return err;
 
@@ -81,12 +78,12 @@ static int hmac_setkey(struct crypto_shash *parent,
opad[i] ^= 0x5c;
}
 
-   return crypto_shash_init() ?:
-  crypto_shash_update(, ipad, bs) ?:
-  crypto_shash_export(, ipad) ?:
-  crypto_shash_init() ?:
-  crypto_shash_update(, opad, bs) ?:
-  crypto_shash_export(, opad);
+   return crypto_shash_init(shash) ?:
+  crypto_shash_update(shash, ipad, bs) ?:
+  crypto_shash_export(shash, ipad) ?:
+  crypto_shash_init(shash) ?:
+  crypto_shash_update(shash, opad, bs) ?:
+  crypto_shash_export(shash, opad);
 }
 
 static int hmac_export(struct shash_desc *pdesc, void *out)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v1 2/5] mm: add full variable in swap_info_struct

2014-09-22 Thread Minchan Kim
On Mon, Sep 22, 2014 at 01:45:22PM -0700, Andrew Morton wrote:
> On Mon, 22 Sep 2014 09:03:08 +0900 Minchan Kim  wrote:
> 
> > Now, swap leans on !p->highest_bit to indicate a swap is full.
> > It works well for normal swap because every slot on swap device
> > is used up when the swap is full but in case of zram, swap sees
> > still many empty slot although backed device(ie, zram) is full
> > since zram's limit is over so that it could make trouble when
> > swap use highest_bit to select new slot via free_cluster.
> > 
> > This patch introduces full varaiable in swap_info_struct
> > to solve the problem.
> > 
> > ...
> >
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -224,6 +224,7 @@ struct swap_info_struct {
> > struct swap_cluster_info free_cluster_tail; /* free cluster list tail */
> > unsigned int lowest_bit;/* index of first free in swap_map */
> > unsigned int highest_bit;   /* index of last free in swap_map */
> > +   boolfull;   /* whether swap is full or not */
> 
> This is protected by swap_info_struct.lock, I worked out.
> 
> There's a large comment at swap_info_struct.lock which could be updated.

Sure.

-- 
Kind regards,
Minchan Kim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 03/12] crypto: LLVMLinux: Remove VLAIS from crypto/ccp/ccp-crypto-sha.c

2014-09-22 Thread behanw
From: Jan-Simon Möller 

Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent. This patch allocates the appropriate amount of memory
using a char array using the SHASH_DESC_ON_STACK macro.

The new code can be compiled with both gcc and clang.

Signed-off-by: Jan-Simon Möller 
Signed-off-by: Behan Webster 
Reviewed-by: Mark Charlebois 
Acked-by: Herbert Xu 
---
 drivers/crypto/ccp/ccp-crypto-sha.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/crypto/ccp/ccp-crypto-sha.c 
b/drivers/crypto/ccp/ccp-crypto-sha.c
index 873f234..9653157 100644
--- a/drivers/crypto/ccp/ccp-crypto-sha.c
+++ b/drivers/crypto/ccp/ccp-crypto-sha.c
@@ -198,10 +198,9 @@ static int ccp_sha_setkey(struct crypto_ahash *tfm, const 
u8 *key,
 {
struct ccp_ctx *ctx = crypto_tfm_ctx(crypto_ahash_tfm(tfm));
struct crypto_shash *shash = ctx->u.sha.hmac_tfm;
-   struct {
-   struct shash_desc sdesc;
-   char ctx[crypto_shash_descsize(shash)];
-   } desc;
+
+   SHASH_DESC_ON_STACK(sdesc, shash);
+
unsigned int block_size = crypto_shash_blocksize(shash);
unsigned int digest_size = crypto_shash_digestsize(shash);
int i, ret;
@@ -216,11 +215,11 @@ static int ccp_sha_setkey(struct crypto_ahash *tfm, const 
u8 *key,
 
if (key_len > block_size) {
/* Must hash the input key */
-   desc.sdesc.tfm = shash;
-   desc.sdesc.flags = crypto_ahash_get_flags(tfm) &
+   sdesc->tfm = shash;
+   sdesc->flags = crypto_ahash_get_flags(tfm) &
CRYPTO_TFM_REQ_MAY_SLEEP;
 
-   ret = crypto_shash_digest(, key, key_len,
+   ret = crypto_shash_digest(sdesc, key, key_len,
  ctx->u.sha.key);
if (ret) {
crypto_ahash_set_flags(tfm, CRYPTO_TFM_RES_BAD_KEY_LEN);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 10/12] crypto: LLVMLinux: Remove VLAIS usage from libcrc32c.c

2014-09-22 Thread behanw
From: Jan-Simon Möller 

Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent. This patch allocates the appropriate amount of memory
using a char array using the SHASH_DESC_ON_STACK macro.

The new code can be compiled with both gcc and clang.

Signed-off-by: Jan-Simon Möller 
Signed-off-by: Behan Webster 
Reviewed-by: Mark Charlebois 
Acked-by: Herbert Xu 
Cc: pagee...@freemail.hu
Cc: "David S. Miller" 
---
 lib/libcrc32c.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/lib/libcrc32c.c b/lib/libcrc32c.c
index b3131f5..6a08ce7 100644
--- a/lib/libcrc32c.c
+++ b/lib/libcrc32c.c
@@ -41,20 +41,18 @@ static struct crypto_shash *tfm;
 
 u32 crc32c(u32 crc, const void *address, unsigned int length)
 {
-   struct {
-   struct shash_desc shash;
-   char ctx[crypto_shash_descsize(tfm)];
-   } desc;
+   SHASH_DESC_ON_STACK(shash, tfm);
+   u32 *ctx = (u32 *)shash_desc_ctx(shash);
int err;
 
-   desc.shash.tfm = tfm;
-   desc.shash.flags = 0;
-   *(u32 *)desc.ctx = crc;
+   shash->tfm = tfm;
+   shash->flags = 0;
+   *ctx = crc;
 
-   err = crypto_shash_update(, address, length);
+   err = crypto_shash_update(shash, address, length);
BUG_ON(err);
 
-   return *(u32 *)desc.ctx;
+   return *ctx;
 }
 
 EXPORT_SYMBOL(crc32c);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v1 1/5] zram: generalize swap_slot_free_notify

2014-09-22 Thread Minchan Kim
Hi Andrew,

On Mon, Sep 22, 2014 at 01:41:09PM -0700, Andrew Morton wrote:
> On Mon, 22 Sep 2014 09:03:07 +0900 Minchan Kim  wrote:
> 
> > Currently, swap_slot_free_notify is used for zram to free
> > duplicated copy page for memory efficiency when it knows
> > there is no reference to the swap slot.
> > 
> > This patch generalizes it to be able to use for other
> > swap hint to communicate with VM.
> > 
> 
> I really think we need to do a better job of documenting the code.
> 
> > index 94d93b1f8b53..c262bfbeafa9 100644
> > --- a/Documentation/filesystems/Locking
> > +++ b/Documentation/filesystems/Locking
> > @@ -405,7 +405,7 @@ prototypes:
> > void (*unlock_native_capacity) (struct gendisk *);
> > int (*revalidate_disk) (struct gendisk *);
> > int (*getgeo)(struct block_device *, struct hd_geometry *);
> > -   void (*swap_slot_free_notify) (struct block_device *, unsigned long);
> > +   int (*swap_hint) (struct block_device *, unsigned int, void *);
> >  
> >  locking rules:
> > bd_mutex
> > @@ -418,7 +418,7 @@ media_changed:  no
> >  unlock_native_capacity:no
> >  revalidate_disk:   no
> >  getgeo:no
> > -swap_slot_free_notify: no  (see below)
> > +swap_hint: no  (see below)
> 
> This didn't tell anyone anythnig much.

Yeb. :(

> 
> > index d78b245bae06..22a37764c409 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -926,7 +926,8 @@ error:
> > bio_io_error(bio);
> >  }
> >  
> > -static void zram_slot_free_notify(struct block_device *bdev,
> > +/* this callback is with swap_lock and sometimes page table lock held */
> 
> OK, that was useful.
> 
> It's called "page_table_lock".
> 
> Also *which* page_table_lock?  current->mm?

It depends on ALLOC_SPLIT_PTLOCKS so it could be page->ptl, too.
So, it would be better to call it as *ptlock*?
Since it's ptlock, it isn't related to which mm struct.
What we should make sure is just ptlock which belong to the page
table pointed to this swap page.

So, I want this.

diff --git a/Documentation/filesystems/Locking 
b/Documentation/filesystems/Locking
index c262bfbeafa9..19d2726e34f4 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -423,8 +423,8 @@ swap_hint:  no  (see below)
 media_changed, unlock_native_capacity and revalidate_disk are called only from
 check_disk_change().
 
-swap_slot_free_notify is called with swap_lock and sometimes the page lock
-held.
+swap_hint is called with swap_info_struct->lock and sometimes the ptlock
+of the page table pointed to the swap page.
 
 
 --- file_operations ---

> 
> > +static int zram_slot_free_notify(struct block_device *bdev,
> > unsigned long index)
> >  {
> > struct zram *zram;
> >
> > ...
> >
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1609,6 +1609,10 @@ static inline bool 
> > blk_integrity_is_initialized(struct gendisk *g)
> >  
> >  #endif /* CONFIG_BLK_DEV_INTEGRITY */
> >  
> > +enum swap_blk_hint {
> > +   SWAP_FREE,
> > +};
> 
> This would be a great place to document SWAP_FREE.

Yes,

> 
> >  struct block_device_operations {
> > int (*open) (struct block_device *, fmode_t);
> > void (*release) (struct gendisk *, fmode_t);
> > @@ -1624,8 +1628,7 @@ struct block_device_operations {
> > void (*unlock_native_capacity) (struct gendisk *);
> > int (*revalidate_disk) (struct gendisk *);
> > int (*getgeo)(struct block_device *, struct hd_geometry *);
> > -   /* this callback is with swap_lock and sometimes page table lock held */
> > -   void (*swap_slot_free_notify) (struct block_device *, unsigned long);
> > +   int (*swap_hint)(struct block_device *, unsigned int, void *);
> 
> And this would be a suitable place to document ->swap_hint().

If we consider to be able to add more hints in future so it could
be verbose, IMO, it would be better to desribe it in enum swap_hint. :)

> 
> - Hint from who to who?  Is it the caller providing the callee a hint
>   or is the caller asking the callee for a hint?
> 
> - What is the meaning of the return value?
> 
> - What are the meaning of the arguments?

Okay.

> 
> Please don't omit the argument names like this.  They are useful!  How
> is a reader to know what that "unsigned int" and "void *" actually
> *do*?

Yes.

> 
> The second arg-which-doesn't-have-a-name should have had type
> swap_blk_hint, yes?

Yes.

> 
> swap_blk_hint should be called swap_block_hint.  I assume that's what
> "blk" means.  Why does the name have "block" in there anyway?  It has
> something to do with disk blocks?  How is anyone supposed to work that
> out?

Yeb, I think we don't need block in name. I will remove it.

> 
> ->swap_hint was converted to return an `int', but all the callers
> simply ignore the return value.

You're right. All caller doesn't use it in this patch but this 

[PATCH v4 11/12] security, crypto: LLVMLinux: Remove VLAIS from ima_crypto.c

2014-09-22 Thread behanw
From: Behan Webster 

Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent. This patch allocates the appropriate amount of memory
using a char array using the SHASH_DESC_ON_STACK macro.

The new code can be compiled with both gcc and clang.

Signed-off-by: Behan Webster 
Reviewed-by: Mark Charlebois 
Reviewed-by: Jan-Simon Möller 
Acked-by: Herbert Xu 
Cc: t...@linutronix.de
---
 security/integrity/ima/ima_crypto.c | 47 +++--
 1 file changed, 19 insertions(+), 28 deletions(-)

diff --git a/security/integrity/ima/ima_crypto.c 
b/security/integrity/ima/ima_crypto.c
index 0bd7328..e35f5d9 100644
--- a/security/integrity/ima/ima_crypto.c
+++ b/security/integrity/ima/ima_crypto.c
@@ -380,17 +380,14 @@ static int ima_calc_file_hash_tfm(struct file *file,
loff_t i_size, offset = 0;
char *rbuf;
int rc, read = 0;
-   struct {
-   struct shash_desc shash;
-   char ctx[crypto_shash_descsize(tfm)];
-   } desc;
+   SHASH_DESC_ON_STACK(shash, tfm);
 
-   desc.shash.tfm = tfm;
-   desc.shash.flags = 0;
+   shash->tfm = tfm;
+   shash->flags = 0;
 
hash->length = crypto_shash_digestsize(tfm);
 
-   rc = crypto_shash_init();
+   rc = crypto_shash_init(shash);
if (rc != 0)
return rc;
 
@@ -420,7 +417,7 @@ static int ima_calc_file_hash_tfm(struct file *file,
break;
offset += rbuf_len;
 
-   rc = crypto_shash_update(, rbuf, rbuf_len);
+   rc = crypto_shash_update(shash, rbuf, rbuf_len);
if (rc)
break;
}
@@ -429,7 +426,7 @@ static int ima_calc_file_hash_tfm(struct file *file,
kfree(rbuf);
 out:
if (!rc)
-   rc = crypto_shash_final(, hash->digest);
+   rc = crypto_shash_final(shash, hash->digest);
return rc;
 }
 
@@ -487,18 +484,15 @@ static int ima_calc_field_array_hash_tfm(struct 
ima_field_data *field_data,
 struct ima_digest_data *hash,
 struct crypto_shash *tfm)
 {
-   struct {
-   struct shash_desc shash;
-   char ctx[crypto_shash_descsize(tfm)];
-   } desc;
+   SHASH_DESC_ON_STACK(shash, tfm);
int rc, i;
 
-   desc.shash.tfm = tfm;
-   desc.shash.flags = 0;
+   shash->tfm = tfm;
+   shash->flags = 0;
 
hash->length = crypto_shash_digestsize(tfm);
 
-   rc = crypto_shash_init();
+   rc = crypto_shash_init(shash);
if (rc != 0)
return rc;
 
@@ -508,7 +502,7 @@ static int ima_calc_field_array_hash_tfm(struct 
ima_field_data *field_data,
u32 datalen = field_data[i].len;
 
if (strcmp(td->name, IMA_TEMPLATE_IMA_NAME) != 0) {
-   rc = crypto_shash_update(,
+   rc = crypto_shash_update(shash,
(const u8 *) _data[i].len,
sizeof(field_data[i].len));
if (rc)
@@ -518,13 +512,13 @@ static int ima_calc_field_array_hash_tfm(struct 
ima_field_data *field_data,
data_to_hash = buffer;
datalen = IMA_EVENT_NAME_LEN_MAX + 1;
}
-   rc = crypto_shash_update(, data_to_hash, datalen);
+   rc = crypto_shash_update(shash, data_to_hash, datalen);
if (rc)
break;
}
 
if (!rc)
-   rc = crypto_shash_final(, hash->digest);
+   rc = crypto_shash_final(shash, hash->digest);
 
return rc;
 }
@@ -565,15 +559,12 @@ static int __init ima_calc_boot_aggregate_tfm(char 
*digest,
 {
u8 pcr_i[TPM_DIGEST_SIZE];
int rc, i;
-   struct {
-   struct shash_desc shash;
-   char ctx[crypto_shash_descsize(tfm)];
-   } desc;
+   SHASH_DESC_ON_STACK(shash, tfm);
 
-   desc.shash.tfm = tfm;
-   desc.shash.flags = 0;
+   shash->tfm = tfm;
+   shash->flags = 0;
 
-   rc = crypto_shash_init();
+   rc = crypto_shash_init(shash);
if (rc != 0)
return rc;
 
@@ -581,10 +572,10 @@ static int __init ima_calc_boot_aggregate_tfm(char 
*digest,
for (i = TPM_PCR0; i < TPM_PCR8; i++) {
ima_pcrread(i, pcr_i);
/* now accumulate with current aggregate */
-   rc = crypto_shash_update(, pcr_i, TPM_DIGEST_SIZE);
+   rc = crypto_shash_update(shash, pcr_i, TPM_DIGEST_SIZE);
}
if (!rc)
-   crypto_shash_final(, digest);
+   crypto_shash_final(shash, digest);
return rc;
 }
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org

[PATCH v4 00/12] LLVMLinux: Patches to enable the kernel to be compiled with clang/LLVM

2014-09-22 Thread behanw
From: Behan Webster 

Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent. These patches allocate the appropriate amount of memory
using a char array using the SHASH_DESC_ON_STACK macro.

There are places in the kernel whose maintainers have previously taken our
patches to remove VLAIS from their crypto code. Once this patch set is accepted
into mainline, I'll go back and resubmit patches to these maintainers to use
this new macro so the same approach is used consistently in all places in the
kernel.

The LLVMLinux project aims to fully build the Linux kernel using both gcc and
clang (the C front end for the LLVM compiler infrastructure project). 


Behan Webster (6):
  crypto: LLVMLinux: Add macro to remove use of VLAIS in crypto code
  crypto: LLVMLinux: Remove VLAIS from crypto/mv_cesa.c
  crypto: LLVMLinux: Remove VLAIS from crypto/n2_core.c
  crypto: LLVMLinux: Remove VLAIS from crypto/omap_sham.c
  crypto: LLVMLinux: Remove VLAIS from crypto/.../qat_algs.c
  security, crypto: LLVMLinux: Remove VLAIS from ima_crypto.c

Jan-Simon Möller (5):
  crypto: LLVMLinux: Remove VLAIS from crypto/ccp/ccp-crypto-sha.c
  crypto, dm: LLVMLinux: Remove VLAIS usage from dm-crypt
  crypto: LLVMLinux: Remove VLAIS usage from crypto/hmac.c
  crypto: LLVMLinux: Remove VLAIS usage from libcrc32c.c
  crypto: LLVMLinux: Remove VLAIS usage from crypto/testmgr.c

Vinícius Tinti (1):
  btrfs: LLVMLinux: Remove VLAIS

 crypto/hmac.c| 25 -
 crypto/testmgr.c | 14 --
 drivers/crypto/ccp/ccp-crypto-sha.c  | 13 -
 drivers/crypto/mv_cesa.c | 41 
 drivers/crypto/n2_core.c | 11 +++-
 drivers/crypto/omap-sham.c   | 28 ---
 drivers/crypto/qat/qat_common/qat_algs.c | 31 ++---
 drivers/md/dm-crypt.c| 34 ++-
 fs/btrfs/hash.c  | 16 +--
 include/crypto/hash.h|  5 
 lib/libcrc32c.c  | 16 +--
 security/integrity/ima/ima_crypto.c  | 47 +---
 12 files changed, 122 insertions(+), 159 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 12/12] crypto: LLVMLinux: Remove VLAIS usage from crypto/testmgr.c

2014-09-22 Thread behanw
From: Jan-Simon Möller 

Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent. This patch allocates the appropriate amount of memory
using a char array using the SHASH_DESC_ON_STACK macro.

The new code can be compiled with both gcc and clang.

Signed-off-by: Jan-Simon Möller 
Signed-off-by: Behan Webster 
Reviewed-by: Mark Charlebois 
Acked-by: Herbert Xu 
Cc: pagee...@freemail.hu
---
 crypto/testmgr.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index ac2b631..b959c0c 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -1714,16 +1714,14 @@ static int alg_test_crc32c(const struct alg_test_desc 
*desc,
}
 
do {
-   struct {
-   struct shash_desc shash;
-   char ctx[crypto_shash_descsize(tfm)];
-   } sdesc;
+   SHASH_DESC_ON_STACK(shash, tfm);
+   u32 *ctx = (u32 *)shash_desc_ctx(shash);
 
-   sdesc.shash.tfm = tfm;
-   sdesc.shash.flags = 0;
+   shash->tfm = tfm;
+   shash->flags = 0;
 
-   *(u32 *)sdesc.ctx = le32_to_cpu(420553207);
-   err = crypto_shash_final(, (u8 *));
+   *ctx = le32_to_cpu(420553207);
+   err = crypto_shash_final(shash, (u8 *));
if (err) {
printk(KERN_ERR "alg: crc32c: Operation failed for "
   "%s: %d\n", driver, err);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 04/12] crypto: LLVMLinux: Remove VLAIS from crypto/mv_cesa.c

2014-09-22 Thread behanw
From: Behan Webster 

Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99
compliant equivalent. This patch allocates the appropriate amount of memory
using a char array using the SHASH_DESC_ON_STACK macro.

The new code can be compiled with both gcc and clang.

Signed-off-by: Behan Webster 
Reviewed-by: Mark Charlebois 
Reviewed-by: Jan-Simon Möller 
Acked-by: Herbert Xu 
---
 drivers/crypto/mv_cesa.c | 41 ++---
 1 file changed, 18 insertions(+), 23 deletions(-)

diff --git a/drivers/crypto/mv_cesa.c b/drivers/crypto/mv_cesa.c
index 29d0ee5..032c72c 100644
--- a/drivers/crypto/mv_cesa.c
+++ b/drivers/crypto/mv_cesa.c
@@ -402,26 +402,23 @@ static int mv_hash_final_fallback(struct ahash_request 
*req)
 {
const struct mv_tfm_hash_ctx *tfm_ctx = crypto_tfm_ctx(req->base.tfm);
struct mv_req_hash_ctx *req_ctx = ahash_request_ctx(req);
-   struct {
-   struct shash_desc shash;
-   char ctx[crypto_shash_descsize(tfm_ctx->fallback)];
-   } desc;
+   SHASH_DESC_ON_STACK(shash, tfm_ctx->fallback);
int rc;
 
-   desc.shash.tfm = tfm_ctx->fallback;
-   desc.shash.flags = CRYPTO_TFM_REQ_MAY_SLEEP;
+   shash->tfm = tfm_ctx->fallback;
+   shash->flags = CRYPTO_TFM_REQ_MAY_SLEEP;
if (unlikely(req_ctx->first_hash)) {
-   crypto_shash_init();
-   crypto_shash_update(, req_ctx->buffer,
+   crypto_shash_init(shash);
+   crypto_shash_update(shash, req_ctx->buffer,
req_ctx->extra_bytes);
} else {
/* only SHA1 for now
 */
-   rc = mv_hash_import_sha1_ctx(req_ctx, );
+   rc = mv_hash_import_sha1_ctx(req_ctx, shash);
if (rc)
goto out;
}
-   rc = crypto_shash_final(, req->result);
+   rc = crypto_shash_final(shash, req->result);
 out:
return rc;
 }
@@ -794,23 +791,21 @@ static int mv_hash_setkey(struct crypto_ahash *tfm, const 
u8 * key,
ss = crypto_shash_statesize(ctx->base_hash);
 
{
-   struct {
-   struct shash_desc shash;
-   char ctx[crypto_shash_descsize(ctx->base_hash)];
-   } desc;
+   SHASH_DESC_ON_STACK(shash, ctx->base_hash);
+
unsigned int i;
char ipad[ss];
char opad[ss];
 
-   desc.shash.tfm = ctx->base_hash;
-   desc.shash.flags = crypto_shash_get_flags(ctx->base_hash) &
+   shash->tfm = ctx->base_hash;
+   shash->flags = crypto_shash_get_flags(ctx->base_hash) &
CRYPTO_TFM_REQ_MAY_SLEEP;
 
if (keylen > bs) {
int err;
 
err =
-   crypto_shash_digest(, key, keylen, ipad);
+   crypto_shash_digest(shash, key, keylen, ipad);
if (err)
return err;
 
@@ -826,12 +821,12 @@ static int mv_hash_setkey(struct crypto_ahash *tfm, const 
u8 * key,
opad[i] ^= 0x5c;
}
 
-   rc = crypto_shash_init() ? :
-   crypto_shash_update(, ipad, bs) ? :
-   crypto_shash_export(, ipad) ? :
-   crypto_shash_init() ? :
-   crypto_shash_update(, opad, bs) ? :
-   crypto_shash_export(, opad);
+   rc = crypto_shash_init(shash) ? :
+   crypto_shash_update(shash, ipad, bs) ? :
+   crypto_shash_export(shash, ipad) ? :
+   crypto_shash_init(shash) ? :
+   crypto_shash_update(shash, opad, bs) ? :
+   crypto_shash_export(shash, opad);
 
if (rc == 0)
mv_hash_init_ivs(ctx, ipad, opad);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


linux-next: manual merge of the tiny tree with the tip tree

2014-09-22 Thread Stephen Rothwell
Hi Josh,

Today's linux-next merge of the tiny tree got conflicts in
arch/x86/kernel/process_32.c and arch/x86/kernel/process_64.c between
commits dc56c0f9b870 ("x86, fpu: Shift "fpu_counter = 0" from
copy_thread() to arch_dup_task_struct()") and 6f46b3aef003 ("x86:
copy_thread: Don't nullify ->ptrace_bps twice") from the tip tree and
commits a1cf09f93e66 ("x86: process: Unify 32-bit and 64-bit
copy_thread I/O bitmap handling") and e4a191d1e05b ("x86: Support
compiling out userspace I/O (iopl and ioperm)") from the tiny tree.

I fixed it up (I think - see below) and can carry the fix as necessary
(no action is required).

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

diff --cc arch/x86/kernel/process_32.c
index 8f3ebfe710d0,e37f006fda6e..
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@@ -153,7 -153,9 +154,7 @@@ int copy_thread(unsigned long clone_fla
childregs->orig_ax = -1;
childregs->cs = __KERNEL_CS | get_kernel_rpl();
childregs->flags = X86_EFLAGS_IF | X86_EFLAGS_FIXED;
-   p->thread.io_bitmap_ptr = NULL;
 -  p->thread.fpu_counter = 0;
+   clear_thread_io_bitmap(p);
 -  memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
return 0;
}
*childregs = *current_pt_regs();
@@@ -164,22 -166,12 +165,9 @@@
p->thread.ip = (unsigned long) ret_from_fork;
task_user_gs(p) = get_user_gs(current_pt_regs());
  
-   p->thread.io_bitmap_ptr = NULL;
 -  p->thread.fpu_counter = 0;
+   clear_thread_io_bitmap(p);
tsk = current;
-   err = -ENOMEM;
- 
-   if (unlikely(test_tsk_thread_flag(tsk, TIF_IO_BITMAP))) {
-   p->thread.io_bitmap_ptr = kmemdup(tsk->thread.io_bitmap_ptr,
-   IO_BITMAP_BYTES, GFP_KERNEL);
-   if (!p->thread.io_bitmap_ptr) {
-   p->thread.io_bitmap_max = 0;
-   return -ENOMEM;
-   }
-   set_tsk_thread_flag(p, TIF_IO_BITMAP);
-   }
- 
-   err = 0;
  
 -  memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
 -
/*
 * Set a new TLS for the child thread?
 */
diff --cc arch/x86/kernel/process_64.c
index 3ed4a68d4013,80f348659edd..
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@@ -163,7 -164,8 +164,7 @@@ int copy_thread(unsigned long clone_fla
p->thread.sp = (unsigned long) childregs;
p->thread.usersp = me->thread.usersp;
set_tsk_thread_flag(p, TIF_FORK);
-   p->thread.io_bitmap_ptr = NULL;
 -  p->thread.fpu_counter = 0;
+   clear_thread_io_bitmap(p);
  
savesegment(gs, p->thread.gsindex);
p->thread.gs = p->thread.gsindex ? 0 : me->thread.gs;
@@@ -191,17 -193,8 +192,6 @@@
if (sp)
childregs->sp = sp;
  
-   err = -ENOMEM;
-   if (unlikely(test_tsk_thread_flag(me, TIF_IO_BITMAP))) {
-   p->thread.io_bitmap_ptr = kmemdup(me->thread.io_bitmap_ptr,
- IO_BITMAP_BYTES, GFP_KERNEL);
-   if (!p->thread.io_bitmap_ptr) {
-   p->thread.io_bitmap_max = 0;
-   return -ENOMEM;
-   }
-   set_tsk_thread_flag(p, TIF_IO_BITMAP);
-   }
 -  memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps));
--
/*
 * Set a new TLS for the child thread?
 */


signature.asc
Description: PGP signature


linux-next: manual merge of the tiny tree with the tip tree

2014-09-22 Thread Stephen Rothwell
Hi Josh,

Today's linux-next merge of the tiny tree got a conflict in
arch/x86/kernel/cpu/common.c between commit ce4b1b16502b ("x86/smpboot:
Initialize secondary CPU only if master CPU will wait for it") from the
tip tree and commit e4a191d1e05b ("x86: Support compiling out userspace
I/O (iopl and ioperm)") from the tiny tree.

I fixed it up (see below) and can carry the fix as necessary (no action
is required).

-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au

diff --cc arch/x86/kernel/cpu/common.c
index 3d05d4699dbd,11e08cefdb6e..
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@@ -1294,10 -1281,7 +1294,9 @@@ void cpu_init(void
struct task_struct *me;
struct tss_struct *t;
unsigned long v;
 -  int cpu;
 +  int cpu = stack_smp_processor_id();
-   int i;
 +
 +  wait_for_master_cpu(cpu);
  
/*
 * Load microcode on this cpu if a valid microcode is available.


signature.asc
Description: PGP signature


Re: [PATCH] ath: change logging functions to return void

2014-09-22 Thread Kalle Valo
Joe Perches  writes:

> The return values are not used by callers of these functions
> so change the functions to return void.
>
> Other miscellanea:
>
> o add __printf verification to wil6210 logging functions
>   No format/argument mismatches found
>
> Signed-off-by: Joe Perches 
> ---
> This change is associated to a desire to eventually
> change printk to return void.
>
>  drivers/net/wireless/ath/ath10k/debug.c| 18 +-
>  drivers/net/wireless/ath/ath10k/debug.h|  6 +++---
>  drivers/net/wireless/ath/ath6kl/common.h   |  2 +-
>  drivers/net/wireless/ath/ath6kl/debug.c| 28 
>  drivers/net/wireless/ath/ath6kl/debug.h| 13 ++---

For ath6kl and ath10k:

Acked-by: Kalle Valo 

>  drivers/net/wireless/ath/wil6210/debug.c   | 14 --
>  drivers/net/wireless/ath/wil6210/wil6210.h |  7 +--
>  7 files changed, 32 insertions(+), 56 deletions(-)

John, as this patch also contains a wil6210 change how do you want to
handle this?

-- 
Kalle Valo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH v2] Tools: hv: vssdaemon: ignore the EBUSY on multiple freezing the same partition

2014-09-22 Thread Dexuan Cui
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Dexuan Cui
> Sent: Tuesday, September 23, 2014 13:01 PM
> To: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com;
> jasow...@redhat.com
> Cc: KY Srinivasan; Haiyang Zhang
> Subject: [PATCH v2] Tools: hv: vssdaemon: ignore the EBUSY on multiple
> freezing the same partition
> 
> v2: I added "errno = 0;" in the ioctl()
typo --   "in the ioctl(")  should be "before the ioctl()".

Thanks,
-- Dexuan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: sleeping while atomic in blk_free_devt

2014-09-22 Thread Jens Axboe

> On Sep 22, 2014, at 8:49 PM, Dave Jones  wrote:
> 
> Just got this when removing a USB memory stick.
> 
> BUG: sleeping function called from invalid context at block/genhd.c:448

Fixed in for-linus, it's going out tomorrow.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix the issue that lowmemkiller fell into a cycle that try to kill a task

2014-09-22 Thread Greg KH
On Tue, Sep 23, 2014 at 10:57:09AM +0800, Hui Zhu wrote:
> The cause of this issue is when free memroy size is low and a lot of task is
> trying to shrink the memory, the task that is killed by lowmemkiller cannot 
> get
> CPU to exit itself.
> 
> Fix this issue with change the scheduling policy to SCHED_FIFO if a task's 
> flag
> is TIF_MEMDIE in lowmemkiller.
> 
> Signed-off-by: Hui Zhu 
> ---
>  drivers/staging/android/lowmemorykiller.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/staging/android/lowmemorykiller.c 
> b/drivers/staging/android/lowmemorykiller.c
> index b545d3d..ca1ffac 100644
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -129,6 +129,10 @@ static unsigned long lowmem_scan(struct shrinker *s, 
> struct shrink_control *sc)
>  
>   if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
>   time_before_eq(jiffies, lowmem_deathpending_timeout)) {
> + struct sched_param param = { .sched_priority = 1 };
> +
> + if (p->policy == SCHED_NORMAL)
> + sched_setscheduler(p, SCHED_FIFO, );

This seems really specific to a specific scheduler pattern now.  Isn't
there some other way to resolve this?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kernfs: use stack-buf for small writes.

2014-09-22 Thread Tejun Heo
On Tue, Sep 23, 2014 at 02:06:33PM +1000, NeilBrown wrote:
...
> Note that reads from a sysfs file are already safe due to the use for
> seqfile.  The first read will allocate a buffer (m->buf) which will
> be used for all subsequent reads.

Hmmm?  How is seqfile safe?  Where would the seq op write to?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kernfs: use stack-buf for small writes.

2014-09-22 Thread Tejun Heo
Hello, Neil.

On Tue, Sep 23, 2014 at 02:06:33PM +1000, NeilBrown wrote:
>  When mdmon needs to update metadata after a device failure in an array
>  there are two 'kmalloc' sources that can trigger deadlock if memory is tight
>  and needs to be written to the array (which cannot be allowed until mdmon
>  updates the metadata).
>  One is in O_DIRECT writes which I have patches for.  The other is when
>  writing to the sysfs file to tell md that it is safe to continue.
>  This simple patch removes the second.

Ugh... :( If this can't be avoided at all, I'd much prefer it to be
something explicit - a flag marking the file as needing a persistent
write buffer which is allocated on open.  "Small" writes on stack
feels way to implicit to me.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] kernfs: use stack-buf for small writes.

2014-09-22 Thread NeilBrown

For a write <= 128 characters, don't use kmalloc.

mdmon, part of mdadm, will sometimes need to write
to a sysfs file in order to allow writes to the array
to continue.  This is important to support RAID metadata
types that the kernel doesn't know about.

It is important that this write doesn't block on
memory allocation.  The safest way to ensure that is to
use an on-stack buffer.

Writes are always small, typically less than 10 characters.

Note that reads from a sysfs file are already safe due to the use for
seqfile.  The first read will allocate a buffer (m->buf) which will
be used for all subsequent reads.

Signed-off-by: NeilBrown 

---
Hi Tejun,
 I wonder if you would consider this patch.
 When mdmon needs to update metadata after a device failure in an array
 there are two 'kmalloc' sources that can trigger deadlock if memory is tight
 and needs to be written to the array (which cannot be allowed until mdmon
 updates the metadata).
 One is in O_DIRECT writes which I have patches for.  The other is when
 writing to the sysfs file to tell md that it is safe to continue.
 This simple patch removes the second.

Thanks,
NeilBrown


diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 4429d6d9217f..75b58669ce55 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -269,6 +269,7 @@ static ssize_t kernfs_fop_write(struct file *file, const 
char __user *user_buf,
const struct kernfs_ops *ops;
size_t len;
char *buf;
+   char stackbuf[129];
 
if (of->atomic_write_len) {
len = count;
@@ -278,7 +279,10 @@ static ssize_t kernfs_fop_write(struct file *file, const 
char __user *user_buf,
len = min_t(size_t, count, PAGE_SIZE);
}
 
-   buf = kmalloc(len + 1, GFP_KERNEL);
+   if (len < sizeof(stackbuf))
+   buf = stackbuf;
+   else
+   buf = kmalloc(len + 1, GFP_KERNEL);
if (!buf)
return -ENOMEM;
 
@@ -311,7 +315,8 @@ static ssize_t kernfs_fop_write(struct file *file, const 
char __user *user_buf,
if (len > 0)
*ppos += len;
 out_free:
-   kfree(buf);
+   if (buf != stackbuf)
+   kfree(buf);
return len;
 }
 


signature.asc
Description: PGP signature


[PATCH v2 1/2] cap1106: Add support for various cap11xx devices

2014-09-22 Thread Matt Ranostay
Several other variants of the cap11xx device exists with a varying
number of capacitance detection channels. Add support for creating
the channels dynamically.

Signed-off-by: Matt Ranostay 
---
 drivers/input/keyboard/cap1106.c | 64 +++-
 1 file changed, 44 insertions(+), 20 deletions(-)

diff --git a/drivers/input/keyboard/cap1106.c b/drivers/input/keyboard/cap1106.c
index d70b65a..07f9e88 100644
--- a/drivers/input/keyboard/cap1106.c
+++ b/drivers/input/keyboard/cap1106.c
@@ -55,8 +55,6 @@
 #define CAP1106_REG_MANUFACTURER_ID0xfe
 #define CAP1106_REG_REVISION   0xff
 
-#define CAP1106_NUM_CHN 6
-#define CAP1106_PRODUCT_ID 0x55
 #define CAP1106_MANUFACTURER_ID0x5d
 
 struct cap1106_priv {
@@ -64,7 +62,25 @@ struct cap1106_priv {
struct input_dev *idev;
 
/* config */
-   unsigned short keycodes[CAP1106_NUM_CHN];
+   u32 *keycodes;
+   unsigned int num_channels;
+};
+
+struct cap11xx_hw_model {
+   uint8_t product_id;
+   unsigned int num_channels;
+};
+
+enum {
+   CAP1106,
+   CAP1126,
+   CAP1188,
+};
+
+struct cap11xx_hw_model cap11xx_devices[] = {
+   [CAP1106] = { .product_id = 0x55, .num_channels = 6 },
+   [CAP1126] = { .product_id = 0x53, .num_channels = 6 },
+   [CAP1188] = { .product_id = 0x50, .num_channels = 8 },
 };
 
 static const struct reg_default cap1106_reg_defaults[] = {
@@ -151,7 +167,7 @@ static irqreturn_t cap1106_thread_func(int irq_num, void 
*data)
if (ret < 0)
goto out;
 
-   for (i = 0; i < CAP1106_NUM_CHN; i++)
+   for (i = 0; i < priv->num_channels; i++)
input_report_key(priv->idev, priv->keycodes[i],
 status & (1 << i));
 
@@ -188,14 +204,23 @@ static int cap1106_i2c_probe(struct i2c_client 
*i2c_client,
struct device *dev = _client->dev;
struct cap1106_priv *priv;
struct device_node *node;
+   struct cap11xx_hw_model *cap = _devices[id->driver_data];
int i, error, irq, gain = 0;
unsigned int val, rev;
-   u32 gain32, keycodes[CAP1106_NUM_CHN];
+   u32 gain32;
 
priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
if (!priv)
return -ENOMEM;
 
+   BUG_ON(!cap->num_channels);
+
+   priv->num_channels = cap->num_channels;
+   priv->keycodes = devm_kcalloc(dev,
+   priv->num_channels, sizeof(u32), GFP_KERNEL);
+   if (!priv->keycodes)
+   return -ENOMEM;
+
priv->regmap = devm_regmap_init_i2c(i2c_client, _regmap_config);
if (IS_ERR(priv->regmap))
return PTR_ERR(priv->regmap);
@@ -204,9 +229,9 @@ static int cap1106_i2c_probe(struct i2c_client *i2c_client,
if (error)
return error;
 
-   if (val != CAP1106_PRODUCT_ID) {
+   if (val != cap->product_id) {
dev_err(dev, "Product ID: Got 0x%02x, expected 0x%02x\n",
-   val, CAP1106_PRODUCT_ID);
+   val, cap->product_id);
return -ENODEV;
}
 
@@ -235,17 +260,12 @@ static int cap1106_i2c_probe(struct i2c_client 
*i2c_client,
dev_err(dev, "Invalid sensor-gain value %d\n", gain32);
}
 
-   BUILD_BUG_ON(ARRAY_SIZE(keycodes) != ARRAY_SIZE(priv->keycodes));
-
/* Provide some useful defaults */
-   for (i = 0; i < ARRAY_SIZE(keycodes); i++)
-   keycodes[i] = KEY_A + i;
+   for (i = 0; i < priv->num_channels; i++)
+   priv->keycodes[i] = KEY_A + i;
 
of_property_read_u32_array(node, "linux,keycodes",
-  keycodes, ARRAY_SIZE(keycodes));
-
-   for (i = 0; i < ARRAY_SIZE(keycodes); i++)
-   priv->keycodes[i] = keycodes[i];
+   priv->keycodes, priv->num_channels);
 
error = regmap_update_bits(priv->regmap, CAP1106_REG_MAIN_CONTROL,
   CAP1106_REG_MAIN_CONTROL_GAIN_MASK,
@@ -269,17 +289,17 @@ static int cap1106_i2c_probe(struct i2c_client 
*i2c_client,
if (of_property_read_bool(node, "autorepeat"))
__set_bit(EV_REP, priv->idev->evbit);
 
-   for (i = 0; i < CAP1106_NUM_CHN; i++)
+   for (i = 0; i < priv->num_channels; i++)
__set_bit(priv->keycodes[i], priv->idev->keybit);
 
__clear_bit(KEY_RESERVED, priv->idev->keybit);
 
priv->idev->keycode = priv->keycodes;
-   priv->idev->keycodesize = sizeof(priv->keycodes[0]);
-   priv->idev->keycodemax = ARRAY_SIZE(priv->keycodes);
+   priv->idev->keycodesize = sizeof(u32);
+   priv->idev->keycodemax = priv->num_channels;
 
priv->idev->id.vendor = CAP1106_MANUFACTURER_ID;
-   priv->idev->id.product = CAP1106_PRODUCT_ID;
+   priv->idev->id.product = cap->product_id;
priv->idev->id.version = rev;
 
priv->idev->open = cap1106_input_open;
@@ -313,12 +333,16 @@ static 

[PATCH v2 0/2] cap1106: add support for cap11xx variants

2014-09-22 Thread Matt Ranostay
Changes from v1:

 * Reworked various devices support to check product id for
   respective device.
 * Added check for invalid zero channels.
 * Renamed active-high option to more clear irq-active-high
 * Use regmap_update_bits() instead of regmap_write_bits()

Matt Ranostay (2):
  cap1106: Add support for various cap11xx devices
  cap1106: support for irq-active-high option

 .../devicetree/bindings/input/cap1106.txt  |  4 ++
 drivers/input/keyboard/cap1106.c   | 70 --
 2 files changed, 55 insertions(+), 19 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/2] cap1106: support for irq-active-high option

2014-09-22 Thread Matt Ranostay
Some applications need to use the irq-active-high push-pull option.
This allows it be enabled in the device tree child node.

Signed-off-by: Matt Ranostay 
---
 Documentation/devicetree/bindings/input/cap1106.txt | 4 
 drivers/input/keyboard/cap1106.c| 8 
 2 files changed, 12 insertions(+)

diff --git a/Documentation/devicetree/bindings/input/cap1106.txt 
b/Documentation/devicetree/bindings/input/cap1106.txt
index 4b46390..6f5a143 100644
--- a/Documentation/devicetree/bindings/input/cap1106.txt
+++ b/Documentation/devicetree/bindings/input/cap1106.txt
@@ -26,6 +26,10 @@ Optional properties:
Valid values are 1, 2, 4, and 8.
By default, a gain of 1 is set.
 
+   microchip,irq-active-high:  By default the interrupt pin is active 
low
+   open drain. This property allows using the 
active
+   high push-pull output.
+
linux,keycodes: Specifies an array of numeric keycode values to
be used for the channels. If this property is
omitted, KEY_A, KEY_B, etc are used as
diff --git a/drivers/input/keyboard/cap1106.c b/drivers/input/keyboard/cap1106.c
index 07f9e88..d5ce060 100644
--- a/drivers/input/keyboard/cap1106.c
+++ b/drivers/input/keyboard/cap1106.c
@@ -47,6 +47,7 @@
 #define CAP1106_REG_STANDBY_SENSITIVITY0x42
 #define CAP1106_REG_STANDBY_THRESH 0x43
 #define CAP1106_REG_CONFIG20x44
+#define CAP1106_REG_CONFIG2_ALT_POLBIT(6)
 #define CAP1106_REG_SENSOR_BASE_CNT(X) (0x50 + (X))
 #define CAP1106_REG_SENSOR_CALIB   (0xb1 + (X))
 #define CAP1106_REG_SENSOR_CALIB_LSB1  0xb9
@@ -260,6 +261,13 @@ static int cap1106_i2c_probe(struct i2c_client *i2c_client,
dev_err(dev, "Invalid sensor-gain value %d\n", gain32);
}
 
+   if (of_property_read_bool(node, "microchip,irq-active-high")) {
+   error = regmap_update_bits(priv->regmap, CAP1106_REG_CONFIG2,
+  CAP1106_REG_CONFIG2_ALT_POL, 0);
+   if (error)
+   return error;
+   }
+
/* Provide some useful defaults */
for (i = 0; i < priv->num_channels; i++)
priv->keycodes[i] = KEY_A + i;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] Tools: hv: vssdaemon: ignore the EBUSY on multiple freezing the same partition

2014-09-22 Thread Dexuan Cui
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Dexuan Cui
> Sent: Tuesday, September 23, 2014 2:02 AM
> To: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev-
> de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com;
> jasow...@redhat.com
> Cc: KY Srinivasan; Haiyang Zhang
> Subject: [PATCH] Tools: hv: vssdaemon: ignore the EBUSY on multiple
> freezing the same partition
> 
> Signed-off-by: Dexuan Cui 
> Reviewed-by: K. Y. Srinivasan 
> ---
>  tools/hv/hv_vss_daemon.c | 21 +

Please use the v2 patch I sent out just now.
I added a "errno = 0;' before the ioctl() to fix some false warnings.

Thanks,
-- Dexuan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH V3] xen: eliminate scalability issues from initial mapping setup

2014-09-22 Thread Juergen Gross

On 09/17/2014 04:59 PM, Juergen Gross wrote:

Direct Xen to place the initial P->M table outside of the initial
mapping, as otherwise the 1G (implementation) / 2G (theoretical)
restriction on the size of the initial mapping limits the amount
of memory a domain can be handed initially.

As the initial P->M table is copied rather early during boot to
domain private memory and it's initial virtual mapping is dropped,
the easiest way to avoid virtual address conflicts with other
addresses in the kernel is to use a user address area for the
virtual address of the initial P->M table. This allows us to just
throw away the page tables of the initial mapping after the copy
without having to care about address invalidation.

It should be noted that this patch won't enable a pv-domain to USE
more than 512 GB of RAM. It just enables it to be started with a
P->M table covering more memory. This is especially important for
being able to boot a Dom0 on a system with more than 512 GB memory.

Signed-off-by: Juergen Gross 
Signed-off-by: Jan Beulich 


Any Acks/Naks?

Juergen


---
  arch/x86/xen/mmu.c  | 119 +---
  arch/x86/xen/setup.c|  65 ++
  arch/x86/xen/xen-head.S |   2 +
  3 files changed, 151 insertions(+), 35 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 16fb009..3bd403b 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1198,6 +1198,78 @@ static void __init xen_cleanhighmap(unsigned long vaddr,
 * instead of somewhere later and be confusing. */
xen_mc_flush();
  }
+
+/*
+ * Make a page range writeable and free it.
+ */
+static void __init xen_free_ro_pages(unsigned long paddr, unsigned long size)
+{
+   void *vaddr = __va(paddr);
+   void *vaddr_end = vaddr + size;
+
+   for (; vaddr < vaddr_end; vaddr += PAGE_SIZE)
+   make_lowmem_page_readwrite(vaddr);
+
+   memblock_free(paddr, size);
+}
+
+static void xen_cleanmfnmap_free_pgtbl(void *pgtbl)
+{
+   unsigned long pa = __pa(pgtbl) & PHYSICAL_PAGE_MASK;
+
+   ClearPagePinned(virt_to_page(__va(pa)));
+   xen_free_ro_pages(pa, PAGE_SIZE);
+}
+
+/*
+ * Since it is well isolated we can (and since it is perhaps large we should)
+ * also free the page tables mapping the initial P->M table.
+ */
+static void __init xen_cleanmfnmap(unsigned long vaddr)
+{
+   unsigned long va = vaddr & PMD_MASK;
+   unsigned long pa;
+   pgd_t *pgd = pgd_offset_k(va);
+   pud_t *pud_page = pud_offset(pgd, 0);
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+   unsigned int i;
+
+   set_pgd(pgd, __pgd(0));
+   do {
+   pud = pud_page + pud_index(va);
+   if (pud_none(*pud)) {
+   va += PUD_SIZE;
+   } else if (pud_large(*pud)) {
+   pa = pud_val(*pud) & PHYSICAL_PAGE_MASK;
+   xen_free_ro_pages(pa, PUD_SIZE);
+   va += PUD_SIZE;
+   } else {
+   pmd = pmd_offset(pud, va);
+   if (pmd_large(*pmd)) {
+   pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK;
+   xen_free_ro_pages(pa, PMD_SIZE);
+   } else if (!pmd_none(*pmd)) {
+   pte = pte_offset_kernel(pmd, va);
+   for (i = 0; i < PTRS_PER_PTE; ++i) {
+   if (pte_none(pte[i]))
+   break;
+   pa = pte_pfn(pte[i]) << PAGE_SHIFT;
+   xen_free_ro_pages(pa, PAGE_SIZE);
+   }
+   xen_cleanmfnmap_free_pgtbl(pte);
+   }
+   va += PMD_SIZE;
+   if (pmd_index(va))
+   continue;
+   xen_cleanmfnmap_free_pgtbl(pmd);
+   }
+
+   } while (pud_index(va) || pmd_index(va));
+   xen_cleanmfnmap_free_pgtbl(pud_page);
+}
+
  static void __init xen_pagetable_p2m_copy(void)
  {
unsigned long size;
@@ -1217,18 +1289,23 @@ static void __init xen_pagetable_p2m_copy(void)
/* using __ka address and sticking INVALID_P2M_ENTRY! */
memset((void *)xen_start_info->mfn_list, 0xff, size);

-   /* We should be in __ka space. */
-   BUG_ON(xen_start_info->mfn_list < __START_KERNEL_map);
addr = xen_start_info->mfn_list;
-   /* We roundup to the PMD, which means that if anybody at this stage is
+   /* We could be in __ka space.
+* We roundup to the PMD, which means that if anybody at this stage is
 * using the __ka address of xen_start_info or 
xen_start_info->shared_info
 * they are in going to crash. Fortunatly we have already revectored
 * in xen_setup_kernel_pagetable and in 

[PATCH v2] Tools: hv: vssdaemon: ignore the EBUSY on multiple freezing the same partition

2014-09-22 Thread Dexuan Cui
v2: I added "errno = 0;" in the ioctl()

Signed-off-by: Dexuan Cui 
Reviewed-by: K. Y. Srinivasan 
---
 tools/hv/hv_vss_daemon.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c
index 6a213b8..c1af658 100644
--- a/tools/hv/hv_vss_daemon.c
+++ b/tools/hv/hv_vss_daemon.c
@@ -50,7 +50,35 @@ static int vss_do_freeze(char *dir, unsigned int cmd, char 
*fs_op)
 
if (fd < 0)
return 1;
+
+   /* A successful syscall doesn't set errno to 0. Without this line,
+* the below strerror(errno) can accidently show the errno of the
+* previous failed syscall.
+*/
+   errno = 0;
+
ret = ioctl(fd, cmd, 0);
+
+   /*
+* If a partition is mounted more than once, only the first
+* FREEZE/THAW can succeed and the later ones will get
+* EBUSY/EINVAL respectively: there could be 2 cases:
+* 1) a user may mount the same partition to differnt directories
+*  by mistake or on purpose;
+* 2) The subvolume of btrfs appears to have the same partition
+* mounted more than once.
+*/
+   if (ret) {
+   if ((cmd == FIFREEZE && errno == EBUSY) ||
+   (cmd == FITHAW && errno == EINVAL)) {
+   syslog(LOG_INFO, "VSS: %s of %s: %s: ignored\n",
+   fs_op, dir,
+   errno == EBUSY ? "EBUSY" : "EINVAL");
+   close(fd);
+   return 0;
+   }
+   }
+
syslog(LOG_INFO, "VSS: %s of %s: %s\n", fs_op, dir, strerror(errno));
close(fd);
return !!ret;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


From Daniel Klimowicz

2014-09-22 Thread Daniel Klimowicz


Dear Sir,

I am requesting for your help, to assist me in getting £42,000,000.00 to your 
account. please do indicate your interest for more information's.

REPLY  ( klimowi...@yahoo.com.hk )

Yours Truly,

>From Daniel Klimowicz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] ARM: dts: add rk3288 power-domain node

2014-09-22 Thread Kever Yang


On 09/23/2014 10:55 AM, jinkun.hong wrote:

From: "jinkun.hong" 

Any summary for rk3288 power controller?
Maybe you can say something about how rk3288 TRM described this module.

Signed-off-by: Jack Dai 
Signed-off-by: Wang Caesar 
Signed-off-by: jinkun.hong 

---

  arch/arm/boot/dts/rk3288.dtsi |   45 +
  1 file changed, 45 insertions(+)

diff --git a/arch/arm/boot/dts/rk3288.dtsi b/arch/arm/boot/dts/rk3288.dtsi
index 3bb5230..714b9d9 100644
--- a/arch/arm/boot/dts/rk3288.dtsi
+++ b/arch/arm/boot/dts/rk3288.dtsi
@@ -15,6 +15,7 @@
  #include 
  #include 
  #include 
+#include 
  #include "skeleton.dtsi"
  
  / {

@@ -467,6 +468,50 @@
compatible = "rockchip,rk3288-pmu", "syscon";
reg = <0xff73 0x100>;
};
+   power: power-controller {
+   compatible = "rockchip,rk3288-power-controller";
+   #power-domain-cells = <1>;
+   rockchip,pmu = <>;
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   pd_gpu {
+   reg = ;
+   clocks = < ACLK_GPU>;
+   };
+
+   pd_vio {
+   reg = ;
+   clocks = < HCLK_RGA>, < HCLK_VOP0>,
+   < HCLK_VOP1>, < HCLK_VIO_AHB_ARBI>,
+   < HCLK_VIO_NIU>, < HCLK_VIP>,
+   < HCLK_IEP>, < HCLK_ISP>,
+   < HCLK_VIO2_H2P>, < PCLK_MIPI_DSI0>,
+   < PCLK_MIPI_DSI1>, < PCLK_MIPI_CSI>,
+   < PCLK_LVDS_PHY>, < PCLK_EDP_CTRL>,
+   < PCLK_HDMI_CTRL>, < PCLK_VIO2_H2P>,
+   < ACLK_VOP0>, < ACLK_IEP>,
+   < ACLK_VIO0_NIU>, < ACLK_VIP>,
+   < ACLK_VOP1>, < ACLK_ISP>,
+   < ACLK_VIO1_NIU>, < ACLK_RGA>,
+   < ACLK_RGA_NIU>,< SCLK_RGA>,
+   < DCLK_VOP0>, < DCLK_VOP1>,
+   < SCLK_EDP_24M>, < SCLK_EDP>,
+   < SCLK_ISP>, < SCLK_ISP_JPE>,
+   < SCLK_HDMI_HDCP>, < SCLK_HDMI_CEC>;
+   };

Some of clock id here is not in upstream or list, I will send the patch
including all these clock IDs later, maybe you should mention it in your
commit message?

+
+   pd_video {
+   reg = ;
+   /* FIXME: add clocks */
+   };
+
+   pd_hevc {
+   reg = ;
+   clocks = < ACLK_HEVC>, < HCLK_HEVC>,
+   < SCLK_HEVC_CABAC>, < SCLK_HEVC_CORE>;
+   };
+   };
  
  	sgrf: syscon@ff74 {

compatible = "rockchip,rk3288-sgrf", "syscon";


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] [media] videobuf-dma-contig: replace vm_iomap_memory() with remap_pfn_range().

2014-09-22 Thread chen.f...@freescale.com
Hans,
Do you have any more comment on this patch?

Best regards,
Fancy Fang

-Original Message-
From: Fang Chen-B47543 
Sent: Wednesday, September 10, 2014 3:29 PM
To: 'Hans Verkuil'; m.che...@samsung.com; v...@zeniv.linux.org.uk
Cc: Guo Shawn-R65073; linux-me...@vger.kernel.org; 
linux-kernel@vger.kernel.org; Marek Szyprowski
Subject: RE: [PATCH] [media] videobuf-dma-contig: replace vm_iomap_memory() 
with remap_pfn_range().

On the Freescale imx6 platform which belongs to ARM architecture. The driver is 
our local v4l2 output driver which is not upstream yet unfortunately.

Best regards,
Fancy Fang

-Original Message-
From: Hans Verkuil [mailto:hverk...@xs4all.nl]
Sent: Wednesday, September 10, 2014 3:21 PM
To: Fang Chen-B47543; m.che...@samsung.com; v...@zeniv.linux.org.uk
Cc: Guo Shawn-R65073; linux-me...@vger.kernel.org; 
linux-kernel@vger.kernel.org; Marek Szyprowski
Subject: Re: [PATCH] [media] videobuf-dma-contig: replace vm_iomap_memory() 
with remap_pfn_range().

On 09/10/14 09:14, chen.f...@freescale.com wrote:
> It is not a theoretically issue, it is a real case that the mapping failed 
> issue happens in 3.14.y kernel but not happens in previous 3.10.y kernel.
> So I need your confirmation on it.

With which driver does this happen? On which architecture?

Regards,

Hans

> 
> Thanks.
> 
> Best regards,
> Fancy Fang
> 
> -Original Message-
> From: Hans Verkuil [mailto:hverk...@xs4all.nl]
> Sent: Wednesday, September 10, 2014 3:01 PM
> To: Fang Chen-B47543; m.che...@samsung.com; v...@zeniv.linux.org.uk
> Cc: Guo Shawn-R65073; linux-me...@vger.kernel.org; 
> linux-kernel@vger.kernel.org; Marek Szyprowski
> Subject: Re: [PATCH] [media] videobuf-dma-contig: replace vm_iomap_memory() 
> with remap_pfn_range().
> 
> On 09/10/14 07:28, Fancy Fang wrote:
>> When user requests V4L2_MEMORY_MMAP type buffers, the videobuf-core 
>> will assign the corresponding offset to the 'boff' field of the 
>> videobuf_buffer for each requested buffer sequentially. Later, user 
>> may call mmap() to map one or all of the buffers with the 'offset'
>> parameter which is equal to its 'boff' value. Obviously, the 'offset'
>> value is only used to find the matched buffer instead of to be the 
>> real offset from the buffer's physical start address as used by 
>> vm_iomap_memory(). So, in some case that if the offset is not zero,
>> vm_iomap_memory() will fail.
> 
> Is this just a fix for something that can fail theoretically, or do you 
> actually have a case where this happens? I am very reluctant to make any 
> changes to videobuf. Drivers should all migrate to vb2.
> 
> I have CC-ed Marek as well since he knows a lot more about this stuff than I 
> do.
> 
> Regards,
> 
>   Hans
> 
>>
>> Signed-off-by: Fancy Fang 
>> ---
>>  drivers/media/v4l2-core/videobuf-dma-contig.c | 4 +++-
>>  1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/media/v4l2-core/videobuf-dma-contig.c
>> b/drivers/media/v4l2-core/videobuf-dma-contig.c
>> index bf80f0f..8bd9889 100644
>> --- a/drivers/media/v4l2-core/videobuf-dma-contig.c
>> +++ b/drivers/media/v4l2-core/videobuf-dma-contig.c
>> @@ -305,7 +305,9 @@ static int __videobuf_mmap_mapper(struct videobuf_queue 
>> *q,
>>  /* Try to remap memory */
>>  size = vma->vm_end - vma->vm_start;
>>  vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
>> -retval = vm_iomap_memory(vma, mem->dma_handle, size);
>> +retval = remap_pfn_range(vma, vma->vm_start,
>> + mem->dma_handle >> PAGE_SHIFT,
>> + size, vma->vm_page_prot);
>>  if (retval) {
>>  dev_err(q->dev, "mmap: remap failed with error %d. ",
>>  retval);
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 01/13] powerpc/iommu: Check that TCE page size is equal to it_page_size

2014-09-22 Thread Alexey Kardashevskiy
This checks that the TCE table page size is not bigger that the size of
a page we just pinned and going to put its physical address to the table.

Otherwise the hardware gets unwanted access to physical memory between
the end of the actual page and the end of the aligned up TCE page.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/iommu.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index a10642a..b378f78 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1059,16 +1060,37 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, 
unsigned long entry,
tce, entry << tbl->it_page_shift, ret); */
return -EFAULT;
}
+
+   /*
+* Check that the TCE table granularity is not bigger than the size of
+* a page we just found. Otherwise the hardware can get access to
+* a bigger memory chunk that it should.
+*/
+   if (PageHuge(page)) {
+   struct page *head = compound_head(page);
+   long shift = PAGE_SHIFT + compound_order(head);
+
+   if (shift < tbl->it_page_shift) {
+   ret = -EINVAL;
+   goto put_page_exit;
+   }
+
+   }
+
hwaddr = (unsigned long) page_address(page) + offset;
 
ret = iommu_tce_build(tbl, entry, hwaddr, direction);
if (ret)
-   put_page(page);
+   goto put_page_exit;
 
-   if (ret < 0)
-   pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
+   return 0;
+
+put_page_exit:
+   pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n",
__func__, entry << tbl->it_page_shift, tce, ret);
 
+   put_page(page);
+
return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
-- 
2.0.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 02/13] powerpc/powernv: Make invalidate() a callback

2014-09-22 Thread Alexey Kardashevskiy
At the moment pnv_pci_ioda_tce_invalidate() gets the PE pointer via
container_of(tbl). Since we are going to have to add Dynamic DMA windows
and that means having 2 IOMMU tables per PE, this is not going to work.

This implements pnv_pci_ioda(1|2)_tce_invalidate as a pnv_ioda_pe callback.

This adds a pnv_iommu_table wrapper around iommu_table and stores a pointer
to PE there. PNV's ppc_md.tce_build() call uses this to find PE and
do the invalidation. This will be used later for Dynamic DMA windows too.

This registers invalidate() callbacks for IODA1 and IODA2:
- pnv_pci_ioda1_tce_invalidate;
- pnv_pci_ioda2_tce_invalidate.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v4:
* changed commit log to explain why this change is needed
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 35 ---
 arch/powerpc/platforms/powernv/pci.c  | 31 ---
 arch/powerpc/platforms/powernv/pci.h  | 13 +++-
 3 files changed, 48 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index df241b1..136e765 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -857,7 +857,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, 
struct pci_dev *pdev
 
pe = >ioda.pe_array[pdn->pe_number];
WARN_ON(get_dma_ops(>dev) != _iommu_ops);
-   set_iommu_table_base_and_group(>dev, >tce32_table);
+   set_iommu_table_base_and_group(>dev, >tce32.table);
 }
 
 static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
@@ -884,7 +884,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb,
} else {
dev_info(>dev, "Using 32-bit DMA via iommu\n");
set_dma_ops(>dev, _iommu_ops);
-   set_iommu_table_base(>dev, >tce32_table);
+   set_iommu_table_base(>dev, >tce32.table);
}
*pdev->dev.dma_mask = dma_mask;
return 0;
@@ -899,9 +899,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe,
list_for_each_entry(dev, >devices, bus_list) {
if (add_to_iommu_group)
set_iommu_table_base_and_group(>dev,
-  >tce32_table);
+  >tce32.table);
else
-   set_iommu_table_base(>dev, >tce32_table);
+   set_iommu_table_base(>dev, >tce32.table);
 
if (dev->subordinate)
pnv_ioda_setup_bus_dma(pe, dev->subordinate,
@@ -988,19 +988,6 @@ static void pnv_pci_ioda2_tce_invalidate(struct 
pnv_ioda_pe *pe,
}
 }
 
-void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl,
-__be64 *startp, __be64 *endp, bool rm)
-{
-   struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
- tce32_table);
-   struct pnv_phb *phb = pe->phb;
-
-   if (phb->type == PNV_PHB_IODA1)
-   pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm);
-   else
-   pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm);
-}
-
 static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
  struct pnv_ioda_pe *pe, unsigned int base,
  unsigned int segs)
@@ -1058,9 +1045,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb 
*phb,
}
 
/* Setup linux iommu table */
-   tbl = >tce32_table;
+   tbl = >tce32.table;
pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs,
  base << 28, IOMMU_PAGE_SHIFT_4K);
+   pe->tce32.pe = pe;
+   pe->tce32.invalidate_fn = pnv_pci_ioda1_tce_invalidate;
 
/* OPAL variant of P7IOC SW invalidated TCEs */
swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL);
@@ -1097,7 +1086,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb,
 static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
 {
struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
- tce32_table);
+ tce32.table);
uint16_t window_id = (pe->pe_number << 1 ) + 1;
int64_t rc;
 
@@ -1142,10 +1131,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct 
pnv_phb *phb,
pe->tce_bypass_base = 1ull << 59;
 
/* Install set_bypass callback for VFIO */
-   pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass;
+   pe->tce32.table.set_bypass = pnv_pci_ioda2_set_bypass;
 
/* Enable bypass by default */
-   pnv_pci_ioda2_set_bypass(>tce32_table, true);
+   pnv_pci_ioda2_set_bypass(>tce32.table, true);
 }
 
 static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
@@ -1193,9 +1182,11 @@ 

[PATCH v2 03/13] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops

2014-09-22 Thread Alexey Kardashevskiy
Modern IBM POWERPC systems support multiple IOMMU tables per PE
so we need a more reliable way (compared to container_of()) to get
a PE pointer from the iommu_table struct pointer used in IOMMU functions.

At the moment IOMMU group data points to an iommu_table struct. This
introduces a spapr_tce_iommu_group struct which keeps an iommu_owner
and a spapr_tce_iommu_ops struct. For IODA, iommu_owner is a pointer to
the pnv_ioda_pe struct, for others it is still a pointer to
the iommu_table struct. The ops structs correspond to the type which
iommu_owner points to.

This defines a get_table() callback which returns an iommu_table
by its number.

As the IOMMU group data pointer points to variable type instead of
iommu_table, VFIO SPAPR TCE driver is updated to use the new type.
This changes the tce_container struct to store iommu_group instead of
iommu_table.

So, it was:
- iommu_table points to iommu_group via iommu_table::it_group;
- iommu_group points to iommu_table via iommu_group_get_iommudata();

now it is:
- iommu_table points to iommu_group via iommu_table::it_group;
- iommu_group points to spapr_tce_iommu_group via
iommu_group_get_iommudata();
- spapr_tce_iommu_group points to either (depending on .get_table()):
- iommu_table;
- pnv_ioda_pe;

This uses pnv_ioda1_iommu_get_table for both IODA1&2 but IODA2 will
have own pnv_ioda2_iommu_get_table soon and pnv_ioda1_iommu_get_table
will only be used for IODA1.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/iommu.h|   6 ++
 arch/powerpc/include/asm/tce.h  |  13 +++
 arch/powerpc/kernel/iommu.c |  35 ++-
 arch/powerpc/platforms/powernv/pci-ioda.c   |  31 +-
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |   1 +
 arch/powerpc/platforms/powernv/pci.c|   2 +-
 arch/powerpc/platforms/pseries/iommu.c  |  10 +-
 drivers/vfio/vfio_iommu_spapr_tce.c | 148 ++--
 8 files changed, 208 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 42632c7..84ee339 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -108,13 +108,19 @@ extern void iommu_free_table(struct iommu_table *tbl, 
const char *node_name);
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
int nid);
+
+struct spapr_tce_iommu_ops;
 #ifdef CONFIG_IOMMU_API
 extern void iommu_register_group(struct iommu_table *tbl,
+void *iommu_owner,
+struct spapr_tce_iommu_ops *ops,
 int pci_domain_number, unsigned long pe_num);
 extern int iommu_add_device(struct device *dev);
 extern void iommu_del_device(struct device *dev);
 #else
 static inline void iommu_register_group(struct iommu_table *tbl,
+   void *iommu_owner,
+   struct spapr_tce_iommu_ops *ops,
int pci_domain_number,
unsigned long pe_num)
 {
diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index 743f36b..9f159eb 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -50,5 +50,18 @@
 #define TCE_PCI_READ   0x1 /* read from PCI allowed */
 #define TCE_VB_WRITE   0x1 /* write from VB allowed */
 
+struct spapr_tce_iommu_group;
+
+struct spapr_tce_iommu_ops {
+   struct iommu_table *(*get_table)(
+   struct spapr_tce_iommu_group *data,
+   int num);
+};
+
+struct spapr_tce_iommu_group {
+   void *iommu_owner;
+   struct spapr_tce_iommu_ops *ops;
+};
+
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_TCE_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index b378f78..1c5dae7 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -878,24 +878,53 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t 
size,
  */
 static void group_release(void *iommu_data)
 {
-   struct iommu_table *tbl = iommu_data;
-   tbl->it_group = NULL;
+   kfree(iommu_data);
 }
 
+static struct iommu_table *spapr_tce_default_get_table(
+   struct spapr_tce_iommu_group *data, int num)
+{
+   struct iommu_table *tbl = data->iommu_owner;
+
+   switch (num) {
+   case 0:
+   if (tbl->it_size)
+   return tbl;
+   /* fallthru */
+   default:
+   return NULL;
+   }
+}
+
+static struct spapr_tce_iommu_ops spapr_tce_default_ops = {
+   .get_table = spapr_tce_default_get_table
+};
+
 void iommu_register_group(struct iommu_table *tbl,
+   void *iommu_owner, struct spapr_tce_iommu_ops *ops,
int pci_domain_number, unsigned long pe_num)
 {

[PATCH v2 04/13] powerpc/powernv: Convert/move set_bypass() callback to take_ownership()

2014-09-22 Thread Alexey Kardashevskiy
At the moment the iommu_table struct has a set_bypass() which enables/
disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
which calls this callback when external IOMMU users such as VFIO are
about to get over a PHB.

Since the set_bypass() is not really an iommu_table function but PE's
function, and we have an ops struct per IOMMU owner, let's move
set_bypass() to the spapr_tce_iommu_ops struct.

As arch/powerpc/kernel/iommu.c is more about POWERPC IOMMU tables and
has very little to do with PEs, this moves take_ownership() calls to
the VFIO SPAPR TCE driver.

This renames set_bypass() to take_ownership() as it is not necessarily
just enabling bypassing, it can be something else/more so let's give it
a generic name. The bool parameter is inverted.

Signed-off-by: Alexey Kardashevskiy 
Reviewed-by: Gavin Shan 
---
 arch/powerpc/include/asm/iommu.h  |  1 -
 arch/powerpc/include/asm/tce.h|  2 ++
 arch/powerpc/kernel/iommu.c   | 12 
 arch/powerpc/platforms/powernv/pci-ioda.c | 20 
 drivers/vfio/vfio_iommu_spapr_tce.c   | 16 
 5 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 84ee339..2b0b01d 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -77,7 +77,6 @@ struct iommu_table {
 #ifdef CONFIG_IOMMU_API
struct iommu_group *it_group;
 #endif
-   void (*set_bypass)(struct iommu_table *tbl, bool enable);
 };
 
 /* Pure 2^n version of get_order */
diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index 9f159eb..e6355f9 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -56,6 +56,8 @@ struct spapr_tce_iommu_ops {
struct iommu_table *(*get_table)(
struct spapr_tce_iommu_group *data,
int num);
+   void (*take_ownership)(struct spapr_tce_iommu_group *data,
+   bool enable);
 };
 
 struct spapr_tce_iommu_group {
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 1c5dae7..c2c8d9d 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1139,14 +1139,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
memset(tbl->it_map, 0xff, sz);
iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
 
-   /*
-* Disable iommu bypass, otherwise the user can DMA to all of
-* our physical memory via the bypass window instead of just
-* the pages that has been explicitly mapped into the iommu
-*/
-   if (tbl->set_bypass)
-   tbl->set_bypass(tbl, false);
-
return 0;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
@@ -1161,10 +1153,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
/* Restore bit#0 set by iommu_init_table() */
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);
-
-   /* The kernel owns the device now, we can restore the iommu bypass */
-   if (tbl->set_bypass)
-   tbl->set_bypass(tbl, true);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 2d32a1c..8cb2f31 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1105,10 +1105,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb 
*phb,
__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
 }
 
-static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
+static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
 {
-   struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
- tce32.table);
uint16_t window_id = (pe->pe_number << 1 ) + 1;
int64_t rc;
 
@@ -1136,7 +1134,7 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table 
*tbl, bool enable)
 * host side.
 */
if (pe->pdev)
-   set_iommu_table_base(>pdev->dev, tbl);
+   set_iommu_table_base(>pdev->dev, >tce32.table);
else
pnv_ioda_setup_bus_dma(pe, pe->pbus, false);
}
@@ -1152,15 +1150,21 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct 
pnv_phb *phb,
/* TVE #1 is selected by PCI address bit 59 */
pe->tce_bypass_base = 1ull << 59;
 
-   /* Install set_bypass callback for VFIO */
-   pe->tce32.table.set_bypass = pnv_pci_ioda2_set_bypass;
-
/* Enable bypass by default */
-   pnv_pci_ioda2_set_bypass(>tce32.table, true);
+   pnv_pci_ioda2_set_bypass(pe, true);
+}
+
+static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
+bool enable)
+{
+   struct 

[PATCH v2 10/13] powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA

2014-09-22 Thread Alexey Kardashevskiy
SPAPR defines an interface to create additional DMA windows dynamically.
"Dynamically" means that the window is not allocated before the guest
even started, the guest can request it later. In practice, existing linux
guests check for the capability and if it is there, they create and map
a DMA window as big as the entire guest RAM.

This adds 4 callbacks to the spapr_tce_iommu_ops struct:
1. query - ibm,query-pe-dma-window - returns number/size of windows
which can be created (one, any page size);

2. create - ibm,create-pe-dma-window - creates a window;

3. remove - ibm,remove-pe-dma-window - removes a window; removing
the default 32bit window is not allowed by this patch, this will be added
later if needed;

4. reset -  ibm,reset-pe-dma-window - reset the DMA windows configuration
to the default state; as the default window cannot be removed, it only
removes the additional window if it was created.

The next patch will add corresponding ioctls to VFIO SPAPR TCE driver to
provide necessary support to the userspace.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/tce.h|  22 +
 arch/powerpc/platforms/powernv/pci-ioda.c | 159 +-
 arch/powerpc/platforms/powernv/pci.h  |   1 +
 3 files changed, 181 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index e6355f9..23b0362 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -58,6 +58,28 @@ struct spapr_tce_iommu_ops {
int num);
void (*take_ownership)(struct spapr_tce_iommu_group *data,
bool enable);
+
+   /* Dynamic DMA window */
+   /* Page size flags for ibm,query-pe-dma-window */
+#define DDW_PGSIZE_4K   0x01
+#define DDW_PGSIZE_64K  0x02
+#define DDW_PGSIZE_16M  0x04
+#define DDW_PGSIZE_32M  0x08
+#define DDW_PGSIZE_64M  0x10
+#define DDW_PGSIZE_128M 0x20
+#define DDW_PGSIZE_256M 0x40
+#define DDW_PGSIZE_16G  0x80
+   long (*query)(struct spapr_tce_iommu_group *data,
+   __u32 *current_windows,
+   __u32 *windows_available,
+   __u32 *page_size_mask);
+   long (*create)(struct spapr_tce_iommu_group *data,
+   __u32 page_shift,
+   __u32 window_shift,
+   struct iommu_table **ptbl);
+   long (*remove)(struct spapr_tce_iommu_group *data,
+   struct iommu_table *tbl);
+   long (*reset)(struct spapr_tce_iommu_group *data);
 };
 
 struct spapr_tce_iommu_group {
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 296f49b..a6318cb 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1154,6 +1154,26 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb 
*phb,
pnv_pci_ioda2_set_bypass(pe, true);
 }
 
+static struct iommu_table *pnv_ioda2_iommu_get_table(
+   struct spapr_tce_iommu_group *data,
+   int num)
+{
+   struct pnv_ioda_pe *pe = data->iommu_owner;
+
+   switch (num) {
+   case 0:
+   if (pe->tce32.table.it_size)
+   return >tce32.table;
+   return NULL;
+   case 1:
+   if (pe->tce64.table.it_size)
+   return >tce64.table;
+   return NULL;
+   default:
+   return NULL;
+   }
+}
+
 static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
 bool enable)
 {
@@ -1162,9 +1182,146 @@ static void pnv_ioda2_take_ownership(struct 
spapr_tce_iommu_group *data,
pnv_pci_ioda2_set_bypass(pe, !enable);
 }
 
+static long pnv_pci_ioda2_ddw_query(struct spapr_tce_iommu_group *data,
+   __u32 *current_windows,
+   __u32 *windows_available, __u32 *page_size_mask)
+{
+   struct pnv_ioda_pe *pe = data->iommu_owner;
+
+   *windows_available = 2;
+   *current_windows = 0;
+   if (pe->tce32.table.it_size) {
+   --*windows_available;
+   ++*current_windows;
+   }
+   if (pe->tce64.table.it_size) {
+   --*windows_available;
+   ++*current_windows;
+   }
+   *page_size_mask =
+   DDW_PGSIZE_4K |
+   DDW_PGSIZE_64K |
+   DDW_PGSIZE_16M;
+
+   return 0;
+}
+
+static long pnv_pci_ioda2_ddw_create(struct spapr_tce_iommu_group *data,
+   __u32 page_shift, __u32 window_shift,
+   struct iommu_table **ptbl)
+{
+   struct pnv_ioda_pe *pe = data->iommu_owner;
+   struct pnv_phb *phb = pe->phb;
+   struct page *tce_mem = NULL;
+   void *addr;
+   long ret;
+   unsigned long tce_table_size =
+   (1ULL << (window_shift - page_shift)) * 8;
+   unsigned order;
+   

[PATCH] Fix the issue that lowmemkiller fell into a cycle that try to kill a task

2014-09-22 Thread Hui Zhu
The cause of this issue is when free memroy size is low and a lot of task is
trying to shrink the memory, the task that is killed by lowmemkiller cannot get
CPU to exit itself.

Fix this issue with change the scheduling policy to SCHED_FIFO if a task's flag
is TIF_MEMDIE in lowmemkiller.

Signed-off-by: Hui Zhu 
---
 drivers/staging/android/lowmemorykiller.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/staging/android/lowmemorykiller.c 
b/drivers/staging/android/lowmemorykiller.c
index b545d3d..ca1ffac 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -129,6 +129,10 @@ static unsigned long lowmem_scan(struct shrinker *s, 
struct shrink_control *sc)
 
if (test_tsk_thread_flag(p, TIF_MEMDIE) &&
time_before_eq(jiffies, lowmem_deathpending_timeout)) {
+   struct sched_param param = { .sched_priority = 1 };
+
+   if (p->policy == SCHED_NORMAL)
+   sched_setscheduler(p, SCHED_FIFO, );
task_unlock(p);
rcu_read_unlock();
return 0;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 07/13] powerpc/powernv: Do not set "read" flag if direction==DMA_NONE

2014-09-22 Thread Alexey Kardashevskiy
Normally a bitmap from the iommu_table is used to track what TCE entry
is in use. Since we are going to use iommu_table without its locks and
do xchg() instead, it becomes essential not to put bits which are not
implied in the direction flag.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/powernv/pci.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index deddcad..ab79e2d 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -628,10 +628,18 @@ static int pnv_tce_build(struct iommu_table *tbl, long 
index, long npages,
__be64 *tcep, *tces;
u64 rpn;
 
-   proto_tce = TCE_PCI_READ; // Read allowed
-
-   if (direction != DMA_TO_DEVICE)
-   proto_tce |= TCE_PCI_WRITE;
+   switch (direction) {
+   case DMA_BIDIRECTIONAL:
+   case DMA_FROM_DEVICE:
+   proto_tce = TCE_PCI_READ | TCE_PCI_WRITE;
+   break;
+   case DMA_TO_DEVICE:
+   proto_tce = TCE_PCI_READ;
+   break;
+   default:
+   proto_tce = 0;
+   break;
+   }
 
tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset;
rpn = __pa(uaddr) >> tbl->it_page_shift;
-- 
2.0.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 13/13] vfio: powerpc/spapr: Enable Dynamic DMA windows

2014-09-22 Thread Alexey Kardashevskiy
This defines and implements VFIO IOMMU API which lets the userspace
create and remove DMA windows.

This updates VFIO_IOMMU_SPAPR_TCE_GET_INFO to return the number of
available windows and page mask.

This adds VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE
to allow the user space to create and remove window(s).

The VFIO IOMMU driver does basic sanity checks and calls corresponding
SPAPR TCE functions. At the moment only IODA2 (POWER8 PCI host bridge)
implements them.

This advertises VFIO_IOMMU_SPAPR_TCE_FLAG_DDW capability via
VFIO_IOMMU_SPAPR_TCE_GET_INFO.

This calls platform DDW reset() callback when IOMMU is being disabled
to reset the DMA configuration to its original state.

Signed-off-by: Alexey Kardashevskiy 
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 135 ++--
 include/uapi/linux/vfio.h   |  25 ++-
 2 files changed, 153 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 0dccbc4..b518891 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -190,18 +190,25 @@ static void tce_iommu_disable(struct tce_container 
*container)
 
container->enabled = false;
 
-   if (!container->grp || !current->mm)
+   if (!container->grp)
return;
 
data = iommu_group_get_iommudata(container->grp);
if (!data || !data->iommu_owner || !data->ops->get_table)
return;
 
-   tbl = data->ops->get_table(data, 0);
-   if (!tbl)
-   return;
+   if (current->mm) {
+   tbl = data->ops->get_table(data, 0);
+   if (tbl)
+   decrement_locked_vm(tbl);
 
-   decrement_locked_vm(tbl);
+   tbl = data->ops->get_table(data, 1);
+   if (tbl)
+   decrement_locked_vm(tbl);
+   }
+
+   if (data->ops->reset)
+   data->ops->reset(data);
 }
 
 static void *tce_iommu_open(unsigned long arg)
@@ -243,7 +250,7 @@ static long tce_iommu_ioctl(void *iommu_data,
 unsigned int cmd, unsigned long arg)
 {
struct tce_container *container = iommu_data;
-   unsigned long minsz;
+   unsigned long minsz, ddwsz;
long ret;
 
switch (cmd) {
@@ -288,6 +295,28 @@ static long tce_iommu_ioctl(void *iommu_data,
info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
info.flags = 0;
 
+   ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info,
+   page_size_mask);
+
+   if (info.argsz == ddwsz) {
+   if (data->ops->query && data->ops->create &&
+   data->ops->remove) {
+   info.flags |= VFIO_IOMMU_SPAPR_TCE_FLAG_DDW;
+
+   ret = data->ops->query(data,
+   _windows,
+   _available,
+   _size_mask);
+   if (ret)
+   return ret;
+   } else {
+   info.current_windows = 0;
+   info.windows_available = 0;
+   info.page_size_mask = 0;
+   }
+   minsz = ddwsz;
+   }
+
if (copy_to_user((void __user *)arg, , minsz))
return -EFAULT;
 
@@ -412,12 +441,106 @@ static long tce_iommu_ioctl(void *iommu_data,
tce_iommu_disable(container);
mutex_unlock(>lock);
return 0;
+
case VFIO_EEH_PE_OP:
if (!container->grp)
return -ENODEV;
 
return vfio_spapr_iommu_eeh_ioctl(container->grp,
  cmd, arg);
+
+   case VFIO_IOMMU_SPAPR_TCE_CREATE: {
+   struct vfio_iommu_spapr_tce_create create;
+   struct spapr_tce_iommu_group *data;
+   struct iommu_table *tbl;
+
+   if (WARN_ON(!container->grp))
+   return -ENXIO;
+
+   data = iommu_group_get_iommudata(container->grp);
+
+   minsz = offsetofend(struct vfio_iommu_spapr_tce_create,
+   start_addr);
+
+   if (copy_from_user(, (void __user *)arg, minsz))
+   return -EFAULT;
+
+   if (create.argsz < minsz)
+   return -EINVAL;
+
+   if (create.flags)
+   return -EINVAL;
+
+   if (!data->ops->create || !data->iommu_owner)
+   return -ENOSYS;
+
+   BUG_ON(!data || !data->ops || !data->ops->remove);
+
+   ret = 

[PATCH v2 09/13] powerpc/pseries/lpar: Enable VFIO

2014-09-22 Thread Alexey Kardashevskiy
The previous patch introduced iommu_table_ops::exchange() callback
which effectively disabled VFIO on pseries. This implements exchange()
for pseries/lpar so VFIO can work in nested guests.

Since exchaange() callback returns an old TCE, it has to call H_GET_TCE
for every TCE being put to the table so VFIO performance in guests
running under PR KVM is expected to be slower than in guests running under
HV KVM or bare metal hosts.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/pseries/iommu.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 9a7364f..ae15b5a 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -138,13 +138,14 @@ static void tce_freemulti_pSeriesLP(struct iommu_table*, 
long, long);
 
 static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
long npages, unsigned long uaddr,
+   unsigned long *old_tces,
enum dma_data_direction direction,
struct dma_attrs *attrs)
 {
u64 rc = 0;
u64 proto_tce, tce;
u64 rpn;
-   int ret = 0;
+   int ret = 0, i = 0;
long tcenum_start = tcenum, npages_start = npages;
 
rpn = __pa(uaddr) >> TCE_SHIFT;
@@ -154,6 +155,9 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, 
long tcenum,
 
while (npages--) {
tce = proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT;
+   if (old_tces)
+   plpar_tce_get((u64)tbl->it_index, (u64)tcenum << 12,
+   _tces[i++]);
rc = plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, tce);
 
if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
@@ -179,8 +183,9 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, 
long tcenum,
 
 static DEFINE_PER_CPU(__be64 *, tce_page);
 
-static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
+static int tce_xchg_pSeriesLP(struct iommu_table *tbl, long tcenum,
 long npages, unsigned long uaddr,
+unsigned long *old_tces,
 enum dma_data_direction direction,
 struct dma_attrs *attrs)
 {
@@ -195,6 +200,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
 
if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
+  old_tces,
   direction, attrs);
}
 
@@ -211,6 +217,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
if (!tcep) {
local_irq_restore(flags);
return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
+   old_tces,
direction, attrs);
}
__get_cpu_var(tce_page) = tcep;
@@ -232,6 +239,10 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
for (l = 0; l < limit; l++) {
tcep[l] = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) 
<< TCE_RPN_SHIFT);
rpn++;
+   if (old_tces)
+   plpar_tce_get((u64)tbl->it_index,
+   (u64)(tcenum + l) << 12,
+   _tces[tcenum + l]);
}
 
rc = plpar_tce_put_indirect((u64)tbl->it_index,
@@ -262,6 +273,15 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
return ret;
 }
 
+static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
+long npages, unsigned long uaddr,
+enum dma_data_direction direction,
+struct dma_attrs *attrs)
+{
+   return tce_xchg_pSeriesLP(tbl, tcenum, npages, uaddr, NULL,
+   direction, attrs);
+}
+
 static void tce_free_pSeriesLP(struct iommu_table *tbl, long tcenum, long 
npages)
 {
u64 rc;
@@ -637,6 +657,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus)
 
 struct iommu_table_ops iommu_table_lpar_multi_ops = {
.set = tce_buildmulti_pSeriesLP,
+   .exchange = tce_xchg_pSeriesLP,
.clear = tce_freemulti_pSeriesLP,
.get = tce_get_pSeriesLP
 };
-- 
2.0.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

[PATCH v2 08/13] powerpc/powernv: Release replaced TCE

2014-09-22 Thread Alexey Kardashevskiy
At the moment writing new TCE value to the IOMMU table fails with EBUSY
if there is a valid entry already. However PAPR specification allows
the guest to write new TCE value without clearing it first.

Another problem this patch is addressing is the use of pool locks for
external IOMMU users such as VFIO. The pool locks are to protect
DMA page allocator rather than entries and since the host kernel does
not control what pages are in use, there is no point in pool locks and
exchange()+put_page(oldtce) is sufficient to avoid possible races.

This adds an exchange() callback to iommu_table_ops which does the same
thing as set() plus it returns replaced TCE(s) so the caller can release
the pages afterwards.

This makes iommu_tce_build() put pages returned by exchange().

This replaces iommu_clear_tce() with iommu_tce_build which now
can call exchange() with TCE==NULL (i.e. clear).

This preserves permission bits in TCE in iommu_put_tce_user_mode().

This removes use of pool locks for external IOMMU uses.

This disables external IOMMU use (i.e. VFIO) for IOMMUs which do not
implement exchange() callback. Therefore the "powernv" platform is
the only supported one after this patch.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* added missing __pa() for TCE which was read from the table

---
 arch/powerpc/include/asm/iommu.h |  8 +++--
 arch/powerpc/kernel/iommu.c  | 62 
 arch/powerpc/platforms/powernv/pci.c | 40 +++
 3 files changed, 67 insertions(+), 43 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index c725e4a..8e0537d 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -49,6 +49,12 @@ struct iommu_table_ops {
unsigned long uaddr,
enum dma_data_direction direction,
struct dma_attrs *attrs);
+   int (*exchange)(struct iommu_table *tbl,
+   long index, long npages,
+   unsigned long uaddr,
+   unsigned long *old_tces,
+   enum dma_data_direction direction,
+   struct dma_attrs *attrs);
void (*clear)(struct iommu_table *tbl,
long index, long npages);
unsigned long (*get)(struct iommu_table *tbl, long index);
@@ -209,8 +215,6 @@ extern int iommu_tce_put_param_check(struct iommu_table 
*tbl,
unsigned long ioba, unsigned long tce);
 extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
unsigned long hwaddr, enum dma_data_direction direction);
-extern unsigned long iommu_clear_tce(struct iommu_table *tbl,
-   unsigned long entry);
 extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages);
 extern int iommu_put_tce_user_mode(struct iommu_table *tbl,
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 678fee8..39ccce7 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1006,43 +1006,11 @@ int iommu_tce_put_param_check(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_put_param_check);
 
-unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry)
-{
-   unsigned long oldtce;
-   struct iommu_pool *pool = get_pool(tbl, entry);
-
-   spin_lock(&(pool->lock));
-
-   oldtce = tbl->it_ops->get(tbl, entry);
-   if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ))
-   tbl->it_ops->clear(tbl, entry, 1);
-   else
-   oldtce = 0;
-
-   spin_unlock(&(pool->lock));
-
-   return oldtce;
-}
-EXPORT_SYMBOL_GPL(iommu_clear_tce);
-
 int iommu_clear_tces_and_put_pages(struct iommu_table *tbl,
unsigned long entry, unsigned long pages)
 {
-   unsigned long oldtce;
-   struct page *page;
-
for ( ; pages; --pages, ++entry) {
-   oldtce = iommu_clear_tce(tbl, entry);
-   if (!oldtce)
-   continue;
-
-   page = pfn_to_page(oldtce >> PAGE_SHIFT);
-   WARN_ON(!page);
-   if (page) {
-   if (oldtce & TCE_PCI_WRITE)
-   SetPageDirty(page);
-   put_page(page);
-   }
+   iommu_tce_build(tbl, entry, 0, DMA_NONE);
}
 
return 0;
@@ -1056,18 +1024,19 @@ EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages);
 int iommu_tce_build(struct iommu_table *tbl, unsigned long entry,
unsigned long hwaddr, enum dma_data_direction direction)
 {
-   int ret = -EBUSY;
+   int ret;
unsigned long oldtce;
-   struct iommu_pool *pool = get_pool(tbl, entry);
 
-   spin_lock(&(pool->lock));
+   ret = tbl->it_ops->exchange(tbl, entry, 1, hwaddr, ,
+   direction, NULL);
 
-  

[PATCH v2 11/13] vfio: powerpc/spapr: Move locked_vm accounting to helpers

2014-09-22 Thread Alexey Kardashevskiy
There moves locked pages accounting to helpers.
Later they will be reused for Dynamic DMA windows (DDW).

While we are here, update the comment explaining why RLIMIT_MEMLOCK
might be required to be bigger than the guest RAM.

Signed-off-by: Alexey Kardashevskiy 
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 71 +++--
 1 file changed, 53 insertions(+), 18 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index 1c1a9c4..c9fac97 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -29,6 +29,46 @@
 static void tce_iommu_detach_group(void *iommu_data,
struct iommu_group *iommu_group);
 
+static long try_increment_locked_vm(struct iommu_table *tbl)
+{
+   long ret = 0, locked, lock_limit, npages;
+
+   if (!current || !current->mm)
+   return -ESRCH; /* process exited */
+
+   npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+
+   down_write(>mm->mmap_sem);
+   locked = current->mm->locked_vm + npages;
+   lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+   if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
+   pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
+   rlimit(RLIMIT_MEMLOCK));
+   ret = -ENOMEM;
+   } else {
+   current->mm->locked_vm += npages;
+   }
+   up_write(>mm->mmap_sem);
+
+   return ret;
+}
+
+static void decrement_locked_vm(struct iommu_table *tbl)
+{
+   long npages;
+
+   if (!current || !current->mm)
+   return; /* process exited */
+
+   npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+
+   down_write(>mm->mmap_sem);
+   if (npages > current->mm->locked_vm)
+   npages = current->mm->locked_vm;
+   current->mm->locked_vm -= npages;
+   up_write(>mm->mmap_sem);
+}
+
 /*
  * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
  *
@@ -86,7 +126,6 @@ static void tce_iommu_take_ownership_notify(struct 
spapr_tce_iommu_group *data,
 static int tce_iommu_enable(struct tce_container *container)
 {
int ret = 0;
-   unsigned long locked, lock_limit, npages;
struct iommu_table *tbl;
struct spapr_tce_iommu_group *data;
 
@@ -120,24 +159,23 @@ static int tce_iommu_enable(struct tce_container 
*container)
 * Also we don't have a nice way to fail on H_PUT_TCE due to ulimits,
 * that would effectively kill the guest at random points, much better
 * enforcing the limit based on the max that the guest can map.
+*
+* Unfortunately at the moment it counts whole tables, no matter how
+* much memory the guest has. I.e. for 4GB guest and 4 IOMMU groups
+* each with 2GB DMA window, 8GB will be counted here. The reason for
+* this is that we cannot tell here the amount of RAM used by the guest
+* as this information is only available from KVM and VFIO is
+* KVM agnostic.
 */
tbl = data->ops->get_table(data, 0);
if (!tbl)
return -ENXIO;
 
-   down_write(>mm->mmap_sem);
-   npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
-   locked = current->mm->locked_vm + npages;
-   lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-   if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
-   pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
-   rlimit(RLIMIT_MEMLOCK));
-   ret = -ENOMEM;
-   } else {
-   current->mm->locked_vm += npages;
-   container->enabled = true;
-   }
-   up_write(>mm->mmap_sem);
+   ret = try_increment_locked_vm(tbl);
+   if (ret)
+   return ret;
+
+   container->enabled = true;
 
return ret;
 }
@@ -163,10 +201,7 @@ static void tce_iommu_disable(struct tce_container 
*container)
if (!tbl)
return;
 
-   down_write(>mm->mmap_sem);
-   current->mm->locked_vm -= (tbl->it_size <<
-   IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
-   up_write(>mm->mmap_sem);
+   decrement_locked_vm(tbl);
 }
 
 static void *tce_iommu_open(unsigned long arg)
-- 
2.0.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 05/13] powerpc/iommu: Fix IOMMU ownership control functions

2014-09-22 Thread Alexey Kardashevskiy
This adds missing locks in iommu_take_ownership()/
iommu_release_ownership().

This marks all pages busy in iommu_table::it_map in order to catch
errors if there is an attempt to use this table while ownership over it
is taken.

This only clears TCE content if there is no page marked busy in it_map.
Clearing must be done outside of the table locks as iommu_clear_tce()
called from iommu_clear_tces_and_put_pages() does this.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/iommu.c | 36 +---
 1 file changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index c2c8d9d..cd80867 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1126,33 +1126,55 @@ EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode);
 
 int iommu_take_ownership(struct iommu_table *tbl)
 {
-   unsigned long sz = (tbl->it_size + 7) >> 3;
+   unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
+   int ret = 0, bit0 = 0;
+
+   spin_lock_irqsave(>large_pool.lock, flags);
+   for (i = 0; i < tbl->nr_pools; i++)
+   spin_lock(>pools[i].lock);
 
if (tbl->it_offset == 0)
-   clear_bit(0, tbl->it_map);
+   bit0 = test_and_clear_bit(0, tbl->it_map);
 
if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
pr_err("iommu_tce: it_map is not empty");
-   return -EBUSY;
+   ret = -EBUSY;
+   if (bit0)
+   set_bit(0, tbl->it_map);
+   } else {
+   memset(tbl->it_map, 0xff, sz);
}
 
-   memset(tbl->it_map, 0xff, sz);
-   iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
+   for (i = 0; i < tbl->nr_pools; i++)
+   spin_unlock(>pools[i].lock);
+   spin_unlock_irqrestore(>large_pool.lock, flags);
 
-   return 0;
+   if (!ret)
+   iommu_clear_tces_and_put_pages(tbl, tbl->it_offset,
+   tbl->it_size);
+   return ret;
 }
 EXPORT_SYMBOL_GPL(iommu_take_ownership);
 
 void iommu_release_ownership(struct iommu_table *tbl)
 {
-   unsigned long sz = (tbl->it_size + 7) >> 3;
+   unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
 
iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
+
+   spin_lock_irqsave(>large_pool.lock, flags);
+   for (i = 0; i < tbl->nr_pools; i++)
+   spin_lock(>pools[i].lock);
+
memset(tbl->it_map, 0, sz);
 
/* Restore bit#0 set by iommu_init_table() */
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);
+
+   for (i = 0; i < tbl->nr_pools; i++)
+   spin_unlock(>pools[i].lock);
+   spin_unlock_irqrestore(>large_pool.lock, flags);
 }
 EXPORT_SYMBOL_GPL(iommu_release_ownership);
 
-- 
2.0.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 12/13] vfio: powerpc/spapr: Use it_page_size

2014-09-22 Thread Alexey Kardashevskiy
This makes use of the it_page_size from the iommu_table struct
as page size can differ.

This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code
as recently introduced IOMMU_PAGE_XXX macros do not include
IOMMU_PAGE_SHIFT.

Signed-off-by: Alexey Kardashevskiy 
---
 drivers/vfio/vfio_iommu_spapr_tce.c | 36 ++--
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
b/drivers/vfio/vfio_iommu_spapr_tce.c
index c9fac97..0dccbc4 100644
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@ -36,7 +36,7 @@ static long try_increment_locked_vm(struct iommu_table *tbl)
if (!current || !current->mm)
return -ESRCH; /* process exited */
 
-   npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+   npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
 
down_write(>mm->mmap_sem);
locked = current->mm->locked_vm + npages;
@@ -60,7 +60,7 @@ static void decrement_locked_vm(struct iommu_table *tbl)
if (!current || !current->mm)
return; /* process exited */
 
-   npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
+   npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT;
 
down_write(>mm->mmap_sem);
if (npages > current->mm->locked_vm)
@@ -284,8 +284,8 @@ static long tce_iommu_ioctl(void *iommu_data,
if (info.argsz < minsz)
return -EINVAL;
 
-   info.dma32_window_start = tbl->it_offset << IOMMU_PAGE_SHIFT_4K;
-   info.dma32_window_size = tbl->it_size << IOMMU_PAGE_SHIFT_4K;
+   info.dma32_window_start = tbl->it_offset << tbl->it_page_shift;
+   info.dma32_window_size = tbl->it_size << tbl->it_page_shift;
info.flags = 0;
 
if (copy_to_user((void __user *)arg, , minsz))
@@ -318,10 +318,6 @@ static long tce_iommu_ioctl(void *iommu_data,
VFIO_DMA_MAP_FLAG_WRITE))
return -EINVAL;
 
-   if ((param.size & ~IOMMU_PAGE_MASK_4K) ||
-   (param.vaddr & ~IOMMU_PAGE_MASK_4K))
-   return -EINVAL;
-
/* iova is checked by the IOMMU API */
tce = param.vaddr;
if (param.flags & VFIO_DMA_MAP_FLAG_READ)
@@ -334,21 +330,25 @@ static long tce_iommu_ioctl(void *iommu_data,
return -ENXIO;
BUG_ON(!tbl->it_group);
 
+   if ((param.size & ~IOMMU_PAGE_MASK(tbl)) ||
+   (param.vaddr & ~IOMMU_PAGE_MASK(tbl)))
+   return -EINVAL;
+
ret = iommu_tce_put_param_check(tbl, param.iova, tce);
if (ret)
return ret;
 
-   for (i = 0; i < (param.size >> IOMMU_PAGE_SHIFT_4K); ++i) {
+   for (i = 0; i < (param.size >> tbl->it_page_shift); ++i) {
ret = iommu_put_tce_user_mode(tbl,
-   (param.iova >> IOMMU_PAGE_SHIFT_4K) + i,
+   (param.iova >> tbl->it_page_shift) + i,
tce);
if (ret)
break;
-   tce += IOMMU_PAGE_SIZE_4K;
+   tce += IOMMU_PAGE_SIZE(tbl);
}
if (ret)
iommu_clear_tces_and_put_pages(tbl,
-   param.iova >> IOMMU_PAGE_SHIFT_4K, i);
+   param.iova >> tbl->it_page_shift, i);
 
iommu_flush_tce(tbl);
 
@@ -379,23 +379,23 @@ static long tce_iommu_ioctl(void *iommu_data,
if (param.flags)
return -EINVAL;
 
-   if (param.size & ~IOMMU_PAGE_MASK_4K)
-   return -EINVAL;
-
tbl = spapr_tce_find_table(container, data, param.iova);
if (!tbl)
return -ENXIO;
 
+   if (param.size & ~IOMMU_PAGE_MASK(tbl))
+   return -EINVAL;
+
BUG_ON(!tbl->it_group);
 
ret = iommu_tce_clear_param_check(tbl, param.iova, 0,
-   param.size >> IOMMU_PAGE_SHIFT_4K);
+   param.size >> tbl->it_page_shift);
if (ret)
return ret;
 
ret = iommu_clear_tces_and_put_pages(tbl,
-   param.iova >> IOMMU_PAGE_SHIFT_4K,
-   param.size >> IOMMU_PAGE_SHIFT_4K);
+   param.iova >> tbl->it_page_shift,
+   param.size >> tbl->it_page_shift);
iommu_flush_tce(tbl);
 
return ret;
-- 
2.0.0

--
To unsubscribe 

[PATCH v2 06/13] powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table

2014-09-22 Thread Alexey Kardashevskiy
This adds a iommu_table_ops struct and puts pointer to it into
the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush
callbacks from ppc_md to the new struct where they really belong to.

This adds an extra @ops parameter to iommu_init_table() to make sure
that we do not leave any IOMMU table without iommu_table_ops. @it_ops is
initialized in the very beginning as iommu_init_table() calls
iommu_table_clear() and the latter uses callbacks already.

This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_" prefixes
for better readability.

This removes tce_xxx_rm handlers from ppc_md as well but does not add
them to iommu_table_ops, this will be done later if we decide to support
TCE hypercalls in real mode.

This always uses tce_buildmulti_pSeriesLP/tce_buildmulti_pSeriesLP as
callbacks for pseries. This changes "multi" callbacks to fall back to
tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not
present. The reason for this is we still have to support "multitce=off"
boot parameter in disable_multitce() and we do not want to walk through
all IOMMU tables in the system and replace "multi" callbacks with single
ones.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/iommu.h| 20 +++-
 arch/powerpc/include/asm/machdep.h  | 25 ---
 arch/powerpc/kernel/iommu.c | 50 -
 arch/powerpc/kernel/vio.c   |  5 ++-
 arch/powerpc/platforms/cell/iommu.c |  9 --
 arch/powerpc/platforms/pasemi/iommu.c   |  8 +++--
 arch/powerpc/platforms/powernv/pci-ioda.c   |  4 +--
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |  3 +-
 arch/powerpc/platforms/powernv/pci.c| 24 --
 arch/powerpc/platforms/powernv/pci.h|  1 +
 arch/powerpc/platforms/pseries/iommu.c  | 42 +---
 arch/powerpc/sysdev/dart_iommu.c| 13 
 12 files changed, 102 insertions(+), 102 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index 2b0b01d..c725e4a 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -43,6 +43,22 @@
 extern int iommu_is_off;
 extern int iommu_force_on;
 
+struct iommu_table_ops {
+   int (*set)(struct iommu_table *tbl,
+   long index, long npages,
+   unsigned long uaddr,
+   enum dma_data_direction direction,
+   struct dma_attrs *attrs);
+   void (*clear)(struct iommu_table *tbl,
+   long index, long npages);
+   unsigned long (*get)(struct iommu_table *tbl, long index);
+   void (*flush)(struct iommu_table *tbl);
+};
+
+/* These are used by VIO */
+extern struct iommu_table_ops iommu_table_lpar_multi_ops;
+extern struct iommu_table_ops iommu_table_pseries_ops;
+
 /*
  * IOMAP_MAX_ORDER defines the largest contiguous block
  * of dma space we can get.  IOMAP_MAX_ORDER = 13
@@ -77,6 +93,7 @@ struct iommu_table {
 #ifdef CONFIG_IOMMU_API
struct iommu_group *it_group;
 #endif
+   struct iommu_table_ops *it_ops;
 };
 
 /* Pure 2^n version of get_order */
@@ -106,7 +123,8 @@ extern void iommu_free_table(struct iommu_table *tbl, const 
char *node_name);
  * structure
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table * tbl,
-   int nid);
+   int nid,
+   struct iommu_table_ops *ops);
 
 struct spapr_tce_iommu_ops;
 #ifdef CONFIG_IOMMU_API
diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index b125cea..1fc824d 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -65,31 +65,6 @@ struct machdep_calls {
 * destroyed as well */
void(*hpte_clear_all)(void);
 
-   int (*tce_build)(struct iommu_table *tbl,
-long index,
-long npages,
-unsigned long uaddr,
-enum dma_data_direction direction,
-struct dma_attrs *attrs);
-   void(*tce_free)(struct iommu_table *tbl,
-   long index,
-   long npages);
-   unsigned long   (*tce_get)(struct iommu_table *tbl,
-   long index);
-   void(*tce_flush)(struct iommu_table *tbl);
-
-   /* _rm versions are for real mode use only */
-   int (*tce_build_rm)(struct iommu_table *tbl,
-long index,
-long npages,
-unsigned long uaddr,
-enum dma_data_direction direction,
-   

[PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows

2014-09-22 Thread Alexey Kardashevskiy

This enables PAPR defined feature called Dynamic DMA windows (DDW).

Each Partitionable Endpoint (IOMMU group) has a separate DMA window on
a PCI bus where devices are allows to perform DMA. By default there is
1 or 2GB window allocated at the host boot time and these windows are
used when an IOMMU group is passed to the userspace (guest). These windows
are mapped at zero offset on a PCI bus.

Hi-speed devices may suffer from limited size of this window. On the host
side a TCE bypass mode is enabled on POWER8 CPU which implements
direct mapping of the host memory to a PCI bus at 1<<59.

For the guest, PAPR defines a DDW RTAS API which allows the pseries guest
to query the hypervisor if it supports DDW and what are the parameters
of possible windows.

Currently POWER8 supports 2 DMA windows per PE - already mentioned and used
small 32bit window and 64bit window which can only start from 1<<59 and
can support various page sizes.

This patchset reworks PPC IOMMU code and adds necessary structures
to extend it to support big windows.

When the guest detectes the feature and the PE is capable of 64bit DMA,
it does:
1. query to hypervisor about number of available windows and page masks;
2. creates a window with the biggest possible page size (current guests can do
64K or 16MB TCEs);
3. maps the entire guest RAM via H_PUT_TCE* hypercalls
4. switches dma_ops to direct_dma_ops on the selected PE.

Once this is done, H_PUT_TCE is not called anymore and the guest gets
maximum performance.

Please comment. Thanks!


Changes:
v2:
* added missing __pa() in "powerpc/powernv: Release replaced TCE"
* reposted to make some noise :)



Alexey Kardashevskiy (13):
  powerpc/iommu: Check that TCE page size is equal to it_page_size
  powerpc/powernv: Make invalidate() a callback
  powerpc/spapr: vfio: Implement spapr_tce_iommu_ops
  powerpc/powernv: Convert/move set_bypass() callback to
take_ownership()
  powerpc/iommu: Fix IOMMU ownership control functions
  powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table
  powerpc/powernv: Do not set "read" flag if direction==DMA_NONE
  powerpc/powernv: Release replaced TCE
  powerpc/pseries/lpar: Enable VFIO
  powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA
  vfio: powerpc/spapr: Move locked_vm accounting to helpers
  vfio: powerpc/spapr: Use it_page_size
  vfio: powerpc/spapr: Enable Dynamic DMA windows

 arch/powerpc/include/asm/iommu.h|  35 ++-
 arch/powerpc/include/asm/machdep.h  |  25 --
 arch/powerpc/include/asm/tce.h  |  37 +++
 arch/powerpc/kernel/iommu.c | 213 +--
 arch/powerpc/kernel/vio.c   |   5 +-
 arch/powerpc/platforms/cell/iommu.c |   9 +-
 arch/powerpc/platforms/pasemi/iommu.c   |   8 +-
 arch/powerpc/platforms/powernv/pci-ioda.c   | 233 +++--
 arch/powerpc/platforms/powernv/pci-p5ioc2.c |   4 +-
 arch/powerpc/platforms/powernv/pci.c| 113 +---
 arch/powerpc/platforms/powernv/pci.h|  15 +-
 arch/powerpc/platforms/pseries/iommu.c  |  77 --
 arch/powerpc/sysdev/dart_iommu.c|  13 +-
 drivers/vfio/vfio_iommu_spapr_tce.c | 384 +++-
 include/uapi/linux/vfio.h   |  25 +-
 15 files changed, 925 insertions(+), 271 deletions(-)

-- 
2.0.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3.4 00/45] 3.4.104-rc1 review

2014-09-22 Thread Guenter Roeck

On 09/22/2014 07:42 PM, Guenter Roeck wrote:

On 09/22/2014 07:27 PM, Zefan Li wrote:

From: Zefan Li 

This is the start of the stable review cycle for the 3.4.104 release.
There are 45 patches in this series, all will be posted as a response
to this one.  If anyone has any issues with these being applied, please
let me know.

Responses should be made by Thu Sep 25 02:03:31 UTC 2014.
Anything received after that time might be too late.

A combined patch relative to 3.4.103 will be posted as an additional
response to this.  A shortlog and diffstat can be found below.

thanks,

Zefan Li


Hi,

did you push the latest patch ? I only see 43 patches in the queue.



Never mind, got it now.

Guenter


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3.4 00/45] 3.4.104-rc1 review

2014-09-22 Thread Zefan Li
This is the combined patch for 3.4.104-rc1 relative to 3.4.103.

---

diff --git a/Documentation/stable_kernel_rules.txt 
b/Documentation/stable_kernel_rules.txt
index b0714d8..8dfb6a5 100644
--- a/Documentation/stable_kernel_rules.txt
+++ b/Documentation/stable_kernel_rules.txt
@@ -29,6 +29,9 @@ Rules on what kind of patches are accepted, and which ones 
are not, into the
 
 Procedure for submitting patches to the -stable tree:
 
+ - If the patch covers files in net/ or drivers/net please follow netdev stable
+   submission guidelines as described in
+   Documentation/networking/netdev-FAQ.txt
  - Send the patch, after verifying that it follows the above rules, to
sta...@vger.kernel.org.  You must note the upstream commit ID in the
changelog of your submission, as well as the kernel version you wish
diff --git a/Makefile b/Makefile
index 36f0913..77a9aa6 100644
--- a/Makefile
+++ b/Makefile
@@ -1,7 +1,7 @@
 VERSION = 3
 PATCHLEVEL = 4
-SUBLEVEL = 103
-EXTRAVERSION =
+SUBLEVEL = 104
+EXTRAVERSION = -rc1
 NAME = Saber-toothed Squirrel
 
 # *DOCUMENTATION*
diff --git a/arch/alpha/include/asm/io.h b/arch/alpha/include/asm/io.h
index 7a3d38d..5ebab58 100644
--- a/arch/alpha/include/asm/io.h
+++ b/arch/alpha/include/asm/io.h
@@ -489,6 +489,11 @@ extern inline void writeq(u64 b, volatile void __iomem 
*addr)
 }
 #endif
 
+#define ioread16be(p) be16_to_cpu(ioread16(p))
+#define ioread32be(p) be32_to_cpu(ioread32(p))
+#define iowrite16be(v,p) iowrite16(cpu_to_be16(v), (p))
+#define iowrite32be(v,p) iowrite32(cpu_to_be32(v), (p))
+
 #define inb_p  inb
 #define inw_p  inw
 #define inl_p  inl
diff --git a/arch/alpha/oprofile/common.c b/arch/alpha/oprofile/common.c
index a0a5d27..b8ce18f 100644
--- a/arch/alpha/oprofile/common.c
+++ b/arch/alpha/oprofile/common.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "op_impl.h"
 
diff --git a/arch/arm/kernel/entry-header.S b/arch/arm/kernel/entry-header.S
index 9a8531e..9d95a46 100644
--- a/arch/arm/kernel/entry-header.S
+++ b/arch/arm/kernel/entry-header.S
@@ -76,26 +76,21 @@
 #ifndef CONFIG_THUMB2_KERNEL
.macro  svc_exit, rpsr
msr spsr_cxsf, \rpsr
-#if defined(CONFIG_CPU_V6)
-   ldr r0, [sp]
-   strex   r1, r2, [sp]@ clear the exclusive monitor
-   ldmib   sp, {r1 - pc}^  @ load r1 - pc, cpsr
-#elif defined(CONFIG_CPU_32v6K)
-   clrex   @ clear the exclusive monitor
-   ldmia   sp, {r0 - pc}^  @ load r0 - pc, cpsr
-#else
-   ldmia   sp, {r0 - pc}^  @ load r0 - pc, cpsr
+#if defined(CONFIG_CPU_V6) || defined(CONFIG_CPU_32v6K)
+   @ We must avoid clrex due to Cortex-A15 erratum #830321
+   sub r0, sp, #4  @ uninhabited address
+   strex   r1, r2, [r0]@ clear the exclusive monitor
 #endif
+   ldmia   sp, {r0 - pc}^  @ load r0 - pc, cpsr
.endm
 
.macro  restore_user_regs, fast = 0, offset = 0
ldr r1, [sp, #\offset + S_PSR]  @ get calling cpsr
ldr lr, [sp, #\offset + S_PC]!  @ get pc
msr spsr_cxsf, r1   @ save in spsr_svc
-#if defined(CONFIG_CPU_V6)
+#if defined(CONFIG_CPU_V6) || defined(CONFIG_CPU_32v6K)
+   @ We must avoid clrex due to Cortex-A15 erratum #830321
strex   r1, r2, [sp]@ clear the exclusive monitor
-#elif defined(CONFIG_CPU_32v6K)
-   clrex   @ clear the exclusive monitor
 #endif
.if \fast
ldmdb   sp, {r1 - lr}^  @ get calling r1 - lr
@@ -123,7 +118,10 @@
.macro  svc_exit, rpsr
ldr lr, [sp, #S_SP] @ top of the stack
ldrdr0, r1, [sp, #S_LR] @ calling lr and pc
-   clrex   @ clear the exclusive monitor
+
+   @ We must avoid clrex due to Cortex-A15 erratum #830321
+   strex   r2, r1, [sp, #S_LR] @ clear the exclusive monitor
+
stmdb   lr!, {r0, r1, \rpsr}@ calling lr and rfe context
ldmia   sp, {r0 - r12}
mov sp, lr
@@ -132,13 +130,16 @@
.endm
 
.macro  restore_user_regs, fast = 0, offset = 0
-   clrex   @ clear the exclusive monitor
mov r2, sp
load_user_sp_lr r2, r3, \offset + S_SP  @ calling sp, lr
ldr r1, [sp, #\offset + S_PSR]  @ get calling cpsr
ldr lr, [sp, #\offset + S_PC]   @ get pc
add sp, sp, #\offset + S_SP
msr spsr_cxsf, r1   @ save in spsr_svc
+
+   @ We must avoid clrex due to Cortex-A15 erratum #830321
+   strex   r1, r2, [sp]@ clear the exclusive monitor
+
.if \fast
ldmdb   sp, {r1 - r12}  @ get calling r1 - r12
.else
diff --git 

[PATCH 2/3] dt-bindings: add document of Rockchip power domain

2014-09-22 Thread jinkun.hong
From: "jinkun.hong" 

Signed-off-by: Jack Dai 
Signed-off-by: Caesar Wang 
Signed-off-by: jinkun.hong 
---

 .../bindings/arm/rockchip/power_domain.txt |   48 
 1 file changed, 48 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/arm/rockchip/power_domain.txt

diff --git a/Documentation/devicetree/bindings/arm/rockchip/power_domain.txt 
b/Documentation/devicetree/bindings/arm/rockchip/power_domain.txt
new file mode 100644
index 000..2a80d3f
--- /dev/null
+++ b/Documentation/devicetree/bindings/arm/rockchip/power_domain.txt
@@ -0,0 +1,48 @@
+* Rockchip Power Domains
+
+Rockchip processors include support for multiple power domains which can be
+powered up/down by software based on different application scenes to save 
power.
+
+Required properties for power domain controller:
+- compatible: should be one of the following.
+* rockchip,rk3288-power-controller - for rk3288 type power domain.
+- #power-domain-cells: Number of cells in a power-domain specifier.
+  should be 1.
+- rockchip,pmu: phandle referencing a syscon providing the pmu registers
+- #address-cells: should be 1.
+- #size-cells: should be 0.
+
+Required properties for power domain sub nodes:
+- reg: index of the power domain, should use macros in:
+*  include/dt-bindings/power-domain/rk3288.h - for rk3288 type power 
domain.
+- clocks: phandles to clocks which need to be enabled while power domain
+  switches state.
+
+Example:
+
+   power: power-controller {
+  compatible = "rockchip,rk3288-power-controller";
+  #power-domain-cells = <1>;
+  rockchip,pmu = <>;
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+  pd_gpu {
+  reg = ;
+  clocks = < ACLK_GPU>;
+  };
+   };
+
+Node of a device using power domains must have a power-domains property,
+containing a phandle to the power device node and an index specifying which
+power domain to use.
+The index should use macros in:
+   * include/dt-bindings/power-domain/rk3288.h - for rk3288 type power domain.
+
+Example of the node using power domain:
+
+   node {
+   /* ... */
+   power-domains = < RK3288_PD_GPU>;
+   /* ... */
+   };
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   7   8   9   10   >