[PATCH 2/9] percpu_ref: minor code and comment updates
* Some comments became stale. Updated. * percpu_ref_tryget() unnecessarily initializes @ret. Removed. * A blank line removed from percpu_ref_kill_rcu(). * Explicit function name in a WARN format string replaced with __func__. * WARN_ON() in percpu_ref_reinit() converted to WARN_ON_ONCE(). Signed-off-by: Tejun Heo Cc: Kent Overstreet --- include/linux/percpu-refcount.h | 25 - lib/percpu-refcount.c | 14 ++ 2 files changed, 22 insertions(+), 17 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index f015f13..d44b027 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -115,8 +115,10 @@ static inline bool __pcpu_ref_alive(struct percpu_ref *ref, * percpu_ref_get - increment a percpu refcount * @ref: percpu_ref to get * - * Analagous to atomic_inc(). - */ + * Analagous to atomic_long_inc(). + * + * This function is safe to call as long as @ref is between init and exit. + */ static inline void percpu_ref_get(struct percpu_ref *ref) { unsigned long __percpu *pcpu_count; @@ -138,12 +140,12 @@ static inline void percpu_ref_get(struct percpu_ref *ref) * Increment a percpu refcount unless its count already reached zero. * Returns %true on success; %false on failure. * - * The caller is responsible for ensuring that @ref stays accessible. + * This function is safe to call as long as @ref is between init and exit. */ static inline bool percpu_ref_tryget(struct percpu_ref *ref) { unsigned long __percpu *pcpu_count; - int ret = false; + int ret; rcu_read_lock_sched(); @@ -166,12 +168,13 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref) * Increment a percpu refcount unless it has already been killed. Returns * %true on success; %false on failure. * - * Completion of percpu_ref_kill() in itself doesn't guarantee that tryget - * will fail. For such guarantee, percpu_ref_kill_and_confirm() should be - * used. After the confirm_kill callback is invoked, it's guaranteed that - * no new reference will be given out by percpu_ref_tryget(). + * Completion of percpu_ref_kill() in itself doesn't guarantee that this + * function will fail. For such guarantee, percpu_ref_kill_and_confirm() + * should be used. After the confirm_kill callback is invoked, it's + * guaranteed that no new reference will be given out by + * percpu_ref_tryget_live(). * - * The caller is responsible for ensuring that @ref stays accessible. + * This function is safe to call as long as @ref is between init and exit. */ static inline bool percpu_ref_tryget_live(struct percpu_ref *ref) { @@ -196,6 +199,8 @@ static inline bool percpu_ref_tryget_live(struct percpu_ref *ref) * * Decrement the refcount, and if 0, call the release function (which was passed * to percpu_ref_init()) + * + * This function is safe to call as long as @ref is between init and exit. */ static inline void percpu_ref_put(struct percpu_ref *ref) { @@ -216,6 +221,8 @@ static inline void percpu_ref_put(struct percpu_ref *ref) * @ref: percpu_ref to test * * Returns %true if @ref reached zero. + * + * This function is safe to call as long as @ref is between init and exit. */ static inline bool percpu_ref_is_zero(struct percpu_ref *ref) { diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 070dab5..8ef3f5c 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -108,7 +108,6 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu) * reaching 0 before we add the percpu counts. But doing it at the same * time is equivalent and saves us atomic operations: */ - atomic_long_add((long)count - PCPU_COUNT_BIAS, >count); WARN_ONCE(atomic_long_read(>count) <= 0, @@ -120,8 +119,8 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu) ref->confirm_kill(ref); /* -* Now we're in single atomic_t mode with a consistent refcount, so it's -* safe to drop our initial ref: +* Now we're in single atomic_long_t mode with a consistent +* refcount, so it's safe to drop our initial ref: */ percpu_ref_put(ref); } @@ -134,8 +133,8 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu) * Equivalent to percpu_ref_kill() but also schedules kill confirmation if * @confirm_kill is not NULL. @confirm_kill, which may not block, will be * called after @ref is seen as dead from all CPUs - all further - * invocations of percpu_ref_tryget() will fail. See percpu_ref_tryget() - * for more details. + * invocations of percpu_ref_tryget_live() will fail. See + * percpu_ref_tryget_live() for more details. * * Due to the way percpu_ref is implemented, @confirm_kill will be called * after at least one full RCU grace period has passed but this is an @@ -145,8 +144,7 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref,
[PATCHSET percpu/for-3.18] percpu_ref: implement switch_to_atomic/percpu()
Hello, Over the past several months, percpu_ref grew use cases where it's used as a persistent on/off switch which may be cycled multiple times using percpu_ref_reinit(). One of such use cases is blk-mq's mq_usage_counter which tracks the number of in-flight commands and is used to drain them. Unfortunately, SCSI device probing involves synchronously creating and destroying request_queues for non-existent devices and the sched RCU grace period involved in percpu_ref killing adds upto a significant amount of latency. Block layer already experienced the same issue in other areas and works around it by starting the queue in a degraded mode which is faster to shut down and making it fully functional only after it's known that the queue isn't a temporary one for probing. This patchset implements percpu_ref mechanisms to manually switch between atomic and percpu operation modes so that blk-mq can implement a similar degraded operation mode. This will also allow implementing debug mode for percpu_ref so that underflow can be detected sooner. This patchset contains the following nine patches. 0001-percpu_ref-relocate-percpu_ref_reinit.patch 0002-percpu_ref-minor-code-and-comment-updates.patch 0003-percpu_ref-replace-pcpu_-prefix-with-percpu_.patch 0004-percpu_ref-rename-things-to-prepare-for-decoupling-p.patch 0005-percpu_ref-add-PCPU_REF_DEAD.patch 0006-percpu_ref-decouple-switching-to-atomic-mode-and-kil.patch 0007-percpu_ref-decouple-switching-to-percpu-mode-and-rei.patch 0008-percpu_ref-add-PERCPU_REF_INIT_-flags.patch 0009-percpu_ref-make-INIT_ATOMIC-and-switch_to_atomic-sti.patch 0001-0005 are prep patches. 0006-0007 implement percpu_ref_switch_to_atomic/percpu(). 0008 extends percpu_ref_init() so that a percpu_ref can be initialized in different states including atomic mode. 0009 makes atomic mode sticky so that it survives through reinits. This patchset is on top of percpu/for-3.18 and available in the following git branch. git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git review-percpu_ref-switch diffstat follows. block/blk-mq.c |2 fs/aio.c|4 include/linux/percpu-refcount.h | 108 +- kernel/cgroup.c |7 lib/percpu-refcount.c | 291 +--- 5 files changed, 295 insertions(+), 117 deletions(-) Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 4/9] percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch
percpu_ref will be restructured so that percpu/atomic mode switching and reference killing are dedoupled. In preparation, do the following renames. * percpu_ref->confirm_kill -> percpu_ref->confirm_switch * __PERCPU_REF_DEAD -> __PERCPU_REF_ATOMIC * __percpu_ref_alive() -> __ref_is_percpu() This patch is pure rename and doesn't introduce any functional changes. Signed-off-by: Tejun Heo Cc: Kent Overstreet --- include/linux/percpu-refcount.h | 25 ++--- lib/percpu-refcount.c | 22 +++--- 2 files changed, 25 insertions(+), 22 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index 3d463a3..910e5f7 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -54,6 +54,11 @@ struct percpu_ref; typedef void (percpu_ref_func_t)(struct percpu_ref *); +/* flags set in the lower bits of percpu_ref->percpu_count_ptr */ +enum { + __PERCPU_REF_ATOMIC = 1LU << 0, /* operating in atomic mode */ +}; + struct percpu_ref { atomic_long_t count; /* @@ -62,7 +67,7 @@ struct percpu_ref { */ unsigned long percpu_count_ptr; percpu_ref_func_t *release; - percpu_ref_func_t *confirm_kill; + percpu_ref_func_t *confirm_switch; struct rcu_head rcu; }; @@ -88,23 +93,21 @@ static inline void percpu_ref_kill(struct percpu_ref *ref) return percpu_ref_kill_and_confirm(ref, NULL); } -#define __PERCPU_REF_DEAD 1 - /* * Internal helper. Don't use outside percpu-refcount proper. The * function doesn't return the pointer and let the caller test it for NULL * because doing so forces the compiler to generate two conditional * branches as it can't assume that @ref->percpu_count is not NULL. */ -static inline bool __percpu_ref_alive(struct percpu_ref *ref, - unsigned long __percpu **percpu_countp) +static inline bool __ref_is_percpu(struct percpu_ref *ref, + unsigned long __percpu **percpu_countp) { unsigned long percpu_ptr = ACCESS_ONCE(ref->percpu_count_ptr); /* paired with smp_store_release() in percpu_ref_reinit() */ smp_read_barrier_depends(); - if (unlikely(percpu_ptr & __PERCPU_REF_DEAD)) + if (unlikely(percpu_ptr & __PERCPU_REF_ATOMIC)) return false; *percpu_countp = (unsigned long __percpu *)percpu_ptr; @@ -125,7 +128,7 @@ static inline void percpu_ref_get(struct percpu_ref *ref) rcu_read_lock_sched(); - if (__percpu_ref_alive(ref, _count)) + if (__ref_is_percpu(ref, _count)) this_cpu_inc(*percpu_count); else atomic_long_inc(>count); @@ -149,7 +152,7 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref) rcu_read_lock_sched(); - if (__percpu_ref_alive(ref, _count)) { + if (__ref_is_percpu(ref, _count)) { this_cpu_inc(*percpu_count); ret = true; } else { @@ -183,7 +186,7 @@ static inline bool percpu_ref_tryget_live(struct percpu_ref *ref) rcu_read_lock_sched(); - if (__percpu_ref_alive(ref, _count)) { + if (__ref_is_percpu(ref, _count)) { this_cpu_inc(*percpu_count); ret = true; } @@ -208,7 +211,7 @@ static inline void percpu_ref_put(struct percpu_ref *ref) rcu_read_lock_sched(); - if (__percpu_ref_alive(ref, _count)) + if (__ref_is_percpu(ref, _count)) this_cpu_dec(*percpu_count); else if (unlikely(atomic_long_dec_and_test(>count))) ref->release(ref); @@ -228,7 +231,7 @@ static inline bool percpu_ref_is_zero(struct percpu_ref *ref) { unsigned long __percpu *percpu_count; - if (__percpu_ref_alive(ref, _count)) + if (__ref_is_percpu(ref, _count)) return false; return !atomic_long_read(>count); } diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 5aea6b7..7aef590 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -34,7 +34,7 @@ static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref) { return (unsigned long __percpu *) - (ref->percpu_count_ptr & ~__PERCPU_REF_DEAD); + (ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC); } /** @@ -80,7 +80,7 @@ void percpu_ref_exit(struct percpu_ref *ref) if (percpu_count) { free_percpu(percpu_count); - ref->percpu_count_ptr = __PERCPU_REF_DEAD; + ref->percpu_count_ptr = __PERCPU_REF_ATOMIC; } } EXPORT_SYMBOL_GPL(percpu_ref_exit); @@ -117,8 +117,8 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu) ref->release, atomic_long_read(>count)); /* @ref is viewed as dead on all CPUs, send out
[PATCH 1/9] percpu_ref: relocate percpu_ref_reinit()
percpu_ref is gonna go through restructuring. Move percpu_ref_reinit() after percpu_ref_kill_and_confirm(). This will make later changes easier to follow and result in cleaner organization. Signed-off-by: Tejun Heo Cc: Kent Overstreet --- include/linux/percpu-refcount.h | 2 +- lib/percpu-refcount.c | 70 - 2 files changed, 36 insertions(+), 36 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index 5df6784..f015f13 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -68,10 +68,10 @@ struct percpu_ref { int __must_check percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release, gfp_t gfp); -void percpu_ref_reinit(struct percpu_ref *ref); void percpu_ref_exit(struct percpu_ref *ref); void percpu_ref_kill_and_confirm(struct percpu_ref *ref, percpu_ref_func_t *confirm_kill); +void percpu_ref_reinit(struct percpu_ref *ref); /** * percpu_ref_kill - drop the initial ref diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 559ee0b..070dab5 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -63,41 +63,6 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release, EXPORT_SYMBOL_GPL(percpu_ref_init); /** - * percpu_ref_reinit - re-initialize a percpu refcount - * @ref: perpcu_ref to re-initialize - * - * Re-initialize @ref so that it's in the same state as when it finished - * percpu_ref_init(). @ref must have been initialized successfully, killed - * and reached 0 but not exited. - * - * Note that percpu_ref_tryget[_live]() are safe to perform on @ref while - * this function is in progress. - */ -void percpu_ref_reinit(struct percpu_ref *ref) -{ - unsigned long __percpu *pcpu_count = pcpu_count_ptr(ref); - int cpu; - - BUG_ON(!pcpu_count); - WARN_ON(!percpu_ref_is_zero(ref)); - - atomic_long_set(>count, 1 + PCPU_COUNT_BIAS); - - /* -* Restore per-cpu operation. smp_store_release() is paired with -* smp_read_barrier_depends() in __pcpu_ref_alive() and guarantees -* that the zeroing is visible to all percpu accesses which can see -* the following PCPU_REF_DEAD clearing. -*/ - for_each_possible_cpu(cpu) - *per_cpu_ptr(pcpu_count, cpu) = 0; - - smp_store_release(>pcpu_count_ptr, - ref->pcpu_count_ptr & ~PCPU_REF_DEAD); -} -EXPORT_SYMBOL_GPL(percpu_ref_reinit); - -/** * percpu_ref_exit - undo percpu_ref_init() * @ref: percpu_ref to exit * @@ -189,3 +154,38 @@ void percpu_ref_kill_and_confirm(struct percpu_ref *ref, call_rcu_sched(>rcu, percpu_ref_kill_rcu); } EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm); + +/** + * percpu_ref_reinit - re-initialize a percpu refcount + * @ref: perpcu_ref to re-initialize + * + * Re-initialize @ref so that it's in the same state as when it finished + * percpu_ref_init(). @ref must have been initialized successfully, killed + * and reached 0 but not exited. + * + * Note that percpu_ref_tryget[_live]() are safe to perform on @ref while + * this function is in progress. + */ +void percpu_ref_reinit(struct percpu_ref *ref) +{ + unsigned long __percpu *pcpu_count = pcpu_count_ptr(ref); + int cpu; + + BUG_ON(!pcpu_count); + WARN_ON(!percpu_ref_is_zero(ref)); + + atomic_long_set(>count, 1 + PCPU_COUNT_BIAS); + + /* +* Restore per-cpu operation. smp_store_release() is paired with +* smp_read_barrier_depends() in __pcpu_ref_alive() and guarantees +* that the zeroing is visible to all percpu accesses which can see +* the following PCPU_REF_DEAD clearing. +*/ + for_each_possible_cpu(cpu) + *per_cpu_ptr(pcpu_count, cpu) = 0; + + smp_store_release(>pcpu_count_ptr, + ref->pcpu_count_ptr & ~PCPU_REF_DEAD); +} +EXPORT_SYMBOL_GPL(percpu_ref_reinit); -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] x86 fixes
* Ingo Molnar wrote: > * Ingo Molnar wrote: > > > > > * Linus Torvalds wrote: > > > > > On Fri, Sep 19, 2014 at 3:40 AM, Ingo Molnar wrote: > > > > > > > > Please pull the latest x86-urgent-for-linus git tree from: > > > > > > I only just noticed, but this pull request causes my Sony Vaio > > > laptop to immediately reboot at startup. > > > > > > I'm assuming it's one of the efi changes, but I'm bisecting now > > > to say exactly where it happens. It will get reverted. > > > > I've Cc:-ed Matt. > > > > My guess would be one of these two EFI commits: > > > > * Fix early boot regression affecting x86 EFI boot stub when loading > > initrds above 4GB - Yinghai Lu > > > > 47226ad4f4cf x86/efi: Only load initrd above 4g on second try > > > > * Relocate GOT entries in the x86 EFI boot stub now that we have > > symbols with global visibility - Matt Fleming > > > > 9cb0e394234d x86/efi: Fixup GOT in all boot code paths > > > > If it's 9cb0e394234d - then it's perhaps a build quirk, or a bug > > in the assembly code. If so then we'd have to revert this, and > > reintroduce another regression, caused by EFI commit > > f23cf8bd5c1f49 in this merge window. The most recent commit is > > easy to revert, the older one not. > > > > If it's 47226ad4f4cf then we'd reintroduce the regression caused > > by 4bf7111f501 in the previous merge window. They both revert > > cleanly after each other - but it might be safer to just revert > > the most recent one. > > > > My guess is that your regression is caused by 47226ad4f4cf. > > Wrong sha1: my guess is on 9cb0e394234d, the GOT fixup. So if it's the GOT fixup then I feel the safest option is to revert 9cb0e394234d straight away, and then to do a functional revert of f23cf8bd5c1f49 as a separate step, perhaps via something really crude like: #include "..//drivers/firmware/efi/libstub/efi-stub-helper.c" or so. (Maybe someone else can think of something cleaner/simpler, because this method is really ugly, as we'd have to #include the whole libstub library into eboot.c AFAICS...) Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: boot stall regression due to blk-mq: use percpu_ref for mq usage count
On Tue, Sep 23, 2014 at 01:56:48AM -0400, Tejun Heo wrote: > On Tue, Sep 23, 2014 at 07:55:54AM +0200, Christoph Hellwig wrote: > > Jens, > > > > can we simply get these commits reverted from now if there's no better > > fix? I'd hate to have this boot stall in the first kernel with blk-mq > > support for scsi. > > Patches going out right now. And the original implementation was broken, so... -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 5/9] percpu_ref: add PCPU_REF_DEAD
percpu_ref will be restructured so that percpu/atomic mode switching and reference killing are dedoupled. In preparation, add PCPU_REF_DEAD and PCPU_REF_ATOMIC_DEAD which is OR of ATOMIC and DEAD. For now, ATOMIC and DEAD are changed together and all PCPU_REF_ATOMIC uses are converted to PCPU_REF_ATOMIC_DEAD without causing any behavior changes. BUILD_BUG_ON() is added to percpu_ref_init() so that later flag additions don't accidentally clobber lower bits of the pointer in percpu_ref->pcpu_count_ptr. Signed-off-by: Tejun Heo Cc: Kent Overstreet --- include/linux/percpu-refcount.h | 4 +++- lib/percpu-refcount.c | 15 +-- 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index 910e5f7..24cf157 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -57,6 +57,8 @@ typedef void (percpu_ref_func_t)(struct percpu_ref *); /* flags set in the lower bits of percpu_ref->percpu_count_ptr */ enum { __PERCPU_REF_ATOMIC = 1LU << 0, /* operating in atomic mode */ + __PERCPU_REF_DEAD = 1LU << 1, /* (being) killed */ + __PERCPU_REF_ATOMIC_DEAD = __PERCPU_REF_ATOMIC | __PERCPU_REF_DEAD, }; struct percpu_ref { @@ -107,7 +109,7 @@ static inline bool __ref_is_percpu(struct percpu_ref *ref, /* paired with smp_store_release() in percpu_ref_reinit() */ smp_read_barrier_depends(); - if (unlikely(percpu_ptr & __PERCPU_REF_ATOMIC)) + if (unlikely(percpu_ptr & __PERCPU_REF_ATOMIC_DEAD)) return false; *percpu_countp = (unsigned long __percpu *)percpu_ptr; diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 7aef590..b0b8c09 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -34,7 +34,7 @@ static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref) { return (unsigned long __percpu *) - (ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC); + (ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD); } /** @@ -52,6 +52,9 @@ static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref) int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release, gfp_t gfp) { + BUILD_BUG_ON(__PERCPU_REF_ATOMIC_DEAD & +~(__alignof__(unsigned long) - 1)); + atomic_long_set(>count, 1 + PERCPU_COUNT_BIAS); ref->percpu_count_ptr = @@ -80,7 +83,7 @@ void percpu_ref_exit(struct percpu_ref *ref) if (percpu_count) { free_percpu(percpu_count); - ref->percpu_count_ptr = __PERCPU_REF_ATOMIC; + ref->percpu_count_ptr = __PERCPU_REF_ATOMIC_DEAD; } } EXPORT_SYMBOL_GPL(percpu_ref_exit); @@ -145,10 +148,10 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu) void percpu_ref_kill_and_confirm(struct percpu_ref *ref, percpu_ref_func_t *confirm_kill) { - WARN_ONCE(ref->percpu_count_ptr & __PERCPU_REF_ATOMIC, + WARN_ONCE(ref->percpu_count_ptr & __PERCPU_REF_ATOMIC_DEAD, "%s called more than once on %pf!", __func__, ref->release); - ref->percpu_count_ptr |= __PERCPU_REF_ATOMIC; + ref->percpu_count_ptr |= __PERCPU_REF_ATOMIC_DEAD; ref->confirm_switch = confirm_kill; call_rcu_sched(>rcu, percpu_ref_kill_rcu); @@ -180,12 +183,12 @@ void percpu_ref_reinit(struct percpu_ref *ref) * Restore per-cpu operation. smp_store_release() is paired with * smp_read_barrier_depends() in __ref_is_percpu() and guarantees * that the zeroing is visible to all percpu accesses which can see -* the following __PERCPU_REF_ATOMIC clearing. +* the following __PERCPU_REF_ATOMIC_DEAD clearing. */ for_each_possible_cpu(cpu) *per_cpu_ptr(percpu_count, cpu) = 0; smp_store_release(>percpu_count_ptr, - ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC); + ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD); } EXPORT_SYMBOL_GPL(percpu_ref_reinit); -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: two more fixes for block/for-linus
On Mon, Sep 22, 2014 at 02:40:15PM -0400, Douglas Gilbert wrote: > With these patches applied (actually a resync an hour > ago with the for-linus tree which includes them), the > freeze-during-boot-up problem that I have been seeing > with an old SATA boot disk (perhaps 1.5 Gbps) for > the last two weeks, has gone away. > > That SATA disk is connected to the motherboard (Gigabyte > Z97M-D3H/Z97M-D3H, BIOS F5 05/30/2014) and has a standard > AHCI interface as far as I can tell. dmesg confirms that. Should have thought of the weird ATA error handling earlier. Sorry Doug! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: boot stall regression due to blk-mq: use percpu_ref for mq usage count
On Tue, Sep 23, 2014 at 07:55:54AM +0200, Christoph Hellwig wrote: > Jens, > > can we simply get these commits reverted from now if there's no better > fix? I'd hate to have this boot stall in the first kernel with blk-mq > support for scsi. Patches going out right now. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 9/9] percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky
Currently, a percpu_ref which is initialized with PERPCU_REF_INIT_ATOMIC or switched to atomic mode via switch_to_atomic() automatically reverts to percpu mode on the first percpu_ref_reinit(). This makes the atomic mode difficult to use for cases where a percpu_ref is used as a persistent on/off switch which may be cycled multiple times. This patch makes such atomic state sticky so that it survives through kill/reinit cycles. After this patch, atomic state is cleared only by an explicit percpu_ref_switch_to_percpu() call. Signed-off-by: Tejun Heo Cc: Kent Overstreet Cc: Jens Axboe Cc: Christoph Hellwig Cc: Johannes Weiner --- include/linux/percpu-refcount.h | 5 - lib/percpu-refcount.c | 20 +++- 2 files changed, 19 insertions(+), 6 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index 5f84bf0..8459d3a 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -65,7 +65,9 @@ enum { enum { /* * Start w/ ref == 1 in atomic mode. Can be switched to percpu -* operation using percpu_ref_switch_to_percpu(). +* operation using percpu_ref_switch_to_percpu(). If initialized +* with this flag, the ref will stay in atomic mode until +* percpu_ref_switch_to_percpu() is invoked on it. */ PERCPU_REF_INIT_ATOMIC = 1 << 0, @@ -85,6 +87,7 @@ struct percpu_ref { unsigned long percpu_count_ptr; percpu_ref_func_t *release; percpu_ref_func_t *confirm_switch; + boolforce_atomic:1; struct rcu_head rcu; }; diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 74ec33e..c47e496 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -68,6 +68,8 @@ int percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release, if (!ref->percpu_count_ptr) return -ENOMEM; + ref->force_atomic = flags & PERCPU_REF_INIT_ATOMIC; + if (flags & (PERCPU_REF_INIT_ATOMIC | PERCPU_REF_INIT_DEAD)) ref->percpu_count_ptr |= __PERCPU_REF_ATOMIC; else @@ -203,7 +205,8 @@ static void __percpu_ref_switch_to_atomic(struct percpu_ref *ref, * are guaraneed to be in atomic mode, @confirm_switch, which may not * block, is invoked. This function may be invoked concurrently with all * the get/put operations and can safely be mixed with kill and reinit - * operations. + * operations. Note that @ref will stay in atomic mode across kill/reinit + * cycles until percpu_ref_switch_to_percpu() is called. * * This function normally doesn't block and can be called from any context * but it may block if @confirm_kill is specified and @ref is already in @@ -217,6 +220,7 @@ static void __percpu_ref_switch_to_atomic(struct percpu_ref *ref, void percpu_ref_switch_to_atomic(struct percpu_ref *ref, percpu_ref_func_t *confirm_switch) { + ref->force_atomic = true; __percpu_ref_switch_to_atomic(ref, confirm_switch); } @@ -256,7 +260,10 @@ void __percpu_ref_switch_to_percpu(struct percpu_ref *ref) * * Switch @ref to percpu mode. This function may be invoked concurrently * with all the get/put operations and can safely be mixed with kill and - * reinit operations. + * reinit operations. This function reverses the sticky atomic state set + * by PERCPU_REF_INIT_ATOMIC or percpu_ref_switch_to_atomic(). If @ref is + * dying or dead, the actual switching takes place on the following + * percpu_ref_reinit(). * * This function normally doesn't block and can be called from any context * but it may block if @ref is in the process of switching to atomic mode @@ -264,6 +271,8 @@ void __percpu_ref_switch_to_percpu(struct percpu_ref *ref) */ void percpu_ref_switch_to_percpu(struct percpu_ref *ref) { + ref->force_atomic = false; + /* a dying or dead ref can't be switched to percpu mode w/o reinit */ if (!(ref->percpu_count_ptr & __PERCPU_REF_DEAD)) __percpu_ref_switch_to_percpu(ref); @@ -305,8 +314,8 @@ EXPORT_SYMBOL_GPL(percpu_ref_kill_and_confirm); * @ref: perpcu_ref to re-initialize * * Re-initialize @ref so that it's in the same state as when it finished - * percpu_ref_init(). @ref must have been initialized successfully and - * reached 0 but not exited. + * percpu_ref_init() ignoring %PERCPU_REF_INIT_DEAD. @ref must have been + * initialized successfully and reached 0 but not exited. * * Note that percpu_ref_tryget[_live]() are safe to perform on @ref while * this function is in progress. @@ -317,6 +326,7 @@ void percpu_ref_reinit(struct percpu_ref *ref) ref->percpu_count_ptr &= ~__PERCPU_REF_DEAD; percpu_ref_get(ref); - __percpu_ref_switch_to_percpu(ref); + if (!ref->force_atomic) + __percpu_ref_switch_to_percpu(ref); }
[PATCH 6/9] percpu_ref: decouple switching to atomic mode and killing
percpu_ref has treated the dropping of the base reference and switching to atomic mode as an integral operation; however, there's nothing inherent tying the two together. The use cases for percpu_ref have been expanding continuously. While the current init/kill/reinit/exit model can cover a lot, the coupling of kill/reinit with atomic/percpu mode switching is turning out to be too restrictive for use cases where many percpu_refs are created and destroyed back-to-back with only some of them reaching extended operation. The coupling also makes implementing always-atomic debug mode difficult. This patch separates out atomic mode switching into percpu_ref_switch_to_atomic() and reimplements percpu_ref_kill_and_confirm() on top of it. * The handling of __PERCPU_REF_ATOMIC and __PERCPU_REF_DEAD is now differentiated. Among get/put operations, percpu_ref_tryget_live() is the only one which cares about DEAD. * percpu_ref_switch_to_atomic() can be called multiple times on the same ref. This means that multiple @confirm_switch may get queued up which we can't do reliably without extra memory area. This is handled by making the later invocation synchronously wait for the completion of the previous one. This isn't particularly desirable but such synchronous waits shouldn't happen in most cases. Signed-off-by: Tejun Heo Cc: Kent Overstreet Cc: Jens Axboe Cc: Christoph Hellwig Cc: Johannes Weiner --- include/linux/percpu-refcount.h | 8 ++- lib/percpu-refcount.c | 141 +++- 2 files changed, 116 insertions(+), 33 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index 24cf157..03a02e9 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -76,9 +76,11 @@ struct percpu_ref { int __must_check percpu_ref_init(struct percpu_ref *ref, percpu_ref_func_t *release, gfp_t gfp); void percpu_ref_exit(struct percpu_ref *ref); +void percpu_ref_switch_to_atomic(struct percpu_ref *ref, +percpu_ref_func_t *confirm_switch); +void percpu_ref_reinit(struct percpu_ref *ref); void percpu_ref_kill_and_confirm(struct percpu_ref *ref, percpu_ref_func_t *confirm_kill); -void percpu_ref_reinit(struct percpu_ref *ref); /** * percpu_ref_kill - drop the initial ref @@ -109,7 +111,7 @@ static inline bool __ref_is_percpu(struct percpu_ref *ref, /* paired with smp_store_release() in percpu_ref_reinit() */ smp_read_barrier_depends(); - if (unlikely(percpu_ptr & __PERCPU_REF_ATOMIC_DEAD)) + if (unlikely(percpu_ptr & __PERCPU_REF_ATOMIC)) return false; *percpu_countp = (unsigned long __percpu *)percpu_ptr; @@ -191,6 +193,8 @@ static inline bool percpu_ref_tryget_live(struct percpu_ref *ref) if (__ref_is_percpu(ref, _count)) { this_cpu_inc(*percpu_count); ret = true; + } else if (!(ACCESS_ONCE(ref->percpu_count_ptr) & __PERCPU_REF_DEAD)) { + ret = atomic_long_inc_not_zero(>count); } rcu_read_unlock_sched(); diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index b0b8c09..56a7c0d 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -1,6 +1,8 @@ #define pr_fmt(fmt) "%s: " fmt "\n", __func__ #include +#include +#include #include /* @@ -31,6 +33,8 @@ #define PERCPU_COUNT_BIAS (1LU << (BITS_PER_LONG - 1)) +static DECLARE_WAIT_QUEUE_HEAD(percpu_ref_switch_waitq); + static unsigned long __percpu *percpu_count_ptr(struct percpu_ref *ref) { return (unsigned long __percpu *) @@ -88,7 +92,19 @@ void percpu_ref_exit(struct percpu_ref *ref) } EXPORT_SYMBOL_GPL(percpu_ref_exit); -static void percpu_ref_kill_rcu(struct rcu_head *rcu) +static void percpu_ref_call_confirm_rcu(struct rcu_head *rcu) +{ + struct percpu_ref *ref = container_of(rcu, struct percpu_ref, rcu); + + ref->confirm_switch(ref); + ref->confirm_switch = NULL; + wake_up_all(_ref_switch_waitq); + + /* drop ref from percpu_ref_switch_to_atomic() */ + percpu_ref_put(ref); +} + +static void percpu_ref_switch_to_atomic_rcu(struct rcu_head *rcu) { struct percpu_ref *ref = container_of(rcu, struct percpu_ref, rcu); unsigned long __percpu *percpu_count = percpu_count_ptr(ref); @@ -116,47 +132,79 @@ static void percpu_ref_kill_rcu(struct rcu_head *rcu) atomic_long_add((long)count - PERCPU_COUNT_BIAS, >count); WARN_ONCE(atomic_long_read(>count) <= 0, - "percpu ref (%pf) <= 0 (%ld) after killed", + "percpu ref (%pf) <= 0 (%ld) after switching to atomic", ref->release, atomic_long_read(>count)); - /* @ref is viewed as dead on all CPUs, send out kill confirmation */ - if (ref->confirm_switch) - ref->confirm_switch(ref); +
Re: boot stall regression due to blk-mq: use percpu_ref for mq usage count
Jens, can we simply get these commits reverted from now if there's no better fix? I'd hate to have this boot stall in the first kernel with blk-mq support for scsi. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 8/9] percpu_ref: add PERCPU_REF_INIT_* flags
With the recent addition of percpu_ref_reinit(), percpu_ref now can be used as a persistent switch which can be turned on and off repeatedly where turning off maps to killing the ref and waiting for it to drain; however, there currently isn't a way to initialize a percpu_ref in its off (killed and drained) state, which can be inconvenient for certain persistent switch use cases. Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic selection of operation mode; however, currently a newly initialized percpu_ref is always in percpu mode making it impossible to avoid the latency overhead of switching to atomic mode. This patch adds @flags to percpu_ref_init() and implements the following flags. * PERCPU_REF_INIT_ATOMIC: start ref in atomic mode * PERCPU_REF_INIT_DEAD : start ref killed and drained These flags should be able to serve the above two use cases. Signed-off-by: Tejun Heo Cc: Kent Overstreet Cc: Jens Axboe Cc: Christoph Hellwig Cc: Johannes Weiner --- block/blk-mq.c | 2 +- fs/aio.c| 4 ++-- include/linux/percpu-refcount.h | 18 +- kernel/cgroup.c | 7 --- lib/percpu-refcount.c | 24 +++- 5 files changed, 43 insertions(+), 12 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 702df07..3f6e6f5 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1777,7 +1777,7 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set) goto err_hctxs; if (percpu_ref_init(>mq_usage_counter, blk_mq_usage_counter_release, - GFP_KERNEL)) + 0, GFP_KERNEL)) goto err_map; setup_timer(>timeout, blk_mq_rq_timer, (unsigned long) q); diff --git a/fs/aio.c b/fs/aio.c index 93fbcc0f..9b6d5d6 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -666,10 +666,10 @@ static struct kioctx *ioctx_alloc(unsigned nr_events) INIT_LIST_HEAD(>active_reqs); - if (percpu_ref_init(>users, free_ioctx_users, GFP_KERNEL)) + if (percpu_ref_init(>users, free_ioctx_users, 0, GFP_KERNEL)) goto err; - if (percpu_ref_init(>reqs, free_ioctx_reqs, GFP_KERNEL)) + if (percpu_ref_init(>reqs, free_ioctx_reqs, 0, GFP_KERNEL)) goto err; ctx->cpu = alloc_percpu(struct kioctx_cpu); diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index e41ca20..5f84bf0 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -61,6 +61,21 @@ enum { __PERCPU_REF_ATOMIC_DEAD = __PERCPU_REF_ATOMIC | __PERCPU_REF_DEAD, }; +/* @flags for percpu_ref_init() */ +enum { + /* +* Start w/ ref == 1 in atomic mode. Can be switched to percpu +* operation using percpu_ref_switch_to_percpu(). +*/ + PERCPU_REF_INIT_ATOMIC = 1 << 0, + + /* +* Start dead w/ ref == 0 in atomic mode. Must be revived with +* percpu_ref_reinit() before used. Implies INIT_ATOMIC. +*/ + PERCPU_REF_INIT_DEAD= 1 << 1, +}; + struct percpu_ref { atomic_long_t count; /* @@ -74,7 +89,8 @@ struct percpu_ref { }; int __must_check percpu_ref_init(struct percpu_ref *ref, -percpu_ref_func_t *release, gfp_t gfp); +percpu_ref_func_t *release, unsigned int flags, +gfp_t gfp); void percpu_ref_exit(struct percpu_ref *ref); void percpu_ref_switch_to_atomic(struct percpu_ref *ref, percpu_ref_func_t *confirm_switch); diff --git a/kernel/cgroup.c b/kernel/cgroup.c index 589b4d8..e2fbcc1 100644 --- a/kernel/cgroup.c +++ b/kernel/cgroup.c @@ -1628,7 +1628,8 @@ static int cgroup_setup_root(struct cgroup_root *root, unsigned int ss_mask) goto out; root_cgrp->id = ret; - ret = percpu_ref_init(_cgrp->self.refcnt, css_release, GFP_KERNEL); + ret = percpu_ref_init(_cgrp->self.refcnt, css_release, 0, + GFP_KERNEL); if (ret) goto out; @@ -4487,7 +4488,7 @@ static int create_css(struct cgroup *cgrp, struct cgroup_subsys *ss, init_and_link_css(css, ss, cgrp); - err = percpu_ref_init(>refcnt, css_release, GFP_KERNEL); + err = percpu_ref_init(>refcnt, css_release, 0, GFP_KERNEL); if (err) goto err_free_css; @@ -4555,7 +4556,7 @@ static int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, goto out_unlock; } - ret = percpu_ref_init(>self.refcnt, css_release, GFP_KERNEL); + ret = percpu_ref_init(>self.refcnt, css_release, 0, GFP_KERNEL); if (ret) goto out_free_cgrp; diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 548b19e..74ec33e 100644 --- a/lib/percpu-refcount.c +++
[PATCH 3/9] percpu_ref: replace pcpu_ prefix with percpu_
percpu_ref uses pcpu_ prefix for internal stuff and percpu_ for externally visible ones. This is the same convention used in the percpu allocator implementation. It works fine there but percpu_ref doesn't have too much internal-only stuff and scattered usages of pcpu_ prefix are confusing than helpful. This patch replaces all pcpu_ prefixes with percpu_. This is pure rename and there's no functional change. Note that PCPU_REF_DEAD is renamed to __PERCPU_REF_DEAD to signify that the flag is internal. Signed-off-by: Tejun Heo Cc: Kent Overstreet --- include/linux/percpu-refcount.h | 46 - lib/percpu-refcount.c | 56 + 2 files changed, 52 insertions(+), 50 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index d44b027..3d463a3 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -13,7 +13,7 @@ * * The refcount will have a range of 0 to ((1U << 31) - 1), i.e. one bit less * than an atomic_t - this is because of the way shutdown works, see - * percpu_ref_kill()/PCPU_COUNT_BIAS. + * percpu_ref_kill()/PERCPU_COUNT_BIAS. * * Before you call percpu_ref_kill(), percpu_ref_put() does not check for the * refcount hitting 0 - it can't, if it was in percpu mode. percpu_ref_kill() @@ -60,7 +60,7 @@ struct percpu_ref { * The low bit of the pointer indicates whether the ref is in percpu * mode; if set, then get/put will manipulate the atomic_t. */ - unsigned long pcpu_count_ptr; + unsigned long percpu_count_ptr; percpu_ref_func_t *release; percpu_ref_func_t *confirm_kill; struct rcu_head rcu; @@ -88,26 +88,26 @@ static inline void percpu_ref_kill(struct percpu_ref *ref) return percpu_ref_kill_and_confirm(ref, NULL); } -#define PCPU_REF_DEAD 1 +#define __PERCPU_REF_DEAD 1 /* * Internal helper. Don't use outside percpu-refcount proper. The * function doesn't return the pointer and let the caller test it for NULL * because doing so forces the compiler to generate two conditional - * branches as it can't assume that @ref->pcpu_count is not NULL. + * branches as it can't assume that @ref->percpu_count is not NULL. */ -static inline bool __pcpu_ref_alive(struct percpu_ref *ref, - unsigned long __percpu **pcpu_countp) +static inline bool __percpu_ref_alive(struct percpu_ref *ref, + unsigned long __percpu **percpu_countp) { - unsigned long pcpu_ptr = ACCESS_ONCE(ref->pcpu_count_ptr); + unsigned long percpu_ptr = ACCESS_ONCE(ref->percpu_count_ptr); /* paired with smp_store_release() in percpu_ref_reinit() */ smp_read_barrier_depends(); - if (unlikely(pcpu_ptr & PCPU_REF_DEAD)) + if (unlikely(percpu_ptr & __PERCPU_REF_DEAD)) return false; - *pcpu_countp = (unsigned long __percpu *)pcpu_ptr; + *percpu_countp = (unsigned long __percpu *)percpu_ptr; return true; } @@ -121,12 +121,12 @@ static inline bool __pcpu_ref_alive(struct percpu_ref *ref, */ static inline void percpu_ref_get(struct percpu_ref *ref) { - unsigned long __percpu *pcpu_count; + unsigned long __percpu *percpu_count; rcu_read_lock_sched(); - if (__pcpu_ref_alive(ref, _count)) - this_cpu_inc(*pcpu_count); + if (__percpu_ref_alive(ref, _count)) + this_cpu_inc(*percpu_count); else atomic_long_inc(>count); @@ -144,13 +144,13 @@ static inline void percpu_ref_get(struct percpu_ref *ref) */ static inline bool percpu_ref_tryget(struct percpu_ref *ref) { - unsigned long __percpu *pcpu_count; + unsigned long __percpu *percpu_count; int ret; rcu_read_lock_sched(); - if (__pcpu_ref_alive(ref, _count)) { - this_cpu_inc(*pcpu_count); + if (__percpu_ref_alive(ref, _count)) { + this_cpu_inc(*percpu_count); ret = true; } else { ret = atomic_long_inc_not_zero(>count); @@ -178,13 +178,13 @@ static inline bool percpu_ref_tryget(struct percpu_ref *ref) */ static inline bool percpu_ref_tryget_live(struct percpu_ref *ref) { - unsigned long __percpu *pcpu_count; + unsigned long __percpu *percpu_count; int ret = false; rcu_read_lock_sched(); - if (__pcpu_ref_alive(ref, _count)) { - this_cpu_inc(*pcpu_count); + if (__percpu_ref_alive(ref, _count)) { + this_cpu_inc(*percpu_count); ret = true; } @@ -204,12 +204,12 @@ static inline bool percpu_ref_tryget_live(struct percpu_ref *ref) */ static inline void percpu_ref_put(struct percpu_ref *ref) { - unsigned long __percpu *pcpu_count; + unsigned long
[PATCH 7/9] percpu_ref: decouple switching to percpu mode and reinit
percpu_ref has treated the dropping of the base reference and switching to atomic mode as an integral operation; however, there's nothing inherent tying the two together. The use cases for percpu_ref have been expanding continuously. While the current init/kill/reinit/exit model can cover a lot, the coupling of kill/reinit with atomic/percpu mode switching is turning out to be too restrictive for use cases where many percpu_refs are created and destroyed back-to-back with only some of them reaching extended operation. The coupling also makes implementing always-atomic debug mode difficult. This patch separates out percpu mode switching into percpu_ref_switch_to_percpu() and reimplements percpu_ref_reinit() on top of it. * DEAD still requires ATOMIC. A dead ref can't be switched to percpu mode w/o going through reinit. Signed-off-by: Tejun Heo Cc: Kent Overstreet Cc: Jens Axboe Cc: Christoph Hellwig Cc: Johannes Weiner --- include/linux/percpu-refcount.h | 3 +- lib/percpu-refcount.c | 73 ++--- 2 files changed, 56 insertions(+), 20 deletions(-) diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h index 03a02e9..e41ca20 100644 --- a/include/linux/percpu-refcount.h +++ b/include/linux/percpu-refcount.h @@ -78,9 +78,10 @@ int __must_check percpu_ref_init(struct percpu_ref *ref, void percpu_ref_exit(struct percpu_ref *ref); void percpu_ref_switch_to_atomic(struct percpu_ref *ref, percpu_ref_func_t *confirm_switch); -void percpu_ref_reinit(struct percpu_ref *ref); +void percpu_ref_switch_to_percpu(struct percpu_ref *ref); void percpu_ref_kill_and_confirm(struct percpu_ref *ref, percpu_ref_func_t *confirm_kill); +void percpu_ref_reinit(struct percpu_ref *ref); /** * percpu_ref_kill - drop the initial ref diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c index 56a7c0d..548b19e 100644 --- a/lib/percpu-refcount.c +++ b/lib/percpu-refcount.c @@ -206,40 +206,54 @@ void percpu_ref_switch_to_atomic(struct percpu_ref *ref, __percpu_ref_switch_to_atomic(ref, confirm_switch); } -/** - * percpu_ref_reinit - re-initialize a percpu refcount - * @ref: perpcu_ref to re-initialize - * - * Re-initialize @ref so that it's in the same state as when it finished - * percpu_ref_init(). @ref must have been initialized successfully, killed - * and reached 0 but not exited. - * - * Note that percpu_ref_tryget[_live]() are safe to perform on @ref while - * this function is in progress. - */ -void percpu_ref_reinit(struct percpu_ref *ref) +void __percpu_ref_switch_to_percpu(struct percpu_ref *ref) { unsigned long __percpu *percpu_count = percpu_count_ptr(ref); int cpu; BUG_ON(!percpu_count); - WARN_ON_ONCE(!percpu_ref_is_zero(ref)); - atomic_long_set(>count, 1 + PERCPU_COUNT_BIAS); + if (!(ref->percpu_count_ptr & __PERCPU_REF_ATOMIC)) + return; + + wait_event(percpu_ref_switch_waitq, !ref->confirm_switch); + + atomic_long_add(PERCPU_COUNT_BIAS, >count); /* * Restore per-cpu operation. smp_store_release() is paired with * smp_read_barrier_depends() in __ref_is_percpu() and guarantees * that the zeroing is visible to all percpu accesses which can see -* the following __PERCPU_REF_ATOMIC_DEAD clearing. +* the following __PERCPU_REF_ATOMIC clearing. */ for_each_possible_cpu(cpu) *per_cpu_ptr(percpu_count, cpu) = 0; smp_store_release(>percpu_count_ptr, - ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC_DEAD); + ref->percpu_count_ptr & ~__PERCPU_REF_ATOMIC); +} + +/** + * percpu_ref_switch_to_percpu - switch a percpu_ref to percpu mode + * @ref: percpu_ref to switch to percpu mode + * + * There's no reason to use this function for the usual reference counting. + * To re-use an expired ref, use percpu_ref_reinit(). + * + * Switch @ref to percpu mode. This function may be invoked concurrently + * with all the get/put operations and can safely be mixed with kill and + * reinit operations. + * + * This function normally doesn't block and can be called from any context + * but it may block if @ref is in the process of switching to atomic mode + * by percpu_ref_switch_atomic(). + */ +void percpu_ref_switch_to_percpu(struct percpu_ref *ref) +{ + /* a dying or dead ref can't be switched to percpu mode w/o reinit */ + if (!(ref->percpu_count_ptr & __PERCPU_REF_DEAD)) + __percpu_ref_switch_to_percpu(ref); } -EXPORT_SYMBOL_GPL(percpu_ref_reinit); /** * percpu_ref_kill_and_confirm - drop the initial ref and schedule confirmation @@ -253,8 +267,8 @@ EXPORT_SYMBOL_GPL(percpu_ref_reinit); * percpu_ref_tryget_live() for details. * * This function normally doesn't block and can be called from any context - * but it may block if
[PATCH] ata: Disabling the async PM for JMicron chips
Be similar with commit (ata: Disabling the async PM for JMicron chip 363/361), Barto found the similar issue for JMicron chip 368, that 363/368 has no parent-children relationship, but they have the power dependency. So here we can exclude the JMicron chips out of pm_async method directly, to avoid further similar issues. Details in: https://bugzilla.kernel.org/show_bug.cgi?id=84861 Reported-and-tested-by: Barto Signed-off-by: Chuansheng Liu --- drivers/ata/ahci.c |6 +++--- drivers/ata/pata_jmicron.c |6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c index a0cc0ed..c096d49 100644 --- a/drivers/ata/ahci.c +++ b/drivers/ata/ahci.c @@ -1345,10 +1345,10 @@ static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) * follow the sequence one by one, otherwise one of them can not be * powered on successfully, so here we disable the async suspend * method for these chips. +* Jmicron chip 368 has been found has the similar issue, here we can +* exclude the Jmicron family directly to avoid other similar issues. */ - if (pdev->vendor == PCI_VENDOR_ID_JMICRON && - (pdev->device == PCI_DEVICE_ID_JMICRON_JMB363 || - pdev->device == PCI_DEVICE_ID_JMICRON_JMB361)) + if (pdev->vendor == PCI_VENDOR_ID_JMICRON) device_disable_async_suspend(>dev); /* acquire resources */ diff --git a/drivers/ata/pata_jmicron.c b/drivers/ata/pata_jmicron.c index 47e418b..48c993b 100644 --- a/drivers/ata/pata_jmicron.c +++ b/drivers/ata/pata_jmicron.c @@ -149,10 +149,10 @@ static int jmicron_init_one (struct pci_dev *pdev, const struct pci_device_id *i * follow the sequence one by one, otherwise one of them can not be * powered on successfully, so here we disable the async suspend * method for these chips. +* Jmicron chip 368 has been found has the similar issue, here we can +* exclude the Jmicron family directly to avoid other similar issues. */ - if (pdev->vendor == PCI_VENDOR_ID_JMICRON && - (pdev->device == PCI_DEVICE_ID_JMICRON_JMB363 || - pdev->device == PCI_DEVICE_ID_JMICRON_JMB361)) + if (pdev->vendor == PCI_VENDOR_ID_JMICRON) device_disable_async_suspend(>dev); return ata_pci_bmdma_init_one(pdev, ppi, _sht, NULL, 0); -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kernfs: use stack-buf for small writes.
On Tue, Sep 23, 2014 at 03:40:58PM +1000, NeilBrown wrote: > > Oh, I meant the buffer seqfile read op writes to, so it depends on the > > fact that the allocation is only on the first read? That seems > > extremely brittle to me, especially for an issue which tends to be > > difficult to reproduce. > > It is easy for user-space to ensure they read once before any critical time.. Sure, but it's a hard and subtle dependency on an extremely obscure implementation detail. > > I'd much rather keep things direct and make it explicitly allocate r/w > > buffer(s) on open and disallow seq_file operations on such files. > > As far as I can tell, seq_read is used on all sysfs files that are > readable except for 'binary' files. Are you suggesting all files that might > need to be accessed without a kmalloc have to be binary files? kernfs ->direct_read() callback doesn't go through seq_file. sysfs can be extended to support that for regular files, I guess. Or just make those special files binary? > Having to identify those files which are important in advance seems the more > "brittle" approach to me. I would much rather it "just worked" I disagree. The files which shouldn't involve memory allocations must be identified no matter what. They're *very* special. And the rules that userland has to follow seem completely broken to me. "Small" writes are okay, whatever that means, and "small" reads are okay too as long as it isn't the first read. Ooh, BTW, if the second read ends up expanding the initial buffer, it isn't okay - the initial boundary is PAGE_SIZE and the buffer is expanded twice on each overflow. How are these rules okay? This is borderline crazy. In addition, the read path involves a lot more code this way. It ends up locking down buffer policies of the whole seqfile implementation. > Would you prefer a new per-attribute flag which directed sysfs to > pre-allocate a full page, or a 'max_size' attribute which caused a buffer of > that size to be allocated on open? > The same size would be used to pre-allocate the seqfile buf (like > single_open_size does) if reads were supported. Yes but I really think we should avoid seqfile dependency. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net-next] mellanox: Change en_print to return void
On 9/22/2014 8:40 PM, Joe Perches wrote: > No caller or macro uses the return value so make it void. > > Signed-off-by: Joe Perches > --- > This change is associated to a desire to eventually > change printk to return void. > > drivers/net/ethernet/mellanox/mlx4/en_main.c | 17 +++-- > drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 4 ++-- > 2 files changed, 9 insertions(+), 12 deletions(-) Thanks Joe. Acked-By: Amir Vadai -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/5] x86, mm, pat: Set WT to PA7 slot of PAT MSR
On 09/17/2014 09:48 PM, Toshi Kani wrote: This patch sets WT to the PA7 slot in the PAT MSR when the processor is not affected by the PAT errata. The PA7 slot is chosen to further minimize the risk of using the PAT bit as the PA3 slot is UC and is not currently used. The following Intel processors are affected by the PAT errata. errata cpuid Pentium 2, A52 family 0x6, model 0x5 Pentium 3, E27 family 0x6, model 0x7, 0x8 Pentium 3 Xenon, G26 family 0x6, model 0x7, 0x8, 0xa Pentium M, Y26 family 0x6, model 0x9 Pentium M 90nm, X9 family 0x6, model 0xd Pentium 4, N46 family 0xf, model 0x0 Instead of making sharp boundary checks, this patch makes conservative checks to exclude all Pentium 2, 3, M and 4 family processors. For such processors, _PAGE_CACHE_MODE_WT is redirected to UC- per the default setup in __cachemode2pte_tbl[]. Signed-off-by: Toshi Kani Reviewed-by: Juergen Gross --- arch/x86/mm/pat.c | 64 + 1 file changed, 49 insertions(+), 15 deletions(-) diff --git a/arch/x86/mm/pat.c b/arch/x86/mm/pat.c index ff31851..db687c3 100644 --- a/arch/x86/mm/pat.c +++ b/arch/x86/mm/pat.c @@ -133,6 +133,7 @@ void pat_init(void) { u64 pat; bool boot_cpu = !boot_pat_state; + struct cpuinfo_x86 *c = _cpu_data; if (!pat_enabled) return; @@ -153,21 +154,54 @@ void pat_init(void) } } - /* Set PWT to Write-Combining. All other bits stay the same */ - /* -* PTE encoding used in Linux: -* PAT -* |PCD -* ||PWT -* ||| -* 000 WB _PAGE_CACHE_WB -* 001 WC _PAGE_CACHE_WC -* 010 UC- _PAGE_CACHE_UC_MINUS -* 011 UC _PAGE_CACHE_UC -* PAT bit unused -*/ - pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) | - PAT(4, WB) | PAT(5, WC) | PAT(6, UC_MINUS) | PAT(7, UC); + if ((c->x86_vendor == X86_VENDOR_INTEL) && + (((c->x86 == 0x6) && (c->x86_model <= 0xd)) || +((c->x86 == 0xf) && (c->x86_model <= 0x6 { + /* +* PAT support with the lower four entries. Intel Pentium 2, +* 3, M, and 4 are affected by PAT errata, which makes the +* upper four entries unusable. We do not use the upper four +* entries for all the affected processor families for safe. +* +* PTE encoding used in Linux: +* PAT +* |PCD +* ||PWT PAT +* |||slot +* 0000WB : _PAGE_CACHE_MODE_WB +* 0011WC : _PAGE_CACHE_MODE_WC +* 0102UC-: _PAGE_CACHE_MODE_UC_MINUS +* 0113UC : _PAGE_CACHE_MODE_UC +* PAT bit unused +* +* NOTE: When WT or WP is used, it is redirected to UC- per +* the default setup in __cachemode2pte_tbl[]. +*/ + pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) | + PAT(4, WB) | PAT(5, WC) | PAT(6, UC_MINUS) | PAT(7, UC); + } else { + /* +* PAT full support. WT is set to slot 7, which minimizes +* the risk of using the PAT bit as slot 3 is UC and is +* currently unused. Slot 4 should remain as reserved. +* +* PTE encoding used in Linux: +* PAT +* |PCD +* ||PWT PAT +* |||slot +* 0000WB : _PAGE_CACHE_MODE_WB +* 0011WC : _PAGE_CACHE_MODE_WC +* 0102UC-: _PAGE_CACHE_MODE_UC_MINUS +* 0113UC : _PAGE_CACHE_MODE_UC +* 1004 +* 1015 +* 1106 +* 1117WT : _PAGE_CACHE_MODE_WT +*/ + pat = PAT(0, WB) | PAT(1, WC) | PAT(2, UC_MINUS) | PAT(3, UC) | + PAT(4, WB) | PAT(5, WC) | PAT(6, UC_MINUS) | PAT(7, WT); + } /* Boot CPU check */ if (!boot_pat_state) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] x86 fixes
That would be my guess, too. On September 22, 2014 10:37:11 PM PDT, Ingo Molnar wrote: > >* Ingo Molnar wrote: > >> >> * Linus Torvalds wrote: >> >> > On Fri, Sep 19, 2014 at 3:40 AM, Ingo Molnar >wrote: >> > > >> > > Please pull the latest x86-urgent-for-linus git tree from: >> > >> > I only just noticed, but this pull request causes my Sony Vaio >> > laptop to immediately reboot at startup. >> > >> > I'm assuming it's one of the efi changes, but I'm bisecting now >> > to say exactly where it happens. It will get reverted. >> >> I've Cc:-ed Matt. >> >> My guess would be one of these two EFI commits: >> >> * Fix early boot regression affecting x86 EFI boot stub when >loading >> initrds above 4GB - Yinghai Lu >> >> 47226ad4f4cf x86/efi: Only load initrd above 4g on second try >> >> * Relocate GOT entries in the x86 EFI boot stub now that we >have >> symbols with global visibility - Matt Fleming >> >> 9cb0e394234d x86/efi: Fixup GOT in all boot code paths >> >> If it's 9cb0e394234d - then it's perhaps a build quirk, or a bug >> in the assembly code. If so then we'd have to revert this, and >> reintroduce another regression, caused by EFI commit >> f23cf8bd5c1f49 in this merge window. The most recent commit is >> easy to revert, the older one not. >> >> If it's 47226ad4f4cf then we'd reintroduce the regression caused >> by 4bf7111f501 in the previous merge window. They both revert >> cleanly after each other - but it might be safer to just revert >> the most recent one. >> >> My guess is that your regression is caused by 47226ad4f4cf. > >Wrong sha1: my guess is on 9cb0e394234d, the GOT fixup. > >Thanks, > > Ingo -- Sent from my mobile phone. Please pardon brevity and lack of formatting. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next: manual merge of the tiny tree with the tip tree
* Stephen Rothwell wrote: > Hi Josh, > > Today's linux-next merge of the tiny tree got conflicts in > arch/x86/kernel/process_32.c and arch/x86/kernel/process_64.c between > commits dc56c0f9b870 ("x86, fpu: Shift "fpu_counter = 0" from > copy_thread() to arch_dup_task_struct()") and 6f46b3aef003 ("x86: > copy_thread: Don't nullify ->ptrace_bps twice") from the tip tree and > commits a1cf09f93e66 ("x86: process: Unify 32-bit and 64-bit > copy_thread I/O bitmap handling") and e4a191d1e05b ("x86: Support > compiling out userspace I/O (iopl and ioperm)") from the tiny tree. Why are such changes in the 'tiny' tree? These are sensitive arch/x86 files, and any unification and compilation-out support patches need to go through the proper review channels and be merged upstream via the x86 tree if accepted... In particular the graticious sprinking of #ifdef CONFIG_X86_IOPORTs around x86 code looks ugly. Josh, don't do that, this route is really unacceptable. Please resubmit the latest patches and remove these from linux-next. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kernfs: use stack-buf for small writes.
On Tue, 23 Sep 2014 00:55:49 -0400 Tejun Heo wrote: > Hello, Neil. > > On Tue, Sep 23, 2014 at 02:46:50PM +1000, NeilBrown wrote: > > seqfile is only safe for reads. sysfs via kernfs uses seq_read(), so there > > is only a single allocation on the first read. > > > > It doesn't really related to fixing writes, except to point out that only > > writes need to be "fixed". Reads already work. > > Oh, I meant the buffer seqfile read op writes to, so it depends on the > fact that the allocation is only on the first read? That seems > extremely brittle to me, especially for an issue which tends to be > difficult to reproduce. It is easy for user-space to ensure they read once before any critical time.. > > > Separately: > > > > > Ugh... :( If this can't be avoided at all, I'd much prefer it to be > > > something explicit - a flag marking the file as needing a persistent > > > write buffer which is allocated on open. "Small" writes on stack > > > feels way to implicit to me. > > > > How about if we add seq_getbuf() and seq_putbuf() to seqfile > > which takes a 'struct seq_file' and a size and returns the ->buf > > after making sure it is big enough. > > It also claims and releases the seqfile ->lock. > > > > Then we would be using the same buffer for reads and write. > > > > Does that sound suitable? It uses existing infrastructure and avoids having > > to identify in advance which attributes it is important for. > > I'd much rather keep things direct and make it explicitly allocate r/w > buffer(s) on open and disallow seq_file operations on such files. As far as I can tell, seq_read is used on all sysfs files that are readable except for 'binary' files. Are you suggesting all files that might need to be accessed without a kmalloc have to be binary files? Having to identify those files which are important in advance seems the more "brittle" approach to me. I would much rather it "just worked" Would you prefer a new per-attribute flag which directed sysfs to pre-allocate a full page, or a 'max_size' attribute which caused a buffer of that size to be allocated on open? The same size would be used to pre-allocate the seqfile buf (like single_open_size does) if reads were supported. Thanks, NeilBrown signature.asc Description: PGP signature
Re: [PATCH] i2c: move acpi code back into the core
> Sorry for later response due to sickness. I can't write this patch in > time. Sorry again. I will test it soon. Oh, get well soon! Please say so next time, so I know. signature.asc Description: Digital signature
Re: [GIT PULL] x86 fixes
* Ingo Molnar wrote: > > * Linus Torvalds wrote: > > > On Fri, Sep 19, 2014 at 3:40 AM, Ingo Molnar wrote: > > > > > > Please pull the latest x86-urgent-for-linus git tree from: > > > > I only just noticed, but this pull request causes my Sony Vaio > > laptop to immediately reboot at startup. > > > > I'm assuming it's one of the efi changes, but I'm bisecting now > > to say exactly where it happens. It will get reverted. > > I've Cc:-ed Matt. > > My guess would be one of these two EFI commits: > > * Fix early boot regression affecting x86 EFI boot stub when loading > initrds above 4GB - Yinghai Lu > > 47226ad4f4cf x86/efi: Only load initrd above 4g on second try > > * Relocate GOT entries in the x86 EFI boot stub now that we have > symbols with global visibility - Matt Fleming > > 9cb0e394234d x86/efi: Fixup GOT in all boot code paths > > If it's 9cb0e394234d - then it's perhaps a build quirk, or a bug > in the assembly code. If so then we'd have to revert this, and > reintroduce another regression, caused by EFI commit > f23cf8bd5c1f49 in this merge window. The most recent commit is > easy to revert, the older one not. > > If it's 47226ad4f4cf then we'd reintroduce the regression caused > by 4bf7111f501 in the previous merge window. They both revert > cleanly after each other - but it might be safer to just revert > the most recent one. > > My guess is that your regression is caused by 47226ad4f4cf. Wrong sha1: my guess is on 9cb0e394234d, the GOT fixup. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] x86 fixes
* Linus Torvalds wrote: > On Fri, Sep 19, 2014 at 3:40 AM, Ingo Molnar wrote: > > > > Please pull the latest x86-urgent-for-linus git tree from: > > I only just noticed, but this pull request causes my Sony Vaio > laptop to immediately reboot at startup. > > I'm assuming it's one of the efi changes, but I'm bisecting now > to say exactly where it happens. It will get reverted. I've Cc:-ed Matt. My guess would be one of these two EFI commits: * Fix early boot regression affecting x86 EFI boot stub when loading initrds above 4GB - Yinghai Lu 47226ad4f4cf x86/efi: Only load initrd above 4g on second try * Relocate GOT entries in the x86 EFI boot stub now that we have symbols with global visibility - Matt Fleming 9cb0e394234d x86/efi: Fixup GOT in all boot code paths If it's 9cb0e394234d - then it's perhaps a build quirk, or a bug in the assembly code. If so then we'd have to revert this, and reintroduce another regression, caused by EFI commit f23cf8bd5c1f49 in this merge window. The most recent commit is easy to revert, the older one not. If it's 47226ad4f4cf then we'd reintroduce the regression caused by 4bf7111f501 in the previous merge window. They both revert cleanly after each other - but it might be safer to just revert the most recent one. My guess is that your regression is caused by 47226ad4f4cf. Sorry about this, the timing is unfortunate. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3 0/3] x86: Full support of PAT
Hi, any chance to have this in 3.18? Juergen On 09/12/2014 12:35 PM, Juergen Gross wrote: The x86 architecture offers via the PAT (Page Attribute Table) a way to specify different caching modes in page table entries. The PAT MSR contains 8 entries each specifying one of 6 possible cache modes. A pte references one of those entries via 3 bits: _PAGE_PAT, _PAGE_PWT and _PAGE_PCD. The Linux kernel currently supports only 4 different cache modes. The PAT MSR is set up in a way that the setting of _PAGE_PAT in a pte doesn't matter: the top 4 entries in the PAT MSR are the same as the 4 lower entries. This results in the kernel not supporting e.g. write-through mode. Especially this cache mode would speed up drivers of video cards which now have to use uncached accesses. OTOH some old processors (Pentium) don't support PAT correctly and the Xen hypervisor has been using a different PAT MSR configuration for some time now and can't change that as this setting is part of the ABI. This patch set abstracts the cache mode from the pte and introduces tables to translate between cache mode and pte bits (the default cache mode "write back" is hard-wired to PAT entry 0). The tables are statically initialized with values being compatible to old processors and current usage. As soon as the PAT MSR is changed (or - in case of Xen - is read at boot time) the tables are changed accordingly. Requests of mappings with special cache modes are always possible now, in case they are not supported there will be a fallback to a compatible but slower mode. Summing it up, this patch set adds the following features: - capability to support WT and WP cache modes on processors with full PAT support - processors with no or uncorrect PAT support are still working as today, even if WT or WP cache mode are selected by drivers for some pages - reduction of Xen special handling regarding cache mode Changes in V3: - corrected two minor nits (UC_MINUS, again) detected by Toshi Kani Changes in V2: - simplified handling of PAT MSR write under Xen as suggested by David Vrabel - removed resetting of pat_enabled under Xen - two small corrections requested by Toshi Kani (UC_MINUS cache mode in vermilion driver, fix 32 bit kernel build failure) - correct build error on non-x86 arch by moving definition of update_cache_mode_entry() to x86 specific header Changes since RFC: - renamed functions and variables as suggested by Toshi Kani - corrected cache mode bits for WT and WP - modified handling of PAT MSR write under Xen as suggested by Jan Beulich Juergen Gross (3): x86: Make page cache mode a real type x86: Enable PAT to use cache mode translation tables Support Xen pv-domains using PAT arch/x86/include/asm/cacheflush.h | 38 --- arch/x86/include/asm/fb.h | 6 +- arch/x86/include/asm/io.h | 2 +- arch/x86/include/asm/pat.h| 7 +- arch/x86/include/asm/pgtable.h| 19 ++-- arch/x86/include/asm/pgtable_types.h | 96 arch/x86/mm/dump_pagetables.c | 24 ++-- arch/x86/mm/init.c| 37 ++ arch/x86/mm/init_64.c | 9 +- arch/x86/mm/iomap_32.c| 15 ++- arch/x86/mm/ioremap.c | 63 ++- arch/x86/mm/mm_internal.h | 2 + arch/x86/mm/pageattr.c| 84 -- arch/x86/mm/pat.c | 181 +++--- arch/x86/mm/pat_internal.h| 22 ++-- arch/x86/mm/pat_rbtree.c | 8 +- arch/x86/pci/i386.c | 4 +- arch/x86/xen/enlighten.c | 25 ++--- arch/x86/xen/mmu.c| 48 +--- arch/x86/xen/xen-ops.h| 1 - drivers/video/fbdev/gbefb.c | 3 +- drivers/video/fbdev/vermilion/vermilion.c | 6 +- 22 files changed, 421 insertions(+), 279 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [f2fs-dev] [PATCH 08/10] f2fs: remove redundant operation during roll-forward recovery
Hi Chao, I fixed that. :) Thanks, On Mon, Sep 22, 2014 at 05:22:27PM +0800, Chao Yu wrote: > > -Original Message- > > From: Jaegeuk Kim [mailto:jaeg...@kernel.org] > > Sent: Monday, September 15, 2014 6:14 AM > > To: linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org; > > linux-f2fs-de...@lists.sourceforge.net > > Cc: Jaegeuk Kim > > Subject: [f2fs-dev] [PATCH 08/10] f2fs: remove redundant operation during > > roll-forward recovery > > > > If same data is updated multiple times, we don't need to redo whole the > > operations. > > Let's just update the lastest one. > > Reviewed-by: Chao Yu > > And one comment as following. > > > > > Signed-off-by: Jaegeuk Kim > > --- > > fs/f2fs/f2fs.h | 4 +++- > > fs/f2fs/recovery.c | 41 + > > 2 files changed, 20 insertions(+), 25 deletions(-) > > > > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h > > index 48d7d46..74dde99 100644 > > --- a/fs/f2fs/f2fs.h > > +++ b/fs/f2fs/f2fs.h > > @@ -137,7 +137,9 @@ struct discard_entry { > > struct fsync_inode_entry { > > struct list_head list; /* list head */ > > struct inode *inode;/* vfs inode pointer */ > > - block_t blkaddr;/* block address locating the last inode */ > > + block_t blkaddr;/* block address locating the last fsync */ > > + block_t last_dentry;/* block address locating the last dentry */ > > + block_t last_inode; /* block address locating the last inode */ > > }; > > > > #define nats_in_cursum(sum)(le16_to_cpu(sum->n_nats)) > > diff --git a/fs/f2fs/recovery.c b/fs/f2fs/recovery.c > > index 6f7fbfa..95d9dc9 100644 > > --- a/fs/f2fs/recovery.c > > +++ b/fs/f2fs/recovery.c > > @@ -66,7 +66,7 @@ static struct fsync_inode_entry *get_fsync_inode(struct > > list_head *head, > > return NULL; > > } > > > > -static int recover_dentry(struct page *ipage, struct inode *inode) > > +static int recover_dentry(struct inode *inode, struct page *ipage) > > { > > struct f2fs_inode *raw_inode = F2FS_INODE(ipage); > > nid_t pino = le32_to_cpu(raw_inode->i_pino); > > @@ -140,7 +140,7 @@ out: > > return err; > > } > > > > -static void __recover_inode(struct inode *inode, struct page *page) > > +static void recover_inode(struct inode *inode, struct page *page) > > { > > struct f2fs_inode *raw = F2FS_INODE(page); > > > > @@ -152,21 +152,9 @@ static void __recover_inode(struct inode *inode, > > struct page *page) > > inode->i_atime.tv_nsec = le32_to_cpu(raw->i_mtime_nsec); > > inode->i_ctime.tv_nsec = le32_to_cpu(raw->i_ctime_nsec); > > inode->i_mtime.tv_nsec = le32_to_cpu(raw->i_mtime_nsec); > > -} > > - > > -static int recover_inode(struct inode *inode, struct page *node_page) > > -{ > > - if (!IS_INODE(node_page)) > > - return 0; > > - > > - __recover_inode(inode, node_page); > > - > > - if (is_dent_dnode(node_page)) > > - return recover_dentry(node_page, inode); > > > > f2fs_msg(inode->i_sb, KERN_NOTICE, "recover_inode: ino = %x, name = %s", > > - ino_of_node(node_page), F2FS_INODE(node_page)->i_name); > > - return 0; > > + ino_of_node(page), F2FS_INODE(page)->i_name); > > } > > > > static int find_fsync_dnodes(struct f2fs_sb_info *sbi, struct list_head > > *head) > > @@ -214,12 +202,11 @@ static int find_fsync_dnodes(struct f2fs_sb_info > > *sbi, struct list_head > > *head) > > } > > > > /* add this fsync inode to the list */ > > - entry = kmem_cache_alloc(fsync_entry_slab, GFP_NOFS); > > + entry = kmem_cache_alloc(fsync_entry_slab, > > GFP_F2FS_ZERO); > > if (!entry) { > > err = -ENOMEM; > > break; > > } > > - > > /* > > * CP | dnode(F) | inode(DF) > > * For this case, we should not give up now. > > @@ -236,9 +223,11 @@ static int find_fsync_dnodes(struct f2fs_sb_info *sbi, > > struct list_head > > *head) > > } > > entry->blkaddr = blkaddr; > > > > - err = recover_inode(entry->inode, page); > > - if (err && err != -ENOENT) > > - break; > > + if (IS_INODE(page)) { > > + entry->last_inode = blkaddr; > > + if (is_dent_dnode(page)) > > + entry->last_dentry = blkaddr; > > + } > > next: > > /* check next segment */ > > blkaddr = next_blkaddr_of_node(page); > > @@ -455,11 +444,15 @@ static int recover_data(struct f2fs_sb_info *sbi, > > /* > > * inode(x) | CP | inode(x) | dnode(F) > > * In this case, we can lose the latest inode(x). > > -* So, call __recover_inode for the inode update. > > +* So, call recover_inode for the inode
Re: [PATCH 0/4] ipc/shm.c: increase the limits for SHMMAX, SHMALL
On 06/03/2014 09:26 PM, Davidlohr Bueso wrote: > On Fri, 2014-05-02 at 15:16 +0200, Michael Kerrisk (man-pages) wrote: >> Hi Manfred, >> >> On Mon, Apr 21, 2014 at 4:26 PM, Manfred Spraul >> wrote: >>> Hi all, >>> >>> the increase of SHMMAX/SHMALL is now a 4 patch series. >>> I don't have ideas how to improve it further. >> >> On the assumption that your patches are heading to mainline, could you >> send me a man-pages patch for the changes? > > It seems we're still behind here and the 3.16 merge window is already > opened. Please consider this, and again feel free to add/modify as > necessary. I think adding a note as below is enough and was hesitant to > add a lot of details... Thanks. > > 8<-- > From: Davidlohr Bueso > Subject: [PATCH] shmget.2: document new limits for shmmax/shmall > > These limits have been recently enlarged and > modifying them is no longer really necessary. > Update the manpage. > > Signed-off-by: Davidlohr Bueso > --- > man2/shmget.2 | 11 +++ > 1 file changed, 11 insertions(+) > > diff --git a/man2/shmget.2 b/man2/shmget.2 > index f781048..77764ea 100644 > --- a/man2/shmget.2 > +++ b/man2/shmget.2 > @@ -299,6 +299,11 @@ with 8kB page size, it yields 2^20 (1048576). > > On Linux, this limit can be read and modified via > .IR /proc/sys/kernel/shmall . > +As of Linux 3.16, the default value for this limit is increased to > +.B ULONG_MAX - 2^24 > +pages, which is as large as it can be without helping userspace overflow > +the values. Modifying this limit is therefore discouraged. This is suitable > +for both 32 and 64-bit systems. > .TP > .B SHMMAX > Maximum size in bytes for a shared memory segment. > @@ -306,6 +311,12 @@ Since Linux 2.2, the default value of this limit is > 0x200 (32MB). > > On Linux, this limit can be read and modified via > .IR /proc/sys/kernel/shmmax . > +As of Linux 3.16, the default value for this limit is increased from 32Mb > +to > +.B ULONG_MAX - 2^24 > +bytes, which is as large as it can be without helping userspace overflow > +the values. Modifying this limit is therefore discouraged. This is suitable > +for both 32 and 64-bit systems. > .TP > .B SHMMIN > Minimum size in bytes for a shared memory segment: implementation David, I applied various pieces from your patch on top of material that I already had, so that now we have the text below describing these limits. Comments/suggestions/improvements from all welcome. Cheers, Michael SHMALL System-wide limit on the number of pages of shared memory. On Linux, this limit can be read and modified via /proc/sys/kernel/shmall. Since Linux 3.16, the default value for this limit is: ULONG_MAX - 2^24 The effect of this value (which is suitable for both 32-bit and 64-bit systems) is to impose no limitation on allocations. This value, rather than ULONG_MAX, was cho‐ sen as the default to prevent some cases where historical applications simply raised the existing limit without first checking its current value. Such applications would cause the value to overflow if the limit was set at ULONG_MAX. From Linux 2.4 up to Linux 3.15, the default value for this limit was: SHMMAX / PAGE_SIZE * (SHMMNI / 16) If SHMMAX and SHMMNI were not modified, then multiplying the result of this formula by the page size (to get a value in bytes) yielded a value of 8 GB as the limit on the total memory used by all shared memory segments. SHMMAX Maximum size in bytes for a shared memory segment. On Linux, this limit can be read and modified via /proc/sys/kernel/shmmax. Since Linux 3.16, the default value for this limit is: ULONG_MAX - 2^24 The effect of this value (which is suitable for both 32-bit and 64-bit systems) is to impose no limitation on allocations. See the description of SHMALL for a discus‐ sion of why this default value (rather than ULONG_MAX) is used. From Linux 2.2 up to Linux 3.15, the default value of this limit was 0x200 (32MB). Because it is not possible to map just part of a shared memory segment, the amount of virtual memory places another limit on the maximum size of a usable segment: for example, on i386 the largest segments that can be mapped have a size of around 2.8 GB, and on x86_64 the limit is around 127 TB. -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Re: [GIT PULL rcu/next] RCU commits for 3.18
* Paul E. McKenney wrote: > Hello, Ingo, > > The changes in this series include: > > 1.Update RCU documentation. These were posted to LKML at > https://lkml.org/lkml/2014/8/28/378. > > 2.Miscellaneous fixes. These were posted to LKML at > https://lkml.org/lkml/2014/8/28/386. An additional fix that > eliminates a documented (but now inconvenient) deadlock between > RCU hotplug and expedited grace periods was posted at > https://lkml.org/lkml/2014/8/28/573. > > 3.Changes related to No-CBs CPUs and NO_HZ_FULL. These were posted > to LKML at https://lkml.org/lkml/2014/8/28/412. > > 4.Torture-test updates. These were posted to LKML at > https://lkml.org/lkml/2014/8/28/546 and at > https://lkml.org/lkml/2014/9/11/1114. > > 5.RCU-tasks implementation. These were posted to LKML at > https://lkml.org/lkml/2014/8/28/540. > > All of these have been exposed to -next testing. > These changes are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git rcu/next > > for you to fetch changes up to dd56af42bd829c6e770ed69812bd65a04eaeb1e4: > > rcu: Eliminate deadlock between CPU hotplug and expedited grace periods > (2014-09-18 16:22:27 -0700) > > > Ard Biesheuvel (1): > rcu: Define tracepoint strings only if CONFIG_TRACING is set > > Davidlohr Bueso (9): > locktorture: Rename locktorture_runnable parameter > locktorture: Add documentation > locktorture: Support mutexes > locktorture: Teach about lock debugging > locktorture: Make statistics generic > torture: Address race in module cleanup > locktorture: Add infrastructure for torturing read locks > locktorture: Support rwsems > locktorture: Introduce torture context > > Joe Perches (1): > rcu: Use pr_alert/pr_cont for printing logs > > Oleg Nesterov (1): > rcu: Uninline rcu_read_lock_held() > > Paul E. McKenney (46): > memory-barriers: Fix control-ordering no-transitivity example > memory-barriers: Retain barrier() in fold-to-zero example > memory-barriers: Fix description of 2-legged-if-based control > dependencies > rcu: Break more call_rcu() deadlock involving scheduler and perf > rcu: Make TINY_RCU tinier by putting error checks under #ifdef > rcu: Replace flush_signals() with WARN_ON(signal_pending()) > rcu: Add step to initrd documentation > rcutorture: Test partial nohz_full= configuration > rcutorture: Specify MAXSMP=y for TREE01 > rcutorture: Specify CONFIG_CPUMASK_OFFSTACK=y for TREE07 > rcutorture: Add callback-flood test > torture: Print PID in hung-kernel diagnostic message > torture: Check for nul bytes in console output > rcu: Add call_rcu_tasks() > rcu: Provide cond_resched_rcu_qs() to force quiescent states in long > loops > rcu: Add synchronous grace-period waiting for RCU-tasks > rcu: Make TASKS_RCU handle tasks that are almost done exiting > rcutorture: Add torture tests for RCU-tasks > rcutorture: Add RCU-tasks test cases > rcu: Add stall-warning checks for RCU-tasks > rcu: Improve RCU-tasks energy efficiency > documentation: Add verbiage on RCU-tasks stall warning messages > rcu: Defer rcu_tasks_kthread() creation till first call_rcu_tasks() > rcu: Make TASKS_RCU handle nohz_full= CPUs > rcu: Make rcu_tasks_kthread()'s GP-wait loop allow preemption > rcu: Remove redundant preempt_disable() from > rcu_note_voluntary_context_switch() > rcu: Additional information on RCU-tasks stall-warning messages > rcu: Remove local_irq_disable() in rcu_preempt_note_context_switch() > rcu: Per-CPU operation cleanups to rcu_*_qs() functions > rcutorture: Add RCU-tasks tests to default rcutorture list > rcu: Fix attempt to avoid unsolicited offloading of callbacks > rcu: Rationalize kthread spawning > rcu: Create rcuo kthreads only for onlined CPUs > rcu: Eliminate redundant rcu_sysidle_state variable > rcu: Don't track sysidle state if no nohz_full= CPUs > rcu: Avoid misordering in __call_rcu_nocb_enqueue() > rcu: Handle NOCB callbacks from irq-disabled idle code > rcu: Avoid misordering in nocb_leader_wait() > Merge branches 'doc.2014.09.07a', 'fixes.2014.09.10a', > 'nocb-nohz.2014.09.16b' and 'torture.2014.09.07a' into HEAD > Merge branch 'rcu-tasks.2014.09.10a' into HEAD > locktorture: Make torture scripting account for new _runnable name > locktorture: Add test scenario for mutex_lock > locktorture: Add test scenario for rwsem_lock > rcutorture: Rename rcutorture_runnable parameter > locktorture: Document boot/module parameters > rcu: Eliminate deadlock between CPU hotplug and expedited grace periods >
Re: [GIT PULL] x86 fixes
On Fri, Sep 19, 2014 at 3:40 AM, Ingo Molnar wrote: > > Please pull the latest x86-urgent-for-linus git tree from: I only just noticed, but this pull request causes my Sony Vaio laptop to immediately reboot at startup. I'm assuming it's one of the efi changes, but I'm bisecting now to say exactly where it happens. It will get reverted. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ARM: mach-bcm: offer a new maintainer and process
2014-09-22 22:03 GMT-07:00 Olof Johansson : > On Fri, Sep 19, 2014 at 11:17:11AM -0700, Florian Fainelli wrote: >> Hi all, >> >> As some of you may have seen in the news, Broadcom has recently stopped >> its mobile SoC activities. Upstream support for Broadcom's Mobile SoCs >> was an effort initially started by Christian Daudt and his team, and then >> continued by Alex Eleder and Matter Porter assigned to a particular landing >> team within Linaro to help Broadcom doing so. >> >> As part of this effort, Christian and Matt volunteered for centralizing pull >> requests coming from the arch/arm/mach-bcm/* directory and as of today, they >> are still responsible for merging mach-bcm pull requests coming from brcmstb, >> bcm5301x, bcm2835 and bcm63xx, creating an intermediate layer to the arm-soc >> tree. >> >> Following the mobile group shut down, our group (in which Brian, Gregory, >> Marc, >> Kevin and myself are) inherited these mobile SoC platforms, although at this >> point we cannot comment on the future of mobile platforms, we know that our >> Linaro activities have been stopped. >> >> We have not heard much from Christian and Matt in a while, and some of our >> pull >> requests have been stalling as a result. We would like to offer both a new >> maintainer for the mobile platforms as well as reworking the pull request >> process: >> >> - our group has now full access to these platforms, putting us in the best >> position to support Mobile SoCs questions > > So, one question I have is whether it makes sense to keep the mobile > platforms in the kernel if the line of business is ending? I leave it to Scott for more details, but last we talked he mentioned what has been upstreamed is useful for some other platforms he cares about. > > While I truly do appreciate the work done by Matt and others, there's > also little chance that it'll see substantial use by anyone. The Capri > boards aren't common out in the wild and I'm not aware of any dev > boards or consumer products with these SoCs that might want to run > mainline? Critical things such as power management and graphics are > missing from the current platform support in the kernel, so nobody is > likely to want it on their Android phone, etc. > > Maybe the answer to this is "keep it for now, revisit sometime later", > which is perfectly sane -- it has practically no cost to keep it around > the way it's looking now. Right, let's adopt that approach for now, and we can revisit that later in light of Scott and his group's work. -- Florian -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/5] extcon: gpio: Convert the driver to use gpio desc API's
On 09/23/2014 04:44 AM, Chanwoo Choi wrote: On 09/22/2014 06:51 PM, George Cherian wrote: On 09/22/2014 01:37 PM, Chanwoo Choi wrote: Hi George, This patch removes 'gpio_active_low' field of struct gpio_extcon_data. But, include/linux/extcon-gpio.h has the description of 'gpio_active_low' field. Yes didn't want the platform data users to break. Actually I couldn't find any platform users for this driver. Could you please point me to one if in case I missed it. If non present then why cant we get rid of platform data altogether. Right, But, Why do you support platform data on as following your patch? - [PATCH 3/5] extcon: gpio: Add dt support for the driver. According to your comment, you had to remove the support for platform data. My intention with this series was to add dt support by keeping the existing platform data. Now that we know there are no platform data users I will rework on this and keep only dt support. IMO, I think this patchset must need to reorder the sequence of patchset. Also, this patchset is more detailed description. I will rework and submit a v2. Also, This patch has not included the any description/comment of removing 'gpio_active_low'. Also, How to set 'FLAG_ACTIVE_LOW' bit for gpio when using platform data? Now that we are using gpiod_* API's we need not check for gpio_active_low from this driver. This patch just use gpiod API instead of legacy gpio API. I think that if extcon-gpio don't need to check gpio_activ_low field, you have to implement dt support patch before this patch. yes will do in v2 Thanks for your review. This patch don't call 'set_bit()' function to set FLAG_ACTIVE_LOW flag. Thanks, Chanwoo Choi On 09/09/2014 01:14 PM, George Cherian wrote: Convert the driver to use gpiod_* API's. Signed-off-by: George Cherian --- drivers/extcon/extcon-gpio.c | 18 +++--- 1 file changed, 7 insertions(+), 11 deletions(-) diff --git a/drivers/extcon/extcon-gpio.c b/drivers/extcon/extcon-gpio.c index 72f19a3..25269f6 100644 --- a/drivers/extcon/extcon-gpio.c +++ b/drivers/extcon/extcon-gpio.c @@ -33,8 +33,7 @@ struct gpio_extcon_data { struct extcon_dev *edev; -unsigned gpio; -bool gpio_active_low; +struct gpio_desc *gpiod; const char *state_on; const char *state_off; int irq; @@ -50,9 +49,7 @@ static void gpio_extcon_work(struct work_struct *work) container_of(to_delayed_work(work), struct gpio_extcon_data, work); -state = gpio_get_value(data->gpio); -if (data->gpio_active_low) -state = !state; +state = gpiod_get_value(data->gpiod); extcon_set_state(data->edev, state); } @@ -106,22 +103,21 @@ static int gpio_extcon_probe(struct platform_device *pdev) } extcon_data->edev->name = pdata->name; -extcon_data->gpio = pdata->gpio; -extcon_data->gpio_active_low = pdata->gpio_active_low; +extcon_data->gpiod = gpio_to_desc(pdata->gpio); extcon_data->state_on = pdata->state_on; extcon_data->state_off = pdata->state_off; extcon_data->check_on_resume = pdata->check_on_resume; if (pdata->state_on && pdata->state_off) extcon_data->edev->print_state = extcon_gpio_print_state; -ret = devm_gpio_request_one(>dev, extcon_data->gpio, GPIOF_DIR_IN, +ret = devm_gpio_request_one(>dev, pdata->gpio, GPIOF_DIR_IN, pdev->name); if (ret < 0) return ret; if (pdata->debounce) { -ret = gpio_set_debounce(extcon_data->gpio, -pdata->debounce * 1000); +ret = gpiod_set_debounce(extcon_data->gpiod, + pdata->debounce * 1000); if (ret < 0) extcon_data->debounce_jiffies = msecs_to_jiffies(pdata->debounce); @@ -133,7 +129,7 @@ static int gpio_extcon_probe(struct platform_device *pdev) INIT_DELAYED_WORK(_data->work, gpio_extcon_work); -extcon_data->irq = gpio_to_irq(extcon_data->gpio); +extcon_data->irq = gpiod_to_irq(extcon_data->gpiod); if (extcon_data->irq < 0) return extcon_data->irq; -George -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
3.17 kernel crash while loading IPoIB
Hello: I am facing an issue wherein kernel 3.17 crashes while loading IPoIB module. I guess the issue discussed in this thread (https://www.mail-archive.com/linux-rdma@vger.kernel.org/msg20963.html) is similar. We were able to reproduce the issue with RC6 also. Here are the steps I followed: I compiled and installed 3.17 kernel on top of RHEL 6.5. Then I changed rdma.conf to not load IPoIB (If I don't do this, the kernel crashes while booting and starting RDMA service.) After the server comes up, I just did "modprobe ib_ipoib" and kernel crashes. Please see below the kernel back trace. Seeing the announcement, it looks like RC6 will be the last RC for 3.17 kernel. Will the release happen with this issue? Is there any workaround available for this issue? I am not sure what mechanism/process is used to report issue to kernel community. Regards Karun Kernel Stack back-trace: crash> bt PID: 145TASK: 88081a580d90 CPU: 3 COMMAND: "kworker/3:1" #0 [88081a587750] machine_kexec at 8103c5d9 #1 [88081a5877a0] crash_kexec at 810d0ff8 #2 [88081a587870] oops_end at 81007570 #3 [88081a5878a0] no_context at 81046e5e #4 [88081a5878f0] __bad_area_nosemaphore at 8104704d #5 [88081a587940] bad_area_nosemaphore at 81047163 #6 [88081a587950] __do_page_fault at 81047722 #7 [88081a587a70] do_page_fault at 8104798c #8 [88081a587a80] page_fault at 815aad62 [exception RIP: __dev_queue_xmit+894] RIP: 814e17be RSP: 88081a587b38 RFLAGS: 00010282 RAX: 88087c1679fe RBX: 880812cc2500 RCX: 0044 RDX: 0008 RSI: RDI: 88081a363a9c RBP: 88081a587b78 R8: R9: 0040 R10: R11: 7c1679ff R12: 88081a363a00 R13: 880814f3e000 R14: 880809535600 R15: ORIG_RAX: CS: 0010 SS: 0018 #9 [88081a587b30] __dev_queue_xmit at 814e158b #10 [88081a587b80] dev_queue_xmit at 814e1930 #11 [88081a587b90] neigh_connected_output at 814e81e8 #12 [88081a587be0] ip6_finish_output2 at a05ff8dd [ipv6] #13 [88081a587c40] ip6_finish_output at a05ffe5f [ipv6] #14 [88081a587c60] ip6_output at a05fff18 [ipv6] #15 [88081a587c90] ndisc_send_skb at a06169a9 [ipv6] #16 [88081a587d40] ndisc_send_ns at a0616bf6 [ipv6] #17 [88081a587db0] addrconf_dad_work at a06076cb [ipv6] #18 [88081a587df0] process_one_work at 8106b23e #19 [88081a587e40] worker_thread at 8106b63f #20 [88081a587ec0] kthread at 8107041e #21 [88081a587f50] ret_from_fork at 815a92ac - Regards, Karun Sharma -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mfd: inherit coherent_dma_mask from parent device
Hi Arnd, On Mon, 22 Sep 2014 21:45:40 +0200 Arnd Bergmann wrote: > On Monday 22 September 2014 21:37:55 Boris BREZILLON wrote: > > dma_mask and dma_parms are already inherited from the parent device but > > dma_coherent_mask was left uninitialized (set to zero thanks to kzalloc). > > Set sub-device coherent_dma_mask to its parent value to simplify > > sub-drivers making use of dma coherent helper functions (those drivers > > currently have to explicitly set the dma coherent mask using > > dma_set_coherent_mask function). > > > > Signed-off-by: Boris BREZILLON > > --- > > > > Hi, > > > > This patch is follow-up of a discussion we had on a KMS driver thread [1]. > > This patch is only copying the parent device coherent_dma_mask to avoid > > calling specific dma_set_coherent_mask in case the coherent mask is the > > default one. > > > > I'm a bit surprised this hasn't been done earlier while other dma fields > > (mask and parms) are already inherited from the parent device, so please > > tell me if there already was an attempt to do the same, and if so, what > > was the reson for rejecting it :-). > > > > > > Seems reasonable to me. It's not clear whether we should always inherit > the dma_mask, but I see no point in copying just dma_mask but not > coherent_dma_mask. I thought about adding a dma_mask field to mfd_cell to override the default behavior (allocate a new dma_mask and copy the value provided by mfd_cell if it's not zero), but I don't see any real use case where a sub-device does not share the dma capabilities with its parent. IMHO, it's safer to keep it as is until someone really need to set a different dma_mask on a sub-device. Best Regards, Boris -- Boris Brezillon, Free Electrons Embedded Linux and Kernel engineering http://free-electrons.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ARM: mach-bcm: offer a new maintainer and process
On Fri, Sep 19, 2014 at 11:17:11AM -0700, Florian Fainelli wrote: > Hi all, > > As some of you may have seen in the news, Broadcom has recently stopped > its mobile SoC activities. Upstream support for Broadcom's Mobile SoCs > was an effort initially started by Christian Daudt and his team, and then > continued by Alex Eleder and Matter Porter assigned to a particular landing > team within Linaro to help Broadcom doing so. > > As part of this effort, Christian and Matt volunteered for centralizing pull > requests coming from the arch/arm/mach-bcm/* directory and as of today, they > are still responsible for merging mach-bcm pull requests coming from brcmstb, > bcm5301x, bcm2835 and bcm63xx, creating an intermediate layer to the arm-soc > tree. > > Following the mobile group shut down, our group (in which Brian, Gregory, > Marc, > Kevin and myself are) inherited these mobile SoC platforms, although at this > point we cannot comment on the future of mobile platforms, we know that our > Linaro activities have been stopped. > > We have not heard much from Christian and Matt in a while, and some of our > pull > requests have been stalling as a result. We would like to offer both a new > maintainer for the mobile platforms as well as reworking the pull request > process: > > - our group has now full access to these platforms, putting us in the best > position to support Mobile SoCs questions So, one question I have is whether it makes sense to keep the mobile platforms in the kernel if the line of business is ending? While I truly do appreciate the work done by Matt and others, there's also little chance that it'll see substantial use by anyone. The Capri boards aren't common out in the wild and I'm not aware of any dev boards or consumer products with these SoCs that might want to run mainline? Critical things such as power management and graphics are missing from the current platform support in the kernel, so nobody is likely to want it on their Android phone, etc. Maybe the answer to this is "keep it for now, revisit sometime later", which is perfectly sane -- it has practically no cost to keep it around the way it's looking now. -Olof -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.4 00/45] 3.4.104-rc1 review
On 09/22/2014 07:27 PM, Zefan Li wrote: From: Zefan Li This is the start of the stable review cycle for the 3.4.104 release. There are 45 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know. Responses should be made by Thu Sep 25 02:03:31 UTC 2014. Anything received after that time might be too late. Build results: total: 119 pass: 116 fail: 3 Failed builds: score:defconfig sparc64:allmodconfig xtensa:allmodconfig Qemu test results: total: 18 pass: 17 fail: 1 Failed tests: arm:arm_versatile_defconfig This is an improvement over the previous release, where we had six build failures. The failing qemu test is a recent addition which is expected to fail for the 3.4 kernel. The failure is due to Versatile SCSI driver and interrupt handling problems; those were fixed in later kernels but would be difficult to back-port. Guenter -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
linux-next: build failure after merge of the tiny tree
Hi Josh, After merging the tiny tree, today's linux-next build (powerpc ppc64_defconfig) failed like this: mm/built-in.o: In function `.isolate_migratepages_range': (.text+0x2fbd8): undefined reference to `.balloon_page_isolate' mm/built-in.o: In function `.putback_movable_pages': (.text+0x713c4): undefined reference to `.balloon_page_putback' mm/built-in.o: In function `.migrate_pages': (.text+0x72a00): undefined reference to `.balloon_page_migrate' Caused by commit b37a3fee8450 ("mm: Disable mm/balloon_compaction.c completely when !CONFIG_VIRTIO_BALLOON"). I have reverted that commit for today. -- Cheers, Stephen Rothwells...@canb.auug.org.au signature.asc Description: PGP signature
Re: [PATCH v1 5/5] zram: add fullness knob to control swap full
On Mon, Sep 22, 2014 at 02:17:33PM -0700, Andrew Morton wrote: > On Mon, 22 Sep 2014 09:03:11 +0900 Minchan Kim wrote: > > > Some zram usecase could want lower fullness than default 80 to > > avoid unnecessary swapout-and-fail-recover overhead. > > > > A typical example is that mutliple swap with high piroirty > > zram-swap and low priority HDD-swap so it could still enough > > free swap space although one of swap devices is full(ie, zram). > > It would be better to fail over to HDD-swap rather than failing > > swap write to zram in this case. > > > > This patch exports fullness to user so user can control it > > via the knob. > > Adding new userspace interfaces requires a pretty strong justification > and it's unclear to me that this is being met. In fact the whole > patchset reads like "we have some problem, don't know how to fix it so > let's add a userspace knob to make it someone else's problem". I explained rationale in 4/5's reply but if it's not enough or wrong, please tell me. > > > index b13dc993291f..817738d14061 100644 > > --- a/Documentation/ABI/testing/sysfs-block-zram > > +++ b/Documentation/ABI/testing/sysfs-block-zram > > @@ -138,3 +138,13 @@ Description: > > amount of memory ZRAM can use to store the compressed data. The > > limit could be changed in run time and "0" means disable the > > limit. No limit is the initial state. Unit: bytes > > + > > +What: /sys/block/zram/fullness > > +Date: August 2014 > > +Contact: Minchan Kim > > +Description: > > + The fullness file is read/write and specifies how easily > > + zram become full state so if you set it to lower value, > > + zram can reach full state easily compared to higher value. > > + Curretnly, initial value is 80% but it could be changed. > > + Unit: Percentage > > And I don't think that there is sufficient information here for a user > to be able to work out what to do with this tunable. I will put more words. > > > --- a/drivers/block/zram/zram_drv.c > > +++ b/drivers/block/zram/zram_drv.c > > @@ -136,6 +136,37 @@ static ssize_t max_comp_streams_show(struct device > > *dev, > > return scnprintf(buf, PAGE_SIZE, "%d\n", val); > > } > > > > +static ssize_t fullness_show(struct device *dev, > > + struct device_attribute *attr, char *buf) > > +{ > > + int val; > > + struct zram *zram = dev_to_zram(dev); > > + > > + down_read(>init_lock); > > + val = zram->fullness; > > + up_read(>init_lock); > > Did we really need to take a lock to display a value which became > out-of-date as soon as we released that lock? > > > + return scnprintf(buf, PAGE_SIZE, "%d\n", val); > > +} > > + > > +static ssize_t fullness_store(struct device *dev, > > + struct device_attribute *attr, const char *buf, size_t len) > > +{ > > + int err; > > + unsigned long val; > > + struct zram *zram = dev_to_zram(dev); > > + > > + err = kstrtoul(buf, 10, ); > > + if (err || val > 100) > > + return -EINVAL; > > This overwrites the kstrtoul() return value. Will fix. Thanks for the reivew, Andrew. -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kernfs: use stack-buf for small writes.
Hello, Neil. On Tue, Sep 23, 2014 at 02:46:50PM +1000, NeilBrown wrote: > seqfile is only safe for reads. sysfs via kernfs uses seq_read(), so there > is only a single allocation on the first read. > > It doesn't really related to fixing writes, except to point out that only > writes need to be "fixed". Reads already work. Oh, I meant the buffer seqfile read op writes to, so it depends on the fact that the allocation is only on the first read? That seems extremely brittle to me, especially for an issue which tends to be difficult to reproduce. > Separately: > > > Ugh... :( If this can't be avoided at all, I'd much prefer it to be > > something explicit - a flag marking the file as needing a persistent > > write buffer which is allocated on open. "Small" writes on stack > > feels way to implicit to me. > > How about if we add seq_getbuf() and seq_putbuf() to seqfile > which takes a 'struct seq_file' and a size and returns the ->buf > after making sure it is big enough. > It also claims and releases the seqfile ->lock. > > Then we would be using the same buffer for reads and write. > > Does that sound suitable? It uses existing infrastructure and avoids having > to identify in advance which attributes it is important for. I'd much rather keep things direct and make it explicitly allocate r/w buffer(s) on open and disallow seq_file operations on such files. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.4 00/45] 3.4.104-rc1 review
Hi Li, At Tue, 23 Sep 2014 10:27:39 +0800, Zefan Li wrote: > > From: Zefan Li > > This is the start of the stable review cycle for the 3.4.104 release. > There are 45 patches in this series, all will be posted as a response > to this one. If anyone has any issues with these being applied, please > let me know. > > Responses should be made by Thu Sep 25 02:03:31 UTC 2014. > Anything received after that time might be too late. This kernel passed my test. - Test Cases: - Build this kernel. - Boot this kernel. - Build the latest mainline kernel with this kernel. - Test Tool: https://github.com/satoru-takeuchi/test-linux-stable - Test Result (kernel .config, ktest config and test log): http://satoru-takeuchi.org/test-linux-stable/results/-.tar.xz - Build Environment: - OS: Debian Jessy x86_64 - CPU: Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz x 4 - memory: 8GB - Test Target Environment: - Debian Jessy x86_64 (KVM guest on the Build Environment) - # of vCPU: 2 - memory: 2GB Thanks, Satoru > > A combined patch relative to 3.4.103 will be posted as an additional > response to this. A shortlog and diffstat can be found below. > > thanks, > > Zefan Li > > > > Aaro Koskinen (1): > MIPS: OCTEON: make get_system_type() thread-safe > > Alan Douglas (1): > xtensa: fix address checks in dma_{alloc,free}_coherent > > Andi Kleen (1): > slab/mempolicy: always use local policy from interrupt context > > Anton Blanchard (1): > ibmveth: Fix endian issues with rx_no_buffer statistic > > Arjun Sreedharan (1): > pata_scc: propagate return value of scc_wait_after_reset > > Benjamin Tissoires (1): > HID: logitech-dj: prevent false errors to be shown > > Brennan Ashton (1): > USB: option: add VIA Telecom CDS7 chipset device id > > Daniel Mack (1): > ASoC: pxa-ssp: drop SNDRV_PCM_FMTBIT_S24_LE > > Dave Chiluk (1): > stable_kernel_rules: Add pointer to netdev-FAQ for network patches > > Fengguang Wu (1): > unicore32: select generic atomic64_t support > > Florian Fainelli (1): > MIPS: perf: Fix build error caused by unused > counters_per_cpu_to_total() > > Greg KH (1): > USB: serial: pl2303: add device id for ztek device > > Guan Xuetao (2): > UniCore32-bugfix: Remove definitions in asm/bug.h to solve difference > between native and cross compiler > UniCore32-bugfix: fix mismatch return value of __xchg_bad_pointer > > Hans de Goede (1): > xhci: Treat not finding the event_seg on COMP_STOP the same as > COMP_STOP_INVAL > > Huang Rui (1): > usb: xhci: amd chipset also needs short TX quirk > > James Forshaw (1): > USB: whiteheat: Added bounds checking for bulk command response > > Jan Kara (2): > isofs: Fix unbounded recursion when processing relocated directories > ext2: Fix fs corruption in ext2_get_xip_mem() > > Jaša Bartelj (1): > USB: ftdi_sio: Added PID for new ekey device > > Jiri Kosina (4): > HID: fix a couple of off-by-ones > HID: logitech: perform bounds checking on device_id early enough > HID: magicmouse: sanity check report size in raw_event() callback > HID: picolcd: sanity check report size in raw_event() callback > > Joerg Roedel (1): > iommu/amd: Fix cleanup_domain for mass device removal > > Johan Hovold (3): > USB: ftdi_sio: add Basic Micro ATOM Nano USB2Serial PID > USB: serial: fix potential stack buffer overflow > USB: serial: fix potential heap buffer overflow > > Mark Einon (1): > staging: et131x: Fix errors caused by phydev->addr accesses before > initialisation > > Mark Rutland (2): > ARM: 8128/1: abort: don't clear the exclusive monitors > ARM: 8129/1: errata: work around Cortex-A15 erratum 830321 using > dummy strex > > Max Filippov (3): > xtensa: replace IOCTL code definitions with constants > xtensa: fix TLBTEMP_BASE_2 region handling in fast_second_level_miss > xtensa: fix a6 and a7 handling in fast_syscall_xtensa > > Michael Cree (2): > alpha: Fix fall-out from disintegrating asm/system.h > alpha: add io{read,write}{16,32}be functions > > Michael S. Tsirkin (1): > kvm: iommu: fix the third parameter of kvm_iommu_put_pages > (CVE-2014-3601) > > NeilBrown (1): > md/raid6: avoid data corruption during recovery of double-degraded > RAID6 > > Paul Gortmaker (1): > 8250_pci: fix warnings in backport of Broadcom TruManage support > > Pavel Shilovsky (1): > CIFS: Fix wrong directory attributes after rename > > Ralf Baechle (1): > MIPS: Fix accessing to per-cpu data when flushing the cache > > Stefan Kristiansson (1): > openrisc: add missing header inclusion > > Stephen Hemminger (1): > USB: sisusb: add device id for Magic Control USB video > > Takashi Iwai (1): > ALSA: hda/realtek - Avoid setting wrong COEF on ALC269 & co > > Trond Myklebust (1): > NFSv4: Fix problems with close in the presence of a delegation > > Documentation/stable_kernel_rules.txt |3 ++ >
Re: [PATCH v1 4/5] zram: add swap full hint
On Mon, Sep 22, 2014 at 02:11:18PM -0700, Andrew Morton wrote: > On Mon, 22 Sep 2014 09:03:10 +0900 Minchan Kim wrote: > > > This patch implement SWAP_FULL handler in zram so that VM can > > know whether zram is full or not and use it to stop anonymous > > page reclaim. > > > > How to judge fullness is below, > > > > fullness = (100 * used space / total space) > > > > It means the higher fullness is, the slower we reach zram full. > > Now, default of fullness is 80 so that it biased more momory > > consumption rather than early OOM kill. > > It's unclear to me why this is being done. What's wrong with "use it > until it's full then stop", which is what I assume the current code > does? Why add this stuff? What goes wrong with the current code and > how does this fix it? > > ie: better explanation and justification in the chagnelogs, please. My bad. I should have wrote down about zram allocator's fragmentation problem. zsmalloc has various size class so it has a fragmentation problem. For example, a page swap out -> comprssed 32 byte -> has a empty slot of zsmalloc's 32 size class -> successful write. Another swap out -> compressed 256 byte -> no empty slot in zsmalloc's 256 size class -> zsmalloc should allocate new zspage but it would be over limit so it would be failed. The problem is swap layer cannot know compressed size of the page in advance so it couldn't expect whether swap-write will be successful while it could get empty swap slot easily since zram's virtual disk size is fairy enough. Given that zsmalloc's fragmentation, it would be *early-OOM* if zram says *full* as soon as it reaches page limit because it could have empty slots in various size classes. IOW, it doesn't consider fragment problem so this patch suggests two condition to solve it. if (total_pages >= zram->limit_pages) { compr_pages = atomic64_read(>stats.compr_data_size) >> PAGE_SHIFT; if ((100 * compr_pages / total_pages) >= zram->fullness) return 1; } First of all, zram-consumed page should reach *limit* and then we consider fullness. If used space is over 80%, we regards it as full in this implementation because I want to focus more memory usage to avoid early OOM kill when I consider zram's popular usecase in embedded. > > > Above logic works only when used space of zram hit over the limit > > but zram also pretend to be full once 32 consecutive allocation > > fail happens. It's safe guard to prevent system hang caused by > > fragment uncertainty. > > So allocation requests are of variable size, yes? If so, the above > statement should read "32 consecutive allocation attempts for regions > or size 2 or more slots". Because a failure of a single-slot > allocation attempt is an immediate failure. > > The 32-in-a-row thing sounds like a hack. Why can't we do this > deterministically? If one request for four slots fails then the next > one will as well, so why bother retrying? The problem is swap layer cannot expect what compressed size in the end in advance without compressing. If the page is compressed to the size zsmalloc has empty slot in size class, it would be successful. > > > --- a/drivers/block/zram/zram_drv.c > > +++ b/drivers/block/zram/zram_drv.c > > @@ -43,6 +43,20 @@ static const char *default_compressor = "lzo"; > > /* Module params (documentation at end) */ > > static unsigned int num_devices = 1; > > > > +/* > > + * If (100 * used_pages / total_pages) >= ZRAM_FULLNESS_PERCENT), > > + * we regards it as zram-full. It means that the higher > > + * ZRAM_FULLNESS_PERCENT is, the slower we reach zram full. > > + */ > > I just don't understand this patch :( To me, the above implies that the > user who sets 80% has elected to never use 20% of the zram capacity. > Why on earth would anyone do that? This chagnelog doesn't tell me. Hope above my words make you clear. > > > +#define ZRAM_FULLNESS_PERCENT 80 > > We've had problems in the past where 1% is just too large an increment > for large systems. So, do you want fullness_bytes like dirty_bytes? > > > @@ -597,10 +613,15 @@ static int zram_bvec_write(struct zram *zram, struct > > bio_vec *bvec, u32 index, > > } > > > > alloced_pages = zs_get_total_pages(meta->mem_pool); > > - if (zram->limit_pages && alloced_pages > zram->limit_pages) { > > - zs_free(meta->mem_pool, handle); > > - ret = -ENOMEM; > > - goto out; > > + if (zram->limit_pages) { > > + if (alloced_pages > zram->limit_pages) { > > This is all a bit racy, isn't it? pool->pages_allocated and > zram->limit_pages could be changing under our feet. limit_pages cannot be changed by init_lock but pool->pages_allocated is yes but the result by the race is not critical. 1. swap write fail so swap layer could make the page dirty again so it's no problem. Or 2. alloc_fail race so zram
[PATCH 3/3] f2fs: refactor flush_nat_entries to remove costly reorganizing ops
Previously, f2fs tries to reorganize the dirty nat entries into multiple sets according to its nid ranges. This can improve the flushing nat pages, however, if there are a lot of cached nat entries, it becomes a bottleneck. This patch introduces a new set management flow by removing dirty nat list and adding a series of set operations when the nat entry becomes dirty. Signed-off-by: Jaegeuk Kim --- fs/f2fs/f2fs.h | 13 +-- fs/f2fs/node.c | 299 + fs/f2fs/node.h | 9 +- 3 files changed, 162 insertions(+), 159 deletions(-) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 7b1e1d2..94cfdc4 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -164,6 +164,9 @@ struct fsync_inode_entry { #define sit_in_journal(sum, i) (sum->sit_j.entries[i].se) #define segno_in_journal(sum, i) (sum->sit_j.entries[i].segno) +#define MAX_NAT_JENTRIES(sum) (NAT_JOURNAL_ENTRIES - nats_in_cursum(sum)) +#define MAX_SIT_JENTRIES(sum) (SIT_JOURNAL_ENTRIES - sits_in_cursum(sum)) + static inline int update_nats_in_cursum(struct f2fs_summary_block *rs, int i) { int before = nats_in_cursum(rs); @@ -182,9 +185,8 @@ static inline bool __has_cursum_space(struct f2fs_summary_block *sum, int size, int type) { if (type == NAT_JOURNAL) - return nats_in_cursum(sum) + size <= NAT_JOURNAL_ENTRIES; - - return sits_in_cursum(sum) + size <= SIT_JOURNAL_ENTRIES; + return size <= MAX_NAT_JENTRIES(sum); + return size <= MAX_SIT_JENTRIES(sum); } /* @@ -292,11 +294,10 @@ struct f2fs_nm_info { /* NAT cache management */ struct radix_tree_root nat_root;/* root of the nat entry cache */ + struct radix_tree_root nat_set_root;/* root of the nat set cache */ rwlock_t nat_tree_lock; /* protect nat_tree_lock */ - unsigned int nat_cnt; /* the # of cached nat entries */ struct list_head nat_entries; /* cached nat entry list (clean) */ - struct list_head dirty_nat_entries; /* cached nat entry list (dirty) */ - struct list_head nat_entry_set; /* nat entry set list */ + unsigned int nat_cnt; /* the # of cached nat entries */ unsigned int dirty_nat_cnt; /* total num of nat entries in set */ /* free node ids management */ diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index 21ed91b..f5a21f4 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -123,6 +123,57 @@ static void __del_from_nat_cache(struct f2fs_nm_info *nm_i, struct nat_entry *e) kmem_cache_free(nat_entry_slab, e); } +static void __set_nat_cache_dirty(struct f2fs_nm_info *nm_i, + struct nat_entry *ne) +{ + nid_t set = ne->ni.nid / NAT_ENTRY_PER_BLOCK; + struct nat_entry_set *head; + + if (get_nat_flag(ne, IS_DIRTY)) + return; +retry: + head = radix_tree_lookup(_i->nat_set_root, set); + if (!head) { + head = f2fs_kmem_cache_alloc(nat_entry_set_slab, GFP_ATOMIC); + + INIT_LIST_HEAD(>entry_list); + INIT_LIST_HEAD(>set_list); + head->set = set; + head->entry_cnt = 0; + + if (radix_tree_insert(_i->nat_set_root, set, head)) { + cond_resched(); + goto retry; + } + } + list_move_tail(>list, >entry_list); + nm_i->dirty_nat_cnt++; + head->entry_cnt++; + set_nat_flag(ne, IS_DIRTY, true); +} + +static void __clear_nat_cache_dirty(struct f2fs_nm_info *nm_i, + struct nat_entry *ne) +{ + nid_t set = ne->ni.nid / NAT_ENTRY_PER_BLOCK; + struct nat_entry_set *head; + + head = radix_tree_lookup(_i->nat_set_root, set); + if (head) { + list_move_tail(>list, _i->nat_entries); + set_nat_flag(ne, IS_DIRTY, false); + head->entry_cnt--; + nm_i->dirty_nat_cnt--; + } +} + +static unsigned int __gang_lookup_nat_set(struct f2fs_nm_info *nm_i, + nid_t start, unsigned int nr, struct nat_entry_set **ep) +{ + return radix_tree_gang_lookup(_i->nat_set_root, (void **)ep, + start, nr); +} + bool is_checkpointed_node(struct f2fs_sb_info *sbi, nid_t nid) { struct f2fs_nm_info *nm_i = NM_I(sbi); @@ -1739,79 +1790,6 @@ skip: return err; } -static struct nat_entry_set *grab_nat_entry_set(void) -{ - struct nat_entry_set *nes = - f2fs_kmem_cache_alloc(nat_entry_set_slab, GFP_ATOMIC); - - nes->entry_cnt = 0; - INIT_LIST_HEAD(>set_list); - INIT_LIST_HEAD(>entry_list); - return nes; -} - -static void release_nat_entry_set(struct nat_entry_set *nes, -
[PATCH 2/3] f2fs: introduce FITRIM in f2fs_ioctl
This patch introduces FITRIM in f2fs_ioctl. In this case, f2fs will issue small discards and prefree discards as many as possible for the given area. Signed-off-by: Jaegeuk Kim --- fs/f2fs/checkpoint.c| 4 +- fs/f2fs/f2fs.h | 9 +++- fs/f2fs/file.c | 29 fs/f2fs/segment.c | 110 +++- fs/f2fs/super.c | 1 + include/trace/events/f2fs.h | 3 +- 6 files changed, 141 insertions(+), 15 deletions(-) diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c index e401ffd..5d793ba 100644 --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -997,7 +997,7 @@ void write_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc) mutex_lock(>cp_mutex); - if (!sbi->s_dirty) + if (!sbi->s_dirty && cpc->reason != CP_DISCARD) goto out; if (unlikely(f2fs_cp_error(sbi))) goto out; @@ -1020,7 +1020,7 @@ void write_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc) /* write cached NAT/SIT entries to NAT/SIT area */ flush_nat_entries(sbi); - flush_sit_entries(sbi); + flush_sit_entries(sbi, cpc); /* unlock all the fs_lock[] in do_checkpoint() */ do_checkpoint(sbi, cpc); diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 5298924..7b1e1d2 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -99,10 +99,15 @@ enum { enum { CP_UMOUNT, CP_SYNC, + CP_DISCARD, }; struct cp_control { int reason; + __u64 trim_start; + __u64 trim_end; + __u64 trim_minlen; + __u64 trimmed; }; /* @@ -1276,9 +1281,11 @@ void destroy_flush_cmd_control(struct f2fs_sb_info *); void invalidate_blocks(struct f2fs_sb_info *, block_t); void refresh_sit_entry(struct f2fs_sb_info *, block_t, block_t); void clear_prefree_segments(struct f2fs_sb_info *); +void release_discard_addrs(struct f2fs_sb_info *); void discard_next_dnode(struct f2fs_sb_info *, block_t); int npages_for_summary_flush(struct f2fs_sb_info *); void allocate_new_segments(struct f2fs_sb_info *); +int f2fs_trim_fs(struct f2fs_sb_info *, struct fstrim_range *); struct page *get_sum_page(struct f2fs_sb_info *, unsigned int); void write_meta_page(struct f2fs_sb_info *, struct page *); void write_node_page(struct f2fs_sb_info *, struct page *, @@ -1295,7 +1302,7 @@ void write_data_summaries(struct f2fs_sb_info *, block_t); void write_node_summaries(struct f2fs_sb_info *, block_t); int lookup_journal_in_cursum(struct f2fs_summary_block *, int, unsigned int, int); -void flush_sit_entries(struct f2fs_sb_info *); +void flush_sit_entries(struct f2fs_sb_info *, struct cp_control *); int build_segment_manager(struct f2fs_sb_info *); void destroy_segment_manager(struct f2fs_sb_info *); int __init create_segment_manager_caches(void); diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c index ac8c680..1184207 100644 --- a/fs/f2fs/file.c +++ b/fs/f2fs/file.c @@ -860,6 +860,35 @@ out: mnt_drop_write_file(filp); return ret; } + case FITRIM: + { + struct super_block *sb = inode->i_sb; + struct request_queue *q = bdev_get_queue(sb->s_bdev); + struct fstrim_range range; + int ret = 0; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + if (!blk_queue_discard(q)) + return -EOPNOTSUPP; + + if (copy_from_user(, (struct fstrim_range __user *)arg, + sizeof(range))) + return -EFAULT; + + range.minlen = max((unsigned int)range.minlen, + q->limits.discard_granularity); + ret = f2fs_trim_fs(F2FS_SB(sb), ); + if (ret < 0) + return ret; + + if (copy_to_user((struct fstrim_range __user *)arg, , + sizeof(range))) + return -EFAULT; + + return 0; + } default: return -ENOTTY; } diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c index 3125a3d..b423005 100644 --- a/fs/f2fs/segment.c +++ b/fs/f2fs/segment.c @@ -386,45 +386,92 @@ void discard_next_dnode(struct f2fs_sb_info *sbi, block_t blkaddr) } } -static void add_discard_addrs(struct f2fs_sb_info *sbi, - unsigned int segno, struct seg_entry *se) +static void add_discard_addrs(struct f2fs_sb_info *sbi, struct cp_control *cpc) { struct list_head *head = _I(sbi)->discard_list; struct discard_entry *new; int entries = SIT_VBLOCK_MAP_SIZE / sizeof(unsigned long); int max_blocks = sbi->blocks_per_seg; + struct seg_entry *se = get_seg_entry(sbi, cpc->trim_start); unsigned long *cur_map =
[PATCH 1/3] f2fs: introduce cp_control structure
This patch add a new data structure to control checkpoint parameters. Currently, it presents the reason of checkpoint such as is_umount and normal sync. Signed-off-by: Jaegeuk Kim --- fs/f2fs/checkpoint.c| 16 fs/f2fs/f2fs.h | 11 ++- fs/f2fs/gc.c| 7 +-- fs/f2fs/recovery.c | 5 - fs/f2fs/super.c | 13 ++--- include/trace/events/f2fs.h | 15 ++- 6 files changed, 47 insertions(+), 20 deletions(-) diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c index e519aaf..e401ffd 100644 --- a/fs/f2fs/checkpoint.c +++ b/fs/f2fs/checkpoint.c @@ -826,7 +826,7 @@ static void wait_on_all_pages_writeback(struct f2fs_sb_info *sbi) finish_wait(>cp_wait, ); } -static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount) +static void do_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc) { struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi); struct curseg_info *curseg = CURSEG_I(sbi, CURSEG_WARM_NODE); @@ -894,7 +894,7 @@ static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount) ckpt->cp_pack_start_sum = cpu_to_le32(1 + cp_payload_blks + orphan_blocks); - if (is_umount) { + if (cpc->reason == CP_UMOUNT) { set_ckpt_flags(ckpt, CP_UMOUNT_FLAG); ckpt->cp_pack_total_block_count = cpu_to_le32(F2FS_CP_PACKS+ cp_payload_blks + data_sum_blocks + @@ -948,7 +948,7 @@ static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount) write_data_summaries(sbi, start_blk); start_blk += data_sum_blocks; - if (is_umount) { + if (cpc->reason == CP_UMOUNT) { write_node_summaries(sbi, start_blk); start_blk += NR_CURSEG_NODE_TYPE; } @@ -988,12 +988,12 @@ static void do_checkpoint(struct f2fs_sb_info *sbi, bool is_umount) /* * We guarantee that this checkpoint procedure will not fail. */ -void write_checkpoint(struct f2fs_sb_info *sbi, bool is_umount) +void write_checkpoint(struct f2fs_sb_info *sbi, struct cp_control *cpc) { struct f2fs_checkpoint *ckpt = F2FS_CKPT(sbi); unsigned long long ckpt_ver; - trace_f2fs_write_checkpoint(sbi->sb, is_umount, "start block_ops"); + trace_f2fs_write_checkpoint(sbi->sb, cpc->reason, "start block_ops"); mutex_lock(>cp_mutex); @@ -1004,7 +1004,7 @@ void write_checkpoint(struct f2fs_sb_info *sbi, bool is_umount) if (block_operations(sbi)) goto out; - trace_f2fs_write_checkpoint(sbi->sb, is_umount, "finish block_ops"); + trace_f2fs_write_checkpoint(sbi->sb, cpc->reason, "finish block_ops"); f2fs_submit_merged_bio(sbi, DATA, WRITE); f2fs_submit_merged_bio(sbi, NODE, WRITE); @@ -1023,13 +1023,13 @@ void write_checkpoint(struct f2fs_sb_info *sbi, bool is_umount) flush_sit_entries(sbi); /* unlock all the fs_lock[] in do_checkpoint() */ - do_checkpoint(sbi, is_umount); + do_checkpoint(sbi, cpc); unblock_operations(sbi); stat_inc_cp_count(sbi->stat_info); out: mutex_unlock(>cp_mutex); - trace_f2fs_write_checkpoint(sbi->sb, is_umount, "finish checkpoint"); + trace_f2fs_write_checkpoint(sbi->sb, cpc->reason, "finish checkpoint"); } void init_ino_entry_info(struct f2fs_sb_info *sbi) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 3b70b01..5298924 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -96,6 +96,15 @@ enum { SIT_BITMAP }; +enum { + CP_UMOUNT, + CP_SYNC, +}; + +struct cp_control { + int reason; +}; + /* * For CP/NAT/SIT/SSA readahead */ @@ -1314,7 +1323,7 @@ void update_dirty_page(struct inode *, struct page *); void add_dirty_dir_inode(struct inode *); void remove_dirty_dir_inode(struct inode *); void sync_dirty_dir_inodes(struct f2fs_sb_info *); -void write_checkpoint(struct f2fs_sb_info *, bool); +void write_checkpoint(struct f2fs_sb_info *, struct cp_control *); void init_ino_entry_info(struct f2fs_sb_info *); int __init create_checkpoint_caches(void); void destroy_checkpoint_caches(void); diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c index 7bf8392..e88fcf6 100644 --- a/fs/f2fs/gc.c +++ b/fs/f2fs/gc.c @@ -694,6 +694,9 @@ int f2fs_gc(struct f2fs_sb_info *sbi) int gc_type = BG_GC; int nfree = 0; int ret = -1; + struct cp_control cpc = { + .reason = CP_SYNC, + }; INIT_LIST_HEAD(); gc_more: @@ -704,7 +707,7 @@ gc_more: if (gc_type == BG_GC && has_not_enough_free_secs(sbi, nfree)) { gc_type = FG_GC; - write_checkpoint(sbi, false); + write_checkpoint(sbi, ); } if (!__get_victim(sbi, , gc_type, NO_CHECK_TYPE)) @@ -729,7 +732,7 @@ gc_more: goto gc_more; if (gc_type == FG_GC) -
Re: [f2fs-dev] [PATCH 2/3] f2fs: fix conditions to remain recovery information in f2fs_sync_file
On Mon, Sep 22, 2014 at 05:20:19PM +0800, Chao Yu wrote: > > -Original Message- > > From: Huang Ying [mailto:ying.hu...@intel.com] > > Sent: Monday, September 22, 2014 3:39 PM > > To: Chao Yu > > Cc: 'Jaegeuk Kim'; linux-kernel@vger.kernel.org; > > linux-fsde...@vger.kernel.org; > > linux-f2fs-de...@lists.sourceforge.net > > Subject: Re: [f2fs-dev] [PATCH 2/3] f2fs: fix conditions to remain recovery > > information in > > f2fs_sync_file > > > > On Mon, 2014-09-22 at 15:24 +0800, Chao Yu wrote: > > > Hi Jaegeuk, Huang, > > > > > > > -Original Message- > > > > From: Jaegeuk Kim [mailto:jaeg...@kernel.org] > > > > Sent: Thursday, September 18, 2014 1:51 PM > > > > To: linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org; > > > > linux-f2fs-de...@lists.sourceforge.net > > > > Cc: Jaegeuk Kim; Huang Ying > > > > Subject: [f2fs-dev] [PATCH 2/3] f2fs: fix conditions to remain recovery > > > > information in > > > > f2fs_sync_file > > > > > > > > This patch revisited whole the recovery information during the > > > > f2fs_sync_file. > > > > > > > > In this patch, there are three information to make a decision. > > > > > > > > a) IS_CHECKPOINTED, /* is it checkpointed before? */ > > > > b) HAS_FSYNCED_INODE, /* is the inode fsynced before? */ > > > > c) HAS_LAST_FSYNC, /* has the latest node fsync mark? */ > > > > > > > > And, the scenarios for our rule are based on: > > > > > > > > [Term] F: fsync_mark, D: dentry_mark > > > > > > > > 1. inode(x) | CP | inode(x) | dnode(F) > > > > 2. inode(x) | CP | inode(F) | dnode(F) > > > > 3. inode(x) | CP | dnode(F) | inode(x) | inode(F) > > > > 4. inode(x) | CP | dnode(F) | inode(F) > > > > 5. CP | inode(x) | dnode(F) | inode(DF) > > > > 6. CP | inode(DF) | dnode(F) > > > > 7. CP | dnode(F) | inode(DF) > > > > 8. CP | dnode(F) | inode(x) | inode(DF) > > > > > > No sure, do we missed these cases: > > > inode(x) | CP | inode(F) | dnode(x) -> write inode(F) > > > CP | inode(DF) | dnode(x) -> write inode(F) > > > > > > In these cases we will write another inode with fsync flag because our > > > last > > > dnode is written out to disk by bdi-flusher (HAS_LAST_FSYNC is not > > > marked). But > > > this appended inode is not useful. > > > > > > AFAIK, HAS_LAST_FSYNC(AKA fsync_done) is introduced in commit 479f40c44ae3 > > > ("f2fs: skip unnecessary node writes during fsync") to avoid writting > > > multiple > > > unneeded inode page by multiple redundant fsync calls. But for now, its > > > role can > > > be taken by HAS_FSYNCED_INODE. > > > So, can we remove this flag to simplify our logic of fsync flow? > > > > > > Then in fsync flow, the rule can be: > > > If CPed before, there must be a inode(F) written in warm node chain; > > > > How about > > > > inode(x) | CP | dnode(F) > > Oh, I missed this one, thanks for remindering that. > > There is another case: > inode(x) | CP | dnode(F) | dnode(x) -> write inode(F) > It seems we will append another unneeded inode(F) in this patch also due to > no HAS_LAST_FSYNC in nat entry cache of inode. As the current rule for roll-forward recovery, we need inode(F) to find the latest mark. This can also be used to distinguish fsynced inode from writebacked inodes. > > > > > > If not CPed before, there must be a inode(DF) written in warm node chain. > > > > For example below: > > > > 1) checkpoint > > 2) create "a", change "a" > > 3) fsync "a" > > 4) open "a", change "a" > > > > Do we want recovery to stop at dnode(F) in step 3) or stop at dnode(x) > > produced by step 4)? > > To my understanding, we will recover all info related to fsynced nodes in warm > node chain. So will produce to step 4 if changed nodes in step 4 are flushed > to > device. Current rule should stop at 3) fsync "a". It won't recover 4)'s inode, since it was just writebacked. If we'd like to recover whole the inode and its data, we should traverse all the recovery paths from the sketch. Thanks, > > Thanks, > Yu > > > > Best Regards, > > Huang, Ying > > > > > > > > > > For example, #3, the three conditions should be changed as follows. > > > > > > > >inode(x) | CP | dnode(F) | inode(x) | inode(F) > > > > a)x o o o o > > > > b)x x x x o > > > > c)x o o x o > > > > > > > > If f2fs_sync_file stops --^, > > > > it should write inode(F)--^ > > > > > > > > So, the need_inode_block_update should return true, since > > > > c) get_nat_flag(e, HAS_LAST_FSYNC), is false. > > > > > > > > For example, #8, > > > > CP | alloc | dnode(F) | inode(x) | inode(DF) > > > > a)o xx x x > > > > b)x x x o > > > > c)o o x o > > > > > > > > If f2fs_sync_file stops ---^, > > > > it should write inode(DF)--^ > > > > > > > > Note that, the roll-forward policy should follow this rule, which
Re: [PATCH] Fix the issue that lowmemkiller fell into a cycle that try to kill a task
On 09/23/14 12:18, Greg KH wrote: > On Tue, Sep 23, 2014 at 10:57:09AM +0800, Hui Zhu wrote: >> The cause of this issue is when free memroy size is low and a lot of task is >> trying to shrink the memory, the task that is killed by lowmemkiller cannot >> get >> CPU to exit itself. >> >> Fix this issue with change the scheduling policy to SCHED_FIFO if a task's >> flag >> is TIF_MEMDIE in lowmemkiller. >> >> Signed-off-by: Hui Zhu >> --- >> drivers/staging/android/lowmemorykiller.c | 4 >> 1 file changed, 4 insertions(+) >> >> diff --git a/drivers/staging/android/lowmemorykiller.c >> b/drivers/staging/android/lowmemorykiller.c >> index b545d3d..ca1ffac 100644 >> --- a/drivers/staging/android/lowmemorykiller.c >> +++ b/drivers/staging/android/lowmemorykiller.c >> @@ -129,6 +129,10 @@ static unsigned long lowmem_scan(struct shrinker *s, >> struct shrink_control *sc) >> >> if (test_tsk_thread_flag(p, TIF_MEMDIE) && >> time_before_eq(jiffies, lowmem_deathpending_timeout)) { >> +struct sched_param param = { .sched_priority = 1 }; >> + >> +if (p->policy == SCHED_NORMAL) >> +sched_setscheduler(p, SCHED_FIFO, ); > > This seems really specific to a specific scheduler pattern now. Isn't > there some other way to resolve this? I tried to let the task that call lowmemkiller sleep some time when it try to kill same task. But it doesn't work. I think the issue is that the free memroy size is too low to make more and more tasks come to call lowmemkiller. Thanks, Hui > > thanks, > > greg k-h >
[PATCH v4 06/12] crypto: LLVMLinux: Remove VLAIS from crypto/omap_sham.c
From: Behan Webster Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99 compliant equivalent. This patch allocates the appropriate amount of memory using a char array using the SHASH_DESC_ON_STACK macro. The new code can be compiled with both gcc and clang. Signed-off-by: Behan Webster Reviewed-by: Mark Charlebois Reviewed-by: Jan-Simon Möller Acked-by: Herbert Xu --- drivers/crypto/omap-sham.c | 28 +++- 1 file changed, 11 insertions(+), 17 deletions(-) diff --git a/drivers/crypto/omap-sham.c b/drivers/crypto/omap-sham.c index 710d863..24ef489 100644 --- a/drivers/crypto/omap-sham.c +++ b/drivers/crypto/omap-sham.c @@ -949,17 +949,14 @@ static int omap_sham_finish_hmac(struct ahash_request *req) struct omap_sham_hmac_ctx *bctx = tctx->base; int bs = crypto_shash_blocksize(bctx->shash); int ds = crypto_shash_digestsize(bctx->shash); - struct { - struct shash_desc shash; - char ctx[crypto_shash_descsize(bctx->shash)]; - } desc; + SHASH_DESC_ON_STACK(shash, bctx->shash); - desc.shash.tfm = bctx->shash; - desc.shash.flags = 0; /* not CRYPTO_TFM_REQ_MAY_SLEEP */ + shash->tfm = bctx->shash; + shash->flags = 0; /* not CRYPTO_TFM_REQ_MAY_SLEEP */ - return crypto_shash_init() ?: - crypto_shash_update(, bctx->opad, bs) ?: - crypto_shash_finup(, req->result, ds, req->result); + return crypto_shash_init(shash) ?: + crypto_shash_update(shash, bctx->opad, bs) ?: + crypto_shash_finup(shash, req->result, ds, req->result); } static int omap_sham_finish(struct ahash_request *req) @@ -1118,18 +1115,15 @@ static int omap_sham_update(struct ahash_request *req) return omap_sham_enqueue(req, OP_UPDATE); } -static int omap_sham_shash_digest(struct crypto_shash *shash, u32 flags, +static int omap_sham_shash_digest(struct crypto_shash *tfm, u32 flags, const u8 *data, unsigned int len, u8 *out) { - struct { - struct shash_desc shash; - char ctx[crypto_shash_descsize(shash)]; - } desc; + SHASH_DESC_ON_STACK(shash, tfm); - desc.shash.tfm = shash; - desc.shash.flags = flags & CRYPTO_TFM_REQ_MAY_SLEEP; + shash->tfm = tfm; + shash->flags = flags & CRYPTO_TFM_REQ_MAY_SLEEP; - return crypto_shash_digest(, data, len, out); + return crypto_shash_digest(shash, data, len, out); } static int omap_sham_final_shash(struct ahash_request *req) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v4 05/12] crypto: LLVMLinux: Remove VLAIS from crypto/n2_core.c
From: Behan Webster Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99 compliant equivalent. This patch allocates the appropriate amount of memory using a char array using the SHASH_DESC_ON_STACK macro. The new code can be compiled with both gcc and clang. Signed-off-by: Behan Webster Reviewed-by: Mark Charlebois Reviewed-by: Jan-Simon Möller Acked-by: Herbert Xu --- drivers/crypto/n2_core.c | 11 --- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/drivers/crypto/n2_core.c b/drivers/crypto/n2_core.c index 7263c10..f8e3207 100644 --- a/drivers/crypto/n2_core.c +++ b/drivers/crypto/n2_core.c @@ -445,10 +445,7 @@ static int n2_hmac_async_setkey(struct crypto_ahash *tfm, const u8 *key, struct n2_hmac_ctx *ctx = crypto_ahash_ctx(tfm); struct crypto_shash *child_shash = ctx->child_shash; struct crypto_ahash *fallback_tfm; - struct { - struct shash_desc shash; - char ctx[crypto_shash_descsize(child_shash)]; - } desc; + SHASH_DESC_ON_STACK(shash, child_shash); int err, bs, ds; fallback_tfm = ctx->base.fallback_tfm; @@ -456,15 +453,15 @@ static int n2_hmac_async_setkey(struct crypto_ahash *tfm, const u8 *key, if (err) return err; - desc.shash.tfm = child_shash; - desc.shash.flags = crypto_ahash_get_flags(tfm) & + shash->tfm = child_shash; + shash->flags = crypto_ahash_get_flags(tfm) & CRYPTO_TFM_REQ_MAY_SLEEP; bs = crypto_shash_blocksize(child_shash); ds = crypto_shash_digestsize(child_shash); BUG_ON(ds > N2_HASH_KEY_MAX); if (keylen > bs) { - err = crypto_shash_digest(, key, keylen, + err = crypto_shash_digest(shash, key, keylen, ctx->hash_key); if (err) return err; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5] x86, cpu-hotplug: fix llc shared map unreleased during cpu hotplug
(2014/09/17 16:17), Wanpeng Li wrote: > BUG: unable to handle kernel NULL pointer dereference at 0004 > IP: [..] find_busiest_group > PGD 5a9d5067 PUD 13067 PMD 0 > Oops: [#3] SMP > [...] > Call Trace: > load_balance > ? _raw_spin_unlock_irqrestore > idle_balance > __schedule > schedule > schedule_timeout > ? lock_timer_base > schedule_timeout_uninterruptible > msleep > lock_device_hotplug_sysfs > online_store > dev_attr_store > sysfs_write_file > vfs_write > SyS_write > system_call_fastpath > > This bug can be triggered by hot add and remove large number of xen > domain0's vcpus repeatedly. > > Last level cache shared map is built during cpu up and build sched domain > routine takes advantage of it to setup sched domain cpu topology, however, > llc shared map is unreleased during cpu disable which lead to invalid sched > domain cpu topology. This patch fix it by release llc shared map correctly > during cpu disable. > > Reviewed-by: Toshi Kani > Reviewed-by: Yasuaki Ishimatsu > Tested-by: Linn Crosetto > Signed-off-by: Wanpeng Li Yasuaki reported this can happen on our real hardware. https://lkml.org/lkml/2014/7/22/1018 Our case is here. == Here is a example on my system. My system has 4 sockets and each socket has 15 cores and HT is enabled. In this case, each core of sockes is numbered as follows: | CPU# Socket#0 | 0-14 , 60-74 Socket#1 | 15-29, 75-89 Socket#2 | 30-44, 90-104 Socket#3 | 45-59, 105-119 Then llc_shared_mask of CPU#30 has 0x3fff8001fffc000. It means that last level cache of Socket#2 is shared with CPU#30-44 and 90-104. When hot-removing socket#2 and #3, each core of sockets is numbered as follows: | CPU# Socket#0 | 0-14 , 60-74 Socket#1 | 15-29, 75-89 But llc_shared_mask is not cleared. So llc_shared_mask of CPU#30 remains having 0x3fff8001fffc000. After that, when hot-adding socket#2 and #3, each core of sockets is numbered as follows: | CPU# Socket#0 | 0-14 , 60-74 Socket#1 | 15-29, 75-89 Socket#2 | 30-59 Socket#3 | 90-119 Then llc_shared_mask of CPU#30 becomes 0x3fff8000fffc000. It means that last level cache of Socket#2 is shared with CPU#30-59 and 90-104. So the mask has wrong value. At first, I cleared hot-removed CPU number's bit from llc_shared_map when hot removing CPU. But Borislav suggested that the problem will disappear if readded CPU is assigned same CPU number. And llc_shared_map must not be changed. == So, please. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [f2fs-dev] [PATCH 07/10] f2fs: use meta_inode cache to improve roll-forward speed
Hi Chao, On Mon, Sep 22, 2014 at 10:36:25AM +0800, Chao Yu wrote: > Hi Jaegeuk, > > > -Original Message- > > From: Jaegeuk Kim [mailto:jaeg...@kernel.org] > > Sent: Monday, September 15, 2014 6:14 AM > > To: linux-kernel@vger.kernel.org; linux-fsde...@vger.kernel.org; > > linux-f2fs-de...@lists.sourceforge.net > > Cc: Jaegeuk Kim > > Subject: [f2fs-dev] [PATCH 07/10] f2fs: use meta_inode cache to improve > > roll-forward speed > > > > Previously, all the dnode pages should be read during the roll-forward > > recovery. > > Even worsely, whole the chain was traversed twice. > > This patch removes that redundant and costly read operations by using page > > cache > > of meta_inode and readahead function as well. > > > > Signed-off-by: Jaegeuk Kim > > --- > > fs/f2fs/checkpoint.c | 11 -- > > fs/f2fs/f2fs.h | 5 +++-- > > fs/f2fs/recovery.c | 59 > > +--- > > fs/f2fs/segment.h| 5 +++-- > > 4 files changed, 43 insertions(+), 37 deletions(-) > > > > diff --git a/fs/f2fs/checkpoint.c b/fs/f2fs/checkpoint.c > > index 7262d99..d1ed889 100644 > > --- a/fs/f2fs/checkpoint.c > > +++ b/fs/f2fs/checkpoint.c > > @@ -82,6 +82,8 @@ static inline int get_max_meta_blks(struct f2fs_sb_info > > *sbi, int type) > > case META_SSA: > > case META_CP: > > return 0; > > + case META_POR: > > + return SM_I(sbi)->main_blkaddr + sbi->user_block_count; > > Here we will skip virtual over-provision segments, so better to use > TOTAL_BLKS(sbi). > > > default: > > BUG(); > > } > > @@ -90,11 +92,11 @@ static inline int get_max_meta_blks(struct f2fs_sb_info > > *sbi, int type) > > /* > > * Readahead CP/NAT/SIT/SSA pages > > */ > > -int ra_meta_pages(struct f2fs_sb_info *sbi, int start, int nrpages, int > > type) > > +int ra_meta_pages(struct f2fs_sb_info *sbi, block_t start, int nrpages, > > int type) > > { > > block_t prev_blk_addr = 0; > > struct page *page; > > - int blkno = start; > > + block_t blkno = start; > > int max_blks = get_max_meta_blks(sbi, type); > > > > struct f2fs_io_info fio = { > > @@ -128,6 +130,11 @@ int ra_meta_pages(struct f2fs_sb_info *sbi, int start, > > int nrpages, int > > type) > > /* get ssa/cp block addr */ > > blk_addr = blkno; > > break; > > + case META_POR: > > + if (unlikely(blkno >= max_blks)) > > + goto out; > > + blk_addr = blkno; > > + break; > > The real modification in patch which is merged to dev of f2fs is as following: > > - /* get ssa/cp block addr */ > + case META_POR: > + if (blkno >= max_blks || blkno < min_blks) > + goto out; > > IMHO, it's better to verify boundary separately for META_{SSA,CP,POR} with > unlikely. > How do you think? Not bad. Could you check the v2 below? > > > default: > > BUG(); > > } > > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h > > index 4f84d2a..48d7d46 100644 > > --- a/fs/f2fs/f2fs.h > > +++ b/fs/f2fs/f2fs.h > > @@ -103,7 +103,8 @@ enum { > > META_CP, > > META_NAT, > > META_SIT, > > - META_SSA > > + META_SSA, > > + META_POR, > > }; > > > > /* for the list of ino */ > > @@ -1291,7 +1292,7 @@ void destroy_segment_manager_caches(void); > > */ > > struct page *grab_meta_page(struct f2fs_sb_info *, pgoff_t); > > struct page *get_meta_page(struct f2fs_sb_info *, pgoff_t); > > -int ra_meta_pages(struct f2fs_sb_info *, int, int, int); > > +int ra_meta_pages(struct f2fs_sb_info *, block_t, int, int); > > long sync_meta_pages(struct f2fs_sb_info *, enum page_type, long); > > void add_dirty_inode(struct f2fs_sb_info *, nid_t, int type); > > void remove_dirty_inode(struct f2fs_sb_info *, nid_t, int type); > > diff --git a/fs/f2fs/recovery.c b/fs/f2fs/recovery.c > > index 3736728..6f7fbfa 100644 > > --- a/fs/f2fs/recovery.c > > +++ b/fs/f2fs/recovery.c > > @@ -173,7 +173,7 @@ static int find_fsync_dnodes(struct f2fs_sb_info *sbi, > > struct list_head > > *head) > > { > > unsigned long long cp_ver = cur_cp_version(F2FS_CKPT(sbi)); > > struct curseg_info *curseg; > > - struct page *page; > > + struct page *page = NULL; > > block_t blkaddr; > > int err = 0; > > > > @@ -181,20 +181,19 @@ static int find_fsync_dnodes(struct f2fs_sb_info > > *sbi, struct list_head > > *head) > > curseg = CURSEG_I(sbi, CURSEG_WARM_NODE); > > blkaddr = NEXT_FREE_BLKADDR(sbi, curseg); > > > > - /* read node page */ > > - page = alloc_page(GFP_F2FS_ZERO); > > - if (!page) > > - return -ENOMEM; > > - lock_page(page); > > - > > while (1) { > > struct fsync_inode_entry *entry; > > > > - err = f2fs_submit_page_bio(sbi, page, blkaddr, READ_SYNC); > > - if (err) > > - return err; > > + if (blkaddr <
[PATCH v4 07/12] crypto: LLVMLinux: Remove VLAIS from crypto/.../qat_algs.c
From: Behan Webster Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99 compliant equivalent. This patch allocates the appropriate amount of memory using a char array using the SHASH_DESC_ON_STACK macro. The new code can be compiled with both gcc and clang. Signed-off-by: Behan Webster Reviewed-by: Mark Charlebois Reviewed-by: Jan-Simon Möller Acked-by: Herbert Xu --- drivers/crypto/qat/qat_common/qat_algs.c | 31 ++- 1 file changed, 14 insertions(+), 17 deletions(-) diff --git a/drivers/crypto/qat/qat_common/qat_algs.c b/drivers/crypto/qat/qat_common/qat_algs.c index 59df488..9cabadd 100644 --- a/drivers/crypto/qat/qat_common/qat_algs.c +++ b/drivers/crypto/qat/qat_common/qat_algs.c @@ -152,10 +152,7 @@ static int qat_alg_do_precomputes(struct icp_qat_hw_auth_algo_blk *hash, const uint8_t *auth_key, unsigned int auth_keylen, uint8_t *auth_state) { - struct { - struct shash_desc shash; - char ctx[crypto_shash_descsize(ctx->hash_tfm)]; - } desc; + SHASH_DESC_ON_STACK(shash, ctx->hash_tfm); struct sha1_state sha1; struct sha256_state sha256; struct sha512_state sha512; @@ -167,12 +164,12 @@ static int qat_alg_do_precomputes(struct icp_qat_hw_auth_algo_blk *hash, __be64 *hash512_state_out; int i, offset; - desc.shash.tfm = ctx->hash_tfm; - desc.shash.flags = 0x0; + shash->tfm = ctx->hash_tfm; + shash->flags = 0x0; if (auth_keylen > block_size) { char buff[SHA512_BLOCK_SIZE]; - int ret = crypto_shash_digest(, auth_key, + int ret = crypto_shash_digest(shash, auth_key, auth_keylen, buff); if (ret) return ret; @@ -195,10 +192,10 @@ static int qat_alg_do_precomputes(struct icp_qat_hw_auth_algo_blk *hash, *opad_ptr ^= 0x5C; } - if (crypto_shash_init()) + if (crypto_shash_init(shash)) return -EFAULT; - if (crypto_shash_update(, ipad, block_size)) + if (crypto_shash_update(shash, ipad, block_size)) return -EFAULT; hash_state_out = (__be32 *)hash->sha.state1; @@ -206,19 +203,19 @@ static int qat_alg_do_precomputes(struct icp_qat_hw_auth_algo_blk *hash, switch (ctx->qat_hash_alg) { case ICP_QAT_HW_AUTH_ALGO_SHA1: - if (crypto_shash_export(, )) + if (crypto_shash_export(shash, )) return -EFAULT; for (i = 0; i < digest_size >> 2; i++, hash_state_out++) *hash_state_out = cpu_to_be32(*(sha1.state + i)); break; case ICP_QAT_HW_AUTH_ALGO_SHA256: - if (crypto_shash_export(, )) + if (crypto_shash_export(shash, )) return -EFAULT; for (i = 0; i < digest_size >> 2; i++, hash_state_out++) *hash_state_out = cpu_to_be32(*(sha256.state + i)); break; case ICP_QAT_HW_AUTH_ALGO_SHA512: - if (crypto_shash_export(, )) + if (crypto_shash_export(shash, )) return -EFAULT; for (i = 0; i < digest_size >> 3; i++, hash512_state_out++) *hash512_state_out = cpu_to_be64(*(sha512.state + i)); @@ -227,10 +224,10 @@ static int qat_alg_do_precomputes(struct icp_qat_hw_auth_algo_blk *hash, return -EFAULT; } - if (crypto_shash_init()) + if (crypto_shash_init(shash)) return -EFAULT; - if (crypto_shash_update(, opad, block_size)) + if (crypto_shash_update(shash, opad, block_size)) return -EFAULT; offset = round_up(qat_get_inter_state_size(ctx->qat_hash_alg), 8); @@ -239,19 +236,19 @@ static int qat_alg_do_precomputes(struct icp_qat_hw_auth_algo_blk *hash, switch (ctx->qat_hash_alg) { case ICP_QAT_HW_AUTH_ALGO_SHA1: - if (crypto_shash_export(, )) + if (crypto_shash_export(shash, )) return -EFAULT; for (i = 0; i < digest_size >> 2; i++, hash_state_out++) *hash_state_out = cpu_to_be32(*(sha1.state + i)); break; case ICP_QAT_HW_AUTH_ALGO_SHA256: - if (crypto_shash_export(, )) + if (crypto_shash_export(shash, )) return -EFAULT; for (i = 0; i < digest_size >> 2; i++, hash_state_out++) *hash_state_out = cpu_to_be32(*(sha256.state + i)); break; case ICP_QAT_HW_AUTH_ALGO_SHA512: - if (crypto_shash_export(, )) + if (crypto_shash_export(shash, )) return -EFAULT;
[PATCH v4 02/12] btrfs: LLVMLinux: Remove VLAIS
From: Vinícius Tinti Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99 compliant equivalent. This patch instead allocates the appropriate amount of memory using a char array using the SHASH_DESC_ON_STACK macro. The new code can be compiled with both gcc and clang. Signed-off-by: Vinícius Tinti Reviewed-by: Jan-Simon Möller Reviewed-by: Mark Charlebois Signed-off-by: Behan Webster Acked-by: Chris Mason Acked-by: Herbert Xu Cc: "David S. Miller" --- fs/btrfs/hash.c | 16 +++- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/hash.c b/fs/btrfs/hash.c index 85889aa..4bf4d3a 100644 --- a/fs/btrfs/hash.c +++ b/fs/btrfs/hash.c @@ -33,18 +33,16 @@ void btrfs_hash_exit(void) u32 btrfs_crc32c(u32 crc, const void *address, unsigned int length) { - struct { - struct shash_desc shash; - char ctx[crypto_shash_descsize(tfm)]; - } desc; + SHASH_DESC_ON_STACK(shash, tfm); + u32 *ctx = (u32 *)shash_desc_ctx(shash); int err; - desc.shash.tfm = tfm; - desc.shash.flags = 0; - *(u32 *)desc.ctx = crc; + shash->tfm = tfm; + shash->flags = 0; + *ctx = crc; - err = crypto_shash_update(, address, length); + err = crypto_shash_update(shash, address, length); BUG_ON(err); - return *(u32 *)desc.ctx; + return *ctx; } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v4 08/12] crypto, dm: LLVMLinux: Remove VLAIS usage from dm-crypt
From: Jan-Simon Möller Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99 compliant equivalent. This patch allocates the appropriate amount of memory using a char array using the SHASH_DESC_ON_STACK macro. The new code can be compiled with both gcc and clang. Signed-off-by: Jan-Simon Möller Signed-off-by: Behan Webster Reviewed-by: Mark Charlebois Acked-by: Herbert Xu Cc: pagee...@freemail.hu Cc: gmazyl...@gmail.com Cc: "David S. Miller" --- drivers/md/dm-crypt.c | 34 ++ 1 file changed, 14 insertions(+), 20 deletions(-) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index cd15e08..fc93b93 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c @@ -526,29 +526,26 @@ static int crypt_iv_lmk_one(struct crypt_config *cc, u8 *iv, u8 *data) { struct iv_lmk_private *lmk = >iv_gen_private.lmk; - struct { - struct shash_desc desc; - char ctx[crypto_shash_descsize(lmk->hash_tfm)]; - } sdesc; + SHASH_DESC_ON_STACK(desc, lmk->hash_tfm); struct md5_state md5state; __le32 buf[4]; int i, r; - sdesc.desc.tfm = lmk->hash_tfm; - sdesc.desc.flags = CRYPTO_TFM_REQ_MAY_SLEEP; + desc->tfm = lmk->hash_tfm; + desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP; - r = crypto_shash_init(); + r = crypto_shash_init(desc); if (r) return r; if (lmk->seed) { - r = crypto_shash_update(, lmk->seed, LMK_SEED_SIZE); + r = crypto_shash_update(desc, lmk->seed, LMK_SEED_SIZE); if (r) return r; } /* Sector is always 512B, block size 16, add data of blocks 1-31 */ - r = crypto_shash_update(, data + 16, 16 * 31); + r = crypto_shash_update(desc, data + 16, 16 * 31); if (r) return r; @@ -557,12 +554,12 @@ static int crypt_iv_lmk_one(struct crypt_config *cc, u8 *iv, buf[1] = cpu_to_le32u64)dmreq->iv_sector >> 32) & 0x00FF) | 0x8000); buf[2] = cpu_to_le32(4024); buf[3] = 0; - r = crypto_shash_update(, (u8 *)buf, sizeof(buf)); + r = crypto_shash_update(desc, (u8 *)buf, sizeof(buf)); if (r) return r; /* No MD5 padding here */ - r = crypto_shash_export(, ); + r = crypto_shash_export(desc, ); if (r) return r; @@ -679,10 +676,7 @@ static int crypt_iv_tcw_whitening(struct crypt_config *cc, struct iv_tcw_private *tcw = >iv_gen_private.tcw; u64 sector = cpu_to_le64((u64)dmreq->iv_sector); u8 buf[TCW_WHITENING_SIZE]; - struct { - struct shash_desc desc; - char ctx[crypto_shash_descsize(tcw->crc32_tfm)]; - } sdesc; + SHASH_DESC_ON_STACK(desc, tcw->crc32_tfm); int i, r; /* xor whitening with sector number */ @@ -691,16 +685,16 @@ static int crypt_iv_tcw_whitening(struct crypt_config *cc, crypto_xor([8], (u8 *), 8); /* calculate crc32 for every 32bit part and xor it */ - sdesc.desc.tfm = tcw->crc32_tfm; - sdesc.desc.flags = CRYPTO_TFM_REQ_MAY_SLEEP; + desc->tfm = tcw->crc32_tfm; + desc->flags = CRYPTO_TFM_REQ_MAY_SLEEP; for (i = 0; i < 4; i++) { - r = crypto_shash_init(); + r = crypto_shash_init(desc); if (r) goto out; - r = crypto_shash_update(, [i * 4], 4); + r = crypto_shash_update(desc, [i * 4], 4); if (r) goto out; - r = crypto_shash_final(, [i * 4]); + r = crypto_shash_final(desc, [i * 4]); if (r) goto out; } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kernfs: use stack-buf for small writes.
On Tue, 23 Sep 2014 00:18:17 -0400 Tejun Heo wrote: > On Tue, Sep 23, 2014 at 02:06:33PM +1000, NeilBrown wrote: > ... > > Note that reads from a sysfs file are already safe due to the use for > > seqfile. The first read will allocate a buffer (m->buf) which will > > be used for all subsequent reads. > > Hmmm? How is seqfile safe? Where would the seq op write to? seqfile is only safe for reads. sysfs via kernfs uses seq_read(), so there is only a single allocation on the first read. It doesn't really related to fixing writes, except to point out that only writes need to be "fixed". Reads already work. Separately: > Ugh... :( If this can't be avoided at all, I'd much prefer it to be > something explicit - a flag marking the file as needing a persistent > write buffer which is allocated on open. "Small" writes on stack > feels way to implicit to me. How about if we add seq_getbuf() and seq_putbuf() to seqfile which takes a 'struct seq_file' and a size and returns the ->buf after making sure it is big enough. It also claims and releases the seqfile ->lock. Then we would be using the same buffer for reads and write. Does that sound suitable? It uses existing infrastructure and avoids having to identify in advance which attributes it is important for. Thanks, NeilBrown signature.asc Description: PGP signature
[PATCH v4 01/12] crypto: LLVMLinux: Add macro to remove use of VLAIS in crypto code
From: Behan Webster Add a macro which replaces the use of a Variable Length Array In Struct (VLAIS) with a C99 compliant equivalent. This macro instead allocates the appropriate amount of memory using an char array. The new code can be compiled with both gcc and clang. struct shash_desc contains a flexible array member member ctx declared with CRYPTO_MINALIGN_ATTR, so sizeof(struct shash_desc) aligns the beginning of the array declared after struct shash_desc with long long. No trailing padding is required because it is not a struct type that can be used in an array. The CRYPTO_MINALIGN_ATTR is required so that desc is aligned with long long as would be the case for a struct containing a member with CRYPTO_MINALIGN_ATTR. If you want to get to the ctx at the end of the shash_desc as before you can do so using shash_desc_ctx(shash) Signed-off-by: Behan Webster Reviewed-by: Mark Charlebois Acked-by: Herbert Xu Cc: Michał Mirosław --- include/crypto/hash.h | 5 + 1 file changed, 5 insertions(+) diff --git a/include/crypto/hash.h b/include/crypto/hash.h index a391955..74b13ec 100644 --- a/include/crypto/hash.h +++ b/include/crypto/hash.h @@ -58,6 +58,11 @@ struct shash_desc { void *__ctx[] CRYPTO_MINALIGN_ATTR; }; +#define SHASH_DESC_ON_STACK(shash, ctx) \ + char __##shash##_desc[sizeof(struct shash_desc) + \ + crypto_shash_descsize(ctx)] CRYPTO_MINALIGN_ATTR; \ + struct shash_desc *shash = (struct shash_desc *)__##shash##_desc + struct shash_alg { int (*init)(struct shash_desc *desc); int (*update)(struct shash_desc *desc, const u8 *data, -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v4 09/12] crypto: LLVMLinux: Remove VLAIS usage from crypto/hmac.c
From: Jan-Simon Möller Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99 compliant equivalent. This patch allocates the appropriate amount of memory using a char array using the SHASH_DESC_ON_STACK macro. The new code can be compiled with both gcc and clang. Signed-off-by: Jan-Simon Möller Signed-off-by: Behan Webster Reviewed-by: Mark Charlebois Acked-by: Herbert Xu Cc: pagee...@freemail.hu --- crypto/hmac.c | 25 +++-- 1 file changed, 11 insertions(+), 14 deletions(-) diff --git a/crypto/hmac.c b/crypto/hmac.c index 8d9544c..e392219 100644 --- a/crypto/hmac.c +++ b/crypto/hmac.c @@ -52,20 +52,17 @@ static int hmac_setkey(struct crypto_shash *parent, struct hmac_ctx *ctx = align_ptr(opad + ss, crypto_tfm_ctx_alignment()); struct crypto_shash *hash = ctx->hash; - struct { - struct shash_desc shash; - char ctx[crypto_shash_descsize(hash)]; - } desc; + SHASH_DESC_ON_STACK(shash, hash); unsigned int i; - desc.shash.tfm = hash; - desc.shash.flags = crypto_shash_get_flags(parent) & - CRYPTO_TFM_REQ_MAY_SLEEP; + shash->tfm = hash; + shash->flags = crypto_shash_get_flags(parent) + & CRYPTO_TFM_REQ_MAY_SLEEP; if (keylen > bs) { int err; - err = crypto_shash_digest(, inkey, keylen, ipad); + err = crypto_shash_digest(shash, inkey, keylen, ipad); if (err) return err; @@ -81,12 +78,12 @@ static int hmac_setkey(struct crypto_shash *parent, opad[i] ^= 0x5c; } - return crypto_shash_init() ?: - crypto_shash_update(, ipad, bs) ?: - crypto_shash_export(, ipad) ?: - crypto_shash_init() ?: - crypto_shash_update(, opad, bs) ?: - crypto_shash_export(, opad); + return crypto_shash_init(shash) ?: + crypto_shash_update(shash, ipad, bs) ?: + crypto_shash_export(shash, ipad) ?: + crypto_shash_init(shash) ?: + crypto_shash_update(shash, opad, bs) ?: + crypto_shash_export(shash, opad); } static int hmac_export(struct shash_desc *pdesc, void *out) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v1 2/5] mm: add full variable in swap_info_struct
On Mon, Sep 22, 2014 at 01:45:22PM -0700, Andrew Morton wrote: > On Mon, 22 Sep 2014 09:03:08 +0900 Minchan Kim wrote: > > > Now, swap leans on !p->highest_bit to indicate a swap is full. > > It works well for normal swap because every slot on swap device > > is used up when the swap is full but in case of zram, swap sees > > still many empty slot although backed device(ie, zram) is full > > since zram's limit is over so that it could make trouble when > > swap use highest_bit to select new slot via free_cluster. > > > > This patch introduces full varaiable in swap_info_struct > > to solve the problem. > > > > ... > > > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -224,6 +224,7 @@ struct swap_info_struct { > > struct swap_cluster_info free_cluster_tail; /* free cluster list tail */ > > unsigned int lowest_bit;/* index of first free in swap_map */ > > unsigned int highest_bit; /* index of last free in swap_map */ > > + boolfull; /* whether swap is full or not */ > > This is protected by swap_info_struct.lock, I worked out. > > There's a large comment at swap_info_struct.lock which could be updated. Sure. -- Kind regards, Minchan Kim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v4 03/12] crypto: LLVMLinux: Remove VLAIS from crypto/ccp/ccp-crypto-sha.c
From: Jan-Simon Möller Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99 compliant equivalent. This patch allocates the appropriate amount of memory using a char array using the SHASH_DESC_ON_STACK macro. The new code can be compiled with both gcc and clang. Signed-off-by: Jan-Simon Möller Signed-off-by: Behan Webster Reviewed-by: Mark Charlebois Acked-by: Herbert Xu --- drivers/crypto/ccp/ccp-crypto-sha.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/drivers/crypto/ccp/ccp-crypto-sha.c b/drivers/crypto/ccp/ccp-crypto-sha.c index 873f234..9653157 100644 --- a/drivers/crypto/ccp/ccp-crypto-sha.c +++ b/drivers/crypto/ccp/ccp-crypto-sha.c @@ -198,10 +198,9 @@ static int ccp_sha_setkey(struct crypto_ahash *tfm, const u8 *key, { struct ccp_ctx *ctx = crypto_tfm_ctx(crypto_ahash_tfm(tfm)); struct crypto_shash *shash = ctx->u.sha.hmac_tfm; - struct { - struct shash_desc sdesc; - char ctx[crypto_shash_descsize(shash)]; - } desc; + + SHASH_DESC_ON_STACK(sdesc, shash); + unsigned int block_size = crypto_shash_blocksize(shash); unsigned int digest_size = crypto_shash_digestsize(shash); int i, ret; @@ -216,11 +215,11 @@ static int ccp_sha_setkey(struct crypto_ahash *tfm, const u8 *key, if (key_len > block_size) { /* Must hash the input key */ - desc.sdesc.tfm = shash; - desc.sdesc.flags = crypto_ahash_get_flags(tfm) & + sdesc->tfm = shash; + sdesc->flags = crypto_ahash_get_flags(tfm) & CRYPTO_TFM_REQ_MAY_SLEEP; - ret = crypto_shash_digest(, key, key_len, + ret = crypto_shash_digest(sdesc, key, key_len, ctx->u.sha.key); if (ret) { crypto_ahash_set_flags(tfm, CRYPTO_TFM_RES_BAD_KEY_LEN); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v4 10/12] crypto: LLVMLinux: Remove VLAIS usage from libcrc32c.c
From: Jan-Simon Möller Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99 compliant equivalent. This patch allocates the appropriate amount of memory using a char array using the SHASH_DESC_ON_STACK macro. The new code can be compiled with both gcc and clang. Signed-off-by: Jan-Simon Möller Signed-off-by: Behan Webster Reviewed-by: Mark Charlebois Acked-by: Herbert Xu Cc: pagee...@freemail.hu Cc: "David S. Miller" --- lib/libcrc32c.c | 16 +++- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/lib/libcrc32c.c b/lib/libcrc32c.c index b3131f5..6a08ce7 100644 --- a/lib/libcrc32c.c +++ b/lib/libcrc32c.c @@ -41,20 +41,18 @@ static struct crypto_shash *tfm; u32 crc32c(u32 crc, const void *address, unsigned int length) { - struct { - struct shash_desc shash; - char ctx[crypto_shash_descsize(tfm)]; - } desc; + SHASH_DESC_ON_STACK(shash, tfm); + u32 *ctx = (u32 *)shash_desc_ctx(shash); int err; - desc.shash.tfm = tfm; - desc.shash.flags = 0; - *(u32 *)desc.ctx = crc; + shash->tfm = tfm; + shash->flags = 0; + *ctx = crc; - err = crypto_shash_update(, address, length); + err = crypto_shash_update(shash, address, length); BUG_ON(err); - return *(u32 *)desc.ctx; + return *ctx; } EXPORT_SYMBOL(crc32c); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v1 1/5] zram: generalize swap_slot_free_notify
Hi Andrew, On Mon, Sep 22, 2014 at 01:41:09PM -0700, Andrew Morton wrote: > On Mon, 22 Sep 2014 09:03:07 +0900 Minchan Kim wrote: > > > Currently, swap_slot_free_notify is used for zram to free > > duplicated copy page for memory efficiency when it knows > > there is no reference to the swap slot. > > > > This patch generalizes it to be able to use for other > > swap hint to communicate with VM. > > > > I really think we need to do a better job of documenting the code. > > > index 94d93b1f8b53..c262bfbeafa9 100644 > > --- a/Documentation/filesystems/Locking > > +++ b/Documentation/filesystems/Locking > > @@ -405,7 +405,7 @@ prototypes: > > void (*unlock_native_capacity) (struct gendisk *); > > int (*revalidate_disk) (struct gendisk *); > > int (*getgeo)(struct block_device *, struct hd_geometry *); > > - void (*swap_slot_free_notify) (struct block_device *, unsigned long); > > + int (*swap_hint) (struct block_device *, unsigned int, void *); > > > > locking rules: > > bd_mutex > > @@ -418,7 +418,7 @@ media_changed: no > > unlock_native_capacity:no > > revalidate_disk: no > > getgeo:no > > -swap_slot_free_notify: no (see below) > > +swap_hint: no (see below) > > This didn't tell anyone anythnig much. Yeb. :( > > > index d78b245bae06..22a37764c409 100644 > > --- a/drivers/block/zram/zram_drv.c > > +++ b/drivers/block/zram/zram_drv.c > > @@ -926,7 +926,8 @@ error: > > bio_io_error(bio); > > } > > > > -static void zram_slot_free_notify(struct block_device *bdev, > > +/* this callback is with swap_lock and sometimes page table lock held */ > > OK, that was useful. > > It's called "page_table_lock". > > Also *which* page_table_lock? current->mm? It depends on ALLOC_SPLIT_PTLOCKS so it could be page->ptl, too. So, it would be better to call it as *ptlock*? Since it's ptlock, it isn't related to which mm struct. What we should make sure is just ptlock which belong to the page table pointed to this swap page. So, I want this. diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index c262bfbeafa9..19d2726e34f4 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -423,8 +423,8 @@ swap_hint: no (see below) media_changed, unlock_native_capacity and revalidate_disk are called only from check_disk_change(). -swap_slot_free_notify is called with swap_lock and sometimes the page lock -held. +swap_hint is called with swap_info_struct->lock and sometimes the ptlock +of the page table pointed to the swap page. --- file_operations --- > > > +static int zram_slot_free_notify(struct block_device *bdev, > > unsigned long index) > > { > > struct zram *zram; > > > > ... > > > > --- a/include/linux/blkdev.h > > +++ b/include/linux/blkdev.h > > @@ -1609,6 +1609,10 @@ static inline bool > > blk_integrity_is_initialized(struct gendisk *g) > > > > #endif /* CONFIG_BLK_DEV_INTEGRITY */ > > > > +enum swap_blk_hint { > > + SWAP_FREE, > > +}; > > This would be a great place to document SWAP_FREE. Yes, > > > struct block_device_operations { > > int (*open) (struct block_device *, fmode_t); > > void (*release) (struct gendisk *, fmode_t); > > @@ -1624,8 +1628,7 @@ struct block_device_operations { > > void (*unlock_native_capacity) (struct gendisk *); > > int (*revalidate_disk) (struct gendisk *); > > int (*getgeo)(struct block_device *, struct hd_geometry *); > > - /* this callback is with swap_lock and sometimes page table lock held */ > > - void (*swap_slot_free_notify) (struct block_device *, unsigned long); > > + int (*swap_hint)(struct block_device *, unsigned int, void *); > > And this would be a suitable place to document ->swap_hint(). If we consider to be able to add more hints in future so it could be verbose, IMO, it would be better to desribe it in enum swap_hint. :) > > - Hint from who to who? Is it the caller providing the callee a hint > or is the caller asking the callee for a hint? > > - What is the meaning of the return value? > > - What are the meaning of the arguments? Okay. > > Please don't omit the argument names like this. They are useful! How > is a reader to know what that "unsigned int" and "void *" actually > *do*? Yes. > > The second arg-which-doesn't-have-a-name should have had type > swap_blk_hint, yes? Yes. > > swap_blk_hint should be called swap_block_hint. I assume that's what > "blk" means. Why does the name have "block" in there anyway? It has > something to do with disk blocks? How is anyone supposed to work that > out? Yeb, I think we don't need block in name. I will remove it. > > ->swap_hint was converted to return an `int', but all the callers > simply ignore the return value. You're right. All caller doesn't use it in this patch but this
[PATCH v4 11/12] security, crypto: LLVMLinux: Remove VLAIS from ima_crypto.c
From: Behan Webster Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99 compliant equivalent. This patch allocates the appropriate amount of memory using a char array using the SHASH_DESC_ON_STACK macro. The new code can be compiled with both gcc and clang. Signed-off-by: Behan Webster Reviewed-by: Mark Charlebois Reviewed-by: Jan-Simon Möller Acked-by: Herbert Xu Cc: t...@linutronix.de --- security/integrity/ima/ima_crypto.c | 47 +++-- 1 file changed, 19 insertions(+), 28 deletions(-) diff --git a/security/integrity/ima/ima_crypto.c b/security/integrity/ima/ima_crypto.c index 0bd7328..e35f5d9 100644 --- a/security/integrity/ima/ima_crypto.c +++ b/security/integrity/ima/ima_crypto.c @@ -380,17 +380,14 @@ static int ima_calc_file_hash_tfm(struct file *file, loff_t i_size, offset = 0; char *rbuf; int rc, read = 0; - struct { - struct shash_desc shash; - char ctx[crypto_shash_descsize(tfm)]; - } desc; + SHASH_DESC_ON_STACK(shash, tfm); - desc.shash.tfm = tfm; - desc.shash.flags = 0; + shash->tfm = tfm; + shash->flags = 0; hash->length = crypto_shash_digestsize(tfm); - rc = crypto_shash_init(); + rc = crypto_shash_init(shash); if (rc != 0) return rc; @@ -420,7 +417,7 @@ static int ima_calc_file_hash_tfm(struct file *file, break; offset += rbuf_len; - rc = crypto_shash_update(, rbuf, rbuf_len); + rc = crypto_shash_update(shash, rbuf, rbuf_len); if (rc) break; } @@ -429,7 +426,7 @@ static int ima_calc_file_hash_tfm(struct file *file, kfree(rbuf); out: if (!rc) - rc = crypto_shash_final(, hash->digest); + rc = crypto_shash_final(shash, hash->digest); return rc; } @@ -487,18 +484,15 @@ static int ima_calc_field_array_hash_tfm(struct ima_field_data *field_data, struct ima_digest_data *hash, struct crypto_shash *tfm) { - struct { - struct shash_desc shash; - char ctx[crypto_shash_descsize(tfm)]; - } desc; + SHASH_DESC_ON_STACK(shash, tfm); int rc, i; - desc.shash.tfm = tfm; - desc.shash.flags = 0; + shash->tfm = tfm; + shash->flags = 0; hash->length = crypto_shash_digestsize(tfm); - rc = crypto_shash_init(); + rc = crypto_shash_init(shash); if (rc != 0) return rc; @@ -508,7 +502,7 @@ static int ima_calc_field_array_hash_tfm(struct ima_field_data *field_data, u32 datalen = field_data[i].len; if (strcmp(td->name, IMA_TEMPLATE_IMA_NAME) != 0) { - rc = crypto_shash_update(, + rc = crypto_shash_update(shash, (const u8 *) _data[i].len, sizeof(field_data[i].len)); if (rc) @@ -518,13 +512,13 @@ static int ima_calc_field_array_hash_tfm(struct ima_field_data *field_data, data_to_hash = buffer; datalen = IMA_EVENT_NAME_LEN_MAX + 1; } - rc = crypto_shash_update(, data_to_hash, datalen); + rc = crypto_shash_update(shash, data_to_hash, datalen); if (rc) break; } if (!rc) - rc = crypto_shash_final(, hash->digest); + rc = crypto_shash_final(shash, hash->digest); return rc; } @@ -565,15 +559,12 @@ static int __init ima_calc_boot_aggregate_tfm(char *digest, { u8 pcr_i[TPM_DIGEST_SIZE]; int rc, i; - struct { - struct shash_desc shash; - char ctx[crypto_shash_descsize(tfm)]; - } desc; + SHASH_DESC_ON_STACK(shash, tfm); - desc.shash.tfm = tfm; - desc.shash.flags = 0; + shash->tfm = tfm; + shash->flags = 0; - rc = crypto_shash_init(); + rc = crypto_shash_init(shash); if (rc != 0) return rc; @@ -581,10 +572,10 @@ static int __init ima_calc_boot_aggregate_tfm(char *digest, for (i = TPM_PCR0; i < TPM_PCR8; i++) { ima_pcrread(i, pcr_i); /* now accumulate with current aggregate */ - rc = crypto_shash_update(, pcr_i, TPM_DIGEST_SIZE); + rc = crypto_shash_update(shash, pcr_i, TPM_DIGEST_SIZE); } if (!rc) - crypto_shash_final(, digest); + crypto_shash_final(shash, digest); return rc; } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org
[PATCH v4 00/12] LLVMLinux: Patches to enable the kernel to be compiled with clang/LLVM
From: Behan Webster Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99 compliant equivalent. These patches allocate the appropriate amount of memory using a char array using the SHASH_DESC_ON_STACK macro. There are places in the kernel whose maintainers have previously taken our patches to remove VLAIS from their crypto code. Once this patch set is accepted into mainline, I'll go back and resubmit patches to these maintainers to use this new macro so the same approach is used consistently in all places in the kernel. The LLVMLinux project aims to fully build the Linux kernel using both gcc and clang (the C front end for the LLVM compiler infrastructure project). Behan Webster (6): crypto: LLVMLinux: Add macro to remove use of VLAIS in crypto code crypto: LLVMLinux: Remove VLAIS from crypto/mv_cesa.c crypto: LLVMLinux: Remove VLAIS from crypto/n2_core.c crypto: LLVMLinux: Remove VLAIS from crypto/omap_sham.c crypto: LLVMLinux: Remove VLAIS from crypto/.../qat_algs.c security, crypto: LLVMLinux: Remove VLAIS from ima_crypto.c Jan-Simon Möller (5): crypto: LLVMLinux: Remove VLAIS from crypto/ccp/ccp-crypto-sha.c crypto, dm: LLVMLinux: Remove VLAIS usage from dm-crypt crypto: LLVMLinux: Remove VLAIS usage from crypto/hmac.c crypto: LLVMLinux: Remove VLAIS usage from libcrc32c.c crypto: LLVMLinux: Remove VLAIS usage from crypto/testmgr.c Vinícius Tinti (1): btrfs: LLVMLinux: Remove VLAIS crypto/hmac.c| 25 - crypto/testmgr.c | 14 -- drivers/crypto/ccp/ccp-crypto-sha.c | 13 - drivers/crypto/mv_cesa.c | 41 drivers/crypto/n2_core.c | 11 +++- drivers/crypto/omap-sham.c | 28 --- drivers/crypto/qat/qat_common/qat_algs.c | 31 ++--- drivers/md/dm-crypt.c| 34 ++- fs/btrfs/hash.c | 16 +-- include/crypto/hash.h| 5 lib/libcrc32c.c | 16 +-- security/integrity/ima/ima_crypto.c | 47 +--- 12 files changed, 122 insertions(+), 159 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v4 12/12] crypto: LLVMLinux: Remove VLAIS usage from crypto/testmgr.c
From: Jan-Simon Möller Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99 compliant equivalent. This patch allocates the appropriate amount of memory using a char array using the SHASH_DESC_ON_STACK macro. The new code can be compiled with both gcc and clang. Signed-off-by: Jan-Simon Möller Signed-off-by: Behan Webster Reviewed-by: Mark Charlebois Acked-by: Herbert Xu Cc: pagee...@freemail.hu --- crypto/testmgr.c | 14 ++ 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/crypto/testmgr.c b/crypto/testmgr.c index ac2b631..b959c0c 100644 --- a/crypto/testmgr.c +++ b/crypto/testmgr.c @@ -1714,16 +1714,14 @@ static int alg_test_crc32c(const struct alg_test_desc *desc, } do { - struct { - struct shash_desc shash; - char ctx[crypto_shash_descsize(tfm)]; - } sdesc; + SHASH_DESC_ON_STACK(shash, tfm); + u32 *ctx = (u32 *)shash_desc_ctx(shash); - sdesc.shash.tfm = tfm; - sdesc.shash.flags = 0; + shash->tfm = tfm; + shash->flags = 0; - *(u32 *)sdesc.ctx = le32_to_cpu(420553207); - err = crypto_shash_final(, (u8 *)); + *ctx = le32_to_cpu(420553207); + err = crypto_shash_final(shash, (u8 *)); if (err) { printk(KERN_ERR "alg: crc32c: Operation failed for " "%s: %d\n", driver, err); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v4 04/12] crypto: LLVMLinux: Remove VLAIS from crypto/mv_cesa.c
From: Behan Webster Replaced the use of a Variable Length Array In Struct (VLAIS) with a C99 compliant equivalent. This patch allocates the appropriate amount of memory using a char array using the SHASH_DESC_ON_STACK macro. The new code can be compiled with both gcc and clang. Signed-off-by: Behan Webster Reviewed-by: Mark Charlebois Reviewed-by: Jan-Simon Möller Acked-by: Herbert Xu --- drivers/crypto/mv_cesa.c | 41 ++--- 1 file changed, 18 insertions(+), 23 deletions(-) diff --git a/drivers/crypto/mv_cesa.c b/drivers/crypto/mv_cesa.c index 29d0ee5..032c72c 100644 --- a/drivers/crypto/mv_cesa.c +++ b/drivers/crypto/mv_cesa.c @@ -402,26 +402,23 @@ static int mv_hash_final_fallback(struct ahash_request *req) { const struct mv_tfm_hash_ctx *tfm_ctx = crypto_tfm_ctx(req->base.tfm); struct mv_req_hash_ctx *req_ctx = ahash_request_ctx(req); - struct { - struct shash_desc shash; - char ctx[crypto_shash_descsize(tfm_ctx->fallback)]; - } desc; + SHASH_DESC_ON_STACK(shash, tfm_ctx->fallback); int rc; - desc.shash.tfm = tfm_ctx->fallback; - desc.shash.flags = CRYPTO_TFM_REQ_MAY_SLEEP; + shash->tfm = tfm_ctx->fallback; + shash->flags = CRYPTO_TFM_REQ_MAY_SLEEP; if (unlikely(req_ctx->first_hash)) { - crypto_shash_init(); - crypto_shash_update(, req_ctx->buffer, + crypto_shash_init(shash); + crypto_shash_update(shash, req_ctx->buffer, req_ctx->extra_bytes); } else { /* only SHA1 for now */ - rc = mv_hash_import_sha1_ctx(req_ctx, ); + rc = mv_hash_import_sha1_ctx(req_ctx, shash); if (rc) goto out; } - rc = crypto_shash_final(, req->result); + rc = crypto_shash_final(shash, req->result); out: return rc; } @@ -794,23 +791,21 @@ static int mv_hash_setkey(struct crypto_ahash *tfm, const u8 * key, ss = crypto_shash_statesize(ctx->base_hash); { - struct { - struct shash_desc shash; - char ctx[crypto_shash_descsize(ctx->base_hash)]; - } desc; + SHASH_DESC_ON_STACK(shash, ctx->base_hash); + unsigned int i; char ipad[ss]; char opad[ss]; - desc.shash.tfm = ctx->base_hash; - desc.shash.flags = crypto_shash_get_flags(ctx->base_hash) & + shash->tfm = ctx->base_hash; + shash->flags = crypto_shash_get_flags(ctx->base_hash) & CRYPTO_TFM_REQ_MAY_SLEEP; if (keylen > bs) { int err; err = - crypto_shash_digest(, key, keylen, ipad); + crypto_shash_digest(shash, key, keylen, ipad); if (err) return err; @@ -826,12 +821,12 @@ static int mv_hash_setkey(struct crypto_ahash *tfm, const u8 * key, opad[i] ^= 0x5c; } - rc = crypto_shash_init() ? : - crypto_shash_update(, ipad, bs) ? : - crypto_shash_export(, ipad) ? : - crypto_shash_init() ? : - crypto_shash_update(, opad, bs) ? : - crypto_shash_export(, opad); + rc = crypto_shash_init(shash) ? : + crypto_shash_update(shash, ipad, bs) ? : + crypto_shash_export(shash, ipad) ? : + crypto_shash_init(shash) ? : + crypto_shash_update(shash, opad, bs) ? : + crypto_shash_export(shash, opad); if (rc == 0) mv_hash_init_ivs(ctx, ipad, opad); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
linux-next: manual merge of the tiny tree with the tip tree
Hi Josh, Today's linux-next merge of the tiny tree got conflicts in arch/x86/kernel/process_32.c and arch/x86/kernel/process_64.c between commits dc56c0f9b870 ("x86, fpu: Shift "fpu_counter = 0" from copy_thread() to arch_dup_task_struct()") and 6f46b3aef003 ("x86: copy_thread: Don't nullify ->ptrace_bps twice") from the tip tree and commits a1cf09f93e66 ("x86: process: Unify 32-bit and 64-bit copy_thread I/O bitmap handling") and e4a191d1e05b ("x86: Support compiling out userspace I/O (iopl and ioperm)") from the tiny tree. I fixed it up (I think - see below) and can carry the fix as necessary (no action is required). -- Cheers, Stephen Rothwells...@canb.auug.org.au diff --cc arch/x86/kernel/process_32.c index 8f3ebfe710d0,e37f006fda6e.. --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@@ -153,7 -153,9 +154,7 @@@ int copy_thread(unsigned long clone_fla childregs->orig_ax = -1; childregs->cs = __KERNEL_CS | get_kernel_rpl(); childregs->flags = X86_EFLAGS_IF | X86_EFLAGS_FIXED; - p->thread.io_bitmap_ptr = NULL; - p->thread.fpu_counter = 0; + clear_thread_io_bitmap(p); - memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps)); return 0; } *childregs = *current_pt_regs(); @@@ -164,22 -166,12 +165,9 @@@ p->thread.ip = (unsigned long) ret_from_fork; task_user_gs(p) = get_user_gs(current_pt_regs()); - p->thread.io_bitmap_ptr = NULL; - p->thread.fpu_counter = 0; + clear_thread_io_bitmap(p); tsk = current; - err = -ENOMEM; - - if (unlikely(test_tsk_thread_flag(tsk, TIF_IO_BITMAP))) { - p->thread.io_bitmap_ptr = kmemdup(tsk->thread.io_bitmap_ptr, - IO_BITMAP_BYTES, GFP_KERNEL); - if (!p->thread.io_bitmap_ptr) { - p->thread.io_bitmap_max = 0; - return -ENOMEM; - } - set_tsk_thread_flag(p, TIF_IO_BITMAP); - } - - err = 0; - memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps)); - /* * Set a new TLS for the child thread? */ diff --cc arch/x86/kernel/process_64.c index 3ed4a68d4013,80f348659edd.. --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@@ -163,7 -164,8 +164,7 @@@ int copy_thread(unsigned long clone_fla p->thread.sp = (unsigned long) childregs; p->thread.usersp = me->thread.usersp; set_tsk_thread_flag(p, TIF_FORK); - p->thread.io_bitmap_ptr = NULL; - p->thread.fpu_counter = 0; + clear_thread_io_bitmap(p); savesegment(gs, p->thread.gsindex); p->thread.gs = p->thread.gsindex ? 0 : me->thread.gs; @@@ -191,17 -193,8 +192,6 @@@ if (sp) childregs->sp = sp; - err = -ENOMEM; - if (unlikely(test_tsk_thread_flag(me, TIF_IO_BITMAP))) { - p->thread.io_bitmap_ptr = kmemdup(me->thread.io_bitmap_ptr, - IO_BITMAP_BYTES, GFP_KERNEL); - if (!p->thread.io_bitmap_ptr) { - p->thread.io_bitmap_max = 0; - return -ENOMEM; - } - set_tsk_thread_flag(p, TIF_IO_BITMAP); - } - memset(p->thread.ptrace_bps, 0, sizeof(p->thread.ptrace_bps)); -- /* * Set a new TLS for the child thread? */ signature.asc Description: PGP signature
linux-next: manual merge of the tiny tree with the tip tree
Hi Josh, Today's linux-next merge of the tiny tree got a conflict in arch/x86/kernel/cpu/common.c between commit ce4b1b16502b ("x86/smpboot: Initialize secondary CPU only if master CPU will wait for it") from the tip tree and commit e4a191d1e05b ("x86: Support compiling out userspace I/O (iopl and ioperm)") from the tiny tree. I fixed it up (see below) and can carry the fix as necessary (no action is required). -- Cheers, Stephen Rothwells...@canb.auug.org.au diff --cc arch/x86/kernel/cpu/common.c index 3d05d4699dbd,11e08cefdb6e.. --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@@ -1294,10 -1281,7 +1294,9 @@@ void cpu_init(void struct task_struct *me; struct tss_struct *t; unsigned long v; - int cpu; + int cpu = stack_smp_processor_id(); - int i; + + wait_for_master_cpu(cpu); /* * Load microcode on this cpu if a valid microcode is available. signature.asc Description: PGP signature
Re: [PATCH] ath: change logging functions to return void
Joe Perches writes: > The return values are not used by callers of these functions > so change the functions to return void. > > Other miscellanea: > > o add __printf verification to wil6210 logging functions > No format/argument mismatches found > > Signed-off-by: Joe Perches > --- > This change is associated to a desire to eventually > change printk to return void. > > drivers/net/wireless/ath/ath10k/debug.c| 18 +- > drivers/net/wireless/ath/ath10k/debug.h| 6 +++--- > drivers/net/wireless/ath/ath6kl/common.h | 2 +- > drivers/net/wireless/ath/ath6kl/debug.c| 28 > drivers/net/wireless/ath/ath6kl/debug.h| 13 ++--- For ath6kl and ath10k: Acked-by: Kalle Valo > drivers/net/wireless/ath/wil6210/debug.c | 14 -- > drivers/net/wireless/ath/wil6210/wil6210.h | 7 +-- > 7 files changed, 32 insertions(+), 56 deletions(-) John, as this patch also contains a wil6210 change how do you want to handle this? -- Kalle Valo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH v2] Tools: hv: vssdaemon: ignore the EBUSY on multiple freezing the same partition
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel- > ow...@vger.kernel.org] On Behalf Of Dexuan Cui > Sent: Tuesday, September 23, 2014 13:01 PM > To: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev- > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com > Cc: KY Srinivasan; Haiyang Zhang > Subject: [PATCH v2] Tools: hv: vssdaemon: ignore the EBUSY on multiple > freezing the same partition > > v2: I added "errno = 0;" in the ioctl() typo -- "in the ioctl(") should be "before the ioctl()". Thanks, -- Dexuan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: sleeping while atomic in blk_free_devt
> On Sep 22, 2014, at 8:49 PM, Dave Jones wrote: > > Just got this when removing a USB memory stick. > > BUG: sleeping function called from invalid context at block/genhd.c:448 Fixed in for-linus, it's going out tomorrow. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix the issue that lowmemkiller fell into a cycle that try to kill a task
On Tue, Sep 23, 2014 at 10:57:09AM +0800, Hui Zhu wrote: > The cause of this issue is when free memroy size is low and a lot of task is > trying to shrink the memory, the task that is killed by lowmemkiller cannot > get > CPU to exit itself. > > Fix this issue with change the scheduling policy to SCHED_FIFO if a task's > flag > is TIF_MEMDIE in lowmemkiller. > > Signed-off-by: Hui Zhu > --- > drivers/staging/android/lowmemorykiller.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/drivers/staging/android/lowmemorykiller.c > b/drivers/staging/android/lowmemorykiller.c > index b545d3d..ca1ffac 100644 > --- a/drivers/staging/android/lowmemorykiller.c > +++ b/drivers/staging/android/lowmemorykiller.c > @@ -129,6 +129,10 @@ static unsigned long lowmem_scan(struct shrinker *s, > struct shrink_control *sc) > > if (test_tsk_thread_flag(p, TIF_MEMDIE) && > time_before_eq(jiffies, lowmem_deathpending_timeout)) { > + struct sched_param param = { .sched_priority = 1 }; > + > + if (p->policy == SCHED_NORMAL) > + sched_setscheduler(p, SCHED_FIFO, ); This seems really specific to a specific scheduler pattern now. Isn't there some other way to resolve this? thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kernfs: use stack-buf for small writes.
On Tue, Sep 23, 2014 at 02:06:33PM +1000, NeilBrown wrote: ... > Note that reads from a sysfs file are already safe due to the use for > seqfile. The first read will allocate a buffer (m->buf) which will > be used for all subsequent reads. Hmmm? How is seqfile safe? Where would the seq op write to? Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kernfs: use stack-buf for small writes.
Hello, Neil. On Tue, Sep 23, 2014 at 02:06:33PM +1000, NeilBrown wrote: > When mdmon needs to update metadata after a device failure in an array > there are two 'kmalloc' sources that can trigger deadlock if memory is tight > and needs to be written to the array (which cannot be allowed until mdmon > updates the metadata). > One is in O_DIRECT writes which I have patches for. The other is when > writing to the sysfs file to tell md that it is safe to continue. > This simple patch removes the second. Ugh... :( If this can't be avoided at all, I'd much prefer it to be something explicit - a flag marking the file as needing a persistent write buffer which is allocated on open. "Small" writes on stack feels way to implicit to me. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] kernfs: use stack-buf for small writes.
For a write <= 128 characters, don't use kmalloc. mdmon, part of mdadm, will sometimes need to write to a sysfs file in order to allow writes to the array to continue. This is important to support RAID metadata types that the kernel doesn't know about. It is important that this write doesn't block on memory allocation. The safest way to ensure that is to use an on-stack buffer. Writes are always small, typically less than 10 characters. Note that reads from a sysfs file are already safe due to the use for seqfile. The first read will allocate a buffer (m->buf) which will be used for all subsequent reads. Signed-off-by: NeilBrown --- Hi Tejun, I wonder if you would consider this patch. When mdmon needs to update metadata after a device failure in an array there are two 'kmalloc' sources that can trigger deadlock if memory is tight and needs to be written to the array (which cannot be allowed until mdmon updates the metadata). One is in O_DIRECT writes which I have patches for. The other is when writing to the sysfs file to tell md that it is safe to continue. This simple patch removes the second. Thanks, NeilBrown diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c index 4429d6d9217f..75b58669ce55 100644 --- a/fs/kernfs/file.c +++ b/fs/kernfs/file.c @@ -269,6 +269,7 @@ static ssize_t kernfs_fop_write(struct file *file, const char __user *user_buf, const struct kernfs_ops *ops; size_t len; char *buf; + char stackbuf[129]; if (of->atomic_write_len) { len = count; @@ -278,7 +279,10 @@ static ssize_t kernfs_fop_write(struct file *file, const char __user *user_buf, len = min_t(size_t, count, PAGE_SIZE); } - buf = kmalloc(len + 1, GFP_KERNEL); + if (len < sizeof(stackbuf)) + buf = stackbuf; + else + buf = kmalloc(len + 1, GFP_KERNEL); if (!buf) return -ENOMEM; @@ -311,7 +315,8 @@ static ssize_t kernfs_fop_write(struct file *file, const char __user *user_buf, if (len > 0) *ppos += len; out_free: - kfree(buf); + if (buf != stackbuf) + kfree(buf); return len; } signature.asc Description: PGP signature
[PATCH v2 1/2] cap1106: Add support for various cap11xx devices
Several other variants of the cap11xx device exists with a varying number of capacitance detection channels. Add support for creating the channels dynamically. Signed-off-by: Matt Ranostay --- drivers/input/keyboard/cap1106.c | 64 +++- 1 file changed, 44 insertions(+), 20 deletions(-) diff --git a/drivers/input/keyboard/cap1106.c b/drivers/input/keyboard/cap1106.c index d70b65a..07f9e88 100644 --- a/drivers/input/keyboard/cap1106.c +++ b/drivers/input/keyboard/cap1106.c @@ -55,8 +55,6 @@ #define CAP1106_REG_MANUFACTURER_ID0xfe #define CAP1106_REG_REVISION 0xff -#define CAP1106_NUM_CHN 6 -#define CAP1106_PRODUCT_ID 0x55 #define CAP1106_MANUFACTURER_ID0x5d struct cap1106_priv { @@ -64,7 +62,25 @@ struct cap1106_priv { struct input_dev *idev; /* config */ - unsigned short keycodes[CAP1106_NUM_CHN]; + u32 *keycodes; + unsigned int num_channels; +}; + +struct cap11xx_hw_model { + uint8_t product_id; + unsigned int num_channels; +}; + +enum { + CAP1106, + CAP1126, + CAP1188, +}; + +struct cap11xx_hw_model cap11xx_devices[] = { + [CAP1106] = { .product_id = 0x55, .num_channels = 6 }, + [CAP1126] = { .product_id = 0x53, .num_channels = 6 }, + [CAP1188] = { .product_id = 0x50, .num_channels = 8 }, }; static const struct reg_default cap1106_reg_defaults[] = { @@ -151,7 +167,7 @@ static irqreturn_t cap1106_thread_func(int irq_num, void *data) if (ret < 0) goto out; - for (i = 0; i < CAP1106_NUM_CHN; i++) + for (i = 0; i < priv->num_channels; i++) input_report_key(priv->idev, priv->keycodes[i], status & (1 << i)); @@ -188,14 +204,23 @@ static int cap1106_i2c_probe(struct i2c_client *i2c_client, struct device *dev = _client->dev; struct cap1106_priv *priv; struct device_node *node; + struct cap11xx_hw_model *cap = _devices[id->driver_data]; int i, error, irq, gain = 0; unsigned int val, rev; - u32 gain32, keycodes[CAP1106_NUM_CHN]; + u32 gain32; priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL); if (!priv) return -ENOMEM; + BUG_ON(!cap->num_channels); + + priv->num_channels = cap->num_channels; + priv->keycodes = devm_kcalloc(dev, + priv->num_channels, sizeof(u32), GFP_KERNEL); + if (!priv->keycodes) + return -ENOMEM; + priv->regmap = devm_regmap_init_i2c(i2c_client, _regmap_config); if (IS_ERR(priv->regmap)) return PTR_ERR(priv->regmap); @@ -204,9 +229,9 @@ static int cap1106_i2c_probe(struct i2c_client *i2c_client, if (error) return error; - if (val != CAP1106_PRODUCT_ID) { + if (val != cap->product_id) { dev_err(dev, "Product ID: Got 0x%02x, expected 0x%02x\n", - val, CAP1106_PRODUCT_ID); + val, cap->product_id); return -ENODEV; } @@ -235,17 +260,12 @@ static int cap1106_i2c_probe(struct i2c_client *i2c_client, dev_err(dev, "Invalid sensor-gain value %d\n", gain32); } - BUILD_BUG_ON(ARRAY_SIZE(keycodes) != ARRAY_SIZE(priv->keycodes)); - /* Provide some useful defaults */ - for (i = 0; i < ARRAY_SIZE(keycodes); i++) - keycodes[i] = KEY_A + i; + for (i = 0; i < priv->num_channels; i++) + priv->keycodes[i] = KEY_A + i; of_property_read_u32_array(node, "linux,keycodes", - keycodes, ARRAY_SIZE(keycodes)); - - for (i = 0; i < ARRAY_SIZE(keycodes); i++) - priv->keycodes[i] = keycodes[i]; + priv->keycodes, priv->num_channels); error = regmap_update_bits(priv->regmap, CAP1106_REG_MAIN_CONTROL, CAP1106_REG_MAIN_CONTROL_GAIN_MASK, @@ -269,17 +289,17 @@ static int cap1106_i2c_probe(struct i2c_client *i2c_client, if (of_property_read_bool(node, "autorepeat")) __set_bit(EV_REP, priv->idev->evbit); - for (i = 0; i < CAP1106_NUM_CHN; i++) + for (i = 0; i < priv->num_channels; i++) __set_bit(priv->keycodes[i], priv->idev->keybit); __clear_bit(KEY_RESERVED, priv->idev->keybit); priv->idev->keycode = priv->keycodes; - priv->idev->keycodesize = sizeof(priv->keycodes[0]); - priv->idev->keycodemax = ARRAY_SIZE(priv->keycodes); + priv->idev->keycodesize = sizeof(u32); + priv->idev->keycodemax = priv->num_channels; priv->idev->id.vendor = CAP1106_MANUFACTURER_ID; - priv->idev->id.product = CAP1106_PRODUCT_ID; + priv->idev->id.product = cap->product_id; priv->idev->id.version = rev; priv->idev->open = cap1106_input_open; @@ -313,12 +333,16 @@ static
[PATCH v2 0/2] cap1106: add support for cap11xx variants
Changes from v1: * Reworked various devices support to check product id for respective device. * Added check for invalid zero channels. * Renamed active-high option to more clear irq-active-high * Use regmap_update_bits() instead of regmap_write_bits() Matt Ranostay (2): cap1106: Add support for various cap11xx devices cap1106: support for irq-active-high option .../devicetree/bindings/input/cap1106.txt | 4 ++ drivers/input/keyboard/cap1106.c | 70 -- 2 files changed, 55 insertions(+), 19 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 2/2] cap1106: support for irq-active-high option
Some applications need to use the irq-active-high push-pull option. This allows it be enabled in the device tree child node. Signed-off-by: Matt Ranostay --- Documentation/devicetree/bindings/input/cap1106.txt | 4 drivers/input/keyboard/cap1106.c| 8 2 files changed, 12 insertions(+) diff --git a/Documentation/devicetree/bindings/input/cap1106.txt b/Documentation/devicetree/bindings/input/cap1106.txt index 4b46390..6f5a143 100644 --- a/Documentation/devicetree/bindings/input/cap1106.txt +++ b/Documentation/devicetree/bindings/input/cap1106.txt @@ -26,6 +26,10 @@ Optional properties: Valid values are 1, 2, 4, and 8. By default, a gain of 1 is set. + microchip,irq-active-high: By default the interrupt pin is active low + open drain. This property allows using the active + high push-pull output. + linux,keycodes: Specifies an array of numeric keycode values to be used for the channels. If this property is omitted, KEY_A, KEY_B, etc are used as diff --git a/drivers/input/keyboard/cap1106.c b/drivers/input/keyboard/cap1106.c index 07f9e88..d5ce060 100644 --- a/drivers/input/keyboard/cap1106.c +++ b/drivers/input/keyboard/cap1106.c @@ -47,6 +47,7 @@ #define CAP1106_REG_STANDBY_SENSITIVITY0x42 #define CAP1106_REG_STANDBY_THRESH 0x43 #define CAP1106_REG_CONFIG20x44 +#define CAP1106_REG_CONFIG2_ALT_POLBIT(6) #define CAP1106_REG_SENSOR_BASE_CNT(X) (0x50 + (X)) #define CAP1106_REG_SENSOR_CALIB (0xb1 + (X)) #define CAP1106_REG_SENSOR_CALIB_LSB1 0xb9 @@ -260,6 +261,13 @@ static int cap1106_i2c_probe(struct i2c_client *i2c_client, dev_err(dev, "Invalid sensor-gain value %d\n", gain32); } + if (of_property_read_bool(node, "microchip,irq-active-high")) { + error = regmap_update_bits(priv->regmap, CAP1106_REG_CONFIG2, + CAP1106_REG_CONFIG2_ALT_POL, 0); + if (error) + return error; + } + /* Provide some useful defaults */ for (i = 0; i < priv->num_channels; i++) priv->keycodes[i] = KEY_A + i; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] Tools: hv: vssdaemon: ignore the EBUSY on multiple freezing the same partition
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel- > ow...@vger.kernel.org] On Behalf Of Dexuan Cui > Sent: Tuesday, September 23, 2014 2:02 AM > To: gre...@linuxfoundation.org; linux-kernel@vger.kernel.org; driverdev- > de...@linuxdriverproject.org; o...@aepfle.de; a...@canonical.com; > jasow...@redhat.com > Cc: KY Srinivasan; Haiyang Zhang > Subject: [PATCH] Tools: hv: vssdaemon: ignore the EBUSY on multiple > freezing the same partition > > Signed-off-by: Dexuan Cui > Reviewed-by: K. Y. Srinivasan > --- > tools/hv/hv_vss_daemon.c | 21 + Please use the v2 patch I sent out just now. I added a "errno = 0;' before the ioctl() to fix some false warnings. Thanks, -- Dexuan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V3] xen: eliminate scalability issues from initial mapping setup
On 09/17/2014 04:59 PM, Juergen Gross wrote: Direct Xen to place the initial P->M table outside of the initial mapping, as otherwise the 1G (implementation) / 2G (theoretical) restriction on the size of the initial mapping limits the amount of memory a domain can be handed initially. As the initial P->M table is copied rather early during boot to domain private memory and it's initial virtual mapping is dropped, the easiest way to avoid virtual address conflicts with other addresses in the kernel is to use a user address area for the virtual address of the initial P->M table. This allows us to just throw away the page tables of the initial mapping after the copy without having to care about address invalidation. It should be noted that this patch won't enable a pv-domain to USE more than 512 GB of RAM. It just enables it to be started with a P->M table covering more memory. This is especially important for being able to boot a Dom0 on a system with more than 512 GB memory. Signed-off-by: Juergen Gross Signed-off-by: Jan Beulich Any Acks/Naks? Juergen --- arch/x86/xen/mmu.c | 119 +--- arch/x86/xen/setup.c| 65 ++ arch/x86/xen/xen-head.S | 2 + 3 files changed, 151 insertions(+), 35 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index 16fb009..3bd403b 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -1198,6 +1198,78 @@ static void __init xen_cleanhighmap(unsigned long vaddr, * instead of somewhere later and be confusing. */ xen_mc_flush(); } + +/* + * Make a page range writeable and free it. + */ +static void __init xen_free_ro_pages(unsigned long paddr, unsigned long size) +{ + void *vaddr = __va(paddr); + void *vaddr_end = vaddr + size; + + for (; vaddr < vaddr_end; vaddr += PAGE_SIZE) + make_lowmem_page_readwrite(vaddr); + + memblock_free(paddr, size); +} + +static void xen_cleanmfnmap_free_pgtbl(void *pgtbl) +{ + unsigned long pa = __pa(pgtbl) & PHYSICAL_PAGE_MASK; + + ClearPagePinned(virt_to_page(__va(pa))); + xen_free_ro_pages(pa, PAGE_SIZE); +} + +/* + * Since it is well isolated we can (and since it is perhaps large we should) + * also free the page tables mapping the initial P->M table. + */ +static void __init xen_cleanmfnmap(unsigned long vaddr) +{ + unsigned long va = vaddr & PMD_MASK; + unsigned long pa; + pgd_t *pgd = pgd_offset_k(va); + pud_t *pud_page = pud_offset(pgd, 0); + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + unsigned int i; + + set_pgd(pgd, __pgd(0)); + do { + pud = pud_page + pud_index(va); + if (pud_none(*pud)) { + va += PUD_SIZE; + } else if (pud_large(*pud)) { + pa = pud_val(*pud) & PHYSICAL_PAGE_MASK; + xen_free_ro_pages(pa, PUD_SIZE); + va += PUD_SIZE; + } else { + pmd = pmd_offset(pud, va); + if (pmd_large(*pmd)) { + pa = pmd_val(*pmd) & PHYSICAL_PAGE_MASK; + xen_free_ro_pages(pa, PMD_SIZE); + } else if (!pmd_none(*pmd)) { + pte = pte_offset_kernel(pmd, va); + for (i = 0; i < PTRS_PER_PTE; ++i) { + if (pte_none(pte[i])) + break; + pa = pte_pfn(pte[i]) << PAGE_SHIFT; + xen_free_ro_pages(pa, PAGE_SIZE); + } + xen_cleanmfnmap_free_pgtbl(pte); + } + va += PMD_SIZE; + if (pmd_index(va)) + continue; + xen_cleanmfnmap_free_pgtbl(pmd); + } + + } while (pud_index(va) || pmd_index(va)); + xen_cleanmfnmap_free_pgtbl(pud_page); +} + static void __init xen_pagetable_p2m_copy(void) { unsigned long size; @@ -1217,18 +1289,23 @@ static void __init xen_pagetable_p2m_copy(void) /* using __ka address and sticking INVALID_P2M_ENTRY! */ memset((void *)xen_start_info->mfn_list, 0xff, size); - /* We should be in __ka space. */ - BUG_ON(xen_start_info->mfn_list < __START_KERNEL_map); addr = xen_start_info->mfn_list; - /* We roundup to the PMD, which means that if anybody at this stage is + /* We could be in __ka space. +* We roundup to the PMD, which means that if anybody at this stage is * using the __ka address of xen_start_info or xen_start_info->shared_info * they are in going to crash. Fortunatly we have already revectored * in xen_setup_kernel_pagetable and in
[PATCH v2] Tools: hv: vssdaemon: ignore the EBUSY on multiple freezing the same partition
v2: I added "errno = 0;" in the ioctl() Signed-off-by: Dexuan Cui Reviewed-by: K. Y. Srinivasan --- tools/hv/hv_vss_daemon.c | 28 1 file changed, 28 insertions(+) diff --git a/tools/hv/hv_vss_daemon.c b/tools/hv/hv_vss_daemon.c index 6a213b8..c1af658 100644 --- a/tools/hv/hv_vss_daemon.c +++ b/tools/hv/hv_vss_daemon.c @@ -50,7 +50,35 @@ static int vss_do_freeze(char *dir, unsigned int cmd, char *fs_op) if (fd < 0) return 1; + + /* A successful syscall doesn't set errno to 0. Without this line, +* the below strerror(errno) can accidently show the errno of the +* previous failed syscall. +*/ + errno = 0; + ret = ioctl(fd, cmd, 0); + + /* +* If a partition is mounted more than once, only the first +* FREEZE/THAW can succeed and the later ones will get +* EBUSY/EINVAL respectively: there could be 2 cases: +* 1) a user may mount the same partition to differnt directories +* by mistake or on purpose; +* 2) The subvolume of btrfs appears to have the same partition +* mounted more than once. +*/ + if (ret) { + if ((cmd == FIFREEZE && errno == EBUSY) || + (cmd == FITHAW && errno == EINVAL)) { + syslog(LOG_INFO, "VSS: %s of %s: %s: ignored\n", + fs_op, dir, + errno == EBUSY ? "EBUSY" : "EINVAL"); + close(fd); + return 0; + } + } + syslog(LOG_INFO, "VSS: %s of %s: %s\n", fs_op, dir, strerror(errno)); close(fd); return !!ret; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From Daniel Klimowicz
Dear Sir, I am requesting for your help, to assist me in getting £42,000,000.00 to your account. please do indicate your interest for more information's. REPLY ( klimowi...@yahoo.com.hk ) Yours Truly, >From Daniel Klimowicz -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] ARM: dts: add rk3288 power-domain node
On 09/23/2014 10:55 AM, jinkun.hong wrote: From: "jinkun.hong" Any summary for rk3288 power controller? Maybe you can say something about how rk3288 TRM described this module. Signed-off-by: Jack Dai Signed-off-by: Wang Caesar Signed-off-by: jinkun.hong --- arch/arm/boot/dts/rk3288.dtsi | 45 + 1 file changed, 45 insertions(+) diff --git a/arch/arm/boot/dts/rk3288.dtsi b/arch/arm/boot/dts/rk3288.dtsi index 3bb5230..714b9d9 100644 --- a/arch/arm/boot/dts/rk3288.dtsi +++ b/arch/arm/boot/dts/rk3288.dtsi @@ -15,6 +15,7 @@ #include #include #include +#include #include "skeleton.dtsi" / { @@ -467,6 +468,50 @@ compatible = "rockchip,rk3288-pmu", "syscon"; reg = <0xff73 0x100>; }; + power: power-controller { + compatible = "rockchip,rk3288-power-controller"; + #power-domain-cells = <1>; + rockchip,pmu = <>; + #address-cells = <1>; + #size-cells = <0>; + + pd_gpu { + reg = ; + clocks = < ACLK_GPU>; + }; + + pd_vio { + reg = ; + clocks = < HCLK_RGA>, < HCLK_VOP0>, + < HCLK_VOP1>, < HCLK_VIO_AHB_ARBI>, + < HCLK_VIO_NIU>, < HCLK_VIP>, + < HCLK_IEP>, < HCLK_ISP>, + < HCLK_VIO2_H2P>, < PCLK_MIPI_DSI0>, + < PCLK_MIPI_DSI1>, < PCLK_MIPI_CSI>, + < PCLK_LVDS_PHY>, < PCLK_EDP_CTRL>, + < PCLK_HDMI_CTRL>, < PCLK_VIO2_H2P>, + < ACLK_VOP0>, < ACLK_IEP>, + < ACLK_VIO0_NIU>, < ACLK_VIP>, + < ACLK_VOP1>, < ACLK_ISP>, + < ACLK_VIO1_NIU>, < ACLK_RGA>, + < ACLK_RGA_NIU>,< SCLK_RGA>, + < DCLK_VOP0>, < DCLK_VOP1>, + < SCLK_EDP_24M>, < SCLK_EDP>, + < SCLK_ISP>, < SCLK_ISP_JPE>, + < SCLK_HDMI_HDCP>, < SCLK_HDMI_CEC>; + }; Some of clock id here is not in upstream or list, I will send the patch including all these clock IDs later, maybe you should mention it in your commit message? + + pd_video { + reg = ; + /* FIXME: add clocks */ + }; + + pd_hevc { + reg = ; + clocks = < ACLK_HEVC>, < HCLK_HEVC>, + < SCLK_HEVC_CABAC>, < SCLK_HEVC_CORE>; + }; + }; sgrf: syscon@ff74 { compatible = "rockchip,rk3288-sgrf", "syscon"; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: [PATCH] [media] videobuf-dma-contig: replace vm_iomap_memory() with remap_pfn_range().
Hans, Do you have any more comment on this patch? Best regards, Fancy Fang -Original Message- From: Fang Chen-B47543 Sent: Wednesday, September 10, 2014 3:29 PM To: 'Hans Verkuil'; m.che...@samsung.com; v...@zeniv.linux.org.uk Cc: Guo Shawn-R65073; linux-me...@vger.kernel.org; linux-kernel@vger.kernel.org; Marek Szyprowski Subject: RE: [PATCH] [media] videobuf-dma-contig: replace vm_iomap_memory() with remap_pfn_range(). On the Freescale imx6 platform which belongs to ARM architecture. The driver is our local v4l2 output driver which is not upstream yet unfortunately. Best regards, Fancy Fang -Original Message- From: Hans Verkuil [mailto:hverk...@xs4all.nl] Sent: Wednesday, September 10, 2014 3:21 PM To: Fang Chen-B47543; m.che...@samsung.com; v...@zeniv.linux.org.uk Cc: Guo Shawn-R65073; linux-me...@vger.kernel.org; linux-kernel@vger.kernel.org; Marek Szyprowski Subject: Re: [PATCH] [media] videobuf-dma-contig: replace vm_iomap_memory() with remap_pfn_range(). On 09/10/14 09:14, chen.f...@freescale.com wrote: > It is not a theoretically issue, it is a real case that the mapping failed > issue happens in 3.14.y kernel but not happens in previous 3.10.y kernel. > So I need your confirmation on it. With which driver does this happen? On which architecture? Regards, Hans > > Thanks. > > Best regards, > Fancy Fang > > -Original Message- > From: Hans Verkuil [mailto:hverk...@xs4all.nl] > Sent: Wednesday, September 10, 2014 3:01 PM > To: Fang Chen-B47543; m.che...@samsung.com; v...@zeniv.linux.org.uk > Cc: Guo Shawn-R65073; linux-me...@vger.kernel.org; > linux-kernel@vger.kernel.org; Marek Szyprowski > Subject: Re: [PATCH] [media] videobuf-dma-contig: replace vm_iomap_memory() > with remap_pfn_range(). > > On 09/10/14 07:28, Fancy Fang wrote: >> When user requests V4L2_MEMORY_MMAP type buffers, the videobuf-core >> will assign the corresponding offset to the 'boff' field of the >> videobuf_buffer for each requested buffer sequentially. Later, user >> may call mmap() to map one or all of the buffers with the 'offset' >> parameter which is equal to its 'boff' value. Obviously, the 'offset' >> value is only used to find the matched buffer instead of to be the >> real offset from the buffer's physical start address as used by >> vm_iomap_memory(). So, in some case that if the offset is not zero, >> vm_iomap_memory() will fail. > > Is this just a fix for something that can fail theoretically, or do you > actually have a case where this happens? I am very reluctant to make any > changes to videobuf. Drivers should all migrate to vb2. > > I have CC-ed Marek as well since he knows a lot more about this stuff than I > do. > > Regards, > > Hans > >> >> Signed-off-by: Fancy Fang >> --- >> drivers/media/v4l2-core/videobuf-dma-contig.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/media/v4l2-core/videobuf-dma-contig.c >> b/drivers/media/v4l2-core/videobuf-dma-contig.c >> index bf80f0f..8bd9889 100644 >> --- a/drivers/media/v4l2-core/videobuf-dma-contig.c >> +++ b/drivers/media/v4l2-core/videobuf-dma-contig.c >> @@ -305,7 +305,9 @@ static int __videobuf_mmap_mapper(struct videobuf_queue >> *q, >> /* Try to remap memory */ >> size = vma->vm_end - vma->vm_start; >> vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot); >> -retval = vm_iomap_memory(vma, mem->dma_handle, size); >> +retval = remap_pfn_range(vma, vma->vm_start, >> + mem->dma_handle >> PAGE_SHIFT, >> + size, vma->vm_page_prot); >> if (retval) { >> dev_err(q->dev, "mmap: remap failed with error %d. ", >> retval); >> > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 01/13] powerpc/iommu: Check that TCE page size is equal to it_page_size
This checks that the TCE table page size is not bigger that the size of a page we just pinned and going to put its physical address to the table. Otherwise the hardware gets unwanted access to physical memory between the end of the actual page and the end of the aligned up TCE page. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kernel/iommu.c | 28 +--- 1 file changed, 25 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index a10642a..b378f78 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include #include @@ -1059,16 +1060,37 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, unsigned long entry, tce, entry << tbl->it_page_shift, ret); */ return -EFAULT; } + + /* +* Check that the TCE table granularity is not bigger than the size of +* a page we just found. Otherwise the hardware can get access to +* a bigger memory chunk that it should. +*/ + if (PageHuge(page)) { + struct page *head = compound_head(page); + long shift = PAGE_SHIFT + compound_order(head); + + if (shift < tbl->it_page_shift) { + ret = -EINVAL; + goto put_page_exit; + } + + } + hwaddr = (unsigned long) page_address(page) + offset; ret = iommu_tce_build(tbl, entry, hwaddr, direction); if (ret) - put_page(page); + goto put_page_exit; - if (ret < 0) - pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n", + return 0; + +put_page_exit: + pr_err("iommu_tce: %s failed ioba=%lx, tce=%lx, ret=%d\n", __func__, entry << tbl->it_page_shift, tce, ret); + put_page(page); + return ret; } EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode); -- 2.0.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 02/13] powerpc/powernv: Make invalidate() a callback
At the moment pnv_pci_ioda_tce_invalidate() gets the PE pointer via container_of(tbl). Since we are going to have to add Dynamic DMA windows and that means having 2 IOMMU tables per PE, this is not going to work. This implements pnv_pci_ioda(1|2)_tce_invalidate as a pnv_ioda_pe callback. This adds a pnv_iommu_table wrapper around iommu_table and stores a pointer to PE there. PNV's ppc_md.tce_build() call uses this to find PE and do the invalidation. This will be used later for Dynamic DMA windows too. This registers invalidate() callbacks for IODA1 and IODA2: - pnv_pci_ioda1_tce_invalidate; - pnv_pci_ioda2_tce_invalidate. Signed-off-by: Alexey Kardashevskiy --- Changes: v4: * changed commit log to explain why this change is needed --- arch/powerpc/platforms/powernv/pci-ioda.c | 35 --- arch/powerpc/platforms/powernv/pci.c | 31 --- arch/powerpc/platforms/powernv/pci.h | 13 +++- 3 files changed, 48 insertions(+), 31 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index df241b1..136e765 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -857,7 +857,7 @@ static void pnv_pci_ioda_dma_dev_setup(struct pnv_phb *phb, struct pci_dev *pdev pe = >ioda.pe_array[pdn->pe_number]; WARN_ON(get_dma_ops(>dev) != _iommu_ops); - set_iommu_table_base_and_group(>dev, >tce32_table); + set_iommu_table_base_and_group(>dev, >tce32.table); } static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb, @@ -884,7 +884,7 @@ static int pnv_pci_ioda_dma_set_mask(struct pnv_phb *phb, } else { dev_info(>dev, "Using 32-bit DMA via iommu\n"); set_dma_ops(>dev, _iommu_ops); - set_iommu_table_base(>dev, >tce32_table); + set_iommu_table_base(>dev, >tce32.table); } *pdev->dev.dma_mask = dma_mask; return 0; @@ -899,9 +899,9 @@ static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, list_for_each_entry(dev, >devices, bus_list) { if (add_to_iommu_group) set_iommu_table_base_and_group(>dev, - >tce32_table); + >tce32.table); else - set_iommu_table_base(>dev, >tce32_table); + set_iommu_table_base(>dev, >tce32.table); if (dev->subordinate) pnv_ioda_setup_bus_dma(pe, dev->subordinate, @@ -988,19 +988,6 @@ static void pnv_pci_ioda2_tce_invalidate(struct pnv_ioda_pe *pe, } } -void pnv_pci_ioda_tce_invalidate(struct iommu_table *tbl, -__be64 *startp, __be64 *endp, bool rm) -{ - struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe, - tce32_table); - struct pnv_phb *phb = pe->phb; - - if (phb->type == PNV_PHB_IODA1) - pnv_pci_ioda1_tce_invalidate(pe, tbl, startp, endp, rm); - else - pnv_pci_ioda2_tce_invalidate(pe, tbl, startp, endp, rm); -} - static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe, unsigned int base, unsigned int segs) @@ -1058,9 +1045,11 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, } /* Setup linux iommu table */ - tbl = >tce32_table; + tbl = >tce32.table; pnv_pci_setup_iommu_table(tbl, addr, TCE32_TABLE_SIZE * segs, base << 28, IOMMU_PAGE_SHIFT_4K); + pe->tce32.pe = pe; + pe->tce32.invalidate_fn = pnv_pci_ioda1_tce_invalidate; /* OPAL variant of P7IOC SW invalidated TCEs */ swinvp = of_get_property(phb->hose->dn, "ibm,opal-tce-kill", NULL); @@ -1097,7 +1086,7 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable) { struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe, - tce32_table); + tce32.table); uint16_t window_id = (pe->pe_number << 1 ) + 1; int64_t rc; @@ -1142,10 +1131,10 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb, pe->tce_bypass_base = 1ull << 59; /* Install set_bypass callback for VFIO */ - pe->tce32_table.set_bypass = pnv_pci_ioda2_set_bypass; + pe->tce32.table.set_bypass = pnv_pci_ioda2_set_bypass; /* Enable bypass by default */ - pnv_pci_ioda2_set_bypass(>tce32_table, true); + pnv_pci_ioda2_set_bypass(>tce32.table, true); } static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, @@ -1193,9 +1182,11 @@
[PATCH v2 03/13] powerpc/spapr: vfio: Implement spapr_tce_iommu_ops
Modern IBM POWERPC systems support multiple IOMMU tables per PE so we need a more reliable way (compared to container_of()) to get a PE pointer from the iommu_table struct pointer used in IOMMU functions. At the moment IOMMU group data points to an iommu_table struct. This introduces a spapr_tce_iommu_group struct which keeps an iommu_owner and a spapr_tce_iommu_ops struct. For IODA, iommu_owner is a pointer to the pnv_ioda_pe struct, for others it is still a pointer to the iommu_table struct. The ops structs correspond to the type which iommu_owner points to. This defines a get_table() callback which returns an iommu_table by its number. As the IOMMU group data pointer points to variable type instead of iommu_table, VFIO SPAPR TCE driver is updated to use the new type. This changes the tce_container struct to store iommu_group instead of iommu_table. So, it was: - iommu_table points to iommu_group via iommu_table::it_group; - iommu_group points to iommu_table via iommu_group_get_iommudata(); now it is: - iommu_table points to iommu_group via iommu_table::it_group; - iommu_group points to spapr_tce_iommu_group via iommu_group_get_iommudata(); - spapr_tce_iommu_group points to either (depending on .get_table()): - iommu_table; - pnv_ioda_pe; This uses pnv_ioda1_iommu_get_table for both IODA1&2 but IODA2 will have own pnv_ioda2_iommu_get_table soon and pnv_ioda1_iommu_get_table will only be used for IODA1. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h| 6 ++ arch/powerpc/include/asm/tce.h | 13 +++ arch/powerpc/kernel/iommu.c | 35 ++- arch/powerpc/platforms/powernv/pci-ioda.c | 31 +- arch/powerpc/platforms/powernv/pci-p5ioc2.c | 1 + arch/powerpc/platforms/powernv/pci.c| 2 +- arch/powerpc/platforms/pseries/iommu.c | 10 +- drivers/vfio/vfio_iommu_spapr_tce.c | 148 ++-- 8 files changed, 208 insertions(+), 38 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 42632c7..84ee339 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -108,13 +108,19 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name); */ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, int nid); + +struct spapr_tce_iommu_ops; #ifdef CONFIG_IOMMU_API extern void iommu_register_group(struct iommu_table *tbl, +void *iommu_owner, +struct spapr_tce_iommu_ops *ops, int pci_domain_number, unsigned long pe_num); extern int iommu_add_device(struct device *dev); extern void iommu_del_device(struct device *dev); #else static inline void iommu_register_group(struct iommu_table *tbl, + void *iommu_owner, + struct spapr_tce_iommu_ops *ops, int pci_domain_number, unsigned long pe_num) { diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h index 743f36b..9f159eb 100644 --- a/arch/powerpc/include/asm/tce.h +++ b/arch/powerpc/include/asm/tce.h @@ -50,5 +50,18 @@ #define TCE_PCI_READ 0x1 /* read from PCI allowed */ #define TCE_VB_WRITE 0x1 /* write from VB allowed */ +struct spapr_tce_iommu_group; + +struct spapr_tce_iommu_ops { + struct iommu_table *(*get_table)( + struct spapr_tce_iommu_group *data, + int num); +}; + +struct spapr_tce_iommu_group { + void *iommu_owner; + struct spapr_tce_iommu_ops *ops; +}; + #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_TCE_H */ diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index b378f78..1c5dae7 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -878,24 +878,53 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t size, */ static void group_release(void *iommu_data) { - struct iommu_table *tbl = iommu_data; - tbl->it_group = NULL; + kfree(iommu_data); } +static struct iommu_table *spapr_tce_default_get_table( + struct spapr_tce_iommu_group *data, int num) +{ + struct iommu_table *tbl = data->iommu_owner; + + switch (num) { + case 0: + if (tbl->it_size) + return tbl; + /* fallthru */ + default: + return NULL; + } +} + +static struct spapr_tce_iommu_ops spapr_tce_default_ops = { + .get_table = spapr_tce_default_get_table +}; + void iommu_register_group(struct iommu_table *tbl, + void *iommu_owner, struct spapr_tce_iommu_ops *ops, int pci_domain_number, unsigned long pe_num) {
[PATCH v2 04/13] powerpc/powernv: Convert/move set_bypass() callback to take_ownership()
At the moment the iommu_table struct has a set_bypass() which enables/ disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code which calls this callback when external IOMMU users such as VFIO are about to get over a PHB. Since the set_bypass() is not really an iommu_table function but PE's function, and we have an ops struct per IOMMU owner, let's move set_bypass() to the spapr_tce_iommu_ops struct. As arch/powerpc/kernel/iommu.c is more about POWERPC IOMMU tables and has very little to do with PEs, this moves take_ownership() calls to the VFIO SPAPR TCE driver. This renames set_bypass() to take_ownership() as it is not necessarily just enabling bypassing, it can be something else/more so let's give it a generic name. The bool parameter is inverted. Signed-off-by: Alexey Kardashevskiy Reviewed-by: Gavin Shan --- arch/powerpc/include/asm/iommu.h | 1 - arch/powerpc/include/asm/tce.h| 2 ++ arch/powerpc/kernel/iommu.c | 12 arch/powerpc/platforms/powernv/pci-ioda.c | 20 drivers/vfio/vfio_iommu_spapr_tce.c | 16 5 files changed, 30 insertions(+), 21 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 84ee339..2b0b01d 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -77,7 +77,6 @@ struct iommu_table { #ifdef CONFIG_IOMMU_API struct iommu_group *it_group; #endif - void (*set_bypass)(struct iommu_table *tbl, bool enable); }; /* Pure 2^n version of get_order */ diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h index 9f159eb..e6355f9 100644 --- a/arch/powerpc/include/asm/tce.h +++ b/arch/powerpc/include/asm/tce.h @@ -56,6 +56,8 @@ struct spapr_tce_iommu_ops { struct iommu_table *(*get_table)( struct spapr_tce_iommu_group *data, int num); + void (*take_ownership)(struct spapr_tce_iommu_group *data, + bool enable); }; struct spapr_tce_iommu_group { diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 1c5dae7..c2c8d9d 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1139,14 +1139,6 @@ int iommu_take_ownership(struct iommu_table *tbl) memset(tbl->it_map, 0xff, sz); iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size); - /* -* Disable iommu bypass, otherwise the user can DMA to all of -* our physical memory via the bypass window instead of just -* the pages that has been explicitly mapped into the iommu -*/ - if (tbl->set_bypass) - tbl->set_bypass(tbl, false); - return 0; } EXPORT_SYMBOL_GPL(iommu_take_ownership); @@ -1161,10 +1153,6 @@ void iommu_release_ownership(struct iommu_table *tbl) /* Restore bit#0 set by iommu_init_table() */ if (tbl->it_offset == 0) set_bit(0, tbl->it_map); - - /* The kernel owns the device now, we can restore the iommu bypass */ - if (tbl->set_bypass) - tbl->set_bypass(tbl, true); } EXPORT_SYMBOL_GPL(iommu_release_ownership); diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 2d32a1c..8cb2f31 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1105,10 +1105,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb *phb, __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs)); } -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable) +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable) { - struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe, - tce32.table); uint16_t window_id = (pe->pe_number << 1 ) + 1; int64_t rc; @@ -1136,7 +1134,7 @@ static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable) * host side. */ if (pe->pdev) - set_iommu_table_base(>pdev->dev, tbl); + set_iommu_table_base(>pdev->dev, >tce32.table); else pnv_ioda_setup_bus_dma(pe, pe->pbus, false); } @@ -1152,15 +1150,21 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb, /* TVE #1 is selected by PCI address bit 59 */ pe->tce_bypass_base = 1ull << 59; - /* Install set_bypass callback for VFIO */ - pe->tce32.table.set_bypass = pnv_pci_ioda2_set_bypass; - /* Enable bypass by default */ - pnv_pci_ioda2_set_bypass(>tce32.table, true); + pnv_pci_ioda2_set_bypass(pe, true); +} + +static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data, +bool enable) +{ + struct
[PATCH v2 10/13] powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA
SPAPR defines an interface to create additional DMA windows dynamically. "Dynamically" means that the window is not allocated before the guest even started, the guest can request it later. In practice, existing linux guests check for the capability and if it is there, they create and map a DMA window as big as the entire guest RAM. This adds 4 callbacks to the spapr_tce_iommu_ops struct: 1. query - ibm,query-pe-dma-window - returns number/size of windows which can be created (one, any page size); 2. create - ibm,create-pe-dma-window - creates a window; 3. remove - ibm,remove-pe-dma-window - removes a window; removing the default 32bit window is not allowed by this patch, this will be added later if needed; 4. reset - ibm,reset-pe-dma-window - reset the DMA windows configuration to the default state; as the default window cannot be removed, it only removes the additional window if it was created. The next patch will add corresponding ioctls to VFIO SPAPR TCE driver to provide necessary support to the userspace. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/tce.h| 22 + arch/powerpc/platforms/powernv/pci-ioda.c | 159 +- arch/powerpc/platforms/powernv/pci.h | 1 + 3 files changed, 181 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h index e6355f9..23b0362 100644 --- a/arch/powerpc/include/asm/tce.h +++ b/arch/powerpc/include/asm/tce.h @@ -58,6 +58,28 @@ struct spapr_tce_iommu_ops { int num); void (*take_ownership)(struct spapr_tce_iommu_group *data, bool enable); + + /* Dynamic DMA window */ + /* Page size flags for ibm,query-pe-dma-window */ +#define DDW_PGSIZE_4K 0x01 +#define DDW_PGSIZE_64K 0x02 +#define DDW_PGSIZE_16M 0x04 +#define DDW_PGSIZE_32M 0x08 +#define DDW_PGSIZE_64M 0x10 +#define DDW_PGSIZE_128M 0x20 +#define DDW_PGSIZE_256M 0x40 +#define DDW_PGSIZE_16G 0x80 + long (*query)(struct spapr_tce_iommu_group *data, + __u32 *current_windows, + __u32 *windows_available, + __u32 *page_size_mask); + long (*create)(struct spapr_tce_iommu_group *data, + __u32 page_shift, + __u32 window_shift, + struct iommu_table **ptbl); + long (*remove)(struct spapr_tce_iommu_group *data, + struct iommu_table *tbl); + long (*reset)(struct spapr_tce_iommu_group *data); }; struct spapr_tce_iommu_group { diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index 296f49b..a6318cb 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -1154,6 +1154,26 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct pnv_phb *phb, pnv_pci_ioda2_set_bypass(pe, true); } +static struct iommu_table *pnv_ioda2_iommu_get_table( + struct spapr_tce_iommu_group *data, + int num) +{ + struct pnv_ioda_pe *pe = data->iommu_owner; + + switch (num) { + case 0: + if (pe->tce32.table.it_size) + return >tce32.table; + return NULL; + case 1: + if (pe->tce64.table.it_size) + return >tce64.table; + return NULL; + default: + return NULL; + } +} + static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data, bool enable) { @@ -1162,9 +1182,146 @@ static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data, pnv_pci_ioda2_set_bypass(pe, !enable); } +static long pnv_pci_ioda2_ddw_query(struct spapr_tce_iommu_group *data, + __u32 *current_windows, + __u32 *windows_available, __u32 *page_size_mask) +{ + struct pnv_ioda_pe *pe = data->iommu_owner; + + *windows_available = 2; + *current_windows = 0; + if (pe->tce32.table.it_size) { + --*windows_available; + ++*current_windows; + } + if (pe->tce64.table.it_size) { + --*windows_available; + ++*current_windows; + } + *page_size_mask = + DDW_PGSIZE_4K | + DDW_PGSIZE_64K | + DDW_PGSIZE_16M; + + return 0; +} + +static long pnv_pci_ioda2_ddw_create(struct spapr_tce_iommu_group *data, + __u32 page_shift, __u32 window_shift, + struct iommu_table **ptbl) +{ + struct pnv_ioda_pe *pe = data->iommu_owner; + struct pnv_phb *phb = pe->phb; + struct page *tce_mem = NULL; + void *addr; + long ret; + unsigned long tce_table_size = + (1ULL << (window_shift - page_shift)) * 8; + unsigned order; +
[PATCH] Fix the issue that lowmemkiller fell into a cycle that try to kill a task
The cause of this issue is when free memroy size is low and a lot of task is trying to shrink the memory, the task that is killed by lowmemkiller cannot get CPU to exit itself. Fix this issue with change the scheduling policy to SCHED_FIFO if a task's flag is TIF_MEMDIE in lowmemkiller. Signed-off-by: Hui Zhu --- drivers/staging/android/lowmemorykiller.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c index b545d3d..ca1ffac 100644 --- a/drivers/staging/android/lowmemorykiller.c +++ b/drivers/staging/android/lowmemorykiller.c @@ -129,6 +129,10 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc) if (test_tsk_thread_flag(p, TIF_MEMDIE) && time_before_eq(jiffies, lowmem_deathpending_timeout)) { + struct sched_param param = { .sched_priority = 1 }; + + if (p->policy == SCHED_NORMAL) + sched_setscheduler(p, SCHED_FIFO, ); task_unlock(p); rcu_read_unlock(); return 0; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 07/13] powerpc/powernv: Do not set "read" flag if direction==DMA_NONE
Normally a bitmap from the iommu_table is used to track what TCE entry is in use. Since we are going to use iommu_table without its locks and do xchg() instead, it becomes essential not to put bits which are not implied in the direction flag. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/platforms/powernv/pci.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index deddcad..ab79e2d 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -628,10 +628,18 @@ static int pnv_tce_build(struct iommu_table *tbl, long index, long npages, __be64 *tcep, *tces; u64 rpn; - proto_tce = TCE_PCI_READ; // Read allowed - - if (direction != DMA_TO_DEVICE) - proto_tce |= TCE_PCI_WRITE; + switch (direction) { + case DMA_BIDIRECTIONAL: + case DMA_FROM_DEVICE: + proto_tce = TCE_PCI_READ | TCE_PCI_WRITE; + break; + case DMA_TO_DEVICE: + proto_tce = TCE_PCI_READ; + break; + default: + proto_tce = 0; + break; + } tces = tcep = ((__be64 *)tbl->it_base) + index - tbl->it_offset; rpn = __pa(uaddr) >> tbl->it_page_shift; -- 2.0.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 13/13] vfio: powerpc/spapr: Enable Dynamic DMA windows
This defines and implements VFIO IOMMU API which lets the userspace create and remove DMA windows. This updates VFIO_IOMMU_SPAPR_TCE_GET_INFO to return the number of available windows and page mask. This adds VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE to allow the user space to create and remove window(s). The VFIO IOMMU driver does basic sanity checks and calls corresponding SPAPR TCE functions. At the moment only IODA2 (POWER8 PCI host bridge) implements them. This advertises VFIO_IOMMU_SPAPR_TCE_FLAG_DDW capability via VFIO_IOMMU_SPAPR_TCE_GET_INFO. This calls platform DDW reset() callback when IOMMU is being disabled to reset the DMA configuration to its original state. Signed-off-by: Alexey Kardashevskiy --- drivers/vfio/vfio_iommu_spapr_tce.c | 135 ++-- include/uapi/linux/vfio.h | 25 ++- 2 files changed, 153 insertions(+), 7 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 0dccbc4..b518891 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -190,18 +190,25 @@ static void tce_iommu_disable(struct tce_container *container) container->enabled = false; - if (!container->grp || !current->mm) + if (!container->grp) return; data = iommu_group_get_iommudata(container->grp); if (!data || !data->iommu_owner || !data->ops->get_table) return; - tbl = data->ops->get_table(data, 0); - if (!tbl) - return; + if (current->mm) { + tbl = data->ops->get_table(data, 0); + if (tbl) + decrement_locked_vm(tbl); - decrement_locked_vm(tbl); + tbl = data->ops->get_table(data, 1); + if (tbl) + decrement_locked_vm(tbl); + } + + if (data->ops->reset) + data->ops->reset(data); } static void *tce_iommu_open(unsigned long arg) @@ -243,7 +250,7 @@ static long tce_iommu_ioctl(void *iommu_data, unsigned int cmd, unsigned long arg) { struct tce_container *container = iommu_data; - unsigned long minsz; + unsigned long minsz, ddwsz; long ret; switch (cmd) { @@ -288,6 +295,28 @@ static long tce_iommu_ioctl(void *iommu_data, info.dma32_window_size = tbl->it_size << tbl->it_page_shift; info.flags = 0; + ddwsz = offsetofend(struct vfio_iommu_spapr_tce_info, + page_size_mask); + + if (info.argsz == ddwsz) { + if (data->ops->query && data->ops->create && + data->ops->remove) { + info.flags |= VFIO_IOMMU_SPAPR_TCE_FLAG_DDW; + + ret = data->ops->query(data, + _windows, + _available, + _size_mask); + if (ret) + return ret; + } else { + info.current_windows = 0; + info.windows_available = 0; + info.page_size_mask = 0; + } + minsz = ddwsz; + } + if (copy_to_user((void __user *)arg, , minsz)) return -EFAULT; @@ -412,12 +441,106 @@ static long tce_iommu_ioctl(void *iommu_data, tce_iommu_disable(container); mutex_unlock(>lock); return 0; + case VFIO_EEH_PE_OP: if (!container->grp) return -ENODEV; return vfio_spapr_iommu_eeh_ioctl(container->grp, cmd, arg); + + case VFIO_IOMMU_SPAPR_TCE_CREATE: { + struct vfio_iommu_spapr_tce_create create; + struct spapr_tce_iommu_group *data; + struct iommu_table *tbl; + + if (WARN_ON(!container->grp)) + return -ENXIO; + + data = iommu_group_get_iommudata(container->grp); + + minsz = offsetofend(struct vfio_iommu_spapr_tce_create, + start_addr); + + if (copy_from_user(, (void __user *)arg, minsz)) + return -EFAULT; + + if (create.argsz < minsz) + return -EINVAL; + + if (create.flags) + return -EINVAL; + + if (!data->ops->create || !data->iommu_owner) + return -ENOSYS; + + BUG_ON(!data || !data->ops || !data->ops->remove); + + ret =
[PATCH v2 09/13] powerpc/pseries/lpar: Enable VFIO
The previous patch introduced iommu_table_ops::exchange() callback which effectively disabled VFIO on pseries. This implements exchange() for pseries/lpar so VFIO can work in nested guests. Since exchaange() callback returns an old TCE, it has to call H_GET_TCE for every TCE being put to the table so VFIO performance in guests running under PR KVM is expected to be slower than in guests running under HV KVM or bare metal hosts. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/platforms/pseries/iommu.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 9a7364f..ae15b5a 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -138,13 +138,14 @@ static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long); static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages, unsigned long uaddr, + unsigned long *old_tces, enum dma_data_direction direction, struct dma_attrs *attrs) { u64 rc = 0; u64 proto_tce, tce; u64 rpn; - int ret = 0; + int ret = 0, i = 0; long tcenum_start = tcenum, npages_start = npages; rpn = __pa(uaddr) >> TCE_SHIFT; @@ -154,6 +155,9 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum, while (npages--) { tce = proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT; + if (old_tces) + plpar_tce_get((u64)tbl->it_index, (u64)tcenum << 12, + _tces[i++]); rc = plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, tce); if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) { @@ -179,8 +183,9 @@ static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum, static DEFINE_PER_CPU(__be64 *, tce_page); -static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, +static int tce_xchg_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages, unsigned long uaddr, +unsigned long *old_tces, enum dma_data_direction direction, struct dma_attrs *attrs) { @@ -195,6 +200,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) { return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr, + old_tces, direction, attrs); } @@ -211,6 +217,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, if (!tcep) { local_irq_restore(flags); return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr, + old_tces, direction, attrs); } __get_cpu_var(tce_page) = tcep; @@ -232,6 +239,10 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, for (l = 0; l < limit; l++) { tcep[l] = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT); rpn++; + if (old_tces) + plpar_tce_get((u64)tbl->it_index, + (u64)(tcenum + l) << 12, + _tces[tcenum + l]); } rc = plpar_tce_put_indirect((u64)tbl->it_index, @@ -262,6 +273,15 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, return ret; } +static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, +long npages, unsigned long uaddr, +enum dma_data_direction direction, +struct dma_attrs *attrs) +{ + return tce_xchg_pSeriesLP(tbl, tcenum, npages, uaddr, NULL, + direction, attrs); +} + static void tce_free_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages) { u64 rc; @@ -637,6 +657,7 @@ static void pci_dma_bus_setup_pSeries(struct pci_bus *bus) struct iommu_table_ops iommu_table_lpar_multi_ops = { .set = tce_buildmulti_pSeriesLP, + .exchange = tce_xchg_pSeriesLP, .clear = tce_freemulti_pSeriesLP, .get = tce_get_pSeriesLP }; -- 2.0.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at
[PATCH v2 08/13] powerpc/powernv: Release replaced TCE
At the moment writing new TCE value to the IOMMU table fails with EBUSY if there is a valid entry already. However PAPR specification allows the guest to write new TCE value without clearing it first. Another problem this patch is addressing is the use of pool locks for external IOMMU users such as VFIO. The pool locks are to protect DMA page allocator rather than entries and since the host kernel does not control what pages are in use, there is no point in pool locks and exchange()+put_page(oldtce) is sufficient to avoid possible races. This adds an exchange() callback to iommu_table_ops which does the same thing as set() plus it returns replaced TCE(s) so the caller can release the pages afterwards. This makes iommu_tce_build() put pages returned by exchange(). This replaces iommu_clear_tce() with iommu_tce_build which now can call exchange() with TCE==NULL (i.e. clear). This preserves permission bits in TCE in iommu_put_tce_user_mode(). This removes use of pool locks for external IOMMU uses. This disables external IOMMU use (i.e. VFIO) for IOMMUs which do not implement exchange() callback. Therefore the "powernv" platform is the only supported one after this patch. Signed-off-by: Alexey Kardashevskiy --- Changes: v2: * added missing __pa() for TCE which was read from the table --- arch/powerpc/include/asm/iommu.h | 8 +++-- arch/powerpc/kernel/iommu.c | 62 arch/powerpc/platforms/powernv/pci.c | 40 +++ 3 files changed, 67 insertions(+), 43 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index c725e4a..8e0537d 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -49,6 +49,12 @@ struct iommu_table_ops { unsigned long uaddr, enum dma_data_direction direction, struct dma_attrs *attrs); + int (*exchange)(struct iommu_table *tbl, + long index, long npages, + unsigned long uaddr, + unsigned long *old_tces, + enum dma_data_direction direction, + struct dma_attrs *attrs); void (*clear)(struct iommu_table *tbl, long index, long npages); unsigned long (*get)(struct iommu_table *tbl, long index); @@ -209,8 +215,6 @@ extern int iommu_tce_put_param_check(struct iommu_table *tbl, unsigned long ioba, unsigned long tce); extern int iommu_tce_build(struct iommu_table *tbl, unsigned long entry, unsigned long hwaddr, enum dma_data_direction direction); -extern unsigned long iommu_clear_tce(struct iommu_table *tbl, - unsigned long entry); extern int iommu_clear_tces_and_put_pages(struct iommu_table *tbl, unsigned long entry, unsigned long pages); extern int iommu_put_tce_user_mode(struct iommu_table *tbl, diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index 678fee8..39ccce7 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1006,43 +1006,11 @@ int iommu_tce_put_param_check(struct iommu_table *tbl, } EXPORT_SYMBOL_GPL(iommu_tce_put_param_check); -unsigned long iommu_clear_tce(struct iommu_table *tbl, unsigned long entry) -{ - unsigned long oldtce; - struct iommu_pool *pool = get_pool(tbl, entry); - - spin_lock(&(pool->lock)); - - oldtce = tbl->it_ops->get(tbl, entry); - if (oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)) - tbl->it_ops->clear(tbl, entry, 1); - else - oldtce = 0; - - spin_unlock(&(pool->lock)); - - return oldtce; -} -EXPORT_SYMBOL_GPL(iommu_clear_tce); - int iommu_clear_tces_and_put_pages(struct iommu_table *tbl, unsigned long entry, unsigned long pages) { - unsigned long oldtce; - struct page *page; - for ( ; pages; --pages, ++entry) { - oldtce = iommu_clear_tce(tbl, entry); - if (!oldtce) - continue; - - page = pfn_to_page(oldtce >> PAGE_SHIFT); - WARN_ON(!page); - if (page) { - if (oldtce & TCE_PCI_WRITE) - SetPageDirty(page); - put_page(page); - } + iommu_tce_build(tbl, entry, 0, DMA_NONE); } return 0; @@ -1056,18 +1024,19 @@ EXPORT_SYMBOL_GPL(iommu_clear_tces_and_put_pages); int iommu_tce_build(struct iommu_table *tbl, unsigned long entry, unsigned long hwaddr, enum dma_data_direction direction) { - int ret = -EBUSY; + int ret; unsigned long oldtce; - struct iommu_pool *pool = get_pool(tbl, entry); - spin_lock(&(pool->lock)); + ret = tbl->it_ops->exchange(tbl, entry, 1, hwaddr, , + direction, NULL); -
[PATCH v2 11/13] vfio: powerpc/spapr: Move locked_vm accounting to helpers
There moves locked pages accounting to helpers. Later they will be reused for Dynamic DMA windows (DDW). While we are here, update the comment explaining why RLIMIT_MEMLOCK might be required to be bigger than the guest RAM. Signed-off-by: Alexey Kardashevskiy --- drivers/vfio/vfio_iommu_spapr_tce.c | 71 +++-- 1 file changed, 53 insertions(+), 18 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index 1c1a9c4..c9fac97 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -29,6 +29,46 @@ static void tce_iommu_detach_group(void *iommu_data, struct iommu_group *iommu_group); +static long try_increment_locked_vm(struct iommu_table *tbl) +{ + long ret = 0, locked, lock_limit, npages; + + if (!current || !current->mm) + return -ESRCH; /* process exited */ + + npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT; + + down_write(>mm->mmap_sem); + locked = current->mm->locked_vm + npages; + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + if (locked > lock_limit && !capable(CAP_IPC_LOCK)) { + pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n", + rlimit(RLIMIT_MEMLOCK)); + ret = -ENOMEM; + } else { + current->mm->locked_vm += npages; + } + up_write(>mm->mmap_sem); + + return ret; +} + +static void decrement_locked_vm(struct iommu_table *tbl) +{ + long npages; + + if (!current || !current->mm) + return; /* process exited */ + + npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT; + + down_write(>mm->mmap_sem); + if (npages > current->mm->locked_vm) + npages = current->mm->locked_vm; + current->mm->locked_vm -= npages; + up_write(>mm->mmap_sem); +} + /* * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation * @@ -86,7 +126,6 @@ static void tce_iommu_take_ownership_notify(struct spapr_tce_iommu_group *data, static int tce_iommu_enable(struct tce_container *container) { int ret = 0; - unsigned long locked, lock_limit, npages; struct iommu_table *tbl; struct spapr_tce_iommu_group *data; @@ -120,24 +159,23 @@ static int tce_iommu_enable(struct tce_container *container) * Also we don't have a nice way to fail on H_PUT_TCE due to ulimits, * that would effectively kill the guest at random points, much better * enforcing the limit based on the max that the guest can map. +* +* Unfortunately at the moment it counts whole tables, no matter how +* much memory the guest has. I.e. for 4GB guest and 4 IOMMU groups +* each with 2GB DMA window, 8GB will be counted here. The reason for +* this is that we cannot tell here the amount of RAM used by the guest +* as this information is only available from KVM and VFIO is +* KVM agnostic. */ tbl = data->ops->get_table(data, 0); if (!tbl) return -ENXIO; - down_write(>mm->mmap_sem); - npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT; - locked = current->mm->locked_vm + npages; - lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; - if (locked > lock_limit && !capable(CAP_IPC_LOCK)) { - pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n", - rlimit(RLIMIT_MEMLOCK)); - ret = -ENOMEM; - } else { - current->mm->locked_vm += npages; - container->enabled = true; - } - up_write(>mm->mmap_sem); + ret = try_increment_locked_vm(tbl); + if (ret) + return ret; + + container->enabled = true; return ret; } @@ -163,10 +201,7 @@ static void tce_iommu_disable(struct tce_container *container) if (!tbl) return; - down_write(>mm->mmap_sem); - current->mm->locked_vm -= (tbl->it_size << - IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT; - up_write(>mm->mmap_sem); + decrement_locked_vm(tbl); } static void *tce_iommu_open(unsigned long arg) -- 2.0.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 05/13] powerpc/iommu: Fix IOMMU ownership control functions
This adds missing locks in iommu_take_ownership()/ iommu_release_ownership(). This marks all pages busy in iommu_table::it_map in order to catch errors if there is an attempt to use this table while ownership over it is taken. This only clears TCE content if there is no page marked busy in it_map. Clearing must be done outside of the table locks as iommu_clear_tce() called from iommu_clear_tces_and_put_pages() does this. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/kernel/iommu.c | 36 +--- 1 file changed, 29 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index c2c8d9d..cd80867 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -1126,33 +1126,55 @@ EXPORT_SYMBOL_GPL(iommu_put_tce_user_mode); int iommu_take_ownership(struct iommu_table *tbl) { - unsigned long sz = (tbl->it_size + 7) >> 3; + unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; + int ret = 0, bit0 = 0; + + spin_lock_irqsave(>large_pool.lock, flags); + for (i = 0; i < tbl->nr_pools; i++) + spin_lock(>pools[i].lock); if (tbl->it_offset == 0) - clear_bit(0, tbl->it_map); + bit0 = test_and_clear_bit(0, tbl->it_map); if (!bitmap_empty(tbl->it_map, tbl->it_size)) { pr_err("iommu_tce: it_map is not empty"); - return -EBUSY; + ret = -EBUSY; + if (bit0) + set_bit(0, tbl->it_map); + } else { + memset(tbl->it_map, 0xff, sz); } - memset(tbl->it_map, 0xff, sz); - iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size); + for (i = 0; i < tbl->nr_pools; i++) + spin_unlock(>pools[i].lock); + spin_unlock_irqrestore(>large_pool.lock, flags); - return 0; + if (!ret) + iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, + tbl->it_size); + return ret; } EXPORT_SYMBOL_GPL(iommu_take_ownership); void iommu_release_ownership(struct iommu_table *tbl) { - unsigned long sz = (tbl->it_size + 7) >> 3; + unsigned long flags, i, sz = (tbl->it_size + 7) >> 3; iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size); + + spin_lock_irqsave(>large_pool.lock, flags); + for (i = 0; i < tbl->nr_pools; i++) + spin_lock(>pools[i].lock); + memset(tbl->it_map, 0, sz); /* Restore bit#0 set by iommu_init_table() */ if (tbl->it_offset == 0) set_bit(0, tbl->it_map); + + for (i = 0; i < tbl->nr_pools; i++) + spin_unlock(>pools[i].lock); + spin_unlock_irqrestore(>large_pool.lock, flags); } EXPORT_SYMBOL_GPL(iommu_release_ownership); -- 2.0.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 12/13] vfio: powerpc/spapr: Use it_page_size
This makes use of the it_page_size from the iommu_table struct as page size can differ. This replaces missing IOMMU_PAGE_SHIFT macro in commented debug code as recently introduced IOMMU_PAGE_XXX macros do not include IOMMU_PAGE_SHIFT. Signed-off-by: Alexey Kardashevskiy --- drivers/vfio/vfio_iommu_spapr_tce.c | 36 ++-- 1 file changed, 18 insertions(+), 18 deletions(-) diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c index c9fac97..0dccbc4 100644 --- a/drivers/vfio/vfio_iommu_spapr_tce.c +++ b/drivers/vfio/vfio_iommu_spapr_tce.c @@ -36,7 +36,7 @@ static long try_increment_locked_vm(struct iommu_table *tbl) if (!current || !current->mm) return -ESRCH; /* process exited */ - npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT; + npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT; down_write(>mm->mmap_sem); locked = current->mm->locked_vm + npages; @@ -60,7 +60,7 @@ static void decrement_locked_vm(struct iommu_table *tbl) if (!current || !current->mm) return; /* process exited */ - npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT; + npages = (tbl->it_size << tbl->it_page_shift) >> PAGE_SHIFT; down_write(>mm->mmap_sem); if (npages > current->mm->locked_vm) @@ -284,8 +284,8 @@ static long tce_iommu_ioctl(void *iommu_data, if (info.argsz < minsz) return -EINVAL; - info.dma32_window_start = tbl->it_offset << IOMMU_PAGE_SHIFT_4K; - info.dma32_window_size = tbl->it_size << IOMMU_PAGE_SHIFT_4K; + info.dma32_window_start = tbl->it_offset << tbl->it_page_shift; + info.dma32_window_size = tbl->it_size << tbl->it_page_shift; info.flags = 0; if (copy_to_user((void __user *)arg, , minsz)) @@ -318,10 +318,6 @@ static long tce_iommu_ioctl(void *iommu_data, VFIO_DMA_MAP_FLAG_WRITE)) return -EINVAL; - if ((param.size & ~IOMMU_PAGE_MASK_4K) || - (param.vaddr & ~IOMMU_PAGE_MASK_4K)) - return -EINVAL; - /* iova is checked by the IOMMU API */ tce = param.vaddr; if (param.flags & VFIO_DMA_MAP_FLAG_READ) @@ -334,21 +330,25 @@ static long tce_iommu_ioctl(void *iommu_data, return -ENXIO; BUG_ON(!tbl->it_group); + if ((param.size & ~IOMMU_PAGE_MASK(tbl)) || + (param.vaddr & ~IOMMU_PAGE_MASK(tbl))) + return -EINVAL; + ret = iommu_tce_put_param_check(tbl, param.iova, tce); if (ret) return ret; - for (i = 0; i < (param.size >> IOMMU_PAGE_SHIFT_4K); ++i) { + for (i = 0; i < (param.size >> tbl->it_page_shift); ++i) { ret = iommu_put_tce_user_mode(tbl, - (param.iova >> IOMMU_PAGE_SHIFT_4K) + i, + (param.iova >> tbl->it_page_shift) + i, tce); if (ret) break; - tce += IOMMU_PAGE_SIZE_4K; + tce += IOMMU_PAGE_SIZE(tbl); } if (ret) iommu_clear_tces_and_put_pages(tbl, - param.iova >> IOMMU_PAGE_SHIFT_4K, i); + param.iova >> tbl->it_page_shift, i); iommu_flush_tce(tbl); @@ -379,23 +379,23 @@ static long tce_iommu_ioctl(void *iommu_data, if (param.flags) return -EINVAL; - if (param.size & ~IOMMU_PAGE_MASK_4K) - return -EINVAL; - tbl = spapr_tce_find_table(container, data, param.iova); if (!tbl) return -ENXIO; + if (param.size & ~IOMMU_PAGE_MASK(tbl)) + return -EINVAL; + BUG_ON(!tbl->it_group); ret = iommu_tce_clear_param_check(tbl, param.iova, 0, - param.size >> IOMMU_PAGE_SHIFT_4K); + param.size >> tbl->it_page_shift); if (ret) return ret; ret = iommu_clear_tces_and_put_pages(tbl, - param.iova >> IOMMU_PAGE_SHIFT_4K, - param.size >> IOMMU_PAGE_SHIFT_4K); + param.iova >> tbl->it_page_shift, + param.size >> tbl->it_page_shift); iommu_flush_tce(tbl); return ret; -- 2.0.0 -- To unsubscribe
[PATCH v2 06/13] powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table
This adds a iommu_table_ops struct and puts pointer to it into the iommu_table struct. This moves tce_build/tce_free/tce_get/tce_flush callbacks from ppc_md to the new struct where they really belong to. This adds an extra @ops parameter to iommu_init_table() to make sure that we do not leave any IOMMU table without iommu_table_ops. @it_ops is initialized in the very beginning as iommu_init_table() calls iommu_table_clear() and the latter uses callbacks already. This does s/tce_build/set/, s/tce_free/clear/ and removes "tce_" prefixes for better readability. This removes tce_xxx_rm handlers from ppc_md as well but does not add them to iommu_table_ops, this will be done later if we decide to support TCE hypercalls in real mode. This always uses tce_buildmulti_pSeriesLP/tce_buildmulti_pSeriesLP as callbacks for pseries. This changes "multi" callbacks to fall back to tce_build_pSeriesLP/tce_free_pSeriesLP if FW_FEATURE_MULTITCE is not present. The reason for this is we still have to support "multitce=off" boot parameter in disable_multitce() and we do not want to walk through all IOMMU tables in the system and replace "multi" callbacks with single ones. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/iommu.h| 20 +++- arch/powerpc/include/asm/machdep.h | 25 --- arch/powerpc/kernel/iommu.c | 50 - arch/powerpc/kernel/vio.c | 5 ++- arch/powerpc/platforms/cell/iommu.c | 9 -- arch/powerpc/platforms/pasemi/iommu.c | 8 +++-- arch/powerpc/platforms/powernv/pci-ioda.c | 4 +-- arch/powerpc/platforms/powernv/pci-p5ioc2.c | 3 +- arch/powerpc/platforms/powernv/pci.c| 24 -- arch/powerpc/platforms/powernv/pci.h| 1 + arch/powerpc/platforms/pseries/iommu.c | 42 +--- arch/powerpc/sysdev/dart_iommu.c| 13 12 files changed, 102 insertions(+), 102 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index 2b0b01d..c725e4a 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -43,6 +43,22 @@ extern int iommu_is_off; extern int iommu_force_on; +struct iommu_table_ops { + int (*set)(struct iommu_table *tbl, + long index, long npages, + unsigned long uaddr, + enum dma_data_direction direction, + struct dma_attrs *attrs); + void (*clear)(struct iommu_table *tbl, + long index, long npages); + unsigned long (*get)(struct iommu_table *tbl, long index); + void (*flush)(struct iommu_table *tbl); +}; + +/* These are used by VIO */ +extern struct iommu_table_ops iommu_table_lpar_multi_ops; +extern struct iommu_table_ops iommu_table_pseries_ops; + /* * IOMAP_MAX_ORDER defines the largest contiguous block * of dma space we can get. IOMAP_MAX_ORDER = 13 @@ -77,6 +93,7 @@ struct iommu_table { #ifdef CONFIG_IOMMU_API struct iommu_group *it_group; #endif + struct iommu_table_ops *it_ops; }; /* Pure 2^n version of get_order */ @@ -106,7 +123,8 @@ extern void iommu_free_table(struct iommu_table *tbl, const char *node_name); * structure */ extern struct iommu_table *iommu_init_table(struct iommu_table * tbl, - int nid); + int nid, + struct iommu_table_ops *ops); struct spapr_tce_iommu_ops; #ifdef CONFIG_IOMMU_API diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h index b125cea..1fc824d 100644 --- a/arch/powerpc/include/asm/machdep.h +++ b/arch/powerpc/include/asm/machdep.h @@ -65,31 +65,6 @@ struct machdep_calls { * destroyed as well */ void(*hpte_clear_all)(void); - int (*tce_build)(struct iommu_table *tbl, -long index, -long npages, -unsigned long uaddr, -enum dma_data_direction direction, -struct dma_attrs *attrs); - void(*tce_free)(struct iommu_table *tbl, - long index, - long npages); - unsigned long (*tce_get)(struct iommu_table *tbl, - long index); - void(*tce_flush)(struct iommu_table *tbl); - - /* _rm versions are for real mode use only */ - int (*tce_build_rm)(struct iommu_table *tbl, -long index, -long npages, -unsigned long uaddr, -enum dma_data_direction direction, -
[PATCH v2 00/13] powerpc/iommu/vfio: Enable Dynamic DMA windows
This enables PAPR defined feature called Dynamic DMA windows (DDW). Each Partitionable Endpoint (IOMMU group) has a separate DMA window on a PCI bus where devices are allows to perform DMA. By default there is 1 or 2GB window allocated at the host boot time and these windows are used when an IOMMU group is passed to the userspace (guest). These windows are mapped at zero offset on a PCI bus. Hi-speed devices may suffer from limited size of this window. On the host side a TCE bypass mode is enabled on POWER8 CPU which implements direct mapping of the host memory to a PCI bus at 1<<59. For the guest, PAPR defines a DDW RTAS API which allows the pseries guest to query the hypervisor if it supports DDW and what are the parameters of possible windows. Currently POWER8 supports 2 DMA windows per PE - already mentioned and used small 32bit window and 64bit window which can only start from 1<<59 and can support various page sizes. This patchset reworks PPC IOMMU code and adds necessary structures to extend it to support big windows. When the guest detectes the feature and the PE is capable of 64bit DMA, it does: 1. query to hypervisor about number of available windows and page masks; 2. creates a window with the biggest possible page size (current guests can do 64K or 16MB TCEs); 3. maps the entire guest RAM via H_PUT_TCE* hypercalls 4. switches dma_ops to direct_dma_ops on the selected PE. Once this is done, H_PUT_TCE is not called anymore and the guest gets maximum performance. Please comment. Thanks! Changes: v2: * added missing __pa() in "powerpc/powernv: Release replaced TCE" * reposted to make some noise :) Alexey Kardashevskiy (13): powerpc/iommu: Check that TCE page size is equal to it_page_size powerpc/powernv: Make invalidate() a callback powerpc/spapr: vfio: Implement spapr_tce_iommu_ops powerpc/powernv: Convert/move set_bypass() callback to take_ownership() powerpc/iommu: Fix IOMMU ownership control functions powerpc/iommu: Move tce_xxx callbacks from ppc_md to iommu_table powerpc/powernv: Do not set "read" flag if direction==DMA_NONE powerpc/powernv: Release replaced TCE powerpc/pseries/lpar: Enable VFIO powerpc/powernv: Implement Dynamic DMA windows (DDW) for IODA vfio: powerpc/spapr: Move locked_vm accounting to helpers vfio: powerpc/spapr: Use it_page_size vfio: powerpc/spapr: Enable Dynamic DMA windows arch/powerpc/include/asm/iommu.h| 35 ++- arch/powerpc/include/asm/machdep.h | 25 -- arch/powerpc/include/asm/tce.h | 37 +++ arch/powerpc/kernel/iommu.c | 213 +-- arch/powerpc/kernel/vio.c | 5 +- arch/powerpc/platforms/cell/iommu.c | 9 +- arch/powerpc/platforms/pasemi/iommu.c | 8 +- arch/powerpc/platforms/powernv/pci-ioda.c | 233 +++-- arch/powerpc/platforms/powernv/pci-p5ioc2.c | 4 +- arch/powerpc/platforms/powernv/pci.c| 113 +--- arch/powerpc/platforms/powernv/pci.h| 15 +- arch/powerpc/platforms/pseries/iommu.c | 77 -- arch/powerpc/sysdev/dart_iommu.c| 13 +- drivers/vfio/vfio_iommu_spapr_tce.c | 384 +++- include/uapi/linux/vfio.h | 25 +- 15 files changed, 925 insertions(+), 271 deletions(-) -- 2.0.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.4 00/45] 3.4.104-rc1 review
On 09/22/2014 07:42 PM, Guenter Roeck wrote: On 09/22/2014 07:27 PM, Zefan Li wrote: From: Zefan Li This is the start of the stable review cycle for the 3.4.104 release. There are 45 patches in this series, all will be posted as a response to this one. If anyone has any issues with these being applied, please let me know. Responses should be made by Thu Sep 25 02:03:31 UTC 2014. Anything received after that time might be too late. A combined patch relative to 3.4.103 will be posted as an additional response to this. A shortlog and diffstat can be found below. thanks, Zefan Li Hi, did you push the latest patch ? I only see 43 patches in the queue. Never mind, got it now. Guenter -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.4 00/45] 3.4.104-rc1 review
This is the combined patch for 3.4.104-rc1 relative to 3.4.103. --- diff --git a/Documentation/stable_kernel_rules.txt b/Documentation/stable_kernel_rules.txt index b0714d8..8dfb6a5 100644 --- a/Documentation/stable_kernel_rules.txt +++ b/Documentation/stable_kernel_rules.txt @@ -29,6 +29,9 @@ Rules on what kind of patches are accepted, and which ones are not, into the Procedure for submitting patches to the -stable tree: + - If the patch covers files in net/ or drivers/net please follow netdev stable + submission guidelines as described in + Documentation/networking/netdev-FAQ.txt - Send the patch, after verifying that it follows the above rules, to sta...@vger.kernel.org. You must note the upstream commit ID in the changelog of your submission, as well as the kernel version you wish diff --git a/Makefile b/Makefile index 36f0913..77a9aa6 100644 --- a/Makefile +++ b/Makefile @@ -1,7 +1,7 @@ VERSION = 3 PATCHLEVEL = 4 -SUBLEVEL = 103 -EXTRAVERSION = +SUBLEVEL = 104 +EXTRAVERSION = -rc1 NAME = Saber-toothed Squirrel # *DOCUMENTATION* diff --git a/arch/alpha/include/asm/io.h b/arch/alpha/include/asm/io.h index 7a3d38d..5ebab58 100644 --- a/arch/alpha/include/asm/io.h +++ b/arch/alpha/include/asm/io.h @@ -489,6 +489,11 @@ extern inline void writeq(u64 b, volatile void __iomem *addr) } #endif +#define ioread16be(p) be16_to_cpu(ioread16(p)) +#define ioread32be(p) be32_to_cpu(ioread32(p)) +#define iowrite16be(v,p) iowrite16(cpu_to_be16(v), (p)) +#define iowrite32be(v,p) iowrite32(cpu_to_be32(v), (p)) + #define inb_p inb #define inw_p inw #define inl_p inl diff --git a/arch/alpha/oprofile/common.c b/arch/alpha/oprofile/common.c index a0a5d27..b8ce18f 100644 --- a/arch/alpha/oprofile/common.c +++ b/arch/alpha/oprofile/common.c @@ -12,6 +12,7 @@ #include #include #include +#include #include "op_impl.h" diff --git a/arch/arm/kernel/entry-header.S b/arch/arm/kernel/entry-header.S index 9a8531e..9d95a46 100644 --- a/arch/arm/kernel/entry-header.S +++ b/arch/arm/kernel/entry-header.S @@ -76,26 +76,21 @@ #ifndef CONFIG_THUMB2_KERNEL .macro svc_exit, rpsr msr spsr_cxsf, \rpsr -#if defined(CONFIG_CPU_V6) - ldr r0, [sp] - strex r1, r2, [sp]@ clear the exclusive monitor - ldmib sp, {r1 - pc}^ @ load r1 - pc, cpsr -#elif defined(CONFIG_CPU_32v6K) - clrex @ clear the exclusive monitor - ldmia sp, {r0 - pc}^ @ load r0 - pc, cpsr -#else - ldmia sp, {r0 - pc}^ @ load r0 - pc, cpsr +#if defined(CONFIG_CPU_V6) || defined(CONFIG_CPU_32v6K) + @ We must avoid clrex due to Cortex-A15 erratum #830321 + sub r0, sp, #4 @ uninhabited address + strex r1, r2, [r0]@ clear the exclusive monitor #endif + ldmia sp, {r0 - pc}^ @ load r0 - pc, cpsr .endm .macro restore_user_regs, fast = 0, offset = 0 ldr r1, [sp, #\offset + S_PSR] @ get calling cpsr ldr lr, [sp, #\offset + S_PC]! @ get pc msr spsr_cxsf, r1 @ save in spsr_svc -#if defined(CONFIG_CPU_V6) +#if defined(CONFIG_CPU_V6) || defined(CONFIG_CPU_32v6K) + @ We must avoid clrex due to Cortex-A15 erratum #830321 strex r1, r2, [sp]@ clear the exclusive monitor -#elif defined(CONFIG_CPU_32v6K) - clrex @ clear the exclusive monitor #endif .if \fast ldmdb sp, {r1 - lr}^ @ get calling r1 - lr @@ -123,7 +118,10 @@ .macro svc_exit, rpsr ldr lr, [sp, #S_SP] @ top of the stack ldrdr0, r1, [sp, #S_LR] @ calling lr and pc - clrex @ clear the exclusive monitor + + @ We must avoid clrex due to Cortex-A15 erratum #830321 + strex r2, r1, [sp, #S_LR] @ clear the exclusive monitor + stmdb lr!, {r0, r1, \rpsr}@ calling lr and rfe context ldmia sp, {r0 - r12} mov sp, lr @@ -132,13 +130,16 @@ .endm .macro restore_user_regs, fast = 0, offset = 0 - clrex @ clear the exclusive monitor mov r2, sp load_user_sp_lr r2, r3, \offset + S_SP @ calling sp, lr ldr r1, [sp, #\offset + S_PSR] @ get calling cpsr ldr lr, [sp, #\offset + S_PC] @ get pc add sp, sp, #\offset + S_SP msr spsr_cxsf, r1 @ save in spsr_svc + + @ We must avoid clrex due to Cortex-A15 erratum #830321 + strex r1, r2, [sp]@ clear the exclusive monitor + .if \fast ldmdb sp, {r1 - r12} @ get calling r1 - r12 .else diff --git
[PATCH 2/3] dt-bindings: add document of Rockchip power domain
From: "jinkun.hong" Signed-off-by: Jack Dai Signed-off-by: Caesar Wang Signed-off-by: jinkun.hong --- .../bindings/arm/rockchip/power_domain.txt | 48 1 file changed, 48 insertions(+) create mode 100644 Documentation/devicetree/bindings/arm/rockchip/power_domain.txt diff --git a/Documentation/devicetree/bindings/arm/rockchip/power_domain.txt b/Documentation/devicetree/bindings/arm/rockchip/power_domain.txt new file mode 100644 index 000..2a80d3f --- /dev/null +++ b/Documentation/devicetree/bindings/arm/rockchip/power_domain.txt @@ -0,0 +1,48 @@ +* Rockchip Power Domains + +Rockchip processors include support for multiple power domains which can be +powered up/down by software based on different application scenes to save power. + +Required properties for power domain controller: +- compatible: should be one of the following. +* rockchip,rk3288-power-controller - for rk3288 type power domain. +- #power-domain-cells: Number of cells in a power-domain specifier. + should be 1. +- rockchip,pmu: phandle referencing a syscon providing the pmu registers +- #address-cells: should be 1. +- #size-cells: should be 0. + +Required properties for power domain sub nodes: +- reg: index of the power domain, should use macros in: +* include/dt-bindings/power-domain/rk3288.h - for rk3288 type power domain. +- clocks: phandles to clocks which need to be enabled while power domain + switches state. + +Example: + + power: power-controller { + compatible = "rockchip,rk3288-power-controller"; + #power-domain-cells = <1>; + rockchip,pmu = <>; + #address-cells = <1>; + #size-cells = <0>; + + pd_gpu { + reg = ; + clocks = < ACLK_GPU>; + }; + }; + +Node of a device using power domains must have a power-domains property, +containing a phandle to the power device node and an index specifying which +power domain to use. +The index should use macros in: + * include/dt-bindings/power-domain/rk3288.h - for rk3288 type power domain. + +Example of the node using power domain: + + node { + /* ... */ + power-domains = < RK3288_PD_GPU>; + /* ... */ + }; -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/