Re: [PATCH v3 8/8] block: Assign a lock_class per gendisk used for wait_for_completion()

2017-10-24 Thread Ingo Molnar

* Byungchul Park  wrote:

> > Isn't lockdep_map a zero size structure that is always defined? If yes then 
> > there's no need for an #ifdef.
> 
> No, a zero size structure for lockdep_map is not provided yet.
> There are two options I can do:
> 
> 1. Add a zero size structure for lockdep_map and remove #ifdef
> 2. Replace CONFIG_LOCKDEP_COMPLETIONS with CONFIG_LOCKDEP here.
> 
> Or something else?
> 
> Which one do you prefer?

Ok, could we try #1 in a new patch and re-spin the simplified block layer patch 
on 
top of that?

The less ugly a debug facility's impact on unrelated kernel is, the better - 
especially when it comes to annotating false positives.

Thanks,

Ingo


[PATCH v4 7/7] block: Assign a lock_class per gendisk used for wait_for_completion()

2017-10-24 Thread Byungchul Park
Darrick posted the following warning and Dave Chinner analyzed it:

> ==
> WARNING: possible circular locking dependency detected
> 4.14.0-rc1-fixes #1 Tainted: GW
> --
> loop0/31693 is trying to acquire lock:
>  (&(>i_mmaplock)->mr_lock){}, at: [] 
> xfs_ilock+0x23c/0x330 [xfs]
>
> but now in release context of a crosslock acquired at the following:
>  ((complete)){+.+.}, at: [] 
> submit_bio_wait+0x7f/0xb0
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 ((complete)){+.+.}:
>lock_acquire+0xab/0x200
>wait_for_completion_io+0x4e/0x1a0
>submit_bio_wait+0x7f/0xb0
>blkdev_issue_zeroout+0x71/0xa0
>xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
>xfs_bmapi_write+0x374/0x11f0 [xfs]
>xfs_iomap_write_direct+0x2ac/0x430 [xfs]
>xfs_file_iomap_begin+0x20d/0xd50 [xfs]
>iomap_apply+0x43/0xe0
>dax_iomap_rw+0x89/0xf0
>xfs_file_dax_write+0xcc/0x220 [xfs]
>xfs_file_write_iter+0xf0/0x130 [xfs]
>__vfs_write+0xd9/0x150
>vfs_write+0xc8/0x1c0
>SyS_write+0x45/0xa0
>entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> -> #1 (_nondir_ilock_class){}:
>lock_acquire+0xab/0x200
>down_write_nested+0x4a/0xb0
>xfs_ilock+0x263/0x330 [xfs]
>xfs_setattr_size+0x152/0x370 [xfs]
>xfs_vn_setattr+0x6b/0x90 [xfs]
>notify_change+0x27d/0x3f0
>do_truncate+0x5b/0x90
>path_openat+0x237/0xa90
>do_filp_open+0x8a/0xf0
>do_sys_open+0x11c/0x1f0
>entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> -> #0 (&(>i_mmaplock)->mr_lock){}:
>up_write+0x1c/0x40
>xfs_iunlock+0x1d0/0x310 [xfs]
>xfs_file_fallocate+0x8a/0x310 [xfs]
>loop_queue_work+0xb7/0x8d0
>kthread_worker_fn+0xb9/0x1f0
>
> Chain exists of:
>   &(>i_mmaplock)->mr_lock --> _nondir_ilock_class --> 
> (complete)
>
>  Possible unsafe locking scenario by crosslock:
>
>CPU0CPU1
>
>   lock(_nondir_ilock_class);
>   lock((complete));
>lock(&(>i_mmaplock)->mr_lock);
>unlock((complete));
>
>*** DEADLOCK ***

The warning is a false positive, caused by the fact that all
wait_for_completion()s in submit_bio_wait() are waiting with the same
lock class.

However, some bios have nothing to do with others, for example, the case
might happen while using loop devices, between bios of an upper device
and a lower device(=loop device).

The safest way to assign different lock classes to different devices is
to do it for each gendisk. In other words, this patch assigns a
lockdep_map per gendisk and uses it when initializing completion in
submit_bio_wait().

Of course, it might be too conservative. But, making it safest for now
and extended by block layer experts later is good, at the moment.

Reported-by: Darrick J. Wong 
Analyzed-by: Dave Chinner 
Signed-off-by: Byungchul Park 
---
 block/bio.c   |  2 +-
 block/genhd.c | 10 ++
 include/linux/genhd.h | 24 ++--
 3 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 99d0ca5..a3cb1d1 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -935,7 +935,7 @@ static void submit_bio_wait_endio(struct bio *bio)
  */
 int submit_bio_wait(struct bio *bio)
 {
-   DECLARE_COMPLETION_ONSTACK(done);
+   DECLARE_COMPLETION_ONSTACK_MAP(done, bio->bi_disk->lockdep_map);
 
bio->bi_private = 
bio->bi_end_io = submit_bio_wait_endio;
diff --git a/block/genhd.c b/block/genhd.c
index dd305c6..630c0da 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1354,13 +1354,7 @@ dev_t blk_lookup_devt(const char *name, int partno)
 }
 EXPORT_SYMBOL(blk_lookup_devt);
 
-struct gendisk *alloc_disk(int minors)
-{
-   return alloc_disk_node(minors, NUMA_NO_NODE);
-}
-EXPORT_SYMBOL(alloc_disk);
-
-struct gendisk *alloc_disk_node(int minors, int node_id)
+struct gendisk *__alloc_disk_node(int minors, int node_id)
 {
struct gendisk *disk;
struct disk_part_tbl *ptbl;
@@ -1411,7 +1405,7 @@ struct gendisk *alloc_disk_node(int minors, int node_id)
}
return disk;
 }
-EXPORT_SYMBOL(alloc_disk_node);
+EXPORT_SYMBOL(__alloc_disk_node);
 
 struct kobject *get_disk(struct gendisk *disk)
 {
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 6d85a75..f6ec6a2 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -206,6 +206,9 @@ struct gendisk {
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
int node_id;
struct badblocks *bb;
+#ifdef CONFIG_LOCKDEP
+   struct lockdep_map lockdep_map;
+#endif
 };
 
 static inline struct gendisk 

[PATCH v4 1/7] block: use DECLARE_COMPLETION_ONSTACK in submit_bio_wait

2017-10-24 Thread Byungchul Park
From: Christoph Hellwig 

Simplify the code by getting rid of the submit_bio_ret structure.

Signed-off-by: Christoph Hellwig 
---
 block/bio.c | 19 +--
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5f5472e..99d0ca5 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -917,17 +917,9 @@ int bio_iov_iter_get_pages(struct bio *bio, struct 
iov_iter *iter)
 }
 EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
 
-struct submit_bio_ret {
-   struct completion event;
-   int error;
-};
-
 static void submit_bio_wait_endio(struct bio *bio)
 {
-   struct submit_bio_ret *ret = bio->bi_private;
-
-   ret->error = blk_status_to_errno(bio->bi_status);
-   complete(>event);
+   complete(bio->bi_private);
 }
 
 /**
@@ -943,16 +935,15 @@ static void submit_bio_wait_endio(struct bio *bio)
  */
 int submit_bio_wait(struct bio *bio)
 {
-   struct submit_bio_ret ret;
+   DECLARE_COMPLETION_ONSTACK(done);
 
-   init_completion();
-   bio->bi_private = 
+   bio->bi_private = 
bio->bi_end_io = submit_bio_wait_endio;
bio->bi_opf |= REQ_SYNC;
submit_bio(bio);
-   wait_for_completion_io();
+   wait_for_completion_io();
 
-   return ret.error;
+   return blk_status_to_errno(bio->bi_status);
 }
 EXPORT_SYMBOL(submit_bio_wait);
 
-- 
1.9.1



[PATCH v4 6/7] workqueue: Remove unnecessary acquisitions wrt workqueue flush

2017-10-24 Thread Byungchul Park
The workqueue added manual acquisitions to catch deadlock cases.
Now crossrelease was introduced, some of those are redundant, since
wait_for_completion() already includes the acquisition for itself.
Removed it.

Signed-off-by: Byungchul Park 
---
 include/linux/workqueue.h |  4 ++--
 kernel/workqueue.c| 19 +++
 2 files changed, 5 insertions(+), 18 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index f3c47a0..1455b5e 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -218,7 +218,7 @@ static inline void destroy_delayed_work_on_stack(struct 
delayed_work *work) { }
\
__init_work((_work), _onstack); \
(_work)->data = (atomic_long_t) WORK_DATA_INIT();   \
-   lockdep_init_map(&(_work)->lockdep_map, #_work, &__key, 0); \
+   lockdep_init_map(&(_work)->lockdep_map, "(complete)"#_work, 
&__key, 0); \
INIT_LIST_HEAD(&(_work)->entry);\
(_work)->func = (_func);\
} while (0)
@@ -399,7 +399,7 @@ enum {
static struct lock_class_key __key; \
const char *__lock_name;\
\
-   __lock_name = #fmt#args;\
+   __lock_name = "(complete)"#fmt#args;\
\
__alloc_workqueue_key((fmt), (flags), (max_active), \
  &__key, __lock_name, ##args); \
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c77fdf6..ee05d19 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2496,15 +2496,8 @@ static void insert_wq_barrier(struct pool_workqueue *pwq,
INIT_WORK_ONSTACK(>work, wq_barrier_func);
__set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(>work));
 
-   /*
-* Explicitly init the crosslock for wq_barrier::done, make its lock
-* key a subkey of the corresponding work. As a result we won't
-* build a dependency between wq_barrier::done and unrelated work.
-*/
-   lockdep_init_map_crosslock((struct lockdep_map *)>done.map,
-  "(complete)wq_barr::done",
-  target->lockdep_map.key, 1);
-   __init_completion(>done);
+   init_completion_map(>done, >lockdep_map);
+
barr->task = current;
 
/*
@@ -2610,16 +2603,13 @@ void flush_workqueue(struct workqueue_struct *wq)
struct wq_flusher this_flusher = {
.list = LIST_HEAD_INIT(this_flusher.list),
.flush_color = -1,
-   .done = COMPLETION_INITIALIZER_ONSTACK(this_flusher.done),
+   .done = COMPLETION_INITIALIZER_ONSTACK_MAP(this_flusher.done, 
wq->lockdep_map),
};
int next_color;
 
if (WARN_ON(!wq_online))
return;
 
-   lock_map_acquire(>lockdep_map);
-   lock_map_release(>lockdep_map);
-
mutex_lock(>mutex);
 
/*
@@ -2882,9 +2872,6 @@ bool flush_work(struct work_struct *work)
if (WARN_ON(!wq_online))
return false;
 
-   lock_map_acquire(>lockdep_map);
-   lock_map_release(>lockdep_map);
-
if (start_flush_work(work, )) {
wait_for_completion();
destroy_work_on_stack();
-- 
1.9.1



[PATCH v4 2/7] locking/lockdep: Add a boot parameter allowing unwind in cross-release and disable it by default

2017-10-24 Thread Byungchul Park
Johan Hovold reported a heavy performance regression caused by lockdep
cross-release:

 > Boot time (from "Linux version" to login prompt) had in fact doubled
 > since 4.13 where it took 17 seconds (with my current config) compared to
 > the 35 seconds I now see with 4.14-rc4.
 >
 > I quick bisect pointed to lockdep and specifically the following commit:
 >
 >  28a903f63ec0 ("locking/lockdep: Handle non(or multi)-acquisition
 > of a crosslock")
 >
 > which I've verified is the commit which doubled the boot time (compared
 > to 28a903f63ec0^) (added by lockdep crossrelease series [1]).

Currently cross-release performs unwind on every acquisition, but that
is very expensive.

This patch makes unwind optional and disables it by default and only
records acquire_ip.

Full stack traces are sometimes required for full analysis, in which
case a boot paramter, crossrelease_fullstack, can be specified.

On my qemu Ubuntu machine (x86_64, 4 cores, 512M), the regression was
fixed. We measure boot times with 'perf stat --null --repeat 10 $QEMU',
where $QEMU launches a kernel with init=/bin/true:

1. No lockdep enabled:

 Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs):

   2.756558155 seconds time elapsed( +-  0.09% )

2. Lockdep enabled:

 Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs):

   2.968710420 seconds time elapsed( +-  0.12% )

3. Lockdep enabled + cross-release enabled:

 Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs):

   3.153839636 seconds time elapsed( +-  0.31% )

4. Lockdep enabled + cross-release enabled + this patch applied:

 Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs):

   2.963669551 seconds time elapsed( +-  0.11% )

I.e. lockdep cross-release performance is now indistinguishable from
vanilla lockdep.

Reported-by: Johan Hovold 
Suggested-by: Thomas Gleixner 
Signed-off-by: Byungchul Park 
---
 Documentation/admin-guide/kernel-parameters.txt |  3 +++
 kernel/locking/lockdep.c| 19 +--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index ead7f40..4107b01 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -709,6 +709,9 @@
It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G.
 
+   crossrelease_fullstack
+   [KNL] Allow to record full stack trace in cross-release
+
cryptomgr.notests
 [KNL] Disable crypto self-tests
 
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index e36e652..160b5d6 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -76,6 +76,15 @@
 #define lock_stat 0
 #endif
 
+static int crossrelease_fullstack;
+static int __init allow_crossrelease_fullstack(char *str)
+{
+   crossrelease_fullstack = 1;
+   return 0;
+}
+
+early_param("crossrelease_fullstack", allow_crossrelease_fullstack);
+
 /*
  * lockdep_lock: protects the lockdep graph, the hashes and the
  *   class/list/hash allocators.
@@ -4863,8 +4872,14 @@ static void add_xhlock(struct held_lock *hlock)
xhlock->trace.nr_entries = 0;
xhlock->trace.max_entries = MAX_XHLOCK_TRACE_ENTRIES;
xhlock->trace.entries = xhlock->trace_entries;
-   xhlock->trace.skip = 3;
-   save_stack_trace(>trace);
+
+   if (crossrelease_fullstack) {
+   xhlock->trace.skip = 3;
+   save_stack_trace(>trace);
+   } else {
+   xhlock->trace.nr_entries = 1;
+   xhlock->trace.entries[0] = hlock->acquire_ip;
+   }
 }
 
 static inline int same_context_xhlock(struct hist_lock *xhlock)
-- 
1.9.1



[PATCH v4 3/7] locking/lockdep: Remove the BROKEN flag from CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS

2017-10-24 Thread Byungchul Park
Now that the performance regression is fixed, re-enable
CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y.

Signed-off-by: Byungchul Park 
---
 lib/Kconfig.debug | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 3db9167..4bef610 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1138,8 +1138,8 @@ config PROVE_LOCKING
select DEBUG_MUTEXES
select DEBUG_RT_MUTEXES if RT_MUTEXES
select DEBUG_LOCK_ALLOC
-   select LOCKDEP_CROSSRELEASE if BROKEN
-   select LOCKDEP_COMPLETIONS if BROKEN
+   select LOCKDEP_CROSSRELEASE
+   select LOCKDEP_COMPLETIONS
select TRACE_IRQFLAGS
default n
help
-- 
1.9.1



[PATCH v4 5/7] completion: Add support for initializing completion with lockdep_map

2017-10-24 Thread Byungchul Park
Sometimes, we want to initialize completions with sparate lockdep maps
to assign lock classes as desired. For example, the workqueue code
needs to directly manage lockdep maps, since only the code is aware of
how to classify lockdep maps properly.

Provide additional macros initializing completions in that way.

Signed-off-by: Byungchul Park 
---
 include/linux/completion.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/include/linux/completion.h b/include/linux/completion.h
index cae5400..02f8cde 100644
--- a/include/linux/completion.h
+++ b/include/linux/completion.h
@@ -49,6 +49,13 @@ static inline void complete_release_commit(struct completion 
*x)
lock_commit_crosslock((struct lockdep_map *)>map);
 }
 
+#define init_completion_map(x, m)  \
+do {   \
+   lockdep_init_map_crosslock((struct lockdep_map *)&(x)->map, \
+   (m)->name, (m)->key, 0);
\
+   __init_completion(x);   \
+} while (0)
+
 #define init_completion(x) \
 do {   \
static struct lock_class_key __key; \
@@ -58,6 +65,7 @@ static inline void complete_release_commit(struct completion 
*x)
__init_completion(x);   \
 } while (0)
 #else
+#define init_completion_map(x, m) __init_completion(x)
 #define init_completion(x) __init_completion(x)
 static inline void complete_acquire(struct completion *x) {}
 static inline void complete_release(struct completion *x) {}
@@ -73,6 +81,9 @@ static inline void complete_release_commit(struct completion 
*x) {}
{ 0, __WAIT_QUEUE_HEAD_INITIALIZER((work).wait) }
 #endif
 
+#define COMPLETION_INITIALIZER_ONSTACK_MAP(work, map) \
+   (*({ init_completion_map(&(work), &(map)); &(work); }))
+
 #define COMPLETION_INITIALIZER_ONSTACK(work) \
(*({ init_completion();  }))
 
@@ -102,8 +113,11 @@ static inline void complete_release_commit(struct 
completion *x) {}
 #ifdef CONFIG_LOCKDEP
 # define DECLARE_COMPLETION_ONSTACK(work) \
struct completion work = COMPLETION_INITIALIZER_ONSTACK(work)
+# define DECLARE_COMPLETION_ONSTACK_MAP(work, map) \
+   struct completion work = COMPLETION_INITIALIZER_ONSTACK_MAP(work, map)
 #else
 # define DECLARE_COMPLETION_ONSTACK(work) DECLARE_COMPLETION(work)
+# define DECLARE_COMPLETION_ONSTACK_MAP(work, map) DECLARE_COMPLETION(work)
 #endif
 
 /**
-- 
1.9.1



[PATCH v4 0/7] cross-release: Enhence performance and fix false positives

2017-10-24 Thread Byungchul Park
There are two things I didn't apply as Ingo suggested, since I didn't
understand his intention exactly:

   1. Adding 'Analyzed-by' tag at the 2nd patch
   2. Using a inline function instead #define at the 7th patch

Let me know if the above should still be applied.

Changes from v3
- Exclude a patch removing white space
- Enhance commit messages as Ingo suggested
- Re-design patches adding a boot param and a Kconfig allowing unwind
- Simplify a patch assigning lock classes to genhds as Ingo suggested
- Add proper tags in commit messages e.g. reported-by and analyzed-by

Changes from v2
- Combine 2 serises, fixing false positives and enhance performance
- Add Christoph Hellwig's patch simplifying submit_bio_wait() code
- Add 2 more 'init with lockdep map' macros for completionm
- Rename init_completion_with_map() to init_completion_map()

Changes from v1
- Fix kconfig description as Ingo suggested
- Fix commit message writing out CONFIG_ variable
- Introduce a new kernel parameter, crossrelease_fullstack
- Replace the number with the output of *perf*
- Separate a patch removing white space

Byungchul Park (6):
  locking/lockdep: Add a boot parameter allowing unwind in cross-release
and disable it by default
  locking/lockdep: Remove the BROKEN flag from
CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS
  locking/lockdep: Introduce
CONFIG_BOOTPARAM_LOCKDEP_CROSSRELEASE_FULLSTACK
  completion: Add support for initializing completion with lockdep_map
  workqueue: Remove unnecessary acquisitions wrt workqueue flush
  block: Assign a lock_class per gendisk used for wait_for_completion()

Christoph Hellwig (1):
  block: use DECLARE_COMPLETION_ONSTACK in submit_bio_wait

 Documentation/admin-guide/kernel-parameters.txt |  3 +++
 block/bio.c | 19 +--
 block/genhd.c   | 10 ++
 include/linux/completion.h  | 14 ++
 include/linux/genhd.h   | 24 ++--
 include/linux/workqueue.h   |  4 ++--
 kernel/locking/lockdep.c| 23 +--
 kernel/workqueue.c  | 19 +++
 lib/Kconfig.debug   | 19 +--
 9 files changed, 89 insertions(+), 46 deletions(-)

-- 
1.9.1



Re: [PATCH] [PATCH V2] bcache: update bucket_in_use in real time

2017-10-24 Thread Michael Lyle
Hi Junhui---

I have applied the patch [with a small whitespace edit] to my tree.  I
will try to get some test mileage on it.

Reviewed-by: Michael Lyle 

Thanks--

Mike

On 10/24/2017 07:26 PM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> bucket_in_use is updated in gc thread which triggered by invalidating or
> writing sectors_to_gc dirty data, It's a long interval. Therefore, when we
> use it to compare with the threshold, it is often not timely, which leads
> to inaccurate judgment and often results in bucket depletion.
> 
> We have send a patch before, by the means of updating bucket_in_use
> periodically In gc thread, which Coly thought that would lead high
> latency, In this patch, we add avail_nbuckets to record the count of
> available buckets, and we calculate bucket_in_use when alloc or free
> bucket in real time.
> 
> Signed-off-by: Tang Junhui 
> ---
>  drivers/md/bcache/alloc.c  | 10 ++
>  drivers/md/bcache/bcache.h |  1 +
>  drivers/md/bcache/btree.c  | 17 ++---
>  drivers/md/bcache/btree.h  |  2 +-
>  4 files changed, 22 insertions(+), 8 deletions(-)
>  mode change 100644 => 100755 drivers/md/bcache/alloc.c
>  mode change 100644 => 100755 drivers/md/bcache/bcache.h
>  mode change 100644 => 100755 drivers/md/bcache/btree.c
>  mode change 100644 => 100755 drivers/md/bcache/btree.h
> 
> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
> old mode 100644
> new mode 100755
> index ca4abe1..6e036ff
> --- a/drivers/md/bcache/alloc.c
> +++ b/drivers/md/bcache/alloc.c
> @@ -439,6 +439,11 @@ long bch_bucket_alloc(struct cache *ca, unsigned 
> reserve, bool wait)
>   b->prio = INITIAL_PRIO;
>   }
>  
> + if(ca->set->avail_nbuckets > 0) {
> + ca->set->avail_nbuckets--;
> + bch_update_bucket_in_use(ca->set, >set->gc_stats);
> + }
> +
>   return r;
>  }
>  
> @@ -446,6 +451,11 @@ void __bch_bucket_free(struct cache *ca, struct bucket 
> *b)
>  {
>   SET_GC_MARK(b, 0);
>   SET_GC_SECTORS_USED(b, 0);
> + 
> + if(ca->set->avail_nbuckets < ca->set->nbuckets) {
> + ca->set->avail_nbuckets++;
> + bch_update_bucket_in_use(ca->set, >set->gc_stats);
> + }
>  }
>  
>  void bch_bucket_free(struct cache_set *c, struct bkey *k)
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> old mode 100644
> new mode 100755
> index dee542f..275b29c
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -580,6 +580,7 @@ struct cache_set {
>   uint8_t need_gc;
>   struct gc_stat  gc_stats;
>   size_t  nbuckets;
> + size_t  avail_nbuckets;
>  
>   struct task_struct  *gc_thread;
>   /* Where in the btree gc currently is */
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> old mode 100644
> new mode 100755
> index 866dcf7..d8865e6
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -1240,6 +1240,11 @@ void bch_initial_mark_key(struct cache_set *c, int 
> level, struct bkey *k)
>   __bch_btree_mark_key(c, level, k);
>  }
>  
> +void bch_update_bucket_in_use(struct cache_set *c, struct gc_stat *stats)
> +{
> + stats->in_use = (c->nbuckets - c->avail_nbuckets) * 100 / c->nbuckets;
> +}
> +
>  static bool btree_gc_mark_node(struct btree *b, struct gc_stat *gc)
>  {
>   uint8_t stale = 0;
> @@ -1651,9 +1656,8 @@ static void btree_gc_start(struct cache_set *c)
>   mutex_unlock(>bucket_lock);
>  }
>  
> -static size_t bch_btree_gc_finish(struct cache_set *c)
> +static void bch_btree_gc_finish(struct cache_set *c)
>  {
> - size_t available = 0;
>   struct bucket *b;
>   struct cache *ca;
>   unsigned i;
> @@ -1690,6 +1694,7 @@ static size_t bch_btree_gc_finish(struct cache_set *c)
>   }
>   rcu_read_unlock();
>  
> + c->avail_nbuckets = 0;
>   for_each_cache(ca, c, i) {
>   uint64_t *i;
>  
> @@ -1711,18 +1716,16 @@ static size_t bch_btree_gc_finish(struct cache_set *c)
>   BUG_ON(!GC_MARK(b) && GC_SECTORS_USED(b));
>  
>   if (!GC_MARK(b) || GC_MARK(b) == GC_MARK_RECLAIMABLE)
> - available++;
> + c->avail_nbuckets++;
>   }
>   }
>  
>   mutex_unlock(>bucket_lock);
> - return available;
>  }
>  
>  static void bch_btree_gc(struct cache_set *c)
>  {
>   int ret;
> - unsigned long available;
>   struct gc_stat stats;
>   struct closure writes;
>   struct btree_op op;
> @@ -1745,14 +1748,14 @@ static void bch_btree_gc(struct cache_set *c)
>   pr_warn("gc failed!");
>   } while (ret);
>  
> - available = bch_btree_gc_finish(c);
> + bch_btree_gc_finish(c);
>   wake_up_allocators(c);
>  
>   bch_time_stats_update(>btree_gc_time, start_time);

[PATCH] [PATCH V2] bcache: update bucket_in_use in real time

2017-10-24 Thread tang . junhui
From: Tang Junhui 

bucket_in_use is updated in gc thread which triggered by invalidating or
writing sectors_to_gc dirty data, It's a long interval. Therefore, when we
use it to compare with the threshold, it is often not timely, which leads
to inaccurate judgment and often results in bucket depletion.

We have send a patch before, by the means of updating bucket_in_use
periodically In gc thread, which Coly thought that would lead high
latency, In this patch, we add avail_nbuckets to record the count of
available buckets, and we calculate bucket_in_use when alloc or free
bucket in real time.

Signed-off-by: Tang Junhui 
---
 drivers/md/bcache/alloc.c  | 10 ++
 drivers/md/bcache/bcache.h |  1 +
 drivers/md/bcache/btree.c  | 17 ++---
 drivers/md/bcache/btree.h  |  2 +-
 4 files changed, 22 insertions(+), 8 deletions(-)
 mode change 100644 => 100755 drivers/md/bcache/alloc.c
 mode change 100644 => 100755 drivers/md/bcache/bcache.h
 mode change 100644 => 100755 drivers/md/bcache/btree.c
 mode change 100644 => 100755 drivers/md/bcache/btree.h

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
old mode 100644
new mode 100755
index ca4abe1..6e036ff
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -439,6 +439,11 @@ long bch_bucket_alloc(struct cache *ca, unsigned reserve, 
bool wait)
b->prio = INITIAL_PRIO;
}
 
+   if(ca->set->avail_nbuckets > 0) {
+   ca->set->avail_nbuckets--;
+   bch_update_bucket_in_use(ca->set, >set->gc_stats);
+   }
+
return r;
 }
 
@@ -446,6 +451,11 @@ void __bch_bucket_free(struct cache *ca, struct bucket *b)
 {
SET_GC_MARK(b, 0);
SET_GC_SECTORS_USED(b, 0);
+   
+   if(ca->set->avail_nbuckets < ca->set->nbuckets) {
+   ca->set->avail_nbuckets++;
+   bch_update_bucket_in_use(ca->set, >set->gc_stats);
+   }
 }
 
 void bch_bucket_free(struct cache_set *c, struct bkey *k)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
old mode 100644
new mode 100755
index dee542f..275b29c
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -580,6 +580,7 @@ struct cache_set {
uint8_t need_gc;
struct gc_stat  gc_stats;
size_t  nbuckets;
+   size_t  avail_nbuckets;
 
struct task_struct  *gc_thread;
/* Where in the btree gc currently is */
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
old mode 100644
new mode 100755
index 866dcf7..d8865e6
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1240,6 +1240,11 @@ void bch_initial_mark_key(struct cache_set *c, int 
level, struct bkey *k)
__bch_btree_mark_key(c, level, k);
 }
 
+void bch_update_bucket_in_use(struct cache_set *c, struct gc_stat *stats)
+{
+   stats->in_use = (c->nbuckets - c->avail_nbuckets) * 100 / c->nbuckets;
+}
+
 static bool btree_gc_mark_node(struct btree *b, struct gc_stat *gc)
 {
uint8_t stale = 0;
@@ -1651,9 +1656,8 @@ static void btree_gc_start(struct cache_set *c)
mutex_unlock(>bucket_lock);
 }
 
-static size_t bch_btree_gc_finish(struct cache_set *c)
+static void bch_btree_gc_finish(struct cache_set *c)
 {
-   size_t available = 0;
struct bucket *b;
struct cache *ca;
unsigned i;
@@ -1690,6 +1694,7 @@ static size_t bch_btree_gc_finish(struct cache_set *c)
}
rcu_read_unlock();
 
+   c->avail_nbuckets = 0;
for_each_cache(ca, c, i) {
uint64_t *i;
 
@@ -1711,18 +1716,16 @@ static size_t bch_btree_gc_finish(struct cache_set *c)
BUG_ON(!GC_MARK(b) && GC_SECTORS_USED(b));
 
if (!GC_MARK(b) || GC_MARK(b) == GC_MARK_RECLAIMABLE)
-   available++;
+   c->avail_nbuckets++;
}
}
 
mutex_unlock(>bucket_lock);
-   return available;
 }
 
 static void bch_btree_gc(struct cache_set *c)
 {
int ret;
-   unsigned long available;
struct gc_stat stats;
struct closure writes;
struct btree_op op;
@@ -1745,14 +1748,14 @@ static void bch_btree_gc(struct cache_set *c)
pr_warn("gc failed!");
} while (ret);
 
-   available = bch_btree_gc_finish(c);
+   bch_btree_gc_finish(c);
wake_up_allocators(c);
 
bch_time_stats_update(>btree_gc_time, start_time);
 
stats.key_bytes *= sizeof(uint64_t);
stats.data  <<= 9;
-   stats.in_use= (c->nbuckets - available) * 100 / c->nbuckets;
+   bch_update_bucket_in_use(c, );
memcpy(>gc_stats, , sizeof(struct gc_stat));
 
trace_bcache_gc_end(c);
diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
old mode 100644
new mode 100755
index 73da1f5..4073aca
--- 

Re: Re: [PATCH] update bucket_in_use in real time

2017-10-24 Thread Michael Lyle
On Tue, Oct 24, 2017 at 6:21 PM,   wrote:
>>> static void bch_btree_gc_finish(struct cache_set *c)
>> available should be removed and this function should return 0.
> I have changed this function to a void type, so nothing need return.

Sorry, I misread that. :)

>>> + stats.in_use= (c->nbuckets - c->avail_nbuckets) * 
>>> 100 / c->nbuckets;
>>
>> This should instead call bch_update_bucket_in_use to avoid code duplication.
> OK, I will modify it and resend a v2 patch later.

OK, that's my only complaint.  Will be glad to add v2 to my staging tree.

>
>> Overall I think this is a good piece of progress.
> Thanks for your review and comments again.
>
> Tang Junhui

Thanks!

Mike


Re: Re: [PATCH] update bucket_in_use in real time

2017-10-24 Thread tang . junhui
From: Tang Junhui 

Thanks to Mike and Coly's comments.

>> + if(ca->set->avail_nbuckets > 0) {
>> + ca->set->avail_nbuckets--;
>> + bch_update_bucket_in_use(ca->set);
>> + }
>> +
> 
> I am not sure this needs an atomic.  All accesses to this variable are
>> done with the bucket_lock held, so that seems correct.  Is this right?
Yes, you are right.

>> static void bch_btree_gc_finish(struct cache_set *c)
> available should be removed and this function should return 0.
I have changed this function to a void type, so nothing need return.

>> + stats.in_use= (c->nbuckets - c->avail_nbuckets) * 
>> 100 / c->nbuckets;
> 
> This should instead call bch_update_bucket_in_use to avoid code duplication.
OK, I will modify it and resend a v2 patch later. 
  
> Overall I think this is a good piece of progress.
Thanks for your review and comments again.

Tang Junhui


Re: [PATCH] nbd: handle interrupted sendmsg with a sndtimeo set

2017-10-24 Thread Jens Axboe
On 10/24/2017 01:57 PM, Josef Bacik wrote:
> From: Josef Bacik 
> 
> If you do not set sk_sndtimeo you will get -ERESTARTSYS if there is a
> pending signal when you enter sendmsg, which we handle properly.
> However if you set a timeout for your commands we'll set sk_sndtimeo to
> that timeout, which means that sendmsg will start returning -EINTR
> instead of -ERESTARTSYS.  Fix this by checking either cases and doing
> the correct thing.

Added for 4.14, thanks Josef.

-- 
Jens Axboe



[PATCH] elevator: lookup mq vs non-mq elevators

2017-10-24 Thread Jens Axboe
If an IO scheduler is selected via elevator= and it doesn't match
the driver in question wrt blk-mq support, then we fail to boot.

The elevator= parameter is deprecated and only supported for
non-mq devices. Augment the elevator lookup API so that we
pass in if we're looking for an mq capable scheduler or not,
so that we only ever return a valid type for the queue in
question.

Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=196695
Signed-off-by: Jens Axboe 

diff --git a/block/elevator.c b/block/elevator.c
index 153926a90901..318cf00ea213 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -83,12 +83,13 @@ bool elv_bio_merge_ok(struct request *rq, struct bio *bio)
 }
 EXPORT_SYMBOL(elv_bio_merge_ok);
 
-static struct elevator_type *elevator_find(const char *name)
+static struct elevator_type *elevator_find(const char *name, bool mq)
 {
struct elevator_type *e;
 
list_for_each_entry(e, _list, list) {
-   if (!strcmp(e->elevator_name, name))
+   if (!strcmp(e->elevator_name, name) &&
+   !(mq ^ e->uses_mq))
return e;
}
 
@@ -100,25 +101,25 @@ static void elevator_put(struct elevator_type *e)
module_put(e->elevator_owner);
 }
 
-static struct elevator_type *elevator_get(const char *name, bool try_loading)
+static struct elevator_type *elevator_get(struct request_queue *q,
+ const char *name, bool try_loading)
 {
struct elevator_type *e;
 
spin_lock(_list_lock);
 
-   e = elevator_find(name);
+   e = elevator_find(name, q->mq_ops != NULL);
if (!e && try_loading) {
spin_unlock(_list_lock);
request_module("%s-iosched", name);
spin_lock(_list_lock);
-   e = elevator_find(name);
+   e = elevator_find(name, q->mq_ops != NULL);
}
 
if (e && !try_module_get(e->elevator_owner))
e = NULL;
 
spin_unlock(_list_lock);
-
return e;
 }
 
@@ -144,8 +145,12 @@ void __init load_default_elevator_module(void)
if (!chosen_elevator[0])
return;
 
+   /*
+* Boot parameter is deprecated, we haven't supported that for MQ
+* so only look for non-mq schedulers from here.
+*/
spin_lock(_list_lock);
-   e = elevator_find(chosen_elevator);
+   e = elevator_find(chosen_elevator, false);
spin_unlock(_list_lock);
 
if (!e)
@@ -202,7 +207,7 @@ int elevator_init(struct request_queue *q, char *name)
q->boundary_rq = NULL;
 
if (name) {
-   e = elevator_get(name, true);
+   e = elevator_get(q, name, true);
if (!e)
return -EINVAL;
}
@@ -214,7 +219,7 @@ int elevator_init(struct request_queue *q, char *name)
 * allowed from async.
 */
if (!e && !q->mq_ops && *chosen_elevator) {
-   e = elevator_get(chosen_elevator, false);
+   e = elevator_get(q, chosen_elevator, false);
if (!e)
printk(KERN_ERR "I/O scheduler %s not found\n",
chosen_elevator);
@@ -229,17 +234,17 @@ int elevator_init(struct request_queue *q, char *name)
 */
if (q->mq_ops) {
if (q->nr_hw_queues == 1)
-   e = elevator_get("mq-deadline", false);
+   e = elevator_get(q, "mq-deadline", false);
if (!e)
return 0;
} else
-   e = elevator_get(CONFIG_DEFAULT_IOSCHED, false);
+   e = elevator_get(q, CONFIG_DEFAULT_IOSCHED, false);
 
if (!e) {
printk(KERN_ERR
"Default I/O scheduler not found. " \
"Using noop.\n");
-   e = elevator_get("noop", false);
+   e = elevator_get(q, "noop", false);
}
}
 
@@ -905,7 +910,7 @@ int elv_register(struct elevator_type *e)
 
/* register, don't allow duplicate names */
spin_lock(_list_lock);
-   if (elevator_find(e->elevator_name)) {
+   if (elevator_find(e->elevator_name, e->uses_mq)) {
spin_unlock(_list_lock);
if (e->icq_cache)
kmem_cache_destroy(e->icq_cache);
@@ -1066,7 +1071,7 @@ static int __elevator_change(struct request_queue *q, 
const char *name)
return elevator_switch(q, NULL);
 
strlcpy(elevator_name, name, sizeof(elevator_name));
-   e = elevator_get(strstrip(elevator_name), true);
+   e = elevator_get(q, strstrip(elevator_name), true);
if (!e)
return -EINVAL;
 


Re: [PATCH] update bucket_in_use in real time

2017-10-24 Thread Coly Li
On 2017/10/25 上午2:50, Michael Lyle wrote:
> Hi---
> 
> On 10/24/2017 01:57 AM, tang.jun...@zte.com.cn wrote:
>> From: Tang Junhui 
>>
>> bucket_in_use is updated in gc thread which triggered by invalidating or
>> writing sectors_to_gc dirty data, It's a long interval. Therefore, when we
>> use it to compare with the threshold, it is often not timely, which leads
>> to inaccurate judgment and often results in bucket depletion.
>>
>> We have send a patch before, by the means of updating bucket_in_use
>> periodically In gc thread, which Coly thought that would lead high
>> latency, In this patch, we add avail_nbuckets to record the count of
>> available buckets, and we calculate bucket_in_use when alloc or free
>> bucket in real time.
>>
>> Signed-off-by: Tang Junhui 
>> ---
>>  drivers/md/bcache/alloc.c  | 10 ++
>>  drivers/md/bcache/bcache.h |  1 +
>>  drivers/md/bcache/btree.c  | 17 ++---
>>  drivers/md/bcache/btree.h  |  1 +
>>  4 files changed, 22 insertions(+), 7 deletions(-)
>>  mode change 100644 => 100755 drivers/md/bcache/alloc.c
>>  mode change 100644 => 100755 drivers/md/bcache/bcache.h
>>  mode change 100644 => 100755 drivers/md/bcache/btree.c
>>  mode change 100644 => 100755 drivers/md/bcache/btree.h
>>
>> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
>> old mode 100644
>> new mode 100755
>> index ca4abe1..89a5e35
>> --- a/drivers/md/bcache/alloc.c
>> +++ b/drivers/md/bcache/alloc.c
>> @@ -439,6 +439,11 @@ long bch_bucket_alloc(struct cache *ca, unsigned 
>> reserve, bool wait)
>>  b->prio = INITIAL_PRIO;
>>  }
>>  
>> +if(ca->set->avail_nbuckets > 0) {
>> +ca->set->avail_nbuckets--;
>> +bch_update_bucket_in_use(ca->set);
>> +}
>> +
> 
> I am not sure this needs an atomic.  All accesses to this variable are
> done with the bucket_lock held, so that seems correct.  Is this right?
> 

Hi Mike and Junhui,

Aha! mutex bucket_lock is held in bch_bucket_alloc_set(), which is 2
layers up in the caller :-) In sequence, __bch_bucket_free() has
bucket_lock held too. They are safe.

Since there is mutex protected avai_nbuckets, this patch is good to me.
Nothing needs to be changed. Thanks for Mike's double check.

Reviewed-by: Coly Li 

Thanks.

-- 
Coly Li


Re: [PATCH v3 4/8] lockdep: Add a kernel parameter, crossrelease_fullstack

2017-10-24 Thread Byungchul Park
On Tue, Oct 24, 2017 at 12:08:58PM +0200, Ingo Molnar wrote:
> This is really unnecessarily complex.

I mis-understood your suggestion. I will change it.

> The proper logic is to introduce the crossrelease_fullstack boot parameter, 
> and to 
> also have a Kconfig option that enables it: 
> 
>   CONFIG_BOOTPARAM_LOCKDEP_CROSSRELEASE_FULLSTACK=y
> 
> No #ifdefs please - just an "if ()" branch dependent on the current value of 
> crossrelease_fullstack.

Ok. I will.

Thanks,
Byungchul

> Thanks,
> 
>   Ingo


Re: [PATCH 06/17] block: introduce GENHD_FL_HIDDEN

2017-10-24 Thread Mike Snitzer
On Mon, Oct 23 2017 at 10:51am -0400,
Christoph Hellwig  wrote:

> With this flag a driver can create a gendisk that can be used for I/O
> submission inside the kernel, but which is not registered as user
> facing block device.  This will be useful for the NVMe multipath
> implementation.

Having the NVme driver go to such lengths to hide its resources from
upper layers is certainly the work of an evil genius experiencing some
serious territorial issues.  Not sugar-coating it.. you wouldn't.

I kept meaning to reply to your earlier iterations on this series to
ask: can we please get a CONFIG_NVME_MULTIPATHING knob to make it so
that the NVMe driver doesn't implicitly consume (and hide) all
per-controler devices?

Ah well.  There is only one correct way to do NVMe multipathing after
all right?


Re: [PATCH] block: Invalidate cache on discard v2

2017-10-24 Thread Omar Sandoval
On Wed, Mar 22, 2017 at 11:29:25PM +0400, Dmitry Monakhov wrote:
> It is reasonable drop page cache on discard, otherwise that pages may
> be written by writeback second later, so thin provision devices will
> not be happy. This seems to be a  security leak in case of secure discard 
> case.
> 
> Also add check for queue_discard flag on early stage.
> 
> Signed-off-by: Dmitry Monakhov 
> Reviewed-by: Christoph Hellwig 
> 
> ---
>  block/ioctl.c | 14 ++
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/block/ioctl.c b/block/ioctl.c
> index 7b88820..336610d 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -202,10 +202,16 @@ static int blk_ioctl_discard(struct block_device *bdev, 
> fmode_t mode,
>  {
>   uint64_t range[2];
>   uint64_t start, len;
> + struct request_queue *q = bdev_get_queue(bdev);
> + struct address_space *mapping = bdev->bd_inode->i_mapping;
> +
>  
>   if (!(mode & FMODE_WRITE))
>   return -EBADF;
>  
> + if (!blk_queue_discard(q))
> + return -EOPNOTSUPP;
> +
>   if (copy_from_user(range, (void __user *)arg, sizeof(range)))
>   return -EFAULT;
>  
> @@ -216,12 +222,12 @@ static int blk_ioctl_discard(struct block_device *bdev, 
> fmode_t mode,
>   return -EINVAL;
>   if (len & 511)
>   return -EINVAL;
> - start >>= 9;
> - len >>= 9;
>  
> - if (start + len > (i_size_read(bdev->bd_inode) >> 9))
> + if (start + len > i_size_read(bdev->bd_inode))
>   return -EINVAL;
> - return blkdev_issue_discard(bdev, start, len, GFP_KERNEL, flags);
> + truncate_inode_pages_range(mapping, start, start + len);
> + return blkdev_issue_discard(bdev, start >> 9, len >> 9,
> + GFP_KERNEL, flags);
>  }
>  
>  static int blk_ioctl_zeroout(struct block_device *bdev, fmode_t mode,
> -- 
> 2.9.3
> 

Hey, Jens, Dmitry didn't send this patch to linux-block so looks like
you missed it. Christoph reviewed it and Nishita sent up a blktest for
it so we should probably pull it in.


[PATCH] nbd: handle interrupted sendmsg with a sndtimeo set

2017-10-24 Thread Josef Bacik
From: Josef Bacik 

If you do not set sk_sndtimeo you will get -ERESTARTSYS if there is a
pending signal when you enter sendmsg, which we handle properly.
However if you set a timeout for your commands we'll set sk_sndtimeo to
that timeout, which means that sendmsg will start returning -EINTR
instead of -ERESTARTSYS.  Fix this by checking either cases and doing
the correct thing.

Cc: sta...@vger.kernel.org
Fixes: dc88e34d69d8 ("nbd: set sk->sk_sndtimeo for our sockets")
Reported-and-tested-by: Daniel Xu 
Signed-off-by: Josef Bacik 
---
 drivers/block/nbd.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 528e6f6951cc..8c3aa7ae79c8 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -378,6 +378,15 @@ static int sock_xmit(struct nbd_device *nbd, int index, 
int send,
return result;
 }
 
+/*
+ * Different settings for sk->sk_sndtimeo can result in different return values
+ * if there is a signal pending when we enter sendmsg, because reasons?
+ */
+static inline int was_interrupted(int result)
+{
+   return result == -ERESTARTSYS || result == -EINTR;
+}
+
 /* always call with the tx_lock held */
 static int nbd_send_cmd(struct nbd_device *nbd, struct nbd_cmd *cmd, int index)
 {
@@ -450,7 +459,7 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct 
nbd_cmd *cmd, int index)
result = sock_xmit(nbd, index, 1, ,
(type == NBD_CMD_WRITE) ? MSG_MORE : 0, );
if (result <= 0) {
-   if (result == -ERESTARTSYS) {
+   if (was_interrupted(result)) {
/* If we havne't sent anything we can just return BUSY,
 * however if we have sent something we need to make
 * sure we only allow this req to be sent until we are
@@ -494,7 +503,7 @@ static int nbd_send_cmd(struct nbd_device *nbd, struct 
nbd_cmd *cmd, int index)
}
result = sock_xmit(nbd, index, 1, , flags, );
if (result <= 0) {
-   if (result == -ERESTARTSYS) {
+   if (was_interrupted(result)) {
/* We've already sent the header, we
 * have no choice but to set pending and
 * return BUSY.
-- 
2.7.5



Re: [PATCH] update bucket_in_use in real time

2017-10-24 Thread Michael Lyle
Hi---

On 10/24/2017 01:57 AM, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> bucket_in_use is updated in gc thread which triggered by invalidating or
> writing sectors_to_gc dirty data, It's a long interval. Therefore, when we
> use it to compare with the threshold, it is often not timely, which leads
> to inaccurate judgment and often results in bucket depletion.
> 
> We have send a patch before, by the means of updating bucket_in_use
> periodically In gc thread, which Coly thought that would lead high
> latency, In this patch, we add avail_nbuckets to record the count of
> available buckets, and we calculate bucket_in_use when alloc or free
> bucket in real time.
> 
> Signed-off-by: Tang Junhui 
> ---
>  drivers/md/bcache/alloc.c  | 10 ++
>  drivers/md/bcache/bcache.h |  1 +
>  drivers/md/bcache/btree.c  | 17 ++---
>  drivers/md/bcache/btree.h  |  1 +
>  4 files changed, 22 insertions(+), 7 deletions(-)
>  mode change 100644 => 100755 drivers/md/bcache/alloc.c
>  mode change 100644 => 100755 drivers/md/bcache/bcache.h
>  mode change 100644 => 100755 drivers/md/bcache/btree.c
>  mode change 100644 => 100755 drivers/md/bcache/btree.h
> 
> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
> old mode 100644
> new mode 100755
> index ca4abe1..89a5e35
> --- a/drivers/md/bcache/alloc.c
> +++ b/drivers/md/bcache/alloc.c
> @@ -439,6 +439,11 @@ long bch_bucket_alloc(struct cache *ca, unsigned 
> reserve, bool wait)
>   b->prio = INITIAL_PRIO;
>   }
>  
> + if(ca->set->avail_nbuckets > 0) {
> + ca->set->avail_nbuckets--;
> + bch_update_bucket_in_use(ca->set);
> + }
> +

I am not sure this needs an atomic.  All accesses to this variable are
done with the bucket_lock held, so that seems correct.  Is this right?

>   return r;
>  }
>  
> @@ -446,6 +451,11 @@ void __bch_bucket_free(struct cache *ca, struct bucket 
> *b)
>  {
>   SET_GC_MARK(b, 0);
>   SET_GC_SECTORS_USED(b, 0);
> + 
> + if(ca->set->avail_nbuckets < ca->set->nbuckets) {
> + ca->set->avail_nbuckets++;
> + bch_update_bucket_in_use(ca->set);
> + }
>  }
>  
>  void bch_bucket_free(struct cache_set *c, struct bkey *k)
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> old mode 100644
> new mode 100755
> index dee542f..275b29c
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -580,6 +580,7 @@ struct cache_set {
>   uint8_t need_gc;
>   struct gc_stat  gc_stats;
>   size_t  nbuckets;
> + size_t  avail_nbuckets;
>  
>   struct task_struct  *gc_thread;
>   /* Where in the btree gc currently is */
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> old mode 100644
> new mode 100755
> index 866dcf7..1ccf0c3
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -1240,6 +1240,11 @@ void bch_initial_mark_key(struct cache_set *c, int 
> level, struct bkey *k)
>   __bch_btree_mark_key(c, level, k);
>  }
>  
> +void bch_update_bucket_in_use(struct cache_set *c)
> +{
> + c->gc_stats.in_use = (c->nbuckets - c->avail_nbuckets) * 100 / 
> c->nbuckets;
> +}
> +
>  static bool btree_gc_mark_node(struct btree *b, struct gc_stat *gc)
>  {
>   uint8_t stale = 0;
> @@ -1651,9 +1656,8 @@ static void btree_gc_start(struct cache_set *c)
>   mutex_unlock(>bucket_lock);
>  }
>  
> -static size_t bch_btree_gc_finish(struct cache_set *c)
> +static void bch_btree_gc_finish(struct cache_set *c)
>  {
> - size_t available = 0;
>   struct bucket *b;
>   struct cache *ca;
>   unsigned i;
> @@ -1690,6 +1694,7 @@ static size_t bch_btree_gc_finish(struct cache_set *c)
>   }
>   rcu_read_unlock();
>  
> + c->avail_nbuckets = 0;
>   for_each_cache(ca, c, i) {
>   uint64_t *i;
>  
> @@ -1711,18 +1716,16 @@ static size_t bch_btree_gc_finish(struct cache_set *c)
>   BUG_ON(!GC_MARK(b) && GC_SECTORS_USED(b));
>  
>   if (!GC_MARK(b) || GC_MARK(b) == GC_MARK_RECLAIMABLE)
> - available++;
> + c->avail_nbuckets++;

available should be removed and this function should return 0.

>   }
>   }
>  
>   mutex_unlock(>bucket_lock);
> - return available;
>  }
>  
>  static void bch_btree_gc(struct cache_set *c)
>  {
>   int ret;
> - unsigned long available;
>   struct gc_stat stats;
>   struct closure writes;
>   struct btree_op op;
> @@ -1745,14 +1748,14 @@ static void bch_btree_gc(struct cache_set *c)
>   pr_warn("gc failed!");
>   } while (ret);
>  
> - available = bch_btree_gc_finish(c);
> + bch_btree_gc_finish(c);
>   wake_up_allocators(c);
>  
>   bch_time_stats_update(>btree_gc_time, start_time);
>  
>   

Re: [PATCH 03/17] block: provide a direct_make_request helper

2017-10-24 Thread Javier González
> On 23 Oct 2017, at 16.51, Christoph Hellwig  wrote:
> 
> This helper allows reinserting a bio into a new queue without much
> overhead, but requires all queue limits to be the same for the upper
> and lower queues, and it does not provide any recursion preventions.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Sagi Grimberg 
> ---
> block/blk-core.c   | 34 ++
> include/linux/blkdev.h |  1 +
> 2 files changed, 35 insertions(+)
> 


Reviewed-by: Javier González 



signature.asc
Description: Message signed with OpenPGP


Re: [PATCH 9/9] bsg: split handling of SCSI CDBs vs transport requeues

2017-10-24 Thread Jens Axboe
On 10/23/2017 01:17 AM, Martin K. Petersen wrote:
> 
> Christoph,
> 
>>> Yes, I expected the bsg bits to go through Jens' tree.
>>
>> Ok, then I misremembered it, and we'll have to delay the remaining
>> patches until the next merge window, as they depend on the previous
>> ones.
> 
> I don't mind taking them through SCSI if Jens agrees.

I'm fine with that, as long as the last issue Benjamin brought up has
been fixed up.

-- 
Jens Axboe



Re: [PATCH 9/9] bsg: split handling of SCSI CDBs vs transport requeues

2017-10-24 Thread Benjamin Block
> +static int bsg_transport_complete_rq(struct request *rq, struct sg_io_v4 
> *hdr)
> +{
> + struct bsg_job *job = blk_mq_rq_to_pdu(rq);
> + int ret = 0;
> +
> + /*
> +  * The assignments below don't make much sense, but are kept for
> +  * bug by bug backwards compatibility:
> +  */
> + hdr->device_status = job->result & 0xff;
> + hdr->transport_status = host_byte(job->result);
> + hdr->driver_status = driver_byte(job->result);
> + hdr->info = 0;
> + if (hdr->device_status || hdr->transport_status || hdr->driver_status)
> + hdr->info |= SG_INFO_CHECK;
> + hdr->response_len = 0;
> +
> + if (job->result < 0) {
> + /* we're only returning the result field in the reply */
> + job->reply_len = sizeof(u32);
> + ret = job->result;
> + }
> +
> + if (job->reply_len && hdr->response) {
> + int len = min(hdr->max_response_len, job->reply_len);
> +
> + if (copy_to_user(ptr64(hdr->response), job->reply, len))
> + ret = -EFAULT;
> + else
> + hdr->response_len = len;
> + }
> +
> + /* we assume all request payload was transferred, residual == 0 */
> + hdr->dout_resid = 0;
> +
> + if (rq->next_rq) {
> + unsigned int rsp_len = blk_rq_bytes(rq->next_rq);
> +
> + if (WARN_ON(job->reply_payload_rcv_len > rsp_len))

This gives my a lot of new Warnings when running my tests on zFCP, non
of which happen when I run on 4.13.

After browsing the source some, I figured this is because in comparison
to the old code, blk_rq_bytes() is now called after we finished the
blk-request already and blk_update_bidi_request()+blk_update_request()
was already called. This will update rq->__data_len, and thus the call
here to get rsp_len will get a wrong value. Thus the warnings, and the
following calculation is actually wrong.

I figure you can just replace this call for blk_rq_bytes() with
'job->reply_payload.payload_len'. Its essentially the same value, but
the later is not changed after the job is committed as far as I could see
in the source. Driver makes use of it, but only reading as far as I
could make out after browsing the code for a bit.

I did a quick test with that change in place and that seems to work fine
now. As far as my tests go, they behave as they did before.


Beste Grüße / Best regards,
  - Benjamin Block

> + hdr->din_resid = 0;
> + else
> + hdr->din_resid = rsp_len - job->reply_payload_rcv_len;
> + } else {
> + hdr->din_resid = 0;
> + }
> +
> + return ret;
> +}
> +

--
Linux on z Systems Development / IBM Systems & Technology Group
  IBM Deutschland Research & Development GmbH
Vorsitz. AufsR.: Martina Koederitz /Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: AmtsG Stuttgart, HRB 243294



[PATCH] lightnvm: Convert timers to use timer_setup()

2017-10-24 Thread Kees Cook
In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Matias Bjorling 
Cc: linux-block@vger.kernel.org
Signed-off-by: Kees Cook 
---
 drivers/lightnvm/pblk-core.c | 4 ++--
 drivers/lightnvm/pblk-gc.c   | 6 +++---
 drivers/lightnvm/pblk-init.c | 2 +-
 drivers/lightnvm/pblk-rl.c   | 6 +++---
 drivers/lightnvm/pblk.h  | 2 +-
 drivers/lightnvm/rrpc.c  | 6 +++---
 6 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/lightnvm/pblk-core.c b/drivers/lightnvm/pblk-core.c
index 81501644fb15..8841eb66c902 100644
--- a/drivers/lightnvm/pblk-core.c
+++ b/drivers/lightnvm/pblk-core.c
@@ -229,9 +229,9 @@ static void pblk_write_kick(struct pblk *pblk)
mod_timer(>wtimer, jiffies + msecs_to_jiffies(1000));
 }
 
-void pblk_write_timer_fn(unsigned long data)
+void pblk_write_timer_fn(struct timer_list *t)
 {
-   struct pblk *pblk = (struct pblk *)data;
+   struct pblk *pblk = from_timer(pblk, t, wtimer);
 
/* kick the write thread every tick to flush outstanding data */
pblk_write_kick(pblk);
diff --git a/drivers/lightnvm/pblk-gc.c b/drivers/lightnvm/pblk-gc.c
index 6090d28f7995..85ff64fe5ba2 100644
--- a/drivers/lightnvm/pblk-gc.c
+++ b/drivers/lightnvm/pblk-gc.c
@@ -428,9 +428,9 @@ void pblk_gc_kick(struct pblk *pblk)
mod_timer(>gc_timer, jiffies + msecs_to_jiffies(GC_TIME_MSECS));
 }
 
-static void pblk_gc_timer(unsigned long data)
+static void pblk_gc_timer(struct timer_list *t)
 {
-   struct pblk *pblk = (struct pblk *)data;
+   struct pblk *pblk = from_timer(pblk, t, gc.gc_timer);
 
pblk_gc_kick(pblk);
 }
@@ -569,7 +569,7 @@ int pblk_gc_init(struct pblk *pblk)
goto fail_free_writer_kthread;
}
 
-   setup_timer(>gc_timer, pblk_gc_timer, (unsigned long)pblk);
+   timer_setup(>gc_timer, pblk_gc_timer, 0);
mod_timer(>gc_timer, jiffies + msecs_to_jiffies(GC_TIME_MSECS));
 
gc->gc_active = 0;
diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 1b0f61233c21..c0ae85b514f5 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -833,7 +833,7 @@ static int pblk_lines_init(struct pblk *pblk)
 
 static int pblk_writer_init(struct pblk *pblk)
 {
-   setup_timer(>wtimer, pblk_write_timer_fn, (unsigned long)pblk);
+   timer_setup(>wtimer, pblk_write_timer_fn, 0);
mod_timer(>wtimer, jiffies + msecs_to_jiffies(100));
 
pblk->writer_ts = kthread_create(pblk_write_ts, pblk, "pblk-writer-t");
diff --git a/drivers/lightnvm/pblk-rl.c b/drivers/lightnvm/pblk-rl.c
index 2e6a5361baf0..1210deeb6f43 100644
--- a/drivers/lightnvm/pblk-rl.c
+++ b/drivers/lightnvm/pblk-rl.c
@@ -178,9 +178,9 @@ int pblk_rl_sysfs_rate_show(struct pblk_rl *rl)
return rl->rb_user_max;
 }
 
-static void pblk_rl_u_timer(unsigned long data)
+static void pblk_rl_u_timer(struct timer_list *t)
 {
-   struct pblk_rl *rl = (struct pblk_rl *)data;
+   struct pblk_rl *rl = from_timer(rl, t, u_timer);
 
/* Release user I/O state. Protect from GC */
smp_store_release(>rb_user_active, 0);
@@ -221,7 +221,7 @@ void pblk_rl_init(struct pblk_rl *rl, int budget)
atomic_set(>rb_gc_cnt, 0);
atomic_set(>rb_space, -1);
 
-   setup_timer(>u_timer, pblk_rl_u_timer, (unsigned long)rl);
+   timer_setup(>u_timer, pblk_rl_u_timer, 0);
 
rl->rb_user_active = 0;
rl->rb_gc_active = 0;
diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 67e623bd5c2d..e097dea4a1ea 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -789,7 +789,7 @@ void pblk_map_rq(struct pblk *pblk, struct nvm_rq *rqd, 
unsigned int sentry,
  * pblk write thread
  */
 int pblk_write_ts(void *data);
-void pblk_write_timer_fn(unsigned long data);
+void pblk_write_timer_fn(struct timer_list *t);
 void pblk_write_should_kick(struct pblk *pblk);
 
 /*
diff --git a/drivers/lightnvm/rrpc.c b/drivers/lightnvm/rrpc.c
index 267f01ae87e4..0993c14be860 100644
--- a/drivers/lightnvm/rrpc.c
+++ b/drivers/lightnvm/rrpc.c
@@ -267,9 +267,9 @@ static void rrpc_gc_kick(struct rrpc *rrpc)
 /*
  * timed GC every interval.
  */
-static void rrpc_gc_timer(unsigned long data)
+static void rrpc_gc_timer(struct timer_list *t)
 {
-   struct rrpc *rrpc = (struct rrpc *)data;
+   struct rrpc *rrpc = from_timer(rrpc, t, gc_timer);
 
rrpc_gc_kick(rrpc);
mod_timer(>gc_timer, jiffies + msecs_to_jiffies(10));
@@ -1063,7 +1063,7 @@ static int rrpc_gc_init(struct rrpc *rrpc)
if (!rrpc->kgc_wq)
return -ENOMEM;
 
-   setup_timer(>gc_timer, rrpc_gc_timer, (unsigned long)rrpc);
+   timer_setup(>gc_timer, rrpc_gc_timer, 0);
 
return 0;
 }
-- 
2.7.4


-- 
Kees Cook
Pixel Security


Re: [PATCH v2] block: fix peeking requests during PM

2017-10-24 Thread Christoph Hellwig
On Mon, Oct 23, 2017 at 06:43:16PM +0800, Ming Lei wrote:
> On Fri, Oct 20, 2017 at 04:45:23PM +0200, Christoph Hellwig wrote:
> > We need to look for an active PM request until the next softbarrier
> > instead of looking for the first non-PM request.  Otherwise any cause
> > of request reordering might starve the PM request(s).
> 
> Hi Christoph,
> 
> Could you share us how the starve is caused? Looks all PM
> request is run via scsi_execute(), and the caller is always
> waiting for the completion of PM request.
> 
> Especially PM request is run after SCSI device is in quiesce
> state, during that time, no normal(non-preempt) request can
> enter queue after the making quiesce safe patchset is merged.

I thought I stated that the issue is more theoretical, and the
real reason is to refactor this code for later additions.

Seems like that was only in the cover letter of the previous version,
though.


Re: [PATCH] lightnvm: pblk: remove testing leftover

2017-10-24 Thread Jens Axboe
On 10/24/2017 07:56 AM, Javier González wrote:
> fixes: 8bd400204bd5 ("lightnvm: pblk: cleanup unused and static functions")
> Signed-off-by: Javier González 

Added, thanks.

-- 
Jens Axboe



fix pblk testing leftover

2017-10-24 Thread Javier González
Hi Jens,

I just noticed a testing leftover that made it in one of this window's
patches. Can you pick this up (and eventually squash it)?

Thanks,
Javier

Javier González (1):
  lightnvm: pblk: remove testing leftover

 drivers/lightnvm/pblk.h | 5 -
 1 file changed, 5 deletions(-)

-- 
2.7.4



[PATCH] lightnvm: pblk: remove testing leftover

2017-10-24 Thread Javier González
fixes: 8bd400204bd5 ("lightnvm: pblk: cleanup unused and static functions")
Signed-off-by: Javier González 
---
 drivers/lightnvm/pblk.h | 5 -
 1 file changed, 5 deletions(-)

diff --git a/drivers/lightnvm/pblk.h b/drivers/lightnvm/pblk.h
index 6b64288de6f7..90961033a79f 100644
--- a/drivers/lightnvm/pblk.h
+++ b/drivers/lightnvm/pblk.h
@@ -860,11 +860,6 @@ int pblk_rl_is_limit(struct pblk_rl *rl);
 int pblk_sysfs_init(struct gendisk *tdisk);
 void pblk_sysfs_exit(struct gendisk *tdisk);
 
-static inline void test(size_t a)
-{
-   a += 1;
-}
-
 static inline void *pblk_malloc(size_t size, int type, gfp_t flags)
 {
if (type == PBLK_KMALLOC_META)
-- 
2.7.4



Re: [PATCH] virtio_blk: Fix an SG_IO regression

2017-10-24 Thread Jens Axboe
On 10/24/2017 12:04 AM, Bart Van Assche wrote:
> Avoid that submitting an SG_IO ioctl triggers a kernel oops that
> is preceded by:
> 
> usercopy: kernel memory overwrite attempt detected to (null) () (6 
> bytes)
> kernel BUG at mm/usercopy.c:72!
> 
> Additionally, make it easier to diagnose crashes like the one
> fixed by this patch.

This really should have been two patches... Anyway, applied for 4.14.

-- 
Jens Axboe



Re: [PATCH] update bucket_in_use in real time

2017-10-24 Thread Coly Li
On 2017/10/24 下午4:57, tang.jun...@zte.com.cn wrote:
> From: Tang Junhui 
> 
> bucket_in_use is updated in gc thread which triggered by invalidating or
> writing sectors_to_gc dirty data, It's a long interval. Therefore, when we
> use it to compare with the threshold, it is often not timely, which leads
> to inaccurate judgment and often results in bucket depletion.
> 
> We have send a patch before, by the means of updating bucket_in_use
> periodically In gc thread, which Coly thought that would lead high
> latency, In this patch, we add avail_nbuckets to record the count of
> available buckets, and we calculate bucket_in_use when alloc or free
> bucket in real time.
> 
> Signed-off-by: Tang Junhui 

Hi Junhui,

Thanks for the patch, this is a start to solve gc latency. Please check
my comments in line.


> ---
>  drivers/md/bcache/alloc.c  | 10 ++
>  drivers/md/bcache/bcache.h |  1 +
>  drivers/md/bcache/btree.c  | 17 ++---
>  drivers/md/bcache/btree.h  |  1 +
>  4 files changed, 22 insertions(+), 7 deletions(-)
>  mode change 100644 => 100755 drivers/md/bcache/alloc.c
>  mode change 100644 => 100755 drivers/md/bcache/bcache.h
>  mode change 100644 => 100755 drivers/md/bcache/btree.c
>  mode change 100644 => 100755 drivers/md/bcache/btree.h
> 
> diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
> old mode 100644
> new mode 100755
> index ca4abe1..89a5e35
> --- a/drivers/md/bcache/alloc.c
> +++ b/drivers/md/bcache/alloc.c
> @@ -439,6 +439,11 @@ long bch_bucket_alloc(struct cache *ca, unsigned 
> reserve, bool wait)
>   b->prio = INITIAL_PRIO;
>   }
>  
> + if(ca->set->avail_nbuckets > 0) {
> + ca->set->avail_nbuckets--;
> + bch_update_bucket_in_use(ca->set);
> + }
> +

ca->set->avai_nuckets-- is a read-modify-update operation, which means
concurrent update will overwrite value update, imaging,
   in-memory thread 1 thread 2
  n = 4
v = n n = 4   v <- 4
  n = 4v <- 4
v--   n = 4   v <- 3
  n = 4v <- 3
n = v n = 3   n <- 3
  n = 3n <- 3
  n = 3

In this case a correct n value should be 2, but n value in memory is 3.
Use atomic routine atomc_dec() will solve such issue.

>   return r;
>  }
>  
> @@ -446,6 +451,11 @@ void __bch_bucket_free(struct cache *ca, struct bucket 
> *b)
>  {
>   SET_GC_MARK(b, 0);
>   SET_GC_SECTORS_USED(b, 0);
> + 
> + if(ca->set->avail_nbuckets < ca->set->nbuckets) {
> + ca->set->avail_nbuckets++;
> + bch_update_bucket_in_use(ca->set);
> + }

Here we should use atomic_inc() to avail_nbuckets, the reason is similar
to bch_bucket_alloc().


>  }
>  
>  void bch_bucket_free(struct cache_set *c, struct bkey *k)
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> old mode 100644
> new mode 100755
> index dee542f..275b29c
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -580,6 +580,7 @@ struct cache_set {
>   uint8_t need_gc;
>   struct gc_stat  gc_stats;
>   size_t  nbuckets;
> + size_t  avail_nbuckets;
>

Here avail_nbuckets should be atomic_t.

>   struct task_struct  *gc_thread;
>   /* Where in the btree gc currently is */
> diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
> old mode 100644
> new mode 100755
> index 866dcf7..1ccf0c3
> --- a/drivers/md/bcache/btree.c
> +++ b/drivers/md/bcache/btree.c
> @@ -1240,6 +1240,11 @@ void bch_initial_mark_key(struct cache_set *c, int 
> level, struct bkey *k)
>   __bch_btree_mark_key(c, level, k);
>  }
>  
> +void bch_update_bucket_in_use(struct cache_set *c)
> +{
> + c->gc_stats.in_use = (c->nbuckets - c->avail_nbuckets) * 100 / 
> c->nbuckets;
> +}

Here should use atomic_read() to c->avail_nbuckets.

> +
>  static bool btree_gc_mark_node(struct btree *b, struct gc_stat *gc)
>  {
>   uint8_t stale = 0;
> @@ -1651,9 +1656,8 @@ static void btree_gc_start(struct cache_set *c)
>   mutex_unlock(>bucket_lock);
>  }
>  
> -static size_t bch_btree_gc_finish(struct cache_set *c)
> +static void bch_btree_gc_finish(struct cache_set *c)
>  {
> - size_t available = 0;
>   struct bucket *b;
>   struct cache *ca;
>   unsigned i;
> @@ -1690,6 +1694,7 @@ static size_t bch_btree_gc_finish(struct cache_set *c)
>   }
>   rcu_read_unlock();
>  
> + c->avail_nbuckets = 0;
>   for_each_cache(ca, c, i) {
>   uint64_t *i;
>  
> @@ -1711,18 +1716,16 @@ static size_t bch_btree_gc_finish(struct cache_set *c)
>   BUG_ON(!GC_MARK(b) && GC_SECTORS_USED(b));
>  
>   if (!GC_MARK(b) || GC_MARK(b) == GC_MARK_RECLAIMABLE)
> - available++;
> +   

Re: [PATCH v3 8/8] block: Assign a lock_class per gendisk used for wait_for_completion()

2017-10-24 Thread Ingo Molnar

* Byungchul Park  wrote:

> Darrick and Dave Chinner posted the following warning:
> 
> > ==
> > WARNING: possible circular locking dependency detected
> > 4.14.0-rc1-fixes #1 Tainted: GW
> > --
> > loop0/31693 is trying to acquire lock:
> >  (&(>i_mmaplock)->mr_lock){}, at: [] 
> > xfs_ilock+0x23c/0x330 [xfs]
> >
> > but now in release context of a crosslock acquired at the following:
> >  ((complete)){+.+.}, at: [] 
> > submit_bio_wait+0x7f/0xb0
> >
> > which lock already depends on the new lock.
> >
> > the existing dependency chain (in reverse order) is:
> >
> > -> #2 ((complete)){+.+.}:
> >lock_acquire+0xab/0x200
> >wait_for_completion_io+0x4e/0x1a0
> >submit_bio_wait+0x7f/0xb0
> >blkdev_issue_zeroout+0x71/0xa0
> >xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
> >xfs_bmapi_write+0x374/0x11f0 [xfs]
> >xfs_iomap_write_direct+0x2ac/0x430 [xfs]
> >xfs_file_iomap_begin+0x20d/0xd50 [xfs]
> >iomap_apply+0x43/0xe0
> >dax_iomap_rw+0x89/0xf0
> >xfs_file_dax_write+0xcc/0x220 [xfs]
> >xfs_file_write_iter+0xf0/0x130 [xfs]
> >__vfs_write+0xd9/0x150
> >vfs_write+0xc8/0x1c0
> >SyS_write+0x45/0xa0
> >entry_SYSCALL_64_fastpath+0x1f/0xbe
> >
> > -> #1 (_nondir_ilock_class){}:
> >lock_acquire+0xab/0x200
> >down_write_nested+0x4a/0xb0
> >xfs_ilock+0x263/0x330 [xfs]
> >xfs_setattr_size+0x152/0x370 [xfs]
> >xfs_vn_setattr+0x6b/0x90 [xfs]
> >notify_change+0x27d/0x3f0
> >do_truncate+0x5b/0x90
> >path_openat+0x237/0xa90
> >do_filp_open+0x8a/0xf0
> >do_sys_open+0x11c/0x1f0
> >entry_SYSCALL_64_fastpath+0x1f/0xbe
> >
> > -> #0 (&(>i_mmaplock)->mr_lock){}:
> >up_write+0x1c/0x40
> >xfs_iunlock+0x1d0/0x310 [xfs]
> >xfs_file_fallocate+0x8a/0x310 [xfs]
> >loop_queue_work+0xb7/0x8d0
> >kthread_worker_fn+0xb9/0x1f0
> >
> > Chain exists of:
> >   &(>i_mmaplock)->mr_lock --> _nondir_ilock_class --> 
> > (complete)
> >
> >  Possible unsafe locking scenario by crosslock:
> >
> >CPU0CPU1
> >
> >   lock(_nondir_ilock_class);
> >   lock((complete));
> >lock(&(>i_mmaplock)->mr_lock);
> >unlock((complete));
> >
> >*** DEADLOCK ***
> 
> The warning is a false positive, caused by the fact that all
> wait_for_completion()s in submit_bio_wait() are waiting with the same
> lock class.
> 
> However, some bios have nothing to do with others, for example, the case
> might happen while using loop devices, between bios of an upper device
> and a lower device(=loop device).
> 
> The safest way to assign different lock classes to different devices is
> to do it for each gendisk. In other words, this patch assigns a
> lockdep_map per gendisk and uses it when initializing completion in
> submit_bio_wait().
> 
> Of course, it might be too conservative. But, making it safest for now
> and extended by block layer experts later is good, atm.
> 
> Signed-off-by: Byungchul Park 
> ---
>  block/bio.c   |  2 +-
>  block/genhd.c | 13 +
>  include/linux/genhd.h | 22 --
>  3 files changed, 26 insertions(+), 11 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 5e901bf..cc60213 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -935,7 +935,7 @@ static void submit_bio_wait_endio(struct bio *bio)
>   */
>  int submit_bio_wait(struct bio *bio)
>  {
> - DECLARE_COMPLETION_ONSTACK(done);
> + DECLARE_COMPLETION_ONSTACK_MAP(done, bio->bi_disk->lockdep_map);
>  
>   bio->bi_private = 
>   bio->bi_end_io = submit_bio_wait_endio;
> diff --git a/block/genhd.c b/block/genhd.c
> index dd305c6..f195d22 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -1354,13 +1354,7 @@ dev_t blk_lookup_devt(const char *name, int partno)
>  }
>  EXPORT_SYMBOL(blk_lookup_devt);
>  
> -struct gendisk *alloc_disk(int minors)
> -{
> - return alloc_disk_node(minors, NUMA_NO_NODE);
> -}
> -EXPORT_SYMBOL(alloc_disk);
> -
> -struct gendisk *alloc_disk_node(int minors, int node_id)
> +struct gendisk *__alloc_disk_node(int minors, int node_id, struct 
> lock_class_key *key, const char *lock_name)
>  {
>   struct gendisk *disk;
>   struct disk_part_tbl *ptbl;
> @@ -1409,9 +1403,12 @@ struct gendisk *alloc_disk_node(int minors, int 
> node_id)
>   disk_to_dev(disk)->type = _type;
>   device_initialize(disk_to_dev(disk));
>   }
> +
> + lockdep_init_map(>lockdep_map, lock_name, key, 0);

lockdep_init_map() depends on CONFIG_DEBUG_LOCK_ALLOC IIRC, but the data 
structure 
change you made depends on CONFIG_LOCKDEP_COMPLETIONS:


Re: [PATCH v3 4/8] lockdep: Add a kernel parameter, crossrelease_fullstack

2017-10-24 Thread Ingo Molnar

* Byungchul Park  wrote:

> Make whether to allow recording full stack, in cross-release feature,
> switchable at boot time via a kernel parameter, 'crossrelease_fullstack'.
> In case of a splat with no stack trace, one could just reboot and set
> the kernel parameter to get the full data without having to recompile
> the kernel.
> 
> Change CONFIG_CROSSRELEASE_STACK_TRACE default from N to Y, and
> introduce the new kernel parameter.
> 
> Signed-off-by: Byungchul Park 
> ---
>  Documentation/admin-guide/kernel-parameters.txt |  3 +++
>  kernel/locking/lockdep.c| 18 --
>  lib/Kconfig.debug   |  5 +++--
>  3 files changed, 22 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt 
> b/Documentation/admin-guide/kernel-parameters.txt
> index ead7f40..4107b01 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -709,6 +709,9 @@
>   It will be ignored when crashkernel=X,high is not used
>   or memory reserved is below 4G.
>  
> + crossrelease_fullstack
> + [KNL] Allow to record full stack trace in cross-release
> +
>   cryptomgr.notests
>  [KNL] Disable crypto self-tests
>  
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 5c2ddf2..feba887 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -76,6 +76,15 @@
>  #define lock_stat 0
>  #endif
>  
> +static int crossrelease_fullstack;
> +static int __init allow_crossrelease_fullstack(char *str)
> +{
> + crossrelease_fullstack = 1;
> + return 0;
> +}
> +
> +early_param("crossrelease_fullstack", allow_crossrelease_fullstack);
> +
>  /*
>   * lockdep_lock: protects the lockdep graph, the hashes and the
>   *   class/list/hash allocators.
> @@ -4864,8 +4873,13 @@ static void add_xhlock(struct held_lock *hlock)
>   xhlock->trace.max_entries = MAX_XHLOCK_TRACE_ENTRIES;
>   xhlock->trace.entries = xhlock->trace_entries;
>  #ifdef CONFIG_CROSSRELEASE_STACK_TRACE
> - xhlock->trace.skip = 3;
> - save_stack_trace(>trace);
> + if (crossrelease_fullstack) {
> + xhlock->trace.skip = 3;
> + save_stack_trace(>trace);
> + } else {
> + xhlock->trace.nr_entries = 1;
> + xhlock->trace.entries[0] = hlock->acquire_ip;
> + }
>  #else
>   xhlock->trace.nr_entries = 1;
>   xhlock->trace.entries[0] = hlock->acquire_ip;
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index fe8fceb..132536d 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1228,7 +1228,7 @@ config LOCKDEP_COMPLETIONS
>  config CROSSRELEASE_STACK_TRACE
>   bool "Record more than one entity of stack trace in crossrelease"
>   depends on LOCKDEP_CROSSRELEASE
> - default n
> + default y
>   help
>The lockdep "cross-release" feature needs to record stack traces
>(of calling functions) for all acquisitions, for eventual later
> @@ -1238,7 +1238,8 @@ config CROSSRELEASE_STACK_TRACE
>full analysis. This option turns on the saving of the full stack
>trace entries.
>  
> -  If unsure, say N.
> +  To make the feature actually on, set "crossrelease_fullstack"
> +  kernel parameter, too.
>  
>  config DEBUG_LOCKDEP
>   bool "Lock dependency engine debugging"

This is really unnecessarily complex.

The proper logic is to introduce the crossrelease_fullstack boot parameter, and 
to 
also have a Kconfig option that enables it: 

CONFIG_BOOTPARAM_LOCKDEP_CROSSRELEASE_FULLSTACK=y

No #ifdefs please - just an "if ()" branch dependent on the current value of 
crossrelease_fullstack.

Thanks,

Ingo


Re: [PATCH v3 7/8] genhd.h: Remove trailing white space

2017-10-24 Thread Ingo Molnar

* Byungchul Park  wrote:

> Trailing white space is not accepted in kernel coding style. Remove
> them.
> 
> Signed-off-by: Byungchul Park 
> ---
>  include/linux/genhd.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index ea652bf..6d85a75 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -3,7 +3,7 @@
>  
>  /*
>   *   genhd.h Copyright (C) 1992 Drew Eckhardt
> - *   Generic hard disk header file by  
> + *   Generic hard disk header file by
>   *   Drew Eckhardt
>   *
>   *   
> @@ -471,7 +471,7 @@ struct bsd_disklabel {
>   __s16   d_type; /* drive type */
>   __s16   d_subtype;  /* controller/d_type specific */
>   chard_typename[16]; /* type name, e.g. "eagle" */
> - chard_packname[16]; /* pack identifier */ 
> + chard_packname[16]; /* pack identifier */
>   __u32   d_secsize;  /* # of bytes per sector */
>   __u32   d_nsectors; /* # of data sectors per track */
>   __u32   d_ntracks;  /* # of tracks per cylinder */

This patch should not be part of this lockdep series - please send it to Jens 
separately - who might or might not apply it, depending on the subsystem's 
policy 
regarding whitespace-only patches.

Thanks,

Ingo


Re: [PATCH v3 2/8] lockdep: Introduce CROSSRELEASE_STACK_TRACE and make it not unwind as default

2017-10-24 Thread Ingo Molnar

Cannot pick up this series yet, but I have enhanced the changelog to:

=>
Subject: locking/lockdep: Introduce CONFIG_CROSSRELEASE_STACK_TRACE and make it 
not unwind by default
From: Byungchul Park 
Date: Tue, 24 Oct 2017 18:38:03 +0900

Johan Hovold reported a heavy performance regression caused by
lockdep cross-release:

 > Boot time (from "Linux version" to login prompt) had in fact doubled
 > since 4.13 where it took 17 seconds (with my current config) compared to
 > the 35 seconds I now see with 4.14-rc4.
 >
 > I quick bisect pointed to lockdep and specifically the following commit:
 >
 >  28a903f63ec0 ("locking/lockdep: Handle non(or multi)-acquisition
 > of a crosslock")
 >
 > which I've verified is the commit which doubled the boot time (compared
 > to 28a903f63ec0^) (added by lockdep crossrelease series [1]).

Currently crossrelease performs unwind on every acquisition, but that is
very expensive.

This patch makes unwind optional and disables it by default and only records
acquire_ip.

Full stack traces are sometimes required for full analysis, in which
case CROSSRELEASE_STACK_TRACE can be enabled.

On my qemu Ubuntu machine (x86_64, 4 cores, 512M), the regression was
fixed. We measure boot times with 'perf stat --null --repeat 10 $QEMU',
where $QEMU launches a kernel with init=/bin/true:

1. No lockdep enabled:

 Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs):

   2.756558155 seconds time elapsed( +-  0.09% )

2. Lockdep enabled:

 Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs):

   2.968710420 seconds time elapsed( +-  0.12% )

3. Lockdep enabled + crossrelease enabled:

 Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs):

   3.153839636 seconds time elapsed( +-  0.31% )

4. Lockdep enabled + crossrelease enabled + this patch applied:

 Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs):

   2.963669551 seconds time elapsed( +-  0.11% )

I.e. lockdep-crossrelease performance is now indistinguishable
from vanilla lockdep.



Re: [PATCH v3 3/8] lockdep: Remove BROKEN flag of LOCKDEP_CROSSRELEASE

2017-10-24 Thread Ingo Molnar

better changelog:

>
Subject: locking/lockdep: Remove the BROKEN flag from 
CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS
From: Byungchul Park 
Date: Tue, 24 Oct 2017 18:38:04 +0900

Now that the performance regression is fixed, re-enable
CONFIG_LOCKDEP_CROSSRELEASE=y and CONFIG_LOCKDEP_COMPLETIONS=y.




Re: [PATCH v3 2/8] lockdep: Introduce CROSSRELEASE_STACK_TRACE and make it not unwind as default

2017-10-24 Thread Ingo Molnar

* Byungchul Park  wrote:

> Johan Hovold reported a performance regression by crossrelease like:

Pplease add Reported-by and Analyzed-by tags - you didn't even Cc: Johan!

Thanks,

Ingo


Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart

2017-10-24 Thread Ming Lei
On Mon, Oct 23, 2017 at 06:12:29PM +0200, Roman Penyaev wrote:
> Hi Ming,
> 
> On Fri, Oct 20, 2017 at 3:39 PM, Ming Lei  wrote:
> > On Wed, Oct 18, 2017 at 12:22:06PM +0200, Roman Pen wrote:
> >> Hi all,
> >>
> >> the patch below fixes queue stalling when shared hctx marked for restart
> >> (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.  The
> >> root cause is that hctxs are shared between queues, but 
> >> 'shared_hctx_restart'
> >> belongs to the particular queue, which in fact may not need to be 
> >> restarted,
> >> thus we return from blk_mq_sched_restart() and leave shared hctx of another
> >> queue never restarted.
> >>
> >> The fix is to make shared_hctx_restart counter belong not to the queue, but
> >> to tags, thereby counter will reflect real number of shared hctx needed to
> >> be restarted.
> >>
> >> During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests
> >> were noticed in dd->fifo_list of mq-deadline scheduler.
> >>
> >> Seeming possible sequence of events:
> >>
> >> 1. Request A of queue A is inserted into dd->fifo_list of the scheduler.
> >>
> >> 2. Request B of queue A bypasses scheduler and goes directly to
> >>hctx->dispatch.
> >>
> >> 3. Request C of queue B is inserted.
> >>
> >> 4. blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not
> >>empty (request B is in the list) hctx is only marked for for next 
> >> restart
> >>and request A is left in a list (see comment "So it's best to leave them
> >>there for as long as we can. Mark the hw queue as needing a restart in
> >>that case." in blk-mq-sched.c)
> >>
> >> 5. Eventually request B is completed/freed and blk_mq_sched_restart() is
> >>called, but by chance hctx from queue B is chosen for restart and 
> >> request C
> >>gets a chance to be dispatched.
> >>
> >> 6. Eventually request C is completed/freed and blk_mq_sched_restart() is
> >>called, but shared_hctx_restart for queue B is zero and we return 
> >> without
> >>attempt to restart hctx from queue A, thus request A is stuck forever.
> >>
> >> But stalling queue is not the only one problem with blk_mq_sched_restart().
> >> My tests show that those loops thru all queues and hctxs can be very 
> >> costly,
> >> even with shared_hctx_restart counter, which aims to fix performance issue.
> >> For my tests I create 128 devices with 64 hctx each, which share same tags
> >> set.
> >
> > Hi Roman,
> >
> > I also find the performance issue with RESTART for TAG_SHARED.
> >
> > But from my analysis, RESTART isn't needed for TAG_SHARED
> > because SCSI-MQ has handled the RESTART by itself already, so
> > could you test the patch in following link posted days ago to
> > see if it fixes your issue?
> 
> I can say without any testing: it fixes all the issues :) You've
> reverted
> 
>8e8320c9315c ("blk-mq: fix performance regression with shared tags")
>6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")
> 
> with one major difference: you do not handle shared tags in a special
> way and restart only requested hctx, instead of iterating over all hctxs
> in a queue.

Hi Roman,

Yeah, that is unnecessary as I explained in detail in the commit log and
introduces lots of overhead, and I can see IOPS is improved by 20%~30% in
my SCSI_DEBUG test(only 8 luns) by removing the blk-mq restart for TAG_SHARED.

Also it isn't needed for BLK_STS_RESOURCE returned from .get_budget().

I will post out V3 soon. 

> 
> Firstly I have to say that queue stalling issue(#1) and performance
> issue(#2) were observed on our in-house rdma driver IBNBD:
> https://lwn.net/Articles/718181/ and I've never tried to reproduce
> them on SCSI-MQ.

OK, got it, then you can handle it in SCSI-MQ's way since this way
is used by non-MQ for long time. Or you need to do nothing if your
driver is SCSI based.

> 
> Then your patch brakes RR restarts, which were introduced by this commit
> 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared").
> Seems basic idea of that patch is nice, but because of possible big
> amount of queues and hctxs patch requires reimplementation.  Eventually
> you should get fast hctx restart but in RR fashion. According to my
> understanding that does not contradict with your patch.

Firstly this way has been run/verified for long time in non-mq, and no one
complained it before.

Secondly please see scsi_target_queue_ready() and scsi_host_queue_ready(),
the sdev is added into the 'starved_list' in FIFO style, which is still fair.

So I don't think it is an issue to remove the RR restarts.

Thanks,
Ming


[PATCH v3 1/8] block: use DECLARE_COMPLETION_ONSTACK in submit_bio_wait

2017-10-24 Thread Byungchul Park
From: Christoph Hellwig 

Simplify the code by getting rid of the submit_bio_ret structure.

Signed-off-by: Christoph Hellwig 
---
 block/bio.c | 19 +--
 1 file changed, 5 insertions(+), 14 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 101c2a9..5e901bf 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -917,17 +917,9 @@ int bio_iov_iter_get_pages(struct bio *bio, struct 
iov_iter *iter)
 }
 EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
 
-struct submit_bio_ret {
-   struct completion event;
-   int error;
-};
-
 static void submit_bio_wait_endio(struct bio *bio)
 {
-   struct submit_bio_ret *ret = bio->bi_private;
-
-   ret->error = blk_status_to_errno(bio->bi_status);
-   complete(>event);
+   complete(bio->bi_private);
 }
 
 /**
@@ -943,16 +935,15 @@ static void submit_bio_wait_endio(struct bio *bio)
  */
 int submit_bio_wait(struct bio *bio)
 {
-   struct submit_bio_ret ret;
+   DECLARE_COMPLETION_ONSTACK(done);
 
-   init_completion();
-   bio->bi_private = 
+   bio->bi_private = 
bio->bi_end_io = submit_bio_wait_endio;
bio->bi_opf |= REQ_SYNC;
submit_bio(bio);
-   wait_for_completion_io();
+   wait_for_completion_io();
 
-   return ret.error;
+   return blk_status_to_errno(bio->bi_status);
 }
 EXPORT_SYMBOL(submit_bio_wait);
 
-- 
1.9.1



[PATCH v3 6/8] lockdep: Remove unnecessary acquisitions wrt workqueue flush

2017-10-24 Thread Byungchul Park
The workqueue added manual acquisitions to catch deadlock cases.
Now crossrelease was introduced, some of those are redundant, since
wait_for_completion() already includes the acquisition for itself.
Removed it.

Signed-off-by: Byungchul Park 
---
 include/linux/workqueue.h |  4 ++--
 kernel/workqueue.c| 19 +++
 2 files changed, 5 insertions(+), 18 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index f3c47a0..1455b5e 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -218,7 +218,7 @@ static inline void destroy_delayed_work_on_stack(struct 
delayed_work *work) { }
\
__init_work((_work), _onstack); \
(_work)->data = (atomic_long_t) WORK_DATA_INIT();   \
-   lockdep_init_map(&(_work)->lockdep_map, #_work, &__key, 0); \
+   lockdep_init_map(&(_work)->lockdep_map, "(complete)"#_work, 
&__key, 0); \
INIT_LIST_HEAD(&(_work)->entry);\
(_work)->func = (_func);\
} while (0)
@@ -399,7 +399,7 @@ enum {
static struct lock_class_key __key; \
const char *__lock_name;\
\
-   __lock_name = #fmt#args;\
+   __lock_name = "(complete)"#fmt#args;\
\
__alloc_workqueue_key((fmt), (flags), (max_active), \
  &__key, __lock_name, ##args); \
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c77fdf6..ee05d19 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2496,15 +2496,8 @@ static void insert_wq_barrier(struct pool_workqueue *pwq,
INIT_WORK_ONSTACK(>work, wq_barrier_func);
__set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(>work));
 
-   /*
-* Explicitly init the crosslock for wq_barrier::done, make its lock
-* key a subkey of the corresponding work. As a result we won't
-* build a dependency between wq_barrier::done and unrelated work.
-*/
-   lockdep_init_map_crosslock((struct lockdep_map *)>done.map,
-  "(complete)wq_barr::done",
-  target->lockdep_map.key, 1);
-   __init_completion(>done);
+   init_completion_map(>done, >lockdep_map);
+
barr->task = current;
 
/*
@@ -2610,16 +2603,13 @@ void flush_workqueue(struct workqueue_struct *wq)
struct wq_flusher this_flusher = {
.list = LIST_HEAD_INIT(this_flusher.list),
.flush_color = -1,
-   .done = COMPLETION_INITIALIZER_ONSTACK(this_flusher.done),
+   .done = COMPLETION_INITIALIZER_ONSTACK_MAP(this_flusher.done, 
wq->lockdep_map),
};
int next_color;
 
if (WARN_ON(!wq_online))
return;
 
-   lock_map_acquire(>lockdep_map);
-   lock_map_release(>lockdep_map);
-
mutex_lock(>mutex);
 
/*
@@ -2882,9 +2872,6 @@ bool flush_work(struct work_struct *work)
if (WARN_ON(!wq_online))
return false;
 
-   lock_map_acquire(>lockdep_map);
-   lock_map_release(>lockdep_map);
-
if (start_flush_work(work, )) {
wait_for_completion();
destroy_work_on_stack();
-- 
1.9.1



[PATCH v3 8/8] block: Assign a lock_class per gendisk used for wait_for_completion()

2017-10-24 Thread Byungchul Park
Darrick and Dave Chinner posted the following warning:

> ==
> WARNING: possible circular locking dependency detected
> 4.14.0-rc1-fixes #1 Tainted: GW
> --
> loop0/31693 is trying to acquire lock:
>  (&(>i_mmaplock)->mr_lock){}, at: [] 
> xfs_ilock+0x23c/0x330 [xfs]
>
> but now in release context of a crosslock acquired at the following:
>  ((complete)){+.+.}, at: [] 
> submit_bio_wait+0x7f/0xb0
>
> which lock already depends on the new lock.
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 ((complete)){+.+.}:
>lock_acquire+0xab/0x200
>wait_for_completion_io+0x4e/0x1a0
>submit_bio_wait+0x7f/0xb0
>blkdev_issue_zeroout+0x71/0xa0
>xfs_bmapi_convert_unwritten+0x11f/0x1d0 [xfs]
>xfs_bmapi_write+0x374/0x11f0 [xfs]
>xfs_iomap_write_direct+0x2ac/0x430 [xfs]
>xfs_file_iomap_begin+0x20d/0xd50 [xfs]
>iomap_apply+0x43/0xe0
>dax_iomap_rw+0x89/0xf0
>xfs_file_dax_write+0xcc/0x220 [xfs]
>xfs_file_write_iter+0xf0/0x130 [xfs]
>__vfs_write+0xd9/0x150
>vfs_write+0xc8/0x1c0
>SyS_write+0x45/0xa0
>entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> -> #1 (_nondir_ilock_class){}:
>lock_acquire+0xab/0x200
>down_write_nested+0x4a/0xb0
>xfs_ilock+0x263/0x330 [xfs]
>xfs_setattr_size+0x152/0x370 [xfs]
>xfs_vn_setattr+0x6b/0x90 [xfs]
>notify_change+0x27d/0x3f0
>do_truncate+0x5b/0x90
>path_openat+0x237/0xa90
>do_filp_open+0x8a/0xf0
>do_sys_open+0x11c/0x1f0
>entry_SYSCALL_64_fastpath+0x1f/0xbe
>
> -> #0 (&(>i_mmaplock)->mr_lock){}:
>up_write+0x1c/0x40
>xfs_iunlock+0x1d0/0x310 [xfs]
>xfs_file_fallocate+0x8a/0x310 [xfs]
>loop_queue_work+0xb7/0x8d0
>kthread_worker_fn+0xb9/0x1f0
>
> Chain exists of:
>   &(>i_mmaplock)->mr_lock --> _nondir_ilock_class --> 
> (complete)
>
>  Possible unsafe locking scenario by crosslock:
>
>CPU0CPU1
>
>   lock(_nondir_ilock_class);
>   lock((complete));
>lock(&(>i_mmaplock)->mr_lock);
>unlock((complete));
>
>*** DEADLOCK ***

The warning is a false positive, caused by the fact that all
wait_for_completion()s in submit_bio_wait() are waiting with the same
lock class.

However, some bios have nothing to do with others, for example, the case
might happen while using loop devices, between bios of an upper device
and a lower device(=loop device).

The safest way to assign different lock classes to different devices is
to do it for each gendisk. In other words, this patch assigns a
lockdep_map per gendisk and uses it when initializing completion in
submit_bio_wait().

Of course, it might be too conservative. But, making it safest for now
and extended by block layer experts later is good, atm.

Signed-off-by: Byungchul Park 
---
 block/bio.c   |  2 +-
 block/genhd.c | 13 +
 include/linux/genhd.h | 22 --
 3 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 5e901bf..cc60213 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -935,7 +935,7 @@ static void submit_bio_wait_endio(struct bio *bio)
  */
 int submit_bio_wait(struct bio *bio)
 {
-   DECLARE_COMPLETION_ONSTACK(done);
+   DECLARE_COMPLETION_ONSTACK_MAP(done, bio->bi_disk->lockdep_map);
 
bio->bi_private = 
bio->bi_end_io = submit_bio_wait_endio;
diff --git a/block/genhd.c b/block/genhd.c
index dd305c6..f195d22 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1354,13 +1354,7 @@ dev_t blk_lookup_devt(const char *name, int partno)
 }
 EXPORT_SYMBOL(blk_lookup_devt);
 
-struct gendisk *alloc_disk(int minors)
-{
-   return alloc_disk_node(minors, NUMA_NO_NODE);
-}
-EXPORT_SYMBOL(alloc_disk);
-
-struct gendisk *alloc_disk_node(int minors, int node_id)
+struct gendisk *__alloc_disk_node(int minors, int node_id, struct 
lock_class_key *key, const char *lock_name)
 {
struct gendisk *disk;
struct disk_part_tbl *ptbl;
@@ -1409,9 +1403,12 @@ struct gendisk *alloc_disk_node(int minors, int node_id)
disk_to_dev(disk)->type = _type;
device_initialize(disk_to_dev(disk));
}
+
+   lockdep_init_map(>lockdep_map, lock_name, key, 0);
+
return disk;
 }
-EXPORT_SYMBOL(alloc_disk_node);
+EXPORT_SYMBOL(__alloc_disk_node);
 
 struct kobject *get_disk(struct gendisk *disk)
 {
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 6d85a75..9832e3c 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -206,6 +206,9 @@ struct gendisk {
 #endif /* CONFIG_BLK_DEV_INTEGRITY */
int node_id;
struct badblocks *bb;
+#ifdef 

[PATCH v3 7/8] genhd.h: Remove trailing white space

2017-10-24 Thread Byungchul Park
Trailing white space is not accepted in kernel coding style. Remove
them.

Signed-off-by: Byungchul Park 
---
 include/linux/genhd.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index ea652bf..6d85a75 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -3,7 +3,7 @@
 
 /*
  * genhd.h Copyright (C) 1992 Drew Eckhardt
- * Generic hard disk header file by  
+ * Generic hard disk header file by
  * Drew Eckhardt
  *
  * 
@@ -471,7 +471,7 @@ struct bsd_disklabel {
__s16   d_type; /* drive type */
__s16   d_subtype;  /* controller/d_type specific */
chard_typename[16]; /* type name, e.g. "eagle" */
-   chard_packname[16]; /* pack identifier */ 
+   chard_packname[16]; /* pack identifier */
__u32   d_secsize;  /* # of bytes per sector */
__u32   d_nsectors; /* # of data sectors per track */
__u32   d_ntracks;  /* # of tracks per cylinder */
-- 
1.9.1



[PATCH v3 3/8] lockdep: Remove BROKEN flag of LOCKDEP_CROSSRELEASE

2017-10-24 Thread Byungchul Park
Now the performance regression was fixed, re-enable
CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS.

Signed-off-by: Byungchul Park 
---
 lib/Kconfig.debug | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 90ea784..fe8fceb 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1138,8 +1138,8 @@ config PROVE_LOCKING
select DEBUG_MUTEXES
select DEBUG_RT_MUTEXES if RT_MUTEXES
select DEBUG_LOCK_ALLOC
-   select LOCKDEP_CROSSRELEASE if BROKEN
-   select LOCKDEP_COMPLETIONS if BROKEN
+   select LOCKDEP_CROSSRELEASE
+   select LOCKDEP_COMPLETIONS
select TRACE_IRQFLAGS
default n
help
-- 
1.9.1



[PATCH v3 5/8] completion: Add support for initializing completion with lockdep_map

2017-10-24 Thread Byungchul Park
Sometimes, we want to initialize completions with sparate lockdep maps
to assign lock classes as desired. For example, the workqueue code
needs to directly manage lockdep maps, since only the code is aware of
how to classify lockdep maps properly.

Provide additional macros initializing completions in that way.

Signed-off-by: Byungchul Park 
---
 include/linux/completion.h | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/include/linux/completion.h b/include/linux/completion.h
index cae5400..02f8cde 100644
--- a/include/linux/completion.h
+++ b/include/linux/completion.h
@@ -49,6 +49,13 @@ static inline void complete_release_commit(struct completion 
*x)
lock_commit_crosslock((struct lockdep_map *)>map);
 }
 
+#define init_completion_map(x, m)  \
+do {   \
+   lockdep_init_map_crosslock((struct lockdep_map *)&(x)->map, \
+   (m)->name, (m)->key, 0);
\
+   __init_completion(x);   \
+} while (0)
+
 #define init_completion(x) \
 do {   \
static struct lock_class_key __key; \
@@ -58,6 +65,7 @@ static inline void complete_release_commit(struct completion 
*x)
__init_completion(x);   \
 } while (0)
 #else
+#define init_completion_map(x, m) __init_completion(x)
 #define init_completion(x) __init_completion(x)
 static inline void complete_acquire(struct completion *x) {}
 static inline void complete_release(struct completion *x) {}
@@ -73,6 +81,9 @@ static inline void complete_release_commit(struct completion 
*x) {}
{ 0, __WAIT_QUEUE_HEAD_INITIALIZER((work).wait) }
 #endif
 
+#define COMPLETION_INITIALIZER_ONSTACK_MAP(work, map) \
+   (*({ init_completion_map(&(work), &(map)); &(work); }))
+
 #define COMPLETION_INITIALIZER_ONSTACK(work) \
(*({ init_completion();  }))
 
@@ -102,8 +113,11 @@ static inline void complete_release_commit(struct 
completion *x) {}
 #ifdef CONFIG_LOCKDEP
 # define DECLARE_COMPLETION_ONSTACK(work) \
struct completion work = COMPLETION_INITIALIZER_ONSTACK(work)
+# define DECLARE_COMPLETION_ONSTACK_MAP(work, map) \
+   struct completion work = COMPLETION_INITIALIZER_ONSTACK_MAP(work, map)
 #else
 # define DECLARE_COMPLETION_ONSTACK(work) DECLARE_COMPLETION(work)
+# define DECLARE_COMPLETION_ONSTACK_MAP(work, map) DECLARE_COMPLETION(work)
 #endif
 
 /**
-- 
1.9.1



[PATCH v3 2/8] lockdep: Introduce CROSSRELEASE_STACK_TRACE and make it not unwind as default

2017-10-24 Thread Byungchul Park
Johan Hovold reported a performance regression by crossrelease like:

> Boot time (from "Linux version" to login prompt) had in fact doubled
> since 4.13 where it took 17 seconds (with my current config) compared to
> the 35 seconds I now see with 4.14-rc4.
>
> I quick bisect pointed to lockdep and specifically the following commit:
>
>   28a903f63ec0 ("locking/lockdep: Handle non(or multi)-acquisition
>  of a crosslock")
>
> which I've verified is the commit which doubled the boot time (compared
> to 28a903f63ec0^) (added by lockdep crossrelease series [1]).

Currently crossrelease performs unwind on every acquisition. But, that
overloads systems too much. So this patch makes unwind optional and set
it to N as default. Instead, it records only acquire_ip normally. Of
course, unwind is sometimes required for full analysis. In that case, we
can set CROSSRELEASE_STACK_TRACE to Y and use it.

In my qemu ubuntu machin (x86_64, 4 cores, 512M), the regression was
fixed like, measuring booting times with 'perf stat --null --repeat 10
$QEMU', where $QEMU launchs a kernel with init=/bin/true:

1. No lockdep enabled

 Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs):

   2.756558155 seconds time elapsed( +-  0.09% )

2. Lockdep enabled

 Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs):

   2.968710420 seconds time elapsed( +-  0.12% )

3. Lockdep enabled + crossrelease enabled

 Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs):

   3.153839636 seconds time elapsed( +-  0.31% )

4. Lockdep enabled + crossrelease enabled + this patch applied

 Performance counter stats for 'qemu_booting_time.sh bzImage' (10 runs):

   2.963669551 seconds time elapsed( +-  0.11% )

Signed-off-by: Byungchul Park 
---
 include/linux/lockdep.h  |  4 
 kernel/locking/lockdep.c |  5 +
 lib/Kconfig.debug| 15 +++
 3 files changed, 24 insertions(+)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index bfa8e0b..70358b5 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -278,7 +278,11 @@ struct held_lock {
 };
 
 #ifdef CONFIG_LOCKDEP_CROSSRELEASE
+#ifdef CONFIG_CROSSRELEASE_STACK_TRACE
 #define MAX_XHLOCK_TRACE_ENTRIES 5
+#else
+#define MAX_XHLOCK_TRACE_ENTRIES 1
+#endif
 
 /*
  * This is for keeping locks waiting for commit so that true dependencies
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index e36e652..5c2ddf2 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -4863,8 +4863,13 @@ static void add_xhlock(struct held_lock *hlock)
xhlock->trace.nr_entries = 0;
xhlock->trace.max_entries = MAX_XHLOCK_TRACE_ENTRIES;
xhlock->trace.entries = xhlock->trace_entries;
+#ifdef CONFIG_CROSSRELEASE_STACK_TRACE
xhlock->trace.skip = 3;
save_stack_trace(>trace);
+#else
+   xhlock->trace.nr_entries = 1;
+   xhlock->trace.entries[0] = hlock->acquire_ip;
+#endif
 }
 
 static inline int same_context_xhlock(struct hist_lock *xhlock)
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 3db9167..90ea784 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1225,6 +1225,21 @@ config LOCKDEP_COMPLETIONS
 A deadlock caused by wait_for_completion() and complete() can be
 detected by lockdep using crossrelease feature.
 
+config CROSSRELEASE_STACK_TRACE
+   bool "Record more than one entity of stack trace in crossrelease"
+   depends on LOCKDEP_CROSSRELEASE
+   default n
+   help
+The lockdep "cross-release" feature needs to record stack traces
+(of calling functions) for all acquisitions, for eventual later
+use during analysis. By default only a single caller is recorded,
+because the unwind operation can be very expensive with deeper
+stack chains. However, sometimes deeper traces are required for
+full analysis. This option turns on the saving of the full stack
+trace entries.
+
+If unsure, say N.
+
 config DEBUG_LOCKDEP
bool "Lock dependency engine debugging"
depends on DEBUG_KERNEL && LOCKDEP
-- 
1.9.1



[PATCH v3 4/8] lockdep: Add a kernel parameter, crossrelease_fullstack

2017-10-24 Thread Byungchul Park
Make whether to allow recording full stack, in cross-release feature,
switchable at boot time via a kernel parameter, 'crossrelease_fullstack'.
In case of a splat with no stack trace, one could just reboot and set
the kernel parameter to get the full data without having to recompile
the kernel.

Change CONFIG_CROSSRELEASE_STACK_TRACE default from N to Y, and
introduce the new kernel parameter.

Signed-off-by: Byungchul Park 
---
 Documentation/admin-guide/kernel-parameters.txt |  3 +++
 kernel/locking/lockdep.c| 18 --
 lib/Kconfig.debug   |  5 +++--
 3 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index ead7f40..4107b01 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -709,6 +709,9 @@
It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G.
 
+   crossrelease_fullstack
+   [KNL] Allow to record full stack trace in cross-release
+
cryptomgr.notests
 [KNL] Disable crypto self-tests
 
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 5c2ddf2..feba887 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -76,6 +76,15 @@
 #define lock_stat 0
 #endif
 
+static int crossrelease_fullstack;
+static int __init allow_crossrelease_fullstack(char *str)
+{
+   crossrelease_fullstack = 1;
+   return 0;
+}
+
+early_param("crossrelease_fullstack", allow_crossrelease_fullstack);
+
 /*
  * lockdep_lock: protects the lockdep graph, the hashes and the
  *   class/list/hash allocators.
@@ -4864,8 +4873,13 @@ static void add_xhlock(struct held_lock *hlock)
xhlock->trace.max_entries = MAX_XHLOCK_TRACE_ENTRIES;
xhlock->trace.entries = xhlock->trace_entries;
 #ifdef CONFIG_CROSSRELEASE_STACK_TRACE
-   xhlock->trace.skip = 3;
-   save_stack_trace(>trace);
+   if (crossrelease_fullstack) {
+   xhlock->trace.skip = 3;
+   save_stack_trace(>trace);
+   } else {
+   xhlock->trace.nr_entries = 1;
+   xhlock->trace.entries[0] = hlock->acquire_ip;
+   }
 #else
xhlock->trace.nr_entries = 1;
xhlock->trace.entries[0] = hlock->acquire_ip;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index fe8fceb..132536d 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1228,7 +1228,7 @@ config LOCKDEP_COMPLETIONS
 config CROSSRELEASE_STACK_TRACE
bool "Record more than one entity of stack trace in crossrelease"
depends on LOCKDEP_CROSSRELEASE
-   default n
+   default y
help
 The lockdep "cross-release" feature needs to record stack traces
 (of calling functions) for all acquisitions, for eventual later
@@ -1238,7 +1238,8 @@ config CROSSRELEASE_STACK_TRACE
 full analysis. This option turns on the saving of the full stack
 trace entries.
 
-If unsure, say N.
+To make the feature actually on, set "crossrelease_fullstack"
+kernel parameter, too.
 
 config DEBUG_LOCKDEP
bool "Lock dependency engine debugging"
-- 
1.9.1



[PATCH v3 0/8] cross-release: enhence performance and fix false positives

2017-10-24 Thread Byungchul Park
Changes from v2
- Combine 2 serises, fixing false positives and enhance performance
- Add Christoph Hellwig's patch simplifying submit_bio_wait() code
- Add 2 more 'init with lockdep map' macros for completionm
- Rename init_completion_with_map() to init_completion_map()

Changes from v1
- Fix kconfig description as Ingo suggested
- Fix commit message writing out CONFIG_ variable
- Introduce a new kernel parameter, crossrelease_fullstack
- Replace the number with the output of *perf*
- Separate a patch removing white space

Byungchul Park (7):
  lockdep: Introduce CROSSRELEASE_STACK_TRACE and make it not unwind as
default
  lockdep: Remove BROKEN flag of LOCKDEP_CROSSRELEASE
  lockdep: Add a kernel parameter, crossrelease_fullstack
  completion: Add support for initializing completion with lockdep_map
  lockdep: Remove unnecessary acquisitions wrt workqueue flush
  genhd.h: Remove trailing white space
  block: Assign a lock_class per gendisk used for wait_for_completion()

Christoph Hellwig (1):
  block: use DECLARE_COMPLETION_ONSTACK in submit_bio_wait

 Documentation/admin-guide/kernel-parameters.txt |  3 +++
 block/bio.c | 19 +-
 block/genhd.c   | 13 +
 include/linux/completion.h  | 14 +
 include/linux/genhd.h   | 26 +
 include/linux/lockdep.h |  4 
 include/linux/workqueue.h   |  4 ++--
 kernel/locking/lockdep.c| 23 --
 kernel/workqueue.c  | 19 +++---
 lib/Kconfig.debug   | 20 +--
 10 files changed, 97 insertions(+), 48 deletions(-)

-- 
1.9.1



Re: [PATCH V8 00/14] mmc: Add Command Queue support

2017-10-24 Thread Adrian Hunter
On 24/10/17 10:39, Ulf Hansson wrote:
> [...]
> 
>>> However, you have completely ignored mine, Linus and Bartlomiej's
>>> comments about that we want the blkmq port being a separate patch(es)
>>> and then make the CMDQ patches on top. This worries me, because it
>>> seems like our messages don't reach you.
>>
>> Rubbish!  I gave a very good reason for keeping the CQE code in - it is
>> designed to work together.  I also pointed out that it is trivial to see the
>> CQE code and that it is all '+' lines anyway.
> 
> You gave reasons, but none of us bought them.
> 
>>
>> But not one question in response!  Where is a single example of why it is
>> difficult like it is.  Where are the questions!  Not even a request for
>> documentation!  How I am supposed to know what you do or don't understand if
>> you don't ask any questions! There is no evidence that you guys have read a
>> single line!
> 
> I have and I have also tested it, finding it not working. As reported.
> 
> However, I have also told you that I am having a *hard time* to review
> it, because it implements both blkmq and CMDQ in the same patch to
> code changes get complex.
> 
>>
>> So, what are your plans for the patches?  What don't you understand?
> 
> I have told you this several time, so has Linus and Bartlomiej.
> 
> If you can split it up such the blkmq support comes first, then I can
> review/test and pick it up.

I have done the split, but of course the code is just the same.


[PATCH] update bucket_in_use in real time

2017-10-24 Thread tang . junhui
From: Tang Junhui 

bucket_in_use is updated in gc thread which triggered by invalidating or
writing sectors_to_gc dirty data, It's a long interval. Therefore, when we
use it to compare with the threshold, it is often not timely, which leads
to inaccurate judgment and often results in bucket depletion.

We have send a patch before, by the means of updating bucket_in_use
periodically In gc thread, which Coly thought that would lead high
latency, In this patch, we add avail_nbuckets to record the count of
available buckets, and we calculate bucket_in_use when alloc or free
bucket in real time.

Signed-off-by: Tang Junhui 
---
 drivers/md/bcache/alloc.c  | 10 ++
 drivers/md/bcache/bcache.h |  1 +
 drivers/md/bcache/btree.c  | 17 ++---
 drivers/md/bcache/btree.h  |  1 +
 4 files changed, 22 insertions(+), 7 deletions(-)
 mode change 100644 => 100755 drivers/md/bcache/alloc.c
 mode change 100644 => 100755 drivers/md/bcache/bcache.h
 mode change 100644 => 100755 drivers/md/bcache/btree.c
 mode change 100644 => 100755 drivers/md/bcache/btree.h

diff --git a/drivers/md/bcache/alloc.c b/drivers/md/bcache/alloc.c
old mode 100644
new mode 100755
index ca4abe1..89a5e35
--- a/drivers/md/bcache/alloc.c
+++ b/drivers/md/bcache/alloc.c
@@ -439,6 +439,11 @@ long bch_bucket_alloc(struct cache *ca, unsigned reserve, 
bool wait)
b->prio = INITIAL_PRIO;
}
 
+   if(ca->set->avail_nbuckets > 0) {
+   ca->set->avail_nbuckets--;
+   bch_update_bucket_in_use(ca->set);
+   }
+
return r;
 }
 
@@ -446,6 +451,11 @@ void __bch_bucket_free(struct cache *ca, struct bucket *b)
 {
SET_GC_MARK(b, 0);
SET_GC_SECTORS_USED(b, 0);
+   
+   if(ca->set->avail_nbuckets < ca->set->nbuckets) {
+   ca->set->avail_nbuckets++;
+   bch_update_bucket_in_use(ca->set);
+   }
 }
 
 void bch_bucket_free(struct cache_set *c, struct bkey *k)
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
old mode 100644
new mode 100755
index dee542f..275b29c
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -580,6 +580,7 @@ struct cache_set {
uint8_t need_gc;
struct gc_stat  gc_stats;
size_t  nbuckets;
+   size_t  avail_nbuckets;
 
struct task_struct  *gc_thread;
/* Where in the btree gc currently is */
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
old mode 100644
new mode 100755
index 866dcf7..1ccf0c3
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -1240,6 +1240,11 @@ void bch_initial_mark_key(struct cache_set *c, int 
level, struct bkey *k)
__bch_btree_mark_key(c, level, k);
 }
 
+void bch_update_bucket_in_use(struct cache_set *c)
+{
+   c->gc_stats.in_use = (c->nbuckets - c->avail_nbuckets) * 100 / 
c->nbuckets;
+}
+
 static bool btree_gc_mark_node(struct btree *b, struct gc_stat *gc)
 {
uint8_t stale = 0;
@@ -1651,9 +1656,8 @@ static void btree_gc_start(struct cache_set *c)
mutex_unlock(>bucket_lock);
 }
 
-static size_t bch_btree_gc_finish(struct cache_set *c)
+static void bch_btree_gc_finish(struct cache_set *c)
 {
-   size_t available = 0;
struct bucket *b;
struct cache *ca;
unsigned i;
@@ -1690,6 +1694,7 @@ static size_t bch_btree_gc_finish(struct cache_set *c)
}
rcu_read_unlock();
 
+   c->avail_nbuckets = 0;
for_each_cache(ca, c, i) {
uint64_t *i;
 
@@ -1711,18 +1716,16 @@ static size_t bch_btree_gc_finish(struct cache_set *c)
BUG_ON(!GC_MARK(b) && GC_SECTORS_USED(b));
 
if (!GC_MARK(b) || GC_MARK(b) == GC_MARK_RECLAIMABLE)
-   available++;
+   c->avail_nbuckets++;
}
}
 
mutex_unlock(>bucket_lock);
-   return available;
 }
 
 static void bch_btree_gc(struct cache_set *c)
 {
int ret;
-   unsigned long available;
struct gc_stat stats;
struct closure writes;
struct btree_op op;
@@ -1745,14 +1748,14 @@ static void bch_btree_gc(struct cache_set *c)
pr_warn("gc failed!");
} while (ret);
 
-   available = bch_btree_gc_finish(c);
+   bch_btree_gc_finish(c);
wake_up_allocators(c);
 
bch_time_stats_update(>btree_gc_time, start_time);
 
stats.key_bytes *= sizeof(uint64_t);
stats.data  <<= 9;
-   stats.in_use= (c->nbuckets - available) * 100 / c->nbuckets;
+   stats.in_use= (c->nbuckets - c->avail_nbuckets) * 100 / c->nbuckets;
memcpy(>gc_stats, , sizeof(struct gc_stat));
 
trace_bcache_gc_end(c);
diff --git a/drivers/md/bcache/btree.h b/drivers/md/bcache/btree.h
old mode 100644
new mode 100755
index 73da1f5..dabcde8
--- 

[PATCH V12 0/5] mmc: Add Command Queue support

2017-10-24 Thread Adrian Hunter
Hi

Here is V12 of the hardware command queue patches without the software
command queue patches, now using blk-mq and now with blk-mq support for
non-CQE I/O.

HW CMDQ offers 25% - 50% better random multi-threaded I/O.  I see a slight
2% drop in sequential read speed but no change to sequential write.

Non-CQE blk-mq showed a 3% decrease in sequential read performance.  This
seemed to be coming from the inferior latency of running work items compared
with a dedicated thread.  Hacking blk-mq workqueue to be unbound reduced the
performance degradation from 3% to 1%.

While we should look at changing blk-mq to give better workqueue performance,
a bigger gain is likely to be made by adding a new host API to enable the
next already-prepared request to be issued directly from within ->done()
callback of the current request.


Changes since V11:
  Split "mmc: block: Add CQE and blk-mq support" into 2 patches

Changes since V10:
  mmc: core: Remove unnecessary host claim
  mmc: core: Introduce host claiming by context
  mmc: core: Add support for handling CQE requests
  mmc: mmc: Enable Command Queuing
  mmc: mmc: Enable CQE's
  mmc: block: Use local variables in mmc_blk_data_prep()
  mmc: block: Prepare CQE data
  mmc: block: Factor out mmc_setup_queue()
  mmc: core: Add parameter use_blk_mq
  mmc: core: Export mmc_start_bkops()
  mmc: core: Export mmc_start_request()
  mmc: core: Export mmc_retune_hold_now() and mmc_retune_release()
Dropped because they have been applied
  mmc: block: Add CQE and blk-mq support
Extend blk-mq support for asynchronous read / writes to all host
controllers including those that require polling. The direct
completion path is still available but depends on a new capability
flag.
Drop blk-mq support for synchronous read / writes.

Venkat Gopalakrishnan (1):
  mmc: cqhci: support for command queue enabled host

Changes since V9:
  mmc: block: Add CQE and blk-mq support
- reinstate mq support for REQ_OP_DRV_IN/OUT that was removed because
it was incorrectly assumed to be handled by the rpmb character device
- don't check for rpmb block device anymore
  mmc: cqhci: support for command queue enabled host
Fix cqhci_set_irqs() as per Haibo Chen

Changes since V8:
Re-based
  mmc: core: Introduce host claiming by context
Slightly simplified as per Ulf
  mmc: core: Export mmc_retune_hold_now() and mmc_retune_release()
New patch.
  mmc: block: Add CQE and blk-mq support
Fix missing ->post_req() on the error path

Changes since V7:
Re-based
  mmc: core: Introduce host claiming by context
Slightly simplified
  mmc: core: Add parameter use_blk_mq
New patch.
  mmc: core: Remove unnecessary host claim
New patch.
  mmc: core: Export mmc_start_bkops()
New patch.
  mmc: core: Export mmc_start_request()
New patch.
  mmc: block: Add CQE and blk-mq support
Add blk-mq support for non_CQE requests

Changes since V6:
  mmc: core: Introduce host claiming by context
New patch.
  mmc: core: Move mmc_start_areq() declaration
Dropped because it has been applied
  mmc: block: Fix block status codes
Dropped because it has been applied
  mmc: host: Add CQE interface
Dropped because it has been applied
  mmc: core: Turn off CQE before sending commands
Dropped because it has been applied
  mmc: block: Factor out mmc_setup_queue()
New patch.
  mmc: block: Add CQE support
Drop legacy support and add blk-mq support

Changes since V5:
Re-based
  mmc: core: Add mmc_retune_hold_now()
Dropped because it has been applied
  mmc: core: Add members to mmc_request and mmc_data for CQE's
Dropped because it has been applied
  mmc: core: Move mmc_start_areq() declaration
New patch at Ulf's request
  mmc: block: Fix block status codes
Another un-related patch
  mmc: host: Add CQE interface
Move recovery_notifier() callback to struct mmc_request
  mmc: core: Add support for handling CQE requests
Roll __mmc_cqe_request_done() into mmc_cqe_request_done()
Move function declarations requested by Ulf
  mmc: core: Remove unused MMC_CAP2_PACKED_CMD
Dropped because it has been applied
  mmc: block: Add CQE support
Add explanation to commit message
Adjustment for changed recovery_notifier() callback
  mmc: cqhci: support for command queue enabled host
Adjustment for changed recovery_notifier() callback
  mmc: sdhci-pci: Add CQHCI support for Intel GLK
Add DCMD capability for Intel controllers except GLK

Changes since V4:
  mmc: core: Add mmc_retune_hold_now()
Add explanation to commit message.
  mmc: host: Add CQE interface
Add 

[PATCH V12 1/5] mmc: core: Add parameter use_blk_mq

2017-10-24 Thread Adrian Hunter
Until mmc has blk-mq support fully implemented and tested, add a
parameter use_blk_mq, default to false unless config option MMC_MQ_DEFAULT
is selected.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/Kconfig  | 11 +++
 drivers/mmc/core/core.c  |  7 +++
 drivers/mmc/core/core.h  |  2 ++
 drivers/mmc/core/host.c  |  2 ++
 drivers/mmc/core/host.h  |  4 
 include/linux/mmc/host.h |  1 +
 6 files changed, 27 insertions(+)

diff --git a/drivers/mmc/Kconfig b/drivers/mmc/Kconfig
index ec21388311db..98202934bd29 100644
--- a/drivers/mmc/Kconfig
+++ b/drivers/mmc/Kconfig
@@ -12,6 +12,17 @@ menuconfig MMC
  If you want MMC/SD/SDIO support, you should say Y here and
  also to your specific host controller driver.
 
+config MMC_MQ_DEFAULT
+   bool "MMC: use blk-mq I/O path by default"
+   depends on MMC && BLOCK
+   ---help---
+ This option enables the new blk-mq based I/O path for MMC block
+ devices by default.  With the option the mmc_core.use_blk_mq
+ module/boot option defaults to Y, without it to N, but it can
+ still be overridden either way.
+
+ If unsure say N.
+
 if MMC
 
 source "drivers/mmc/core/Kconfig"
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 1f0f44f4dd5f..5df03cb73be7 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -66,6 +66,13 @@
 bool use_spi_crc = 1;
 module_param(use_spi_crc, bool, 0);
 
+#ifdef CONFIG_MMC_MQ_DEFAULT
+bool mmc_use_blk_mq = true;
+#else
+bool mmc_use_blk_mq = false;
+#endif
+module_param_named(use_blk_mq, mmc_use_blk_mq, bool, S_IWUSR | S_IRUGO);
+
 static int mmc_schedule_delayed_work(struct delayed_work *work,
 unsigned long delay)
 {
diff --git a/drivers/mmc/core/core.h b/drivers/mmc/core/core.h
index 71e6c6d7ceb7..8c5dd8d31400 100644
--- a/drivers/mmc/core/core.h
+++ b/drivers/mmc/core/core.h
@@ -35,6 +35,8 @@ struct mmc_bus_ops {
int (*reset)(struct mmc_host *);
 };
 
+extern bool mmc_use_blk_mq;
+
 void mmc_attach_bus(struct mmc_host *host, const struct mmc_bus_ops *ops);
 void mmc_detach_bus(struct mmc_host *host);
 
diff --git a/drivers/mmc/core/host.c b/drivers/mmc/core/host.c
index 35a9e4fd1a9f..62ef6cb0ece4 100644
--- a/drivers/mmc/core/host.c
+++ b/drivers/mmc/core/host.c
@@ -404,6 +404,8 @@ struct mmc_host *mmc_alloc_host(int extra, struct device 
*dev)
 
host->fixed_drv_type = -EINVAL;
 
+   host->use_blk_mq = mmc_use_blk_mq;
+
return host;
 }
 
diff --git a/drivers/mmc/core/host.h b/drivers/mmc/core/host.h
index fb689a1065ed..6eaf558e62d6 100644
--- a/drivers/mmc/core/host.h
+++ b/drivers/mmc/core/host.h
@@ -74,6 +74,10 @@ static inline bool mmc_card_hs400es(struct mmc_card *card)
return card->host->ios.enhanced_strobe;
 }
 
+static inline bool mmc_host_use_blk_mq(struct mmc_host *host)
+{
+   return host->use_blk_mq;
+}
 
 #endif
 
diff --git a/include/linux/mmc/host.h b/include/linux/mmc/host.h
index e7743eca1021..ce2075d6f429 100644
--- a/include/linux/mmc/host.h
+++ b/include/linux/mmc/host.h
@@ -380,6 +380,7 @@ struct mmc_host {
unsigned intdoing_retune:1; /* re-tuning in progress */
unsigned intretune_now:1;   /* do re-tuning at next req */
unsigned intretune_paused:1; /* re-tuning is temporarily 
disabled */
+   unsigned intuse_blk_mq:1;   /* use blk-mq */
 
int rescan_disable; /* disable card detection */
int rescan_entered; /* used with nonremovable 
devices */
-- 
1.9.1



[PATCH V12 5/5] mmc: sdhci-pci: Add CQHCI support for Intel GLK

2017-10-24 Thread Adrian Hunter
Add CQHCI initialization and implement CQHCI operations for Intel GLK.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/host/Kconfig  |   1 +
 drivers/mmc/host/sdhci-pci-core.c | 155 +-
 2 files changed, 155 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
index 3092b7085cb5..2b02a9788bb6 100644
--- a/drivers/mmc/host/Kconfig
+++ b/drivers/mmc/host/Kconfig
@@ -81,6 +81,7 @@ config MMC_SDHCI_BIG_ENDIAN_32BIT_BYTE_SWAPPER
 config MMC_SDHCI_PCI
tristate "SDHCI support on PCI bus"
depends on MMC_SDHCI && PCI
+   select MMC_CQHCI
help
  This selects the PCI Secure Digital Host Controller Interface.
  Most controllers found today are PCI devices.
diff --git a/drivers/mmc/host/sdhci-pci-core.c 
b/drivers/mmc/host/sdhci-pci-core.c
index 3e4f04fd5175..110c634cfb43 100644
--- a/drivers/mmc/host/sdhci-pci-core.c
+++ b/drivers/mmc/host/sdhci-pci-core.c
@@ -30,6 +30,8 @@
 #include 
 #include 
 
+#include "cqhci.h"
+
 #include "sdhci.h"
 #include "sdhci-pci.h"
 
@@ -116,6 +118,28 @@ int sdhci_pci_resume_host(struct sdhci_pci_chip *chip)
 
return 0;
 }
+
+static int sdhci_cqhci_suspend(struct sdhci_pci_chip *chip)
+{
+   int ret;
+
+   ret = cqhci_suspend(chip->slots[0]->host->mmc);
+   if (ret)
+   return ret;
+
+   return sdhci_pci_suspend_host(chip);
+}
+
+static int sdhci_cqhci_resume(struct sdhci_pci_chip *chip)
+{
+   int ret;
+
+   ret = sdhci_pci_resume_host(chip);
+   if (ret)
+   return ret;
+
+   return cqhci_resume(chip->slots[0]->host->mmc);
+}
 #endif
 
 #ifdef CONFIG_PM
@@ -166,8 +190,48 @@ static int sdhci_pci_runtime_resume_host(struct 
sdhci_pci_chip *chip)
 
return 0;
 }
+
+static int sdhci_cqhci_runtime_suspend(struct sdhci_pci_chip *chip)
+{
+   int ret;
+
+   ret = cqhci_suspend(chip->slots[0]->host->mmc);
+   if (ret)
+   return ret;
+
+   return sdhci_pci_runtime_suspend_host(chip);
+}
+
+static int sdhci_cqhci_runtime_resume(struct sdhci_pci_chip *chip)
+{
+   int ret;
+
+   ret = sdhci_pci_runtime_resume_host(chip);
+   if (ret)
+   return ret;
+
+   return cqhci_resume(chip->slots[0]->host->mmc);
+}
 #endif
 
+static u32 sdhci_cqhci_irq(struct sdhci_host *host, u32 intmask)
+{
+   int cmd_error = 0;
+   int data_error = 0;
+
+   if (!sdhci_cqe_irq(host, intmask, _error, _error))
+   return intmask;
+
+   cqhci_irq(host->mmc, intmask, cmd_error, data_error);
+
+   return 0;
+}
+
+static void sdhci_pci_dumpregs(struct mmc_host *mmc)
+{
+   sdhci_dumpregs(mmc_priv(mmc));
+}
+
 /*\
  *   *
  * Hardware specific quirk handling  *
@@ -583,6 +647,18 @@ static void sdhci_intel_voltage_switch(struct sdhci_host 
*host)
.voltage_switch = sdhci_intel_voltage_switch,
 };
 
+static const struct sdhci_ops sdhci_intel_glk_ops = {
+   .set_clock  = sdhci_set_clock,
+   .set_power  = sdhci_intel_set_power,
+   .enable_dma = sdhci_pci_enable_dma,
+   .set_bus_width  = sdhci_set_bus_width,
+   .reset  = sdhci_reset,
+   .set_uhs_signaling  = sdhci_set_uhs_signaling,
+   .hw_reset   = sdhci_pci_hw_reset,
+   .voltage_switch = sdhci_intel_voltage_switch,
+   .irq= sdhci_cqhci_irq,
+};
+
 static void byt_read_dsm(struct sdhci_pci_slot *slot)
 {
struct intel_host *intel_host = sdhci_pci_priv(slot);
@@ -612,12 +688,80 @@ static int glk_emmc_probe_slot(struct sdhci_pci_slot 
*slot)
 {
int ret = byt_emmc_probe_slot(slot);
 
+   slot->host->mmc->caps2 |= MMC_CAP2_CQE;
+
if (slot->chip->pdev->device != PCI_DEVICE_ID_INTEL_GLK_EMMC) {
slot->host->mmc->caps2 |= MMC_CAP2_HS400_ES,
slot->host->mmc_host_ops.hs400_enhanced_strobe =
intel_hs400_enhanced_strobe;
+   slot->host->mmc->caps2 |= MMC_CAP2_CQE_DCMD;
+   }
+
+   return ret;
+}
+
+static void glk_cqe_enable(struct mmc_host *mmc)
+{
+   struct sdhci_host *host = mmc_priv(mmc);
+   u32 reg;
+
+   /*
+* CQE gets stuck if it sees Buffer Read Enable bit set, which can be
+* the case after tuning, so ensure the buffer is drained.
+*/
+   reg = sdhci_readl(host, SDHCI_PRESENT_STATE);
+   while (reg & SDHCI_DATA_AVAILABLE) {
+   sdhci_readl(host, SDHCI_BUFFER);
+   reg = sdhci_readl(host, SDHCI_PRESENT_STATE);
+   }
+
+   sdhci_cqe_enable(mmc);
+}
+
+static const struct cqhci_host_ops glk_cqhci_ops = {
+   .enable 

[PATCH V12 3/5] mmc: block: Add CQE support

2017-10-24 Thread Adrian Hunter
Add CQE support to the block driver, including:
- optionally using DCMD for flush requests
- "manually" issuing discard requests
- issuing read / write requests to the CQE
- supporting block-layer timeouts
- handling recovery
- supporting re-tuning

CQE offers 25% - 50% better random multi-threaded I/O.  There is a slight
(e.g. 2%) drop in sequential read speed but no observable change to sequential
write.

CQE automatically sends the commands to complete requests.  However it only
supports reads / writes and so-called "direct commands" (DCMD).  Furthermore
DCMD is limited to one command at a time, but discards require 3 commands.
That makes issuing discards through CQE very awkward, but some CQE's don't
support DCMD anyway.  So for discards, the existing non-CQE approach is
taken, where the mmc core code issues the 3 commands one at a time i.e.
mmc_erase(). Where DCMD is used, is for issuing flushes.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/block.c | 150 ++-
 drivers/mmc/core/block.h |   2 +
 drivers/mmc/core/queue.c | 134 --
 drivers/mmc/core/queue.h |  15 +
 4 files changed, 294 insertions(+), 7 deletions(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index 002446e8dc5d..5a1fa678667b 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -112,6 +112,7 @@ struct mmc_blk_data {
 #define MMC_BLK_WRITE  BIT(1)
 #define MMC_BLK_DISCARDBIT(2)
 #define MMC_BLK_SECDISCARD BIT(3)
+#define MMC_BLK_CQE_RECOVERY   BIT(4)
 
/*
 * Only set in main mmc_blk_data associated
@@ -1713,6 +1714,138 @@ static void mmc_blk_data_prep(struct mmc_queue *mq, 
struct mmc_queue_req *mqrq,
*do_data_tag_p = do_data_tag;
 }
 
+#define MMC_CQE_RETRIES 2
+
+static void mmc_blk_cqe_complete_rq(struct mmc_queue *mq, struct request *req)
+{
+   struct mmc_queue_req *mqrq = req_to_mmc_queue_req(req);
+   struct mmc_request *mrq = >brq.mrq;
+   struct request_queue *q = req->q;
+   struct mmc_host *host = mq->card->host;
+   unsigned long flags;
+   bool put_card;
+   int err;
+
+   mmc_cqe_post_req(host, mrq);
+
+   if (mrq->cmd && mrq->cmd->error)
+   err = mrq->cmd->error;
+   else if (mrq->data && mrq->data->error)
+   err = mrq->data->error;
+   else
+   err = 0;
+
+   if (err) {
+   if (mqrq->retries++ < MMC_CQE_RETRIES)
+   blk_mq_requeue_request(req, true);
+   else
+   blk_mq_end_request(req, BLK_STS_IOERR);
+   } else if (mrq->data) {
+   if (blk_update_request(req, BLK_STS_OK, 
mrq->data->bytes_xfered))
+   blk_mq_requeue_request(req, true);
+   else
+   __blk_mq_end_request(req, BLK_STS_OK);
+   } else {
+   blk_mq_end_request(req, BLK_STS_OK);
+   }
+
+   spin_lock_irqsave(q->queue_lock, flags);
+
+   mq->in_flight[mmc_issue_type(mq, req)] -= 1;
+
+   put_card = mmc_tot_in_flight(mq) == 0;
+
+   mmc_cqe_check_busy(mq);
+
+   spin_unlock_irqrestore(q->queue_lock, flags);
+
+   if (!mq->cqe_busy)
+   blk_mq_run_hw_queues(q, true);
+
+   if (put_card)
+   mmc_put_card(mq->card, >ctx);
+}
+
+void mmc_blk_cqe_recovery(struct mmc_queue *mq)
+{
+   struct mmc_card *card = mq->card;
+   struct mmc_host *host = card->host;
+   int err;
+
+   pr_debug("%s: CQE recovery start\n", mmc_hostname(host));
+
+   err = mmc_cqe_recovery(host);
+   if (err)
+   mmc_blk_reset(mq->blkdata, host, MMC_BLK_CQE_RECOVERY);
+   else
+   mmc_blk_reset_success(mq->blkdata, MMC_BLK_CQE_RECOVERY);
+
+   pr_debug("%s: CQE recovery done\n", mmc_hostname(host));
+}
+
+static void mmc_blk_cqe_req_done(struct mmc_request *mrq)
+{
+   struct mmc_queue_req *mqrq = container_of(mrq, struct mmc_queue_req,
+ brq.mrq);
+   struct request *req = mmc_queue_req_to_req(mqrq);
+   struct request_queue *q = req->q;
+   struct mmc_queue *mq = q->queuedata;
+
+   /*
+* Block layer timeouts race with completions which means the normal
+* completion path cannot be used during recovery.
+*/
+   if (mq->in_recovery)
+   mmc_blk_cqe_complete_rq(mq, req);
+   else
+   blk_mq_complete_request(req);
+}
+
+static int mmc_blk_cqe_start_req(struct mmc_host *host, struct mmc_request 
*mrq)
+{
+   mrq->done   = mmc_blk_cqe_req_done;
+   mrq->recovery_notifier  = mmc_cqe_recovery_notifier;
+
+   return mmc_cqe_start_req(host, mrq);
+}
+
+static struct mmc_request *mmc_blk_cqe_prep_dcmd(struct mmc_queue_req *mqrq,
+

[PATCH V12 4/5] mmc: cqhci: support for command queue enabled host

2017-10-24 Thread Adrian Hunter
From: Venkat Gopalakrishnan 

This patch adds CMDQ support for command-queue compatible
hosts.

Command queue is added in eMMC-5.1 specification. This
enables the controller to process upto 32 requests at
a time.

Adrian Hunter contributed renaming to cqhci, recovery, suspend
and resume, cqhci_off, cqhci_wait_for_idle, and external timeout
handling.

Signed-off-by: Asutosh Das 
Signed-off-by: Sujit Reddy Thumma 
Signed-off-by: Konstantin Dorfman 
Signed-off-by: Venkat Gopalakrishnan 
Signed-off-by: Subhash Jadavani 
Signed-off-by: Ritesh Harjani 
Signed-off-by: Adrian Hunter 
---
 drivers/mmc/host/Kconfig  |   13 +
 drivers/mmc/host/Makefile |1 +
 drivers/mmc/host/cqhci.c  | 1150 +
 drivers/mmc/host/cqhci.h  |  240 ++
 4 files changed, 1404 insertions(+)
 create mode 100644 drivers/mmc/host/cqhci.c
 create mode 100644 drivers/mmc/host/cqhci.h

diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
index 567028c9219a..3092b7085cb5 100644
--- a/drivers/mmc/host/Kconfig
+++ b/drivers/mmc/host/Kconfig
@@ -857,6 +857,19 @@ config MMC_SUNXI
  This selects support for the SD/MMC Host Controller on
  Allwinner sunxi SoCs.
 
+config MMC_CQHCI
+   tristate "Command Queue Host Controller Interface support"
+   depends on HAS_DMA
+   help
+ This selects the Command Queue Host Controller Interface (CQHCI)
+ support present in host controllers of Qualcomm Technologies, Inc
+ amongst others.
+ This controller supports eMMC devices with command queue support.
+
+ If you have a controller with this interface, say Y or M here.
+
+ If unsure, say N.
+
 config MMC_TOSHIBA_PCI
tristate "Toshiba Type A SD/MMC Card Interface Driver"
depends on PCI
diff --git a/drivers/mmc/host/Makefile b/drivers/mmc/host/Makefile
index ab61a3e39c0b..de140e3ef402 100644
--- a/drivers/mmc/host/Makefile
+++ b/drivers/mmc/host/Makefile
@@ -91,6 +91,7 @@ obj-$(CONFIG_MMC_SDHCI_ST)+= sdhci-st.o
 obj-$(CONFIG_MMC_SDHCI_MICROCHIP_PIC32)+= sdhci-pic32.o
 obj-$(CONFIG_MMC_SDHCI_BRCMSTB)+= sdhci-brcmstb.o
 obj-$(CONFIG_MMC_SDHCI_OMAP)   += sdhci-omap.o
+obj-$(CONFIG_MMC_CQHCI)+= cqhci.o
 
 ifeq ($(CONFIG_CB710_DEBUG),y)
CFLAGS-cb710-mmc+= -DDEBUG
diff --git a/drivers/mmc/host/cqhci.c b/drivers/mmc/host/cqhci.c
new file mode 100644
index ..159270e947cf
--- /dev/null
+++ b/drivers/mmc/host/cqhci.c
@@ -0,0 +1,1150 @@
+/* Copyright (c) 2015, The Linux Foundation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 and
+ * only version 2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#include "cqhci.h"
+
+#define DCMD_SLOT 31
+#define NUM_SLOTS 32
+
+struct cqhci_slot {
+   struct mmc_request *mrq;
+   unsigned int flags;
+#define CQHCI_EXTERNAL_TIMEOUT BIT(0)
+#define CQHCI_COMPLETEDBIT(1)
+#define CQHCI_HOST_CRC BIT(2)
+#define CQHCI_HOST_TIMEOUT BIT(3)
+#define CQHCI_HOST_OTHER   BIT(4)
+};
+
+static inline u8 *get_desc(struct cqhci_host *cq_host, u8 tag)
+{
+   return cq_host->desc_base + (tag * cq_host->slot_sz);
+}
+
+static inline u8 *get_link_desc(struct cqhci_host *cq_host, u8 tag)
+{
+   u8 *desc = get_desc(cq_host, tag);
+
+   return desc + cq_host->task_desc_len;
+}
+
+static inline dma_addr_t get_trans_desc_dma(struct cqhci_host *cq_host, u8 tag)
+{
+   return cq_host->trans_desc_dma_base +
+   (cq_host->mmc->max_segs * tag *
+cq_host->trans_desc_len);
+}
+
+static inline u8 *get_trans_desc(struct cqhci_host *cq_host, u8 tag)
+{
+   return cq_host->trans_desc_base +
+   (cq_host->trans_desc_len * cq_host->mmc->max_segs * tag);
+}
+
+static void setup_trans_desc(struct cqhci_host *cq_host, u8 tag)
+{
+   u8 *link_temp;
+   dma_addr_t trans_temp;
+
+   link_temp = get_link_desc(cq_host, tag);
+   trans_temp = get_trans_desc_dma(cq_host, tag);
+
+   memset(link_temp, 0, cq_host->link_desc_len);
+   if (cq_host->link_desc_len > 8)
+   *(link_temp + 8) = 0;
+
+   if (tag == DCMD_SLOT && (cq_host->mmc->caps2 & MMC_CAP2_CQE_DCMD)) {
+   *link_temp = CQHCI_VALID(0) 

[PATCH V12 2/5] mmc: block: Add blk-mq support

2017-10-24 Thread Adrian Hunter
Define and use a blk-mq queue. Discards and flushes are processed
synchronously, but reads and writes asynchronously. In order to support
slow DMA unmapping, DMA unmapping is not done until after the next request
is started. That means the request is not completed until then. If there is
no next request then the completion is done by queued work.

Signed-off-by: Adrian Hunter 
---
 drivers/mmc/core/block.c | 655 ++-
 drivers/mmc/core/block.h |  10 +
 drivers/mmc/core/queue.c | 302 --
 drivers/mmc/core/queue.h |  41 +++
 include/linux/mmc/host.h |   1 +
 5 files changed, 979 insertions(+), 30 deletions(-)

diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
index ea80ff4cd7f9..002446e8dc5d 100644
--- a/drivers/mmc/core/block.c
+++ b/drivers/mmc/core/block.c
@@ -1264,7 +1264,10 @@ static void mmc_blk_issue_drv_op(struct mmc_queue *mq, 
struct request *req)
break;
}
mq_rq->drv_op_result = ret;
-   blk_end_request_all(req, ret ? BLK_STS_IOERR : BLK_STS_OK);
+   if (req->mq_ctx)
+   blk_mq_end_request(req, ret ? BLK_STS_IOERR : BLK_STS_OK);
+   else
+   blk_end_request_all(req, ret ? BLK_STS_IOERR : BLK_STS_OK);
 }
 
 static void mmc_blk_issue_discard_rq(struct mmc_queue *mq, struct request *req)
@@ -1307,7 +1310,10 @@ static void mmc_blk_issue_discard_rq(struct mmc_queue 
*mq, struct request *req)
else
mmc_blk_reset_success(md, type);
 fail:
-   blk_end_request(req, status, blk_rq_bytes(req));
+   if (req->mq_ctx)
+   blk_mq_end_request(req, status);
+   else
+   blk_end_request(req, status, blk_rq_bytes(req));
 }
 
 static void mmc_blk_issue_secdiscard_rq(struct mmc_queue *mq,
@@ -1377,7 +1383,10 @@ static void mmc_blk_issue_secdiscard_rq(struct mmc_queue 
*mq,
if (!err)
mmc_blk_reset_success(md, type);
 out:
-   blk_end_request(req, status, blk_rq_bytes(req));
+   if (req->mq_ctx)
+   blk_mq_end_request(req, status);
+   else
+   blk_end_request(req, status, blk_rq_bytes(req));
 }
 
 static void mmc_blk_issue_flush(struct mmc_queue *mq, struct request *req)
@@ -1387,7 +1396,10 @@ static void mmc_blk_issue_flush(struct mmc_queue *mq, 
struct request *req)
int ret = 0;
 
ret = mmc_flush_cache(card);
-   blk_end_request_all(req, ret ? BLK_STS_IOERR : BLK_STS_OK);
+   if (req->mq_ctx)
+   blk_mq_end_request(req, ret ? BLK_STS_IOERR : BLK_STS_OK);
+   else
+   blk_end_request_all(req, ret ? BLK_STS_IOERR : BLK_STS_OK);
 }
 
 /*
@@ -1413,15 +1425,18 @@ static inline void mmc_apply_rel_rw(struct 
mmc_blk_request *brq,
}
 }
 
-#define CMD_ERRORS \
-   (R1_OUT_OF_RANGE |  /* Command argument out of range */ \
-R1_ADDRESS_ERROR | /* Misaligned address */\
+#define CMD_ERRORS_EXCL_OOR\
+   (R1_ADDRESS_ERROR | /* Misaligned address */\
 R1_BLOCK_LEN_ERROR |   /* Transferred block length incorrect */\
 R1_WP_VIOLATION |  /* Tried to write to protected block */ \
 R1_CARD_ECC_FAILED |   /* Card ECC failed */   \
 R1_CC_ERROR |  /* Card controller error */ \
 R1_ERROR)  /* General/unknown error */
 
+#define CMD_ERRORS \
+   (CMD_ERRORS_EXCL_OOR |  \
+R1_OUT_OF_RANGE)   /* Command argument out of range */ \
+
 static void mmc_blk_eval_resp_error(struct mmc_blk_request *brq)
 {
u32 val;
@@ -1766,6 +1781,632 @@ static void mmc_blk_rw_rq_prep(struct mmc_queue_req 
*mqrq,
mqrq->areq.err_check = mmc_blk_err_check;
 }
 
+#define MMC_MAX_RETRIES5
+#define MMC_DATA_RETRIES   2
+#define MMC_NO_RETRIES (MMC_MAX_RETRIES + 1)
+
+/* Single sector read during recovery */
+static void mmc_blk_ss_read(struct mmc_queue *mq, struct request *req)
+{
+   struct mmc_queue_req *mqrq = req_to_mmc_queue_req(req);
+   blk_status_t status;
+
+   while (1) {
+   mmc_blk_rw_rq_prep(mqrq, mq->card, 1, mq);
+
+   mmc_wait_for_req(mq->card->host, >brq.mrq);
+
+   /*
+* Not expecting command errors, so just give up in that case.
+* If there are retries remaining, the request will get
+* requeued.
+*/
+   if (mqrq->brq.cmd.error)
+   return;
+
+   if (blk_rq_bytes(req) <= 512)
+   break;
+
+   status = mqrq->brq.data.error ? BLK_STS_IOERR : BLK_STS_OK;
+
+   blk_update_request(req, status, 512);
+   }
+
+   

Re: [PATCH 04/17] block: add a blk_steal_bios helper

2017-10-24 Thread Max Gurtovoy



On 10/23/2017 5:51 PM, Christoph Hellwig wrote:

This helpers allows to bounce steal the uncompleted bios from a request so
that they can be reissued on another path.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Sagi Grimberg 
---
  block/blk-core.c   | 20 
  include/linux/blkdev.h |  2 ++
  2 files changed, 22 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index b8c80f39f5fe..e804529e65a5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2767,6 +2767,26 @@ struct request *blk_fetch_request(struct request_queue 
*q)
  }
  EXPORT_SYMBOL(blk_fetch_request);
  
+/*

+ * Steal bios from a request.  The request must not have been partially
+ * completed before.
+ */


Maybe we can add to the comment that "list" is the destination for the 
stolen bio.



+void blk_steal_bios(struct bio_list *list, struct request *rq)
+{
+   if (rq->bio) {
+   if (list->tail)
+   list->tail->bi_next = rq->bio;
+   else
+   list->head = rq->bio;
+   list->tail = rq->biotail;


if list->tail != NULL don't we lose the "list->tail->bi_next = rq->bio;" 
assignment after assigning "list->tail = rq->biotail;" ?



+   }
+
+   rq->bio = NULL;


we can add this NULL assignment inside the big "if", but I'm not sure 
regarding the next 2 assignments.

Anyway not a big deal.


+   rq->biotail = NULL;
+   rq->__data_len = 0;
+}
+EXPORT_SYMBOL_GPL(blk_steal_bios);
+




Re: [PATCH 17/17] nvme: also expose the namespace identification sysfs files for mpath nodes

2017-10-24 Thread Hannes Reinecke
On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> We do this by adding a helper that returns the ns_head for a device that
> can belong to either the per-controller or per-subsystem block device
> nodes, and otherwise reuse all the existing code.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Keith Busch 
> Reviewed-by: Sagi Grimberg 
> Reviewed-by: Johannes Thumshirn 
> ---
>  drivers/nvme/host/core.c | 69 
> +---
>  1 file changed, 42 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 22c06cd3bef0..334735db90c8 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -76,6 +76,7 @@ static DEFINE_IDA(nvme_instance_ida);
>  static dev_t nvme_chr_devt;
>  static struct class *nvme_class;
>  static struct class *nvme_subsys_class;
> +static const struct attribute_group nvme_ns_id_attr_group;
>  
>  static __le32 nvme_get_log_dw10(u8 lid, size_t size)
>  {
> @@ -329,6 +330,8 @@ static void nvme_free_ns_head(struct kref *ref)
>   struct nvme_ns_head *head =
>   container_of(ref, struct nvme_ns_head, ref);
>  
> + sysfs_remove_group(_to_dev(head->disk)->kobj,
> +_ns_id_attr_group);
>   del_gendisk(head->disk);
>   blk_set_queue_dying(head->disk->queue);
>   /* make sure all pending bios are cleaned up */
> @@ -1925,7 +1928,7 @@ static struct nvme_subsystem 
> *__nvme_find_get_subsystem(const char *subsysnqn)
>  static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl 
> *id)
>  {
>   struct nvme_subsystem *subsys, *found;
> - int ret = -ENOMEM;
> + int ret;
>  
>   subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
>   if (!subsys)
> @@ -2267,12 +2270,22 @@ static ssize_t nvme_sysfs_rescan(struct device *dev,
>  }
>  static DEVICE_ATTR(rescan_controller, S_IWUSR, NULL, nvme_sysfs_rescan);
>  
> +static inline struct nvme_ns_head *dev_to_ns_head(struct device *dev)
> +{
> + struct gendisk *disk = dev_to_disk(dev);
> +
> + if (disk->fops == _fops)
> + return nvme_get_ns_from_dev(dev)->head;
> + else
> + return disk->private_data;
> +}
> +
>  static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
> - char *buf)
> + char *buf)
>  {
> - struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> - struct nvme_ns_ids *ids = >head->ids;
> - struct nvme_subsystem *subsys = ns->ctrl->subsys;
> + struct nvme_ns_head *head = dev_to_ns_head(dev);
> + struct nvme_ns_ids *ids = >ids;
> + struct nvme_subsystem *subsys = head->subsys;
>   int serial_len = sizeof(subsys->serial);
>   int model_len = sizeof(subsys->model);
>  
> @@ -2294,23 +2307,21 @@ static ssize_t wwid_show(struct device *dev, struct 
> device_attribute *attr,
>  
>   return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id,
>   serial_len, subsys->serial, model_len, subsys->model,
> - ns->head->ns_id);
> + head->ns_id);
>  }
>  static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
>  
>  static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
> -   char *buf)
> + char *buf)
>  {
> - struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> - return sprintf(buf, "%pU\n", ns->head->ids.nguid);
> + return sprintf(buf, "%pU\n", dev_to_ns_head(dev)->ids.nguid);
>  }
>  static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
>  
>  static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> - char *buf)
> + char *buf)
>  {
> - struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> - struct nvme_ns_ids *ids = >head->ids;
> + struct nvme_ns_ids *ids = _to_ns_head(dev)->ids;
>  
>   /* For backward compatibility expose the NGUID to userspace if
>* we have no UUID set
> @@ -2325,22 +2336,20 @@ static ssize_t uuid_show(struct device *dev, struct 
> device_attribute *attr,
>  static DEVICE_ATTR(uuid, S_IRUGO, uuid_show, NULL);
>  
>  static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
> - char *buf)
> + char *buf)
>  {
> - struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> - return sprintf(buf, "%8phd\n", ns->head->ids.eui64);
> + return sprintf(buf, "%8phd\n", dev_to_ns_head(dev)->ids.eui64);
>  }
>  static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
>  
>  static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
> - char *buf)
> + char *buf)
>  {
> - struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
> - return 

Re: [PATCH 16/17] nvme: implement multipath access to nvme subsystems

2017-10-24 Thread Hannes Reinecke
On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> This patch adds native multipath support to the nvme driver.  For each
> namespace we create only single block device node, which can be used
> to access that namespace through any of the controllers that refer to it.
> The gendisk for each controllers path to the name space still exists
> inside the kernel, but is hidden from userspace.  The character device
> nodes are still available on a per-controller basis.  A new link from
> the sysfs directory for the subsystem allows to find all controllers
> for a given subsystem.
> 
> Currently we will always send I/O to the first available path, this will
> be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
> ratified and implemented, at which point we will look at the ANA state
> for each namespace.  Another possibility that was prototyped is to
> use the path that is closes to the submitting NUMA code, which will be
> mostly interesting for PCI, but might also be useful for RDMA or FC
> transports in the future.  There is not plan to implement round robin
> or I/O service time path selectors, as those are not scalable with
> the performance rates provided by NVMe.
> 
> The multipath device will go away once all paths to it disappear,
> any delay to keep it alive needs to be implemented at the controller
> level.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/nvme/host/core.c | 398 
> ---
>  drivers/nvme/host/nvme.h |  15 +-
>  drivers/nvme/host/pci.c  |   2 +
>  3 files changed, 355 insertions(+), 60 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 1db26729bd89..22c06cd3bef0 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -102,6 +102,20 @@ static int nvme_reset_ctrl_sync(struct nvme_ctrl *ctrl)
>   return ret;
>  }
>  
> +static void nvme_failover_req(struct request *req)
> +{
> + struct nvme_ns *ns = req->q->queuedata;
> + unsigned long flags;
> +
> + spin_lock_irqsave(>head->requeue_lock, flags);
> + blk_steal_bios(>head->requeue_list, req);
> + spin_unlock_irqrestore(>head->requeue_lock, flags);
> + blk_mq_end_request(req, 0);
> +
> + nvme_reset_ctrl(ns->ctrl);
> + kblockd_schedule_work(>head->requeue_work);
> +}
> +
>  static blk_status_t nvme_error_status(struct request *req)
>  {
>   switch (nvme_req(req)->status & 0x7ff) {
> @@ -129,6 +143,53 @@ static blk_status_t nvme_error_status(struct request 
> *req)
>   }
>  }
>  
> +static bool nvme_req_needs_failover(struct request *req)
> +{
> + if (!(req->cmd_flags & REQ_NVME_MPATH))
> + return false;
> +
> + switch (nvme_req(req)->status & 0x7ff) {
> + /*
> +  * Generic command status:
> +  */
> + case NVME_SC_INVALID_OPCODE:
> + case NVME_SC_INVALID_FIELD:
> + case NVME_SC_INVALID_NS:
> + case NVME_SC_LBA_RANGE:
> + case NVME_SC_CAP_EXCEEDED:
> + case NVME_SC_RESERVATION_CONFLICT:
> + return false;
> +
> + /*
> +  * I/O command set specific error.  Unfortunately these values are
> +  * reused for fabrics commands, but those should never get here.
> +  */
> + case NVME_SC_BAD_ATTRIBUTES:
> + case NVME_SC_INVALID_PI:
> + case NVME_SC_READ_ONLY:
> + case NVME_SC_ONCS_NOT_SUPPORTED:
> + WARN_ON_ONCE(nvme_req(req)->cmd->common.opcode ==
> + nvme_fabrics_command);
> + return false;
> +
> + /*
> +  * Media and Data Integrity Errors:
> +  */
> + case NVME_SC_WRITE_FAULT:
> + case NVME_SC_READ_ERROR:
> + case NVME_SC_GUARD_CHECK:
> + case NVME_SC_APPTAG_CHECK:
> + case NVME_SC_REFTAG_CHECK:
> + case NVME_SC_COMPARE_FAILED:
> + case NVME_SC_ACCESS_DENIED:
> + case NVME_SC_UNWRITTEN_BLOCK:
> + return false;
> + }
> +
> + /* Everything else could be a path failure, so should be retried */
> + return true;
> +}
> +
>  static inline bool nvme_req_needs_retry(struct request *req)
>  {
>   if (blk_noretry_request(req))
> @@ -143,6 +204,11 @@ static inline bool nvme_req_needs_retry(struct request 
> *req)
>  void nvme_complete_rq(struct request *req)
>  {
>   if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
> + if (nvme_req_needs_failover(req)) {
> + nvme_failover_req(req);
> + return;
> + }
> +
>   nvme_req(req)->retries++;
>   blk_mq_requeue_request(req, true);
>   return;
Sure this works? nvme_req_needs_retry() checks blk_noretry_request():

static inline bool nvme_req_needs_retry(struct request *req)
{
if (blk_noretry_request(req))
return false;

which has:
#define blk_noretry_request(rq) \
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
 REQ_FAILFAST_DRIVER))

The 

Re: [PATCH 15/17] nvme: track shared namespaces

2017-10-24 Thread Hannes Reinecke
On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> Introduce a new struct nvme_ns_head that holds information about an actual
> namespace, unlike struct nvme_ns, which only holds the per-controller
> namespace information.  For private namespaces there is a 1:1 relation of
> the two, but for shared namespaces this lets us discover all the paths to
> it.  For now only the identifiers are moved to the new structure, but most
> of the information in struct nvme_ns should eventually move over.
> 
> To allow lockless path lookup the list of nvme_ns structures per
> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
> structure through call_srcu.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Keith Busch 
> Reviewed-by: Javier González 
> Reviewed-by: Johannes Thumshirn 
> ---
>  drivers/nvme/host/core.c | 190 
> +--
>  drivers/nvme/host/lightnvm.c |  14 ++--
>  drivers/nvme/host/nvme.h |  21 -
>  3 files changed, 190 insertions(+), 35 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 12/17] nvme: check for a live controller in nvme_dev_open

2017-10-24 Thread Hannes Reinecke
On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> This is a much more sensible check than just the admin queue.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Sagi Grimberg 
> Reviewed-by: Johannes Thumshirn 
> ---
>  drivers/nvme/host/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index a56a1e0432e7..df525ab42fcd 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -1891,7 +1891,7 @@ static int nvme_dev_open(struct inode *inode, struct 
> file *file)
>   struct nvme_ctrl *ctrl =
>   container_of(inode->i_cdev, struct nvme_ctrl, cdev);
>  
> - if (!ctrl->admin_q)
> + if (ctrl->state != NVME_CTRL_LIVE)
>   return -EWOULDBLOCK;
>   file->private_data = ctrl;
>   return 0;
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 11/17] nvme: get rid of nvme_ctrl_list

2017-10-24 Thread Hannes Reinecke
On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> Use the core chrdev code to set up the link between the character device
> and the nvme controller.  This allows us to get rid of the global list
> of all controllers, and also ensures that we have both a reference to
> the controller and the transport module before the open method of the
> character device is called.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Sagi Grimberg 
> Reviewed-by: Johannes Thumshirn 
> ---
>  drivers/nvme/host/core.c | 76 
> ++--
>  drivers/nvme/host/nvme.h |  3 +-
>  2 files changed, 18 insertions(+), 61 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 10/17] nvme: switch controller refcounting to use struct device

2017-10-24 Thread Hannes Reinecke
On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> Instead of allocating a separate struct device for the character device
> handle embedd it into struct nvme_ctrl and use it for the main controller
> refcounting.  This removes double refcounting and gets us an automatic
> reference for the character device operations.  We keep ctrl->device as a
> pointer for now to avoid chaning printks all over, but in the future we
> could look into message printing helpers that take a controller structure
> similar to what other subsystems do.
> 
> Note the delete_ctrl operation always already has a reference (either
> through sysfs due this change, or because every open file on the
> /dev/nvme-fabrics node has a refernece) when it is entered now, so we
> don't need to do the unless_zero variant there.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/nvme/host/core.c   | 43 ++-
>  drivers/nvme/host/fc.c |  8 ++--
>  drivers/nvme/host/nvme.h   | 12 +++-
>  drivers/nvme/host/pci.c|  2 +-
>  drivers/nvme/host/rdma.c   |  5 ++---
>  drivers/nvme/target/loop.c |  2 +-
>  6 files changed, 39 insertions(+), 33 deletions(-)
> Round of applause for this :-)

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 09/17] nvme: simplify nvme_open

2017-10-24 Thread Hannes Reinecke
On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> Now that we are protected against lookup vs free races for the namespace
> by using kref_get_unless_zero we don't need the hack of NULLing out the
> disk private data during removal.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Sagi Grimberg 
> Reviewed-by: Johannes Thumshirn 
> ---
>  drivers/nvme/host/core.c | 40 ++--
>  1 file changed, 10 insertions(+), 30 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 08/17] nvme: use kref_get_unless_zero in nvme_find_get_ns

2017-10-24 Thread Hannes Reinecke
On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> For kref_get_unless_zero to protect against lookup vs free races we need
> to use it in all places where we aren't guaranteed to already hold a
> reference.  There is no such guarantee in nvme_find_get_ns, so switch to
> kref_get_unless_zero in this function.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Sagi Grimberg 
> Reviewed-by: Johannes Thumshirn 
> ---
>  drivers/nvme/host/core.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 07/17] block: add a poll_fn callback to struct request_queue

2017-10-24 Thread Hannes Reinecke
On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> That we we can also poll non blk-mq queues.  Mostly needed for
> the NVMe multipath code, but could also be useful elsewhere.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/blk-core.c | 11 +++
>  block/blk-mq.c   | 14 +-
>  drivers/nvme/target/io-cmd.c |  2 +-
>  fs/block_dev.c   |  4 ++--
>  fs/direct-io.c   |  2 +-
>  fs/iomap.c   |  2 +-
>  include/linux/blkdev.h   |  4 +++-
>  mm/page_io.c |  2 +-
>  8 files changed, 25 insertions(+), 16 deletions(-)
> Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 06/17] block: introduce GENHD_FL_HIDDEN

2017-10-24 Thread Hannes Reinecke
On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> With this flag a driver can create a gendisk that can be used for I/O
> submission inside the kernel, but which is not registered as user
> facing block device.  This will be useful for the NVMe multipath
> implementation.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  block/genhd.c | 57 
> +++
>  include/linux/genhd.h |  1 +
>  2 files changed, 40 insertions(+), 18 deletions(-)
> 
Can we have some information in sysfs to figure out if a gendisk is
hidden? I'd hate having to do an inverse lookup in /proc/partitions;
it's always hard to prove that something is _not_ present.
And we already present various information (like disk_removable_show()),
so it's not without precedent.
And it would make integration with systemd/udev easier.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 05/17] block: don't look at the struct device dev_t in disk_devt

2017-10-24 Thread Hannes Reinecke
On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> The hidden gendisks introduced in the next patch need to keep the dev
> field in their struct device empty so that udev won't try to create
> block device nodes for them.  To support that rewrite disk_devt to
> look at the major and first_minor fields in the gendisk itself instead
> of looking into the struct device.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Johannes Thumshirn 
> ---
>  block/genhd.c | 4 
>  include/linux/genhd.h | 2 +-
>  2 files changed, 1 insertion(+), 5 deletions(-)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 04/17] block: add a blk_steal_bios helper

2017-10-24 Thread Hannes Reinecke
On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> This helpers allows to bounce steal the uncompleted bios from a request so
> that they can be reissued on another path.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Sagi Grimberg 
> ---
>  block/blk-core.c   | 20 
>  include/linux/blkdev.h |  2 ++
>  2 files changed, 22 insertions(+)
> Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH 03/17] block: provide a direct_make_request helper

2017-10-24 Thread Hannes Reinecke
On 10/23/2017 04:51 PM, Christoph Hellwig wrote:
> This helper allows reinserting a bio into a new queue without much
> overhead, but requires all queue limits to be the same for the upper
> and lower queues, and it does not provide any recursion preventions.
> 
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Sagi Grimberg 
> ---
>  block/blk-core.c   | 34 ++
>  include/linux/blkdev.h |  1 +
>  2 files changed, 35 insertions(+)
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH V8 00/14] mmc: Add Command Queue support

2017-10-24 Thread Adrian Hunter
On 24/10/17 08:37, Ulf Hansson wrote:
> + Bartlomiej
> 
> [...]
> 
>> So my conclusion is, let's start a as you suggested, by not completing
>> the request in ->done() as to maintain existing behavior. Then we can
>> address optimizations on top, which very likely will involve doing
>> changes to host drivers as well.
>
> Have you tested the latest version now?
>

 Ping?
>>>
>>> Still ping?
>>
>> How is your silence in any way an acceptable way to execute your
>> responsibilities as maintainer!
> 
> Seriously? You posted the new version Oct 13.

When you constantly go silent it looks like you are deliberately delaying
the patches - which you have a long history of doing.  Are you deliberately
delaying again?

What are you plans for the patches?

Why do you think mmc work is so unimportant it can be left to whenever you
deign to get around to it?

> 
> I had to make some late minute travel decisions, so unfortunate I
> won't be able to test this on HW from this Friday.
> 
> However, you have completely ignored mine, Linus and Bartlomiej's
> comments about that we want the blkmq port being a separate patch(es)
> and then make the CMDQ patches on top. This worries me, because it
> seems like our messages don't reach you.

Rubbish!  I gave a very good reason for keeping the CQE code in - it is
designed to work together.  I also pointed out that it is trivial to see the
CQE code and that it is all '+' lines anyway.

But not one question in response!  Where is a single example of why it is
difficult like it is.  Where are the questions!  Not even a request for
documentation!  How I am supposed to know what you do or don't understand if
you don't ask any questions! There is no evidence that you guys have read a
single line!

So, what are your plans for the patches?  What don't you understand?