Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-06 Thread Hannes Reinecke
Hi all,

[ .. ]
>>
>> Could you share us your patch for enabling global_tags/MQ on
> megaraid_sas
>> so that I can reproduce your test?
>>
>>> See below perf top data. "bt_iter" is consuming 4 times more CPU.
>>
>> Could you share us what the IOPS/CPU utilization effect is after
> applying the
>> patch V2? And your test script?
> Regarding CPU utilization, I need to test one more time. Currently system
> is in used.
> 
> I run below fio test on total 24 SSDs expander attached.
> 
> numactl -N 1 fio jbod.fio --rw=randread --iodepth=64 --bs=4k
> --ioengine=libaio --rw=randread
> 
> Performance dropped from 1.6 M IOPs to 770K IOPs.
> 
This is basically what we've seen with earlier iterations.

>>
>> In theory, it shouldn't, because the HBA only supports HBA wide tags,
>> that means the allocation has to share a HBA wide sbitmap no matte>> if 
>> global tags is used or not.
>>
>> Anyway, I will take a look at the performance test and data.
>>
>>
>> Thanks,
>> Ming
> 
> 
> Megaraid_sas version of shared tag set.
> 
Whee; thanks for that.

I've just finished a patchset moving megarai_sas_fusion to embedded
commands (and cutting down the size of 'struct megasas_cmd_fusion' by
half :-), so that will come in just handy.

Will give it a spin.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: WARNING in kmalloc_slab (3)

2018-02-06 Thread Dmitry Vyukov
On Tue, Dec 12, 2017 at 10:22 PM, Eric Biggers  wrote:
> On Mon, Dec 04, 2017 at 12:26:32PM +0300, Dan Carpenter wrote:
>> On Mon, Dec 04, 2017 at 09:18:05AM +0100, Dmitry Vyukov wrote:
>> > On Mon, Dec 4, 2017 at 9:14 AM, Dan Carpenter  
>> > wrote:
>> > > On Sun, Dec 03, 2017 at 12:16:08PM -0800, Eric Biggers wrote:
>> > >> Looks like BLKTRACESETUP doesn't limit the '.buf_nr' parameter, 
>> > >> allowing anyone
>> > >> who can open a block device to cause an extremely large kmalloc.  
>> > >> Here's a
>> > >> simplified reproducer:
>> > >>
>> > >
>> > > There are lots of places which allow people to allocate as much as they
>> > > want.  With Syzcaller, you might want to just hard code a __GFP_NOWARN
>> > > in to disable it.
>> >
>> > Hi,
>> >
>> > Hard code it where?
>>
>> My idea was to just make warn_alloc() a no-op.
>>
>> >
>> > User-controllable allocation are supposed to use __GFP_NOWARN.
>>
>> No that's not right.  What we don't want is unprivileged users to use
>> all the memory and we don't want unprivileged users to spam
>> /var/log/messages.  But you have to have slightly elevated permissions
>> to open block devices right?  The warning is helpful.  Admins should
>> "don't do that" if they don't want the warning.
>
> WARN_ON() should only be used for kernel bugs.  printk can be a different 
> story.
> If it's a "userspace shouldn't do this" kind of thing, then if there is any
> message at all it should be a rate-limited printk that actually explains what
> the problem is, not a random WARN_ON() that can only be interpreted by kernel
> developers.
>
> And yes, the fact that anyone with read access to any block device, even e.g. 
> a
> loop device, can cause the kernel to do an unbounded kmalloc *is* a bug.  It
> needs to have a reasonable limit.  It is not a problem on all systems, but on
> some systems "the admin" might give users read access to some block devices.



#syz fix: kernel/relay.c: limit kmalloc size to KMALLOC_MAX_SIZE


[PATCH] blk-mq-debugfs: Show more request state information

2018-02-06 Thread Bart Van Assche
Since commit 634f9e4631a8 ("blk-mq: remove REQ_ATOM_COMPLETE usages
from blk-mq") blk_rq_is_complete() only reports whether or not a
request has completed for legacy queues. Hence modify the
blk-mq-debugfs code such that it shows the blk-mq request state
again.

Fixes: 634f9e4631a8 ("blk-mq: remove REQ_ATOM_COMPLETE usages from blk-mq")
Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
---
 block/blk-mq-debugfs.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 43084d64aa41..c62a3704515b 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -294,6 +294,13 @@ static const char *const rqf_name[] = {
 };
 #undef RQF_NAME
 
+static const char *blk_mq_rq_state_name[4] = {
+   [MQ_RQ_IDLE]= "idle",
+   [MQ_RQ_IN_FLIGHT]   = "in_flight",
+   [MQ_RQ_COMPLETE]= "complete",
+   [3] = "(?)",
+};
+
 int __blk_mq_debugfs_rq_show(struct seq_file *m, struct request *rq)
 {
const struct blk_mq_ops *const mq_ops = rq->q->mq_ops;
@@ -310,7 +317,7 @@ int __blk_mq_debugfs_rq_show(struct seq_file *m, struct 
request *rq)
seq_puts(m, ", .rq_flags=");
blk_flags_show(m, (__force unsigned int)rq->rq_flags, rqf_name,
   ARRAY_SIZE(rqf_name));
-   seq_printf(m, ", complete=%d", blk_rq_is_complete(rq));
+   seq_printf(m, ", .state=%s", blk_mq_rq_state_name[blk_mq_rq_state(rq)]);
seq_printf(m, ", .tag=%d, .internal_tag=%d", rq->tag,
   rq->internal_tag);
if (mq_ops->show_rq)
-- 
2.16.1



[PATCH v2] blk-mq: Fix race between resetting the timer and completion handling

2018-02-06 Thread Bart Van Assche
The following race can occur between the code that resets the timer
and completion handling:
- The code that handles BLK_EH_RESET_TIMER resets aborted_gstate.
- A completion occurs and blk_mq_complete_request() calls
  __blk_mq_complete_request().
- The timeout code calls blk_add_timer() and that function sets the
  request deadline and adjusts the timer.
- __blk_mq_complete_request() frees the request tag.
- The timer fires and the timeout handler gets called for a freed
  request.

Since I have not found any approach to fix this race that does not
require the use of atomic instructions to manipulate the request state,
fix this race as follows:
- Introduce a spinlock to protecte request state and deadline changes.
- Use the deadline instead of the request generation to detect whether
  or not a request timer got fired after reinitialization of a request.
- Store the request state in the lowest two bits of the deadline instead
  of the lowest to bits of 'gstate'.
- Remove all request member variables that became superfluous due to
  this change: gstate, aborted_gstate, gstate_seq and aborted_gstate_sync.
- Remove the state information that became superfluous due to this patch,
  namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

This patch fixes the following kernel crash:

BUG: unable to handle kernel NULL pointer dereference at   (null)
Oops:  [#1] PREEMPT SMP
CPU: 2 PID: 151 Comm: kworker/2:1H Tainted: GW4.15.0-dbg+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.0.0-prebuilt.qemu-project.org 04/01/2014
Workqueue: kblockd blk_mq_timeout_work
RIP: 0010:scsi_times_out+0x17/0x2c0 [scsi_mod]
Call Trace:
blk_mq_terminate_expired+0x42/0x80
bt_iter+0x3d/0x50
blk_mq_queue_tag_busy_iter+0xe9/0x200
blk_mq_timeout_work+0x181/0x2e0
process_one_work+0x21c/0x6d0
worker_thread+0x35/0x380
kthread+0x117/0x130
ret_from_fork+0x24/0x30

Fixes: 1d9bd5161ba3 ("blk-mq: replace timeout synchronization with a RCU and 
generation based scheme")
Signed-off-by: Bart Van Assche 
Cc: Tejun Heo 
---
 block/blk-core.c   |   3 +-
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c | 178 +++--
 block/blk-mq.h |  25 ++-
 block/blk-timeout.c|   1 -
 block/blk.h|   4 +-
 include/linux/blkdev.h |  28 ++--
 7 files changed, 53 insertions(+), 187 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d0d104268f1a..2e3cf04f12a8 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -126,8 +126,7 @@ void blk_rq_init(struct request_queue *q, struct request 
*rq)
rq->start_time = jiffies;
set_start_time_ns(rq);
rq->part = NULL;
-   seqcount_init(>gstate_seq);
-   u64_stats_init(>aborted_gstate_sync);
+   spin_lock_init(>state_lock);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 21cbc1f071c6..43084d64aa41 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -290,7 +290,6 @@ static const char *const rqf_name[] = {
RQF_NAME(STATS),
RQF_NAME(SPECIAL_PAYLOAD),
RQF_NAME(ZONE_WRITE_LOCKED),
-   RQF_NAME(MQ_TIMEOUT_EXPIRED),
RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index df93102e2149..e17a2536ac50 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -309,7 +309,6 @@ static struct request *blk_mq_rq_ctx_init(struct 
blk_mq_alloc_data *data,
rq->special = NULL;
/* tag was already set */
rq->extra_len = 0;
-   rq->__deadline = 0;
 
INIT_LIST_HEAD(>timeout_list);
rq->timeout = 0;
@@ -531,8 +530,7 @@ static void __blk_mq_complete_request(struct request *rq)
bool shared = false;
int cpu;
 
-   WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
-   blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
+   WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
 
if (rq->internal_tag != -1)
blk_mq_sched_completed_request(rq);
@@ -581,34 +579,26 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int 
*srcu_idx)
*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
+/**
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old: Old request state.
+ * @new: New request state.
+ */
+static bool blk_mq_change_rq_state(struct request *rq, enum mq_rq_state old,
+  enum mq_rq_state new)
 {
unsigned long flags;
+   bool changed_state = false;
 
-   /*
-* blk_mq_rq_aborted_gstate() is used from the completion path and
-* can thus be called from irq 

Re: [PATCH V2 5/8] scsi: introduce force_blk_mq

2018-02-06 Thread Ming Lei
On Tue, Feb 06, 2018 at 12:20:43PM -0800, Omar Sandoval wrote:
> On Mon, Feb 05, 2018 at 11:20:32PM +0800, Ming Lei wrote:

...
> > shost->use_blk_mq = scsi_use_blk_mq;
> 
> Not sure if this is a patch formatting issue, but this old line wasn't
> deleted.

Good catch, the old line need to be removed.

Thanks,
Ming


Re: [PATCH V2 2/8] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS

2018-02-06 Thread Ming Lei
On Tue, Feb 06, 2018 at 04:18:20PM -0700, Jens Axboe wrote:
> On 2/5/18 8:20 AM, Ming Lei wrote:
...
> 
> GLOBAL implies that it's, strangely enough, global. That isn't really the
> case. Why not call this BLK_MQ_F_HOST_TAGS or something like that? I'd
> welcome better names, but global doesn't seem to be a great choice.
> 
> BLK_MQ_F_SET_TAGS?

Good point, I am fine with either BLK_MQ_F_HOST_TAGS or BLK_MQ_F_SET_TAGS,
will update in V3.

Thanks,
Ming


Re: [PATCH V2 2/8] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS

2018-02-06 Thread Ming Lei
On Tue, Feb 06, 2018 at 12:33:36PM -0800, Omar Sandoval wrote:
> On Mon, Feb 05, 2018 at 11:20:29PM +0800, Ming Lei wrote:
..
> >  
> > +   /* need to restart all hw queues for global tags */
> > +   if (hctx->flags & BLK_MQ_F_GLOBAL_TAGS) {
> > +   struct blk_mq_hw_ctx *hctx2;
> > +   int i;
> > +
> > +   queue_for_each_hw_ctx(hctx->queue, hctx2, i)
> > +   if (blk_mq_run_hw_queue(hctx2, true))
> > +   return true;
> 
> Is it intentional that we stop after the first hw queue does work? That
> seems fine but it's a little confusing because the comment claims we
> restart everything.

Good catch, will update comment in V3.

Thanks,
Ming


Re: [PATCH V2 2/8] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS

2018-02-06 Thread Jens Axboe
On 2/5/18 8:20 AM, Ming Lei wrote:
> Quite a few HBAs(such as HPSA, megaraid, mpt3sas, ..) support multiple
> reply queues, but tags is often HBA wide.
> 
> These HBAs have switched to use pci_alloc_irq_vectors(PCI_IRQ_AFFINITY)
> for automatic affinity assignment.
> 
> Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
> has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
> for some irq vectors, this can't be avoided even though the allocation
> is improved.
> 
> So all these drivers have to avoid to ask HBA to complete request in
> reply queue which hasn't online CPUs assigned, and HPSA has been broken
> with v4.15+:
> 
>   https://marc.info/?l=linux-kernel=151748144730409=2
> 
> This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
> hw queue by mapping each reply queue into hctx, but one tricky thing is
> the HBA wide(instead of hw queue wide) tags.
> 
> This patch is based on the following Hannes's patch:
> 
>   https://marc.info/?l=linux-block=149132580511346=2
> 
> One big difference with Hannes's is that this patch only makes the tags 
> sbitmap
> and active_queues data structure HBA wide, and others are kept as NUMA 
> locality,
> such as request, hctx, tags, ...
> 
> The following patch will support global tags on null_blk, also the performance
> data is provided, no obvious performance loss is observed when the whole
> hw queue depth is same.

GLOBAL implies that it's, strangely enough, global. That isn't really the
case. Why not call this BLK_MQ_F_HOST_TAGS or something like that? I'd
welcome better names, but global doesn't seem to be a great choice.

BLK_MQ_F_SET_TAGS?

-- 
Jens Axboe



Re: [PATCH V2 0/8] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-06 Thread Jens Axboe
On 2/5/18 8:20 AM, Ming Lei wrote:
> Hi All,
> 
> This patchset supports global tags which was started by Hannes originally:
> 
>   https://marc.info/?l=linux-block=149132580511346=2
> 
> Also inroduce 'force_blk_mq' and 'host_tagset' to 'struct scsi_host_template',
> so that driver can avoid to support two IO paths(legacy and blk-mq), 
> especially
> recent discusion mentioned that SCSI_MQ will be enabled at default soon.
> 
>   https://marc.info/?l=linux-scsi=151727684915589=2
> 
> With the above changes, it should be easier to convert SCSI drivers'
> reply queue into blk-mq's hctx, then the automatic irq affinity issue can
> be solved easily, please see detailed descrption in commit log and the
> 8th patch for converting HPSA.
> 
> Also drivers may require to complete request on the submission CPU
> for avoiding hard/soft deadlock, which can be done easily with blk_mq
> too.
> 
>   https://marc.info/?t=15160185141=1=2
> 
> The 6th patch uses the introduced 'force_blk_mq' to fix virtio_scsi
> so that IO hang issue can be avoided inside legacy IO path, this issue is
> a bit generic, at least HPSA/virtio-scsi are found broken with v4.15+.
> 
> The 7th & 8th patch fixes HPSA's IO issue which is caused by the reply
> queue selection algorithem.
> 
> Laurence has verified that this patch makes HPSA working with the linus
> tree with this patchset.

Do you have any performance numbers for this patchset? I'd be curious
to know how big the hit is.

-- 
Jens Axboe



Re: [PATCH v3] block: Add should_fail_bio() for bpf error injection

2018-02-06 Thread Jens Axboe
On 2/6/18 3:05 PM, Howard McLauchlan wrote:
> The classic error injection mechanism, should_fail_request() does not
> support use cases where more information is required (from the entire
> struct bio, for example).
> 
> To that end, this patch introduces should_fail_bio(), which calls
> should_fail_request() under the hood but provides a convenient
> place for kprobes to hook into if they require the entire struct bio.
> This patch also replaces some existing calls to should_fail_request()
> with should_fail_bio() with no degradation in performance.

This seems better, applied, thanks.

-- 
Jens Axboe



[PATCH v3] block: Add should_fail_bio() for bpf error injection

2018-02-06 Thread Howard McLauchlan
The classic error injection mechanism, should_fail_request() does not
support use cases where more information is required (from the entire
struct bio, for example).

To that end, this patch introduces should_fail_bio(), which calls
should_fail_request() under the hood but provides a convenient
place for kprobes to hook into if they require the entire struct bio.
This patch also replaces some existing calls to should_fail_request()
with should_fail_bio() with no degradation in performance.

Signed-off-by: Howard McLauchlan 
---
 block/blk-core.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d0d104268f1a..2d1a7bbe0634 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -2083,6 +2084,14 @@ static inline bool bio_check_ro(struct bio *bio, struct 
hd_struct *part)
return false;
 }
 
+static noinline int should_fail_bio(struct bio *bio)
+{
+   if (should_fail_request(>bi_disk->part0, bio->bi_iter.bi_size))
+   return -EIO;
+   return 0;
+}
+ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
+
 /*
  * Remap block n of partition p to block n+start(p) of the disk.
  */
@@ -2174,7 +2183,7 @@ generic_make_request_checks(struct bio *bio)
if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
goto not_supported;
 
-   if (should_fail_request(>bi_disk->part0, bio->bi_iter.bi_size))
+   if (should_fail_bio(bio))
goto end_io;
 
if (!bio->bi_partno) {
-- 
2.14.1



Re: [PATCH V2 4/8] block: null_blk: introduce module parameter of 'g_global_tags'

2018-02-06 Thread Omar Sandoval
On Mon, Feb 05, 2018 at 11:20:31PM +0800, Ming Lei wrote:
> This patch introduces the parameter of 'g_global_tags' so that we can
> test this feature by null_blk easiy.
> 
> Not see obvious performance drop with global_tags when the whole hw
> depth is kept as same:
> 
> 1) no 'global_tags', each hw queue depth is 1, and 4 hw queues
> modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=0 
> submit_queues=4 hw_queue_depth=1
> 
> 2) 'global_tags', global hw queue depth is 4, and 4 hw queues
> modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=1 
> submit_queues=4 hw_queue_depth=4
> 
> 3) fio test done in above two settings:
>fio --bs=4k --size=512G  --rw=randread --norandommap --direct=1 
> --ioengine=libaio --iodepth=4 --runtime=$RUNTIME --group_reporting=1  
> --name=nullb0 --filename=/dev/nullb0 --name=nullb1 --filename=/dev/nullb1 
> --name=nullb2 --filename=/dev/nullb2 --name=nullb3 --filename=/dev/nullb3
> 
> 1M IOPS can be reached in both above tests which is done in one VM.
> 
> Cc: Arun Easi 
> Cc: Omar Sandoval ,
> Cc: "Martin K. Petersen" ,
> Cc: James Bottomley ,
> Cc: Christoph Hellwig ,
> Cc: Don Brace 
> Cc: Kashyap Desai 
> Cc: Peter Rivera 
> Cc: Mike Snitzer 
> Tested-by: Laurence Oberman 
> Reviewed-by: Hannes Reinecke 

The module parameter is just called "global_tags", not "g_global_tags",
right? The subject should say the former.

Otherwise,

Reviewed-by: Omar Sandoval 

> Signed-off-by: Ming Lei 
> ---
>  drivers/block/null_blk.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
> index 287a09611c0f..ad0834efad42 100644
> --- a/drivers/block/null_blk.c
> +++ b/drivers/block/null_blk.c
> @@ -163,6 +163,10 @@ static int g_submit_queues = 1;
>  module_param_named(submit_queues, g_submit_queues, int, S_IRUGO);
>  MODULE_PARM_DESC(submit_queues, "Number of submission queues");
>  
> +static int g_global_tags = 0;
> +module_param_named(global_tags, g_global_tags, int, S_IRUGO);
> +MODULE_PARM_DESC(global_tags, "All submission queues share one tags");
> +
>  static int g_home_node = NUMA_NO_NODE;
>  module_param_named(home_node, g_home_node, int, S_IRUGO);
>  MODULE_PARM_DESC(home_node, "Home node for the device");
> @@ -1622,6 +1626,8 @@ static int null_init_tag_set(struct nullb *nullb, 
> struct blk_mq_tag_set *set)
>   set->flags = BLK_MQ_F_SHOULD_MERGE;
>   if (g_no_sched)
>   set->flags |= BLK_MQ_F_NO_SCHED;
> + if (g_global_tags)
> + set->flags |= BLK_MQ_F_GLOBAL_TAGS;
>   set->driver_data = NULL;
>  
>   if ((nullb && nullb->dev->blocking) || g_blocking)
> -- 
> 2.9.5
> 


Re: [PATCH V2 1/8] blk-mq: tags: define several fields of tags as pointer

2018-02-06 Thread Omar Sandoval
On Mon, Feb 05, 2018 at 11:20:28PM +0800, Ming Lei wrote:
> This patch changes tags->breserved_tags, tags->bitmap_tags and
> tags->active_queues as pointer, and prepares for supporting global tags.
> 
> No functional change.
> 
> Tested-by: Laurence Oberman 
> Reviewed-by: Hannes Reinecke 

Assuming it builds :)

Reviewed-by: Omar Sandoval 

> Cc: Mike Snitzer 
> Cc: Christoph Hellwig 
> Signed-off-by: Ming Lei 
> ---
>  block/bfq-iosched.c|  4 ++--
>  block/blk-mq-debugfs.c | 10 +-
>  block/blk-mq-tag.c | 48 ++--
>  block/blk-mq-tag.h | 10 +++---
>  block/blk-mq.c |  2 +-
>  block/kyber-iosched.c  |  2 +-
>  6 files changed, 42 insertions(+), 34 deletions(-)
> 
> diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
> index 47e6ec7427c4..1e1211814a57 100644
> --- a/block/bfq-iosched.c
> +++ b/block/bfq-iosched.c
> @@ -534,9 +534,9 @@ static void bfq_limit_depth(unsigned int op, struct 
> blk_mq_alloc_data *data)
>   WARN_ON_ONCE(1);
>   return;
>   }
> - bt = >breserved_tags;
> + bt = tags->breserved_tags;
>   } else
> - bt = >bitmap_tags;
> + bt = tags->bitmap_tags;
>  
>   if (unlikely(bfqd->sb_shift != bt->sb.shift))
>   bfq_update_depths(bfqd, bt);
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 21cbc1f071c6..0dfafa4b655a 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -433,14 +433,14 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
>   seq_printf(m, "nr_tags=%u\n", tags->nr_tags);
>   seq_printf(m, "nr_reserved_tags=%u\n", tags->nr_reserved_tags);
>   seq_printf(m, "active_queues=%d\n",
> -atomic_read(>active_queues));
> +atomic_read(tags->active_queues));
>  
>   seq_puts(m, "\nbitmap_tags:\n");
> - sbitmap_queue_show(>bitmap_tags, m);
> + sbitmap_queue_show(tags->bitmap_tags, m);
>  
>   if (tags->nr_reserved_tags) {
>   seq_puts(m, "\nbreserved_tags:\n");
> - sbitmap_queue_show(>breserved_tags, m);
> + sbitmap_queue_show(tags->breserved_tags, m);
>   }
>  }
>  
> @@ -471,7 +471,7 @@ static int hctx_tags_bitmap_show(void *data, struct 
> seq_file *m)
>   if (res)
>   goto out;
>   if (hctx->tags)
> - sbitmap_bitmap_show(>tags->bitmap_tags.sb, m);
> + sbitmap_bitmap_show(>tags->bitmap_tags->sb, m);
>   mutex_unlock(>sysfs_lock);
>  
>  out:
> @@ -505,7 +505,7 @@ static int hctx_sched_tags_bitmap_show(void *data, struct 
> seq_file *m)
>   if (res)
>   goto out;
>   if (hctx->sched_tags)
> - sbitmap_bitmap_show(>sched_tags->bitmap_tags.sb, m);
> + sbitmap_bitmap_show(>sched_tags->bitmap_tags->sb, m);
>   mutex_unlock(>sysfs_lock);
>  
>  out:
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 336dde07b230..571797dc36cb 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -18,7 +18,7 @@ bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
>   if (!tags)
>   return true;
>  
> - return sbitmap_any_bit_clear(>bitmap_tags.sb);
> + return sbitmap_any_bit_clear(>bitmap_tags->sb);
>  }
>  
>  /*
> @@ -28,7 +28,7 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
>  {
>   if (!test_bit(BLK_MQ_S_TAG_ACTIVE, >state) &&
>   !test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, >state))
> - atomic_inc(>tags->active_queues);
> + atomic_inc(hctx->tags->active_queues);
>  
>   return true;
>  }
> @@ -38,9 +38,9 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
>   */
>  void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
>  {
> - sbitmap_queue_wake_all(>bitmap_tags);
> + sbitmap_queue_wake_all(tags->bitmap_tags);
>   if (include_reserve)
> - sbitmap_queue_wake_all(>breserved_tags);
> + sbitmap_queue_wake_all(tags->breserved_tags);
>  }
>  
>  /*
> @@ -54,7 +54,7 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
>   if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, >state))
>   return;
>  
> - atomic_dec(>active_queues);
> + atomic_dec(tags->active_queues);
>  
>   blk_mq_tag_wakeup_all(tags, false);
>  }
> @@ -79,7 +79,7 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx 
> *hctx,
>   if (bt->sb.depth == 1)
>   return true;
>  
> - users = atomic_read(>tags->active_queues);
> + users = atomic_read(hctx->tags->active_queues);
>   if (!users)
>   return true;
>  
> @@ -117,10 +117,10 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data 
> *data)
>   WARN_ON_ONCE(1);
>   return BLK_MQ_TAG_FAIL;
>   

Re: [PATCH v2] block: Add should_fail_bio() for bpf error injection

2018-02-06 Thread Jens Axboe
On 2/6/18 2:11 PM, Howard McLauchlan wrote:
> The classic error injection mechanism, should_fail_request() does not
> support use cases where more information is required (from the entire
> struct bio, for example).
> 
> To that end, this patch introduces should_fail_bio(), which calls
> should_fail_request() under the hood but provides a convenient
> place for kprobes to hook into if they require the entire struct bio.
> This patch also replaces some existing calls to should_fail_request()
> with should_fail_bio() with no degradation in performance.

It fails:

block/blk-core.c:2091:1: warning: data definition has no type or storage class
 BPF_ALLOW_ERROR_INJECTION(should_fail_bio);
 ^
block/blk-core.c:2091:1: error: type defaults to ‘int’ in declaration of 
‘BPF_ALLOW_ERROR_INJECTION’ [-Werror=implicit-int]
block/blk-core.c:2091:1: warning: parameter names (without types) in function 
declaration
cc1: some warnings being treated as errors

-- 
Jens Axboe



[PATCH v2] block: Add should_fail_bio() for bpf error injection

2018-02-06 Thread Howard McLauchlan
The classic error injection mechanism, should_fail_request() does not
support use cases where more information is required (from the entire
struct bio, for example).

To that end, this patch introduces should_fail_bio(), which calls
should_fail_request() under the hood but provides a convenient
place for kprobes to hook into if they require the entire struct bio.
This patch also replaces some existing calls to should_fail_request()
with should_fail_bio() with no degradation in performance.

Reviewed-by: Omar Sandoval 
Signed-off-by: Howard McLauchlan 
---
V2: rebased on master for 4.15

 block/blk-core.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index d0d104268f1a..560a0260a8dc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define CREATE_TRACE_POINTS
 #include 
@@ -2083,6 +2084,12 @@ static inline bool bio_check_ro(struct bio *bio, struct 
hd_struct *part)
return false;
 }
 
+static noinline bool should_fail_bio(struct bio *bio)
+{
+   return should_fail_request(>bi_disk->part0, bio->bi_iter.bi_size);
+}
+BPF_ALLOW_ERROR_INJECTION(should_fail_bio);
+
 /*
  * Remap block n of partition p to block n+start(p) of the disk.
  */
@@ -2174,7 +2181,7 @@ generic_make_request_checks(struct bio *bio)
if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
goto not_supported;
 
-   if (should_fail_request(>bi_disk->part0, bio->bi_iter.bi_size))
+   if (should_fail_bio(bio))
goto end_io;
 
if (!bio->bi_partno) {
-- 
2.14.1



Re: [PATCH V2 2/8] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS

2018-02-06 Thread Omar Sandoval
On Mon, Feb 05, 2018 at 11:20:29PM +0800, Ming Lei wrote:
> Quite a few HBAs(such as HPSA, megaraid, mpt3sas, ..) support multiple
> reply queues, but tags is often HBA wide.
> 
> These HBAs have switched to use pci_alloc_irq_vectors(PCI_IRQ_AFFINITY)
> for automatic affinity assignment.
> 
> Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
> has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
> for some irq vectors, this can't be avoided even though the allocation
> is improved.
> 
> So all these drivers have to avoid to ask HBA to complete request in
> reply queue which hasn't online CPUs assigned, and HPSA has been broken
> with v4.15+:
> 
>   https://marc.info/?l=linux-kernel=151748144730409=2
> 
> This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
> hw queue by mapping each reply queue into hctx, but one tricky thing is
> the HBA wide(instead of hw queue wide) tags.
> 
> This patch is based on the following Hannes's patch:
> 
>   https://marc.info/?l=linux-block=149132580511346=2
> 
> One big difference with Hannes's is that this patch only makes the tags 
> sbitmap
> and active_queues data structure HBA wide, and others are kept as NUMA 
> locality,
> such as request, hctx, tags, ...
> 
> The following patch will support global tags on null_blk, also the performance
> data is provided, no obvious performance loss is observed when the whole
> hw queue depth is same.
> 
> Cc: Hannes Reinecke 
> Cc: Arun Easi 
> Cc: Omar Sandoval ,
> Cc: "Martin K. Petersen" ,
> Cc: James Bottomley ,
> Cc: Christoph Hellwig ,
> Cc: Don Brace 
> Cc: Kashyap Desai 
> Cc: Peter Rivera 
> Cc: Mike Snitzer 
> Tested-by: Laurence Oberman 
> Signed-off-by: Ming Lei 
> ---
>  block/blk-mq-debugfs.c |  1 +
>  block/blk-mq-sched.c   | 13 -
>  block/blk-mq-tag.c | 23 ++-
>  block/blk-mq-tag.h |  5 -
>  block/blk-mq.c | 29 -
>  block/blk-mq.h |  3 ++-
>  include/linux/blk-mq.h |  2 ++
>  7 files changed, 63 insertions(+), 13 deletions(-)
> 
> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> index 0dfafa4b655a..0f0fafe03f5d 100644
> --- a/block/blk-mq-debugfs.c
> +++ b/block/blk-mq-debugfs.c
> @@ -206,6 +206,7 @@ static const char *const hctx_flag_name[] = {
>   HCTX_FLAG_NAME(SHOULD_MERGE),
>   HCTX_FLAG_NAME(TAG_SHARED),
>   HCTX_FLAG_NAME(SG_MERGE),
> + HCTX_FLAG_NAME(GLOBAL_TAGS),
>   HCTX_FLAG_NAME(BLOCKING),
>   HCTX_FLAG_NAME(NO_SCHED),
>  };
> diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> index 55c0a745b427..385bbec73804 100644
> --- a/block/blk-mq-sched.c
> +++ b/block/blk-mq-sched.c
> @@ -81,6 +81,17 @@ static bool blk_mq_sched_restart_hctx(struct blk_mq_hw_ctx 
> *hctx)
>   } else
>   clear_bit(BLK_MQ_S_SCHED_RESTART, >state);
>  
> + /* need to restart all hw queues for global tags */
> + if (hctx->flags & BLK_MQ_F_GLOBAL_TAGS) {
> + struct blk_mq_hw_ctx *hctx2;
> + int i;
> +
> + queue_for_each_hw_ctx(hctx->queue, hctx2, i)
> + if (blk_mq_run_hw_queue(hctx2, true))
> + return true;

Is it intentional that we stop after the first hw queue does work? That
seems fine but it's a little confusing because the comment claims we
restart everything.

> + return false;
> + }
> +


Re: [PATCH V2 5/8] scsi: introduce force_blk_mq

2018-02-06 Thread Omar Sandoval
On Mon, Feb 05, 2018 at 11:20:32PM +0800, Ming Lei wrote:
> From scsi driver view, it is a bit troublesome to support both blk-mq
> and non-blk-mq at the same time, especially when drivers need to support
> multi hw-queue.
> 
> This patch introduces 'force_blk_mq' to scsi_host_template so that drivers
> can provide blk-mq only support, so driver code can avoid the trouble
> for supporting both.
> 
> This patch may clean up driver a lot by providing blk-mq only support, 
> espeically
> it is easier to convert multiple reply queues into blk_mq's MQ for the 
> following
> purposes:
> 
> 1) use blk_mq multiple hw queue to deal with allocated irq vectors of all 
> offline
> CPU affinity[1]:
> 
>   [1] https://marc.info/?l=linux-kernel=151748144730409=2
> 
> Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
> has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
> for some irq vectors, this can't be avoided even though the allocation
> is improved.
> 
> So all these drivers have to avoid to ask HBA to complete request in
> reply queue which hasn't online CPUs assigned.
> 
> This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
> hw queue by mapping each reply queue into hctx.
> 
> 2) some drivers[1] require to complete request in the submission CPU for
> avoiding hard/soft lockup, which is easily done with blk_mq, so not necessary
> to reinvent wheels for solving the problem.
> 
>   [2] https://marc.info/?t=15160185141=1=2
> 
> Sovling the above issues for non-MQ path may not be easy, or introduce
> unnecessary work, especially we plan to enable SCSI_MQ soon as discussed
> recently[3]:
> 
>   [3] https://marc.info/?l=linux-scsi=151727684915589=2
> 
> Cc: Arun Easi 
> Cc: Omar Sandoval ,
> Cc: "Martin K. Petersen" ,
> Cc: James Bottomley ,
> Cc: Christoph Hellwig ,
> Cc: Don Brace 
> Cc: Kashyap Desai 
> Cc: Peter Rivera 
> Cc: Mike Snitzer 
> Reviewed-by: Hannes Reinecke 
> Tested-by: Laurence Oberman 
> Signed-off-by: Ming Lei 
> ---
>  drivers/scsi/hosts.c | 1 +
>  include/scsi/scsi_host.h | 3 +++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
> index fe3a0da3ec97..c75cebd7911d 100644
> --- a/drivers/scsi/hosts.c
> +++ b/drivers/scsi/hosts.c
> @@ -471,6 +471,7 @@ struct Scsi_Host *scsi_host_alloc(struct 
> scsi_host_template *sht, int privsize)
>   shost->dma_boundary = 0x;
>  
>   shost->use_blk_mq = scsi_use_blk_mq;

Not sure if this is a patch formatting issue, but this old line wasn't
deleted.

> + shost->use_blk_mq = scsi_use_blk_mq || !!shost->hostt->force_blk_mq;


Re: [PATCH 2/2] block, char_dev: Use correct format specifier for unsigned ints

2018-02-06 Thread Srivatsa S. Bhat
On 2/6/18 2:24 AM, Greg KH wrote:
> On Mon, Feb 05, 2018 at 06:25:27PM -0800, Srivatsa S. Bhat wrote:
>> From: Srivatsa S. Bhat 
>>
>> register_blkdev() and __register_chrdev_region() treat the major
>> number as an unsigned int. So print it the same way to avoid
>> absurd error statements such as:
>> "... major requested (-1) is greater than the maximum (511) ..."
>> (and also fix off-by-one bugs in the error prints).
>>
>> While at it, also update the comment describing register_blkdev().
>>
>> Signed-off-by: Srivatsa S. Bhat 
>> ---
>>
>>  block/genhd.c |   19 +++
>>  fs/char_dev.c |4 ++--
>>  2 files changed, 13 insertions(+), 10 deletions(-)
>>
>> diff --git a/block/genhd.c b/block/genhd.c
>> index 88a53c1..21a18e5 100644
>> --- a/block/genhd.c
>> +++ b/block/genhd.c
>> @@ -308,19 +308,22 @@ void blkdev_show(struct seq_file *seqf, off_t offset)
>>  /**
>>   * register_blkdev - register a new block device
>>   *
>> - * @major: the requested major device number [1..255]. If @major = 0, try to
>> - * allocate any unused major number.
>> + * @major: the requested major device number [1..BLKDEV_MAJOR_MAX-1]. If
>> + * @major = 0, try to allocate any unused major number.
>>   * @name: the name of the new block device as a zero terminated string
>>   *
>>   * The @name must be unique within the system.
>>   *
>>   * The return value depends on the @major input parameter:
>>   *
>> - *  - if a major device number was requested in range [1..255] then the
>> - *function returns zero on success, or a negative error code
>> + *  - if a major device number was requested in range 
>> [1..BLKDEV_MAJOR_MAX-1]
>> + *then the function returns zero on success, or a negative error code
>>   *  - if any unused major number was requested with @major = 0 parameter
>>   *then the return value is the allocated major number in range
>> - *[1..255] or a negative error code otherwise
>> + *[1..BLKDEV_MAJOR_MAX-1] or a negative error code otherwise
>> + *
>> + * See Documentation/admin-guide/devices.txt for the list of allocated
>> + * major numbers.
>>   */
>>  int register_blkdev(unsigned int major, const char *name)
>>  {
>> @@ -347,8 +350,8 @@ int register_blkdev(unsigned int major, const char *name)
>>  }
>>  
>>  if (major >= BLKDEV_MAJOR_MAX) {
>> -pr_err("register_blkdev: major requested (%d) is greater than 
>> the maximum (%d) for %s\n",
>> -   major, BLKDEV_MAJOR_MAX, name);
>> +pr_err("register_blkdev: major requested (%u) is greater than 
>> the maximum (%u) for %s\n",
>> +   major, BLKDEV_MAJOR_MAX-1, name);
>>  
>>  ret = -EINVAL;
>>  goto out;
>> @@ -375,7 +378,7 @@ int register_blkdev(unsigned int major, const char *name)
>>  ret = -EBUSY;
>>  
>>  if (ret < 0) {
>> -printk("register_blkdev: cannot get major %d for %s\n",
>> +printk("register_blkdev: cannot get major %u for %s\n",
>> major, name);
>>  kfree(p);
>>  }
>> diff --git a/fs/char_dev.c b/fs/char_dev.c
>> index 33c9385..a279c58 100644
>> --- a/fs/char_dev.c
>> +++ b/fs/char_dev.c
>> @@ -121,8 +121,8 @@ __register_chrdev_region(unsigned int major, unsigned 
>> int baseminor,
>>  }
>>  
>>  if (major >= CHRDEV_MAJOR_MAX) {
>> -pr_err("CHRDEV \"%s\" major requested (%d) is greater than the 
>> maximum (%d)\n",
>> -   name, major, CHRDEV_MAJOR_MAX);
>> +pr_err("CHRDEV \"%s\" major requested (%u) is greater than the 
>> maximum (%u)\n",
>> +   name, major, CHRDEV_MAJOR_MAX-1);
>>  ret = -EINVAL;
>>  goto out;
>>  }
> 
> Thanks for both of these patches, if Al doesn't grab them, I will after
> 4.16-rc1 comes out.
> 

Sounds great! Thank you!

Regards,
Srivatsa


Re: [PATCH] block: Add should_fail_bio() for bpf error injection

2018-02-06 Thread Jens Axboe
On 2/6/18 12:27 PM, Omar Sandoval wrote:
> On Wed, Jan 24, 2018 at 03:22:58PM -0800, Howard McLauchlan wrote:
>> The classic error injection mechanism, should_fail_request() does not
>> support use cases where more information is required (from the entire
>> struct bio, for example).
>>
>> To that end, this patch introduces should_fail_bio(), which calls
>> should_fail_request() under the hood but provides a convenient
>> place for kprobes to hook into if they require the entire struct bio.
>> This patch also replaces some existing calls to should_fail_request()
>> with should_fail_bio() with no degradation in performance.
> 
> Reviewed-by: Omar Sandoval 
> 
> Jens, can we pick this up for 4.16? The necessary BPF support has been
> merged into Linus' tree.

I don't see why not - but the patch no longer applies. Howard, can you
respin it?

-- 
Jens Axboe



Re: [PATCH] block: Add should_fail_bio() for bpf error injection

2018-02-06 Thread Omar Sandoval
On Wed, Jan 24, 2018 at 03:22:58PM -0800, Howard McLauchlan wrote:
> The classic error injection mechanism, should_fail_request() does not
> support use cases where more information is required (from the entire
> struct bio, for example).
> 
> To that end, this patch introduces should_fail_bio(), which calls
> should_fail_request() under the hood but provides a convenient
> place for kprobes to hook into if they require the entire struct bio.
> This patch also replaces some existing calls to should_fail_request()
> with should_fail_bio() with no degradation in performance.

Reviewed-by: Omar Sandoval 

Jens, can we pick this up for 4.16? The necessary BPF support has been
merged into Linus' tree.

> Signed-off-by: Howard McLauchlan 
> ---
>  block/blk-core.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b8881750a3ac..5e73c996d338 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -34,6 +34,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define CREATE_TRACE_POINTS
>  #include 
> @@ -2050,6 +2051,12 @@ static inline bool should_fail_request(struct 
> hd_struct *part,
>  
>  #endif /* CONFIG_FAIL_MAKE_REQUEST */
>  
> +static noinline bool should_fail_bio(struct bio *bio)
> +{
> + return should_fail_request(>bi_disk->part0, bio->bi_iter.bi_size);
> +}
> +BPF_ALLOW_ERROR_INJECTION(should_fail_bio);
> +
>  /*
>   * Remap block n of partition p to block n+start(p) of the disk.
>   */
> @@ -2141,7 +2148,7 @@ generic_make_request_checks(struct bio *bio)
>   if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
>   goto not_supported;
>  
> - if (should_fail_request(>bi_disk->part0, bio->bi_iter.bi_size))
> + if (should_fail_bio(bio))
>   goto end_io;
>  
>   if (blk_partition_remap(bio))
> -- 
> 2.14.1
> 


Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Oleksandr Natalenko

Hi.

06.02.2018 15:50, Paolo Valente wrote:

Could you please do a
gdb /block/bfq-iosched.o # or vmlinux.o if bfq is builtin
list *(bfq_finish_requeue_request+0x54)
list *(bfq_put_queue+0x10b)
for me?


Fresh crashes and gdb output are given below. A side note: it is harder 
to trigger things on a slower machine, so clearly some timing-bounded 
race condition there.


[  134.276548] BUG: unable to handle kernel NULL pointer dereference at  
 (null)

[  134.283699] IP: blk_flush_complete_seq+0x20a/0x300
[  134.288163] PGD 0 P4D 0
[  134.291284] Oops: 0002 [#1] PREEMPT SMP PTI
[  134.293842] Modules linked in: bochs_drm ttm nls_iso8859_1 kvm_intel 
nls_cp437 vfat fat drm_kms_helper kvm drm irqbypass psmouse iTCO_wdt 
ppdev iTCO_vendor_support input_leds led_class i2c_i801 parport_pc 
joydev intel_agp parport intel_gtt mousedev lpc_ich rtc_cmos syscopyarea 
evdev sysfillrect agpgart qemu_fw_cfg mac_hid sysimgblt fb_sys_fops 
sch_fq_codel ip_tables x_tables xfs dm_thin_pool dm_persistent_data 
dm_bio_prison dm_bufio libcrc32c crc32c_generic dm_crypt algif_skcipher 
af_alg dm_mod hid_generic usbhid raid10 hid md_mod sr_mod sd_mod cdrom 
uhci_hcd ehci_pci serio_raw crct10dif_pclmul crc32_pclmul atkbd 
crc32c_intel libps2 ghash_clmulni_intel pcbc xhci_pci xhci_hcd ehci_hcd 
aesni_intel aes_x86_64 crypto_simd glue_helper cryptd ahci libahci 
libata usbcore usb_common i8042 serio virtio_scsi
[  134.340606]  scsi_mod virtio_blk virtio_net virtio_pci virtio_ring 
virtio
[  134.345803] CPU: 0 PID: 178 Comm: kworker/0:1H Not tainted 4.15.0-pf2 
#1
[  134.350309] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
0.0.0 02/06/2015

[  134.355106] Workqueue: kblockd blk_mq_run_work_fn
[  134.359034] RIP: 0010:blk_flush_complete_seq+0x20a/0x300
[  134.367647] RSP: :88000f803ce8 EFLAGS: 00010082
[  134.371632] RAX: 88000d9755c0 RBX: 88000d9755a0 RCX: 
88000c9b39a8
[  134.375675] RDX:  RSI: 88000d9755d0 RDI: 
88000c9b3900
[  134.381068] RBP: 88000d21a990 R08: 88000d9755b0 R09: 

[  134.386302] R10: 8800058ff100 R11: 0002000b R12: 

[  134.396915] R13: 88000d9755f0 R14: 0046 R15: 
88000d9755a0
[  134.401140] FS:  () GS:88000f80() 
knlGS:

[  134.407361] CS:  0010 DS:  ES:  CR0: 80050033
[  134.412384] CR2:  CR3: 04008006 CR4: 
001606f0

[  134.416913] Call Trace:
[  134.420251]  
[  134.427731]  mq_flush_data_end_io+0xb3/0xf0
[  134.431848]  scsi_end_request+0x90/0x1e0 [scsi_mod]
[  134.436424]  scsi_io_completion+0x237/0x650 [scsi_mod]
[  134.440109]  __blk_mq_complete_request+0xc4/0x150
[  134.444517]  ? scsi_mq_get_budget+0x110/0x110 [scsi_mod]
[  134.449603]  ata_scsi_qc_complete+0x8d/0x430 [libata]
[  134.458487]  ata_qc_complete_multiple+0x8d/0xe0 [libata]
[  134.461726]  ahci_handle_port_interrupt+0xc9/0x5b0 [libahci]
[  134.466420]  ahci_handle_port_intr+0x54/0xb0 [libahci]
[  134.470128]  ahci_single_level_irq_intr+0x3b/0x60 [libahci]
[  134.473327]  __handle_irq_event_percpu+0x44/0x1e0
[  134.476700]  handle_irq_event_percpu+0x30/0x70
[  134.480227]  handle_irq_event+0x37/0x60
[  134.490341]  handle_edge_irq+0x107/0x1c0
[  134.492876]  handle_irq+0x1f/0x30
[  134.495497]  do_IRQ+0x4d/0xe0
[  134.497963]  common_interrupt+0xa2/0xa2
[  134.500877]  
[  134.503129] RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x40
[  134.506782] RSP: :c9307d30 EFLAGS: 0293 ORIG_RAX: 
ffdb
[  134.511845] RAX: 0001 RBX: 88000db04000 RCX: 
0008
[  134.523019] RDX: 0100 RSI: 0293 RDI: 
0293
[  134.527968] RBP: 0293 R08:  R09: 
0040
[  134.532289] R10: 008e66bf R11: 0002000b R12: 

[  134.536376] R13: 88000d26a000 R14: 88000b99ac48 R15: 
88000d26a000

[  134.541046]  ata_scsi_queuecmd+0xa0/0x210 [libata]
[  134.544363]  scsi_dispatch_cmd+0xe8/0x260 [scsi_mod]
[  134.552883]  scsi_queue_rq+0x4cf/0x560 [scsi_mod]
[  134.556811]  blk_mq_dispatch_rq_list+0x8f/0x4c0
[  134.559741]  blk_mq_sched_dispatch_requests+0x105/0x190
[  134.563253]  __blk_mq_run_hw_queue+0x80/0x90
[  134.565540]  process_one_work+0x1df/0x420
[  134.568041]  worker_thread+0x2b/0x3d0
[  134.571032]  ? process_one_work+0x420/0x420
[  134.573964]  kthread+0x113/0x130
[  134.584370]  ? kthread_create_on_node+0x70/0x70
[  134.587355]  ret_from_fork+0x35/0x40
[  134.589796] Code: 39 d0 0f 84 8f 00 00 00 48 8b 97 b0 00 00 00 49 c1 
e0 04 45 31 e4 48 8b b7 a8 00 00 00 49 01 d8 48 8d 8f a8 00 00 00 48 89 
56 08 <48> 89 32 49 8b 50 18 49 89 48 18 48 89 87 a8 00 00 00 48 89 97
[  134.598881] RIP: blk_flush_complete_seq+0x20a/0x300 RSP: 
88000f803ce8

[  134.601812] CR2: 
[  134.603728] ---[ end trace fc6d0cdf33d29717 ]---
[  134.612349] Kernel panic - not syncing: 

Re: [PATCH rfc 2/5] irq-am: add some debugfs exposure on tuning state

2018-02-06 Thread kbuild test robot
Hi Sagi,

I love your patch! Yet something to improve:

[auto build test ERROR on v4.15]
[also build test ERROR on next-20180206]
[cannot apply to linus/master rdma/for-next]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Sagi-Grimberg/irq-am-Introduce-library-implementing-generic-adaptive-moderation/20180206-224501
config: sh-allmodconfig (attached as .config)
compiler: sh4-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=sh 

All error/warnings (new ones prefixed by >>):

>> lib/irq-am.c:17:8: error: type defaults to 'int' in declaration of 
>> 'DEFINE_IDA' [-Werror=implicit-int]
static DEFINE_IDA(am_ida);
   ^~
>> lib/irq-am.c:17:1: warning: parameter names (without types) in function 
>> declaration
static DEFINE_IDA(am_ida);
^~
   lib/irq-am.c: In function 'irq_am_cleanup':
>> lib/irq-am.c:262:2: error: implicit declaration of function 
>> 'ida_simple_remove'; did you mean 'simple_rename'? 
>> [-Werror=implicit-function-declaration]
 ida_simple_remove(_ida, am->id);
 ^
 simple_rename
>> lib/irq-am.c:262:21: error: 'am_ida' undeclared (first use in this function)
 ida_simple_remove(_ida, am->id);
^~
   lib/irq-am.c:262:21: note: each undeclared identifier is reported only once 
for each function it appears in
   lib/irq-am.c: In function 'irq_am_init':
>> lib/irq-am.c:276:11: error: implicit declaration of function 
>> 'ida_simple_get'; did you mean 'simple_open'? 
>> [-Werror=implicit-function-declaration]
 am->id = ida_simple_get(_ida, 0, 0, GFP_KERNEL);
  ^~
  simple_open
   lib/irq-am.c:276:27: error: 'am_ida' undeclared (first use in this function)
 am->id = ida_simple_get(_ida, 0, 0, GFP_KERNEL);
  ^~
   lib/irq-am.c: At top level:
   lib/irq-am.c:17:8: warning: 'DEFINE_IDA' declared 'static' but never defined 
[-Wunused-function]
static DEFINE_IDA(am_ida);
   ^~
   cc1: some warnings being treated as errors

vim +17 lib/irq-am.c

16  
  > 17  static DEFINE_IDA(am_ida);
18  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 2/2] block, char_dev: Use correct format specifier for unsigned ints

2018-02-06 Thread Logan Gunthorpe



On 05/02/18 07:25 PM, Srivatsa S. Bhat wrote:

From: Srivatsa S. Bhat 

register_blkdev() and __register_chrdev_region() treat the major
number as an unsigned int. So print it the same way to avoid
absurd error statements such as:
"... major requested (-1) is greater than the maximum (511) ..."
(and also fix off-by-one bugs in the error prints).

While at it, also update the comment describing register_blkdev().

Signed-off-by: Srivatsa S. Bhat 


Reviewed-by: Logan Gunthorpe 


Re: [PATCH 1/2] char_dev: Fix off-by-one bugs in find_dynamic_major()

2018-02-06 Thread Logan Gunthorpe

Thanks!

On 05/02/18 07:25 PM, Srivatsa S. Bhat wrote:

From: Srivatsa S. Bhat 

CHRDEV_MAJOR_DYN_END and CHRDEV_MAJOR_DYN_EXT_END are valid major
numbers. So fix the loop iteration to include them in the search for
free major numbers.

While at it, also remove a redundant if condition ("cd->major != i"),
as it will never be true.

Signed-off-by: Srivatsa S. Bhat 


Reviewed-by: Logan Gunthorpe 


Logan


Re: [LSF/MM TOPIC] get_user_pages() and filesystems

2018-02-06 Thread Jan Kara
Hello,

On Fri 02-02-18 15:04:11, Liu Bo wrote:
> On Thu, Jan 25, 2018 at 12:57:27PM +0100, Jan Kara wrote:
> > Hello,
> > 
> > this is about a problem I have identified last month and for which I still
> > don't have good solution. Some discussion of the problem happened here [1]
> > where also technical details are posted but culprit of the problem is
> > relatively simple: Lots of places in kernel (fs code, writeback logic,
> > stable-pages framework for DIF/DIX) assume that file pages in page cache
> > can be modified either via write(2), truncate(2), fallocate(2) or similar
> > code paths explicitely manipulating with file space or via a writeable
> > mapping into page tables. In particular we assume that if we block all the
> > above paths by taking proper locks, block page faults, and unmap (/ map
> > read-only) the page, it cannot be modified. But this assumption is violated
> > by get_user_pages() users (such as direct IO or RDMA drivers - and we've
> > got reports from such users of weird things happening).
> > 
> > The problem with GUP users is that they acquire page reference (at that
> > point page is writeably mapped into page tables) and some time in future
> > (which can be quite far in case of RDMA) page contents gets modified and
> > page marked dirty.
> 
> I got a question here, when you say 'page contents gets modified', do
> you mean that GUP users modify the page content?

Yes.

> I have another story about GUP users who use direct-IO, qemu sometimes
> doesn't work well with btrfs when checksum enabled and reports
> checksum failures when guest OS doesn't use stable pages, where it is
> not GUP users but the original file/mapping that may be changing the
> page content in flight.

OK, but that is kind of expected, isn't it? The whole purpose of 'stable
pages' is exactly to modifying pages while IO is in flight. So if a device
image is backed by a storage (filesystem in this case) which checksums
data, qemu should present it to the guest as a block device supporting
DIF/DIX and thus requiring stable pages...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 03/24] ibtrs: core: lib functions shared between client and server modules

2018-02-06 Thread Jason Gunthorpe
On Tue, Feb 06, 2018 at 01:01:23PM +0100, Roman Penyaev wrote:

> >> +static int ibtrs_ib_dev_init(struct ibtrs_ib_dev *d, struct ib_device
> >> *dev)
> >> +{
> >> +   int err;
> >> +
> >> +   d->pd = ib_alloc_pd(dev, IB_PD_UNSAFE_GLOBAL_RKEY);
> >> +   if (IS_ERR(d->pd))
> >> +   return PTR_ERR(d->pd);
> >> +   d->dev = dev;
> >> +   d->lkey = d->pd->local_dma_lkey;
> >> +   d->rkey = d->pd->unsafe_global_rkey;
> >> +
> >> +   err = ibtrs_query_device(d);
> >> +   if (unlikely(err))
> >> +   ib_dealloc_pd(d->pd);
> >> +
> >> +   return err;
> >> +}
> >
> >
> > I must say that this makes me frustrated.. We stopped doing these
> > sort of things long time ago. No way we can even consider accepting
> > the unsafe use of the global rkey exposing the entire memory space for
> > remote access permissions.
> >
> > Sorry for being blunt, but this protocol design which makes a concious
> > decision to expose unconditionally is broken by definition.
> 
> I suppose we can also afford the same trick which nvme does: provide
> register_always module argument, can we?  That can be also interesting
> to measure the performance difference.

I can be firmer than Sagi - new code that has IB_PD_UNSAFE_GLOBAL_RKEY
at can not be accepted upstream.

Once you get that fixed, you should go and read my past comments on
how to properly order dma mapping and completions and fix that
too. Then redo your performance..

Jason


Re: [PATCH rfc 2/5] irq-am: add some debugfs exposure on tuning state

2018-02-06 Thread kbuild test robot
Hi Sagi,

I love your patch! Yet something to improve:

[auto build test ERROR on v4.15]
[also build test ERROR on next-20180206]
[cannot apply to linus/master rdma/for-next]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Sagi-Grimberg/irq-am-Introduce-library-implementing-generic-adaptive-moderation/20180206-224501
config: i386-randconfig-s0-201805 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   lib/irq-am.c:17:8: error: type defaults to 'int' in declaration of 
'DEFINE_IDA' [-Werror=implicit-int]
static DEFINE_IDA(am_ida);
   ^~
   lib/irq-am.c:17:1: warning: parameter names (without types) in function 
declaration
static DEFINE_IDA(am_ida);
^~
   lib/irq-am.c: In function 'irq_am_cleanup':
>> lib/irq-am.c:262:2: error: implicit declaration of function 
>> 'ida_simple_remove' [-Werror=implicit-function-declaration]
 ida_simple_remove(_ida, am->id);
 ^
   lib/irq-am.c:262:21: error: 'am_ida' undeclared (first use in this function)
 ida_simple_remove(_ida, am->id);
^~
   lib/irq-am.c:262:21: note: each undeclared identifier is reported only once 
for each function it appears in
   lib/irq-am.c: In function 'irq_am_init':
>> lib/irq-am.c:276:11: error: implicit declaration of function 
>> 'ida_simple_get' [-Werror=implicit-function-declaration]
 am->id = ida_simple_get(_ida, 0, 0, GFP_KERNEL);
  ^~
   lib/irq-am.c:276:27: error: 'am_ida' undeclared (first use in this function)
 am->id = ida_simple_get(_ida, 0, 0, GFP_KERNEL);
  ^~
   lib/irq-am.c: At top level:
   lib/irq-am.c:17:8: warning: 'DEFINE_IDA' declared 'static' but never defined 
[-Wunused-function]
static DEFINE_IDA(am_ida);
   ^~
   cc1: some warnings being treated as errors

vim +/ida_simple_remove +262 lib/irq-am.c

   257  
   258  void irq_am_cleanup(struct irq_am *am)
   259  {
   260  flush_work(>work);
   261  irq_am_deregister_debugfs(am);
 > 262  ida_simple_remove(_ida, am->id);
   263  }
   264  EXPORT_SYMBOL_GPL(irq_am_cleanup);
   265  
   266  void irq_am_init(struct irq_am *am, unsigned int nr_events,
   267  unsigned short nr_levels, unsigned short start_level, irq_am_fn 
*fn)
   268  {
   269  memset(am, 0, sizeof(*am));
   270  am->state = IRQ_AM_START_MEASURING;
   271  am->tune_state = IRQ_AM_GOING_UP;
   272  am->nr_levels = nr_levels;
   273  am->nr_events = nr_events;
   274  am->curr_level = start_level;
   275  am->program = fn;
 > 276  am->id = ida_simple_get(_ida, 0, 0, GFP_KERNEL);
   277  WARN_ON(am->id < 0);
   278  INIT_WORK(>work, irq_am_program_moderation_work);
   279  if (irq_am_register_debugfs(am))
   280  pr_warn("irq-am %d failed to register debugfs\n", 
am->id);
   281  }
   282  EXPORT_SYMBOL_GPL(irq_am_init);
   283  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH 1/6] genhd: Fix leaked module reference for NVME devices

2018-02-06 Thread Jan Kara
Commit 8ddcd653257c "block: introduce GENHD_FL_HIDDEN" added handling of
hidden devices to get_gendisk() but forgot to drop module reference
which is also acquired by get_disk(). Drop the reference as necessary.

Arguably the function naming here is misleading as put_disk() is *not*
the counterpart of get_disk() but let's fix that in the follow up
commit since that will be more intrusive.

Fixes: 8ddcd653257c18a669fcb75ee42c37054908e0d6
CC: Christoph Hellwig 
Signed-off-by: Jan Kara 
---
 block/genhd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/genhd.c b/block/genhd.c
index 96a66f671720..22f4fc340a3a 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -802,7 +802,10 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
}
 
if (disk && unlikely(disk->flags & GENHD_FL_HIDDEN)) {
+   struct module *owner = disk->fops->owner;
+
put_disk(disk);
+   module_put(owner);
disk = NULL;
}
return disk;
-- 
2.13.6



[PATCH 5/6] genhd: Fix BUG in blkdev_open()

2018-02-06 Thread Jan Kara
When two blkdev_open() calls for a partition race with device removal
and recreation, we can hit BUG_ON(!bd_may_claim(bdev, whole, holder)) in
blkdev_open(). The race can happen as follows:

CPU0CPU1CPU2
del_gendisk()
  
bdev_unhash_inode(part1);

blkdev_open(part1, O_EXCL)  blkdev_open(part1, O_EXCL)
  bdev = bd_acquire() bdev = bd_acquire()
  blkdev_get(bdev)
bd_start_claiming(bdev)
  - finds old inode 'whole'
  bd_prepare_to_claim() -> 0
  
bdev_unhash_inode(whole);


  blkdev_get(bdev);
bd_start_claiming(bdev)
  - finds new inode 'whole'
  bd_prepare_to_claim()
- this also succeeds as we have
  different 'whole' here...
- bad things happen now as we
  have two exclusive openers of
  the same bdev

The problem here is that block device opens can see various intermediate
states while gendisk is shutting down and then being recreated.

We fix the problem by introducing new lookup_sem in gendisk that
synchronizes gendisk deletion with get_gendisk() and furthermore by
making sure that get_gendisk() does not return gendisk that is being (or
has been) deleted. This makes sure that once we ever manage to look up
newly created bdev inode, we are also guaranteed that following
get_gendisk() will either return failure (and we fail open) or it
returns gendisk for the new device and following bdget_disk() will
return new bdev inode (i.e., blkdev_open() follows the path as if it is
completely run after new device is created).

Reported-and-analyzed-by: Hou Tao 
Signed-off-by: Jan Kara 
---
 block/genhd.c | 21 -
 include/linux/genhd.h |  1 +
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/block/genhd.c b/block/genhd.c
index 64c323549a22..c6f68c332bfe 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -703,6 +703,11 @@ void del_gendisk(struct gendisk *disk)
blk_integrity_del(disk);
disk_del_events(disk);
 
+   /*
+* Block lookups of the disk until all bdevs are unhashed and the
+* disk is marked as dead (GENHD_FL_UP cleared).
+*/
+   down_write(>lookup_sem);
/* invalidate stuff */
disk_part_iter_init(, disk,
 DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
@@ -717,6 +722,7 @@ void del_gendisk(struct gendisk *disk)
bdev_unhash_inode(disk_devt(disk));
set_capacity(disk, 0);
disk->flags &= ~GENHD_FL_UP;
+   up_write(>lookup_sem);
 
if (!(disk->flags & GENHD_FL_HIDDEN))
sysfs_remove_link(_to_dev(disk)->kobj, "bdi");
@@ -801,9 +807,21 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
spin_unlock_bh(_devt_lock);
}
 
-   if (disk && unlikely(disk->flags & GENHD_FL_HIDDEN)) {
+   if (!disk)
+   return NULL;
+
+   /*
+* Synchronize with del_gendisk() to not return disk that is being
+* destroyed.
+*/
+   down_read(>lookup_sem);
+   if (unlikely((disk->flags & GENHD_FL_HIDDEN) ||
+!(disk->flags & GENHD_FL_UP))) {
+   up_read(>lookup_sem);
put_disk_and_module(disk);
disk = NULL;
+   } else {
+   up_read(>lookup_sem);
}
return disk;
 }
@@ -1403,6 +1421,7 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
kfree(disk);
return NULL;
}
+   init_rwsem(>lookup_sem);
disk->node_id = node_id;
if (disk_expand_part_tbl(disk, 0)) {
free_part_stats(>part0);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 07b715cdeb93..7b548253eaef 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -198,6 +198,7 @@ struct gendisk {
void *private_data;
 
int flags;
+   struct rw_semaphore lookup_sem;
struct kobject *slave_dir;
 
struct timer_rand_state *random;
-- 
2.13.6



[PATCH 0/6] block: Fix races in bdev - gendisk handling

2018-02-06 Thread Jan Kara
Hello,

these patches fix races happening when devices are frequently destroyed and
recreated in association of block device inode with corresponding gendisk.
Generally when such race happen it results in use-after-free issues, block
device page cache inconsistencies, or other problems. I have verified these
patches fix use-after-free issues that could be reproduced by frequent creation
and destruction of loop device. I was not able to reproduce races reported by
Hou Tao [1] related to gendisk-blkdev association (at least I was not able to
hit any issues with stressing using scsi_debug in various ways with or without
patches). My patches should fix those but it would be great if Tao could verify
that. Any comments and review welcome!

Honza

[1] https://www.spinics.net/lists/linux-block/msg20015.html


[PATCH 2/6] genhd: Rename get_disk() to get_disk_and_module()

2018-02-06 Thread Jan Kara
Rename get_disk() to get_disk_and_module() to make sure what the
function does. It's not a great name but at least it is now clear that
put_disk() is not it's counterpart.

Signed-off-by: Jan Kara 
---
 block/genhd.c   | 10 --
 drivers/block/amiflop.c |  2 +-
 drivers/block/ataflop.c |  2 +-
 drivers/block/brd.c |  2 +-
 drivers/block/floppy.c  |  2 +-
 drivers/block/loop.c|  2 +-
 drivers/block/swim.c|  2 +-
 drivers/block/z2ram.c   |  2 +-
 drivers/ide/ide-probe.c |  2 +-
 include/linux/genhd.h   |  2 +-
 10 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 22f4fc340a3a..058cf91f8d32 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -547,7 +547,7 @@ static int exact_lock(dev_t devt, void *data)
 {
struct gendisk *p = data;
 
-   if (!get_disk(p))
+   if (!get_disk_and_module(p))
return -1;
return 0;
 }
@@ -794,7 +794,7 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
 
spin_lock_bh(_devt_lock);
part = idr_find(_devt_idr, blk_mangle_minor(MINOR(devt)));
-   if (part && get_disk(part_to_disk(part))) {
+   if (part && get_disk_and_module(part_to_disk(part))) {
*partno = part->partno;
disk = part_to_disk(part);
}
@@ -1441,7 +1441,7 @@ struct gendisk *__alloc_disk_node(int minors, int node_id)
 }
 EXPORT_SYMBOL(__alloc_disk_node);
 
-struct kobject *get_disk(struct gendisk *disk)
+struct kobject *get_disk_and_module(struct gendisk *disk)
 {
struct module *owner;
struct kobject *kobj;
@@ -1459,15 +1459,13 @@ struct kobject *get_disk(struct gendisk *disk)
return kobj;
 
 }
-
-EXPORT_SYMBOL(get_disk);
+EXPORT_SYMBOL(get_disk_and_module);
 
 void put_disk(struct gendisk *disk)
 {
if (disk)
kobject_put(_to_dev(disk)->kobj);
 }
-
 EXPORT_SYMBOL(put_disk);
 
 static void set_disk_ro_uevent(struct gendisk *gd, int ro)
diff --git a/drivers/block/amiflop.c b/drivers/block/amiflop.c
index e5aa62fcf5a8..3aaf6af3ec23 100644
--- a/drivers/block/amiflop.c
+++ b/drivers/block/amiflop.c
@@ -1758,7 +1758,7 @@ static struct kobject *floppy_find(dev_t dev, int *part, 
void *data)
if (unit[drive].type->code == FD_NODRIVE)
return NULL;
*part = 0;
-   return get_disk(unit[drive].gendisk);
+   return get_disk_and_module(unit[drive].gendisk);
 }
 
 static int __init amiga_floppy_probe(struct platform_device *pdev)
diff --git a/drivers/block/ataflop.c b/drivers/block/ataflop.c
index 8bc3b9fd8dd2..dfb2c2622e5a 100644
--- a/drivers/block/ataflop.c
+++ b/drivers/block/ataflop.c
@@ -1917,7 +1917,7 @@ static struct kobject *floppy_find(dev_t dev, int *part, 
void *data)
if (drive >= FD_MAX_UNITS || type > NUM_DISK_MINORS)
return NULL;
*part = 0;
-   return get_disk(unit[drive].disk);
+   return get_disk_and_module(unit[drive].disk);
 }
 
 static int __init atari_floppy_init (void)
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 8028a3a7e7fd..deea78e485da 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -456,7 +456,7 @@ static struct kobject *brd_probe(dev_t dev, int *part, void 
*data)
 
mutex_lock(_devices_mutex);
brd = brd_init_one(MINOR(dev) / max_part, );
-   kobj = brd ? get_disk(brd->brd_disk) : NULL;
+   kobj = brd ? get_disk_and_module(brd->brd_disk) : NULL;
mutex_unlock(_devices_mutex);
 
if (new)
diff --git a/drivers/block/floppy.c b/drivers/block/floppy.c
index eae484acfbbc..8ec7235fc93b 100644
--- a/drivers/block/floppy.c
+++ b/drivers/block/floppy.c
@@ -4505,7 +4505,7 @@ static struct kobject *floppy_find(dev_t dev, int *part, 
void *data)
if (((*part >> 2) & 0x1f) >= ARRAY_SIZE(floppy_type))
return NULL;
*part = 0;
-   return get_disk(disks[drive]);
+   return get_disk_and_module(disks[drive]);
 }
 
 static int __init do_floppy_init(void)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d5fe720cf149..87855b5123a6 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1922,7 +1922,7 @@ static struct kobject *loop_probe(dev_t dev, int *part, 
void *data)
if (err < 0)
kobj = NULL;
else
-   kobj = get_disk(lo->lo_disk);
+   kobj = get_disk_and_module(lo->lo_disk);
mutex_unlock(_index_mutex);
 
*part = 0;
diff --git a/drivers/block/swim.c b/drivers/block/swim.c
index 84434d3ea19b..64e066eba72e 100644
--- a/drivers/block/swim.c
+++ b/drivers/block/swim.c
@@ -799,7 +799,7 @@ static struct kobject *floppy_find(dev_t dev, int *part, 
void *data)
return NULL;
 
*part = 0;
-   return get_disk(swd->unit[drive].disk);
+   return get_disk_and_module(swd->unit[drive].disk);
 }
 
 static int swim_add_floppy(struct swim_priv *swd, 

[PATCH 6/6] blockdev: Avoid two active bdev inodes for one device

2018-02-06 Thread Jan Kara
When blkdev_open() races with device removal and creation it can happen
that unhashed bdev inode gets associated with newly created gendisk
like:

CPU0CPU1
blkdev_open()
  bdev = bd_acquire()
del_gendisk()
  bdev_unhash_inode(bdev);
remove device
create new device with the same number
  __blkdev_get()
disk = get_gendisk()
  - gets reference to gendisk of the new device

Now another blkdev_open() will not find original 'bdev' as it got
unhashed, create a new one and associate it with the same 'disk' at
which point problems start as we have two independent page caches for
one device.

Fix the problem by verifying that the bdev inode didn't get unhashed
before we acquired gendisk reference. That way we make sure gendisk can
get associated only with visible bdev inodes.

Signed-off-by: Jan Kara 
---
 fs/block_dev.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index fe41a76769fa..fe09ef9c21f3 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1058,6 +1058,27 @@ static int bd_prepare_to_claim(struct block_device *bdev,
return 0;
 }
 
+static struct gendisk *bdev_get_gendisk(struct block_device *bdev, int *partno)
+{
+   struct gendisk *disk = get_gendisk(bdev->bd_dev, partno);
+
+   if (!disk)
+   return NULL;
+   /*
+* Now that we hold gendisk reference we make sure bdev we looked up is
+* not stale. If it is, it means device got removed and created before
+* we looked up gendisk and we fail open in such case. Associating
+* unhashed bdev with newly created gendisk could lead to two bdevs
+* (and thus two independent caches) being associated with one device
+* which is bad.
+*/
+   if (inode_unhashed(bdev->bd_inode)) {
+   put_disk_and_module(disk);
+   return NULL;
+   }
+   return disk;
+}
+
 /**
  * bd_start_claiming - start claiming a block device
  * @bdev: block device of interest
@@ -1094,7 +1115,7 @@ static struct block_device *bd_start_claiming(struct 
block_device *bdev,
 * @bdev might not have been initialized properly yet, look up
 * and grab the outer block device the hard way.
 */
-   disk = get_gendisk(bdev->bd_dev, );
+   disk = bdev_get_gendisk(bdev, );
if (!disk)
return ERR_PTR(-ENXIO);
 
@@ -1429,7 +1450,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
  restart:
 
ret = -ENXIO;
-   disk = get_gendisk(bdev->bd_dev, );
+   disk = bdev_get_gendisk(bdev, );
if (!disk)
goto out;
 
-- 
2.13.6



[PATCH 3/6] genhd: Add helper put_disk_and_module()

2018-02-06 Thread Jan Kara
Add a proper counterpart to get_disk_and_module() -
put_disk_and_module(). Currently it is opencoded in several places.

Signed-off-by: Jan Kara 
---
 block/blk-cgroup.c| 11 ++-
 block/genhd.c | 20 
 fs/block_dev.c| 19 +--
 include/linux/genhd.h |  1 +
 4 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4117524ca45b..c2033a232a44 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -812,7 +812,6 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct 
blkcg_policy *pol,
struct gendisk *disk;
struct request_queue *q;
struct blkcg_gq *blkg;
-   struct module *owner;
unsigned int major, minor;
int key_len, part, ret;
char *body;
@@ -904,9 +903,7 @@ int blkg_conf_prep(struct blkcg *blkcg, const struct 
blkcg_policy *pol,
spin_unlock_irq(q->queue_lock);
rcu_read_unlock();
 fail:
-   owner = disk->fops->owner;
-   put_disk(disk);
-   module_put(owner);
+   put_disk_and_module(disk);
/*
 * If queue was bypassing, we should retry.  Do so after a
 * short msleep().  It isn't strictly necessary but queue
@@ -931,13 +928,9 @@ EXPORT_SYMBOL_GPL(blkg_conf_prep);
 void blkg_conf_finish(struct blkg_conf_ctx *ctx)
__releases(ctx->disk->queue->queue_lock) __releases(rcu)
 {
-   struct module *owner;
-
spin_unlock_irq(ctx->disk->queue->queue_lock);
rcu_read_unlock();
-   owner = ctx->disk->fops->owner;
-   put_disk(ctx->disk);
-   module_put(owner);
+   put_disk_and_module(ctx->disk);
 }
 EXPORT_SYMBOL_GPL(blkg_conf_finish);
 
diff --git a/block/genhd.c b/block/genhd.c
index 058cf91f8d32..64c323549a22 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -802,10 +802,7 @@ struct gendisk *get_gendisk(dev_t devt, int *partno)
}
 
if (disk && unlikely(disk->flags & GENHD_FL_HIDDEN)) {
-   struct module *owner = disk->fops->owner;
-
-   put_disk(disk);
-   module_put(owner);
+   put_disk_and_module(disk);
disk = NULL;
}
return disk;
@@ -1468,6 +1465,21 @@ void put_disk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(put_disk);
 
+/*
+ * This is a counterpart of get_disk_and_module() and thus also of
+ * get_gendisk().
+ */
+void put_disk_and_module(struct gendisk *disk)
+{
+   if (disk) {
+   struct module *owner = disk->fops->owner;
+
+   put_disk(disk);
+   module_put(owner);
+   }
+}
+EXPORT_SYMBOL(put_disk_and_module);
+
 static void set_disk_ro_uevent(struct gendisk *gd, int ro)
 {
char event[] = "DISK_RO=1";
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4a181fcb5175..1dbbf847911a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -,8 +,7 @@ static struct block_device *bd_start_claiming(struct 
block_device *bdev,
else
whole = bdgrab(bdev);
 
-   module_put(disk->fops->owner);
-   put_disk(disk);
+   put_disk_and_module(disk);
if (!whole)
return ERR_PTR(-ENOMEM);
 
@@ -1407,7 +1406,6 @@ static void __blkdev_put(struct block_device *bdev, 
fmode_t mode, int for_part);
 static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
 {
struct gendisk *disk;
-   struct module *owner;
int ret;
int partno;
int perm = 0;
@@ -1433,7 +1431,6 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
disk = get_gendisk(bdev->bd_dev, );
if (!disk)
goto out;
-   owner = disk->fops->owner;
 
disk_block_events(disk);
mutex_lock_nested(>bd_mutex, for_part);
@@ -1463,8 +1460,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
bdev->bd_queue = NULL;
mutex_unlock(>bd_mutex);
disk_unblock_events(disk);
-   put_disk(disk);
-   module_put(owner);
+   put_disk_and_module(disk);
goto restart;
}
}
@@ -1525,8 +1521,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
goto out_unlock_bdev;
}
/* only one opener holds refs to the module and disk */
-   put_disk(disk);
-   module_put(owner);
+   put_disk_and_module(disk);
}
bdev->bd_openers++;
if (for_part)
@@ -1546,8 +1541,7 @@ static int __blkdev_get(struct block_device *bdev, 
fmode_t mode, int for_part)
  out_unlock_bdev:
mutex_unlock(>bd_mutex);

Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-06 Thread Bart Van Assche
On Tue, 2018-02-06 at 14:12 +0100, Roman Penyaev wrote:
> On Mon, Feb 5, 2018 at 1:16 PM, Sagi Grimberg  wrote:
> > [ ... ]
> > - srp/scst comparison is really not fair having it in legacy request
> >   mode. Can you please repeat it and report a bug to either linux-rdma
> >   or to the scst mailing list?
> 
> Yep, I can retest with mq.
> 
> > - Your latency measurements are surprisingly high for a null target
> >   device (even for low end nvme device actually) regardless of the
> >   transport implementation.
> 
> Hm, network configuration?  These are results on machines dedicated
> to our team for testing in one of our datacenters. Nothing special
> in configuration.

Hello Roman,

I agree that the latency numbers are way too high for a null target device.
Last time I measured latency for the SRP protocol against an SCST target
+ null block driver at the target side and ConnectX-3 adapters I measured a
latency of about 14 microseconds. That's almost 100 times less than the
measurement results in https://www.spinics.net/lists/linux-rdma/msg48799.html.

Something else I would like to understand better is how much of the latency
gap between NVMeOF/SRP and IBNBD can be closed without changing the wire
protocol. Was e.g. support for immediate data present in the NVMeOF and/or
SRP drivers used on your test setup? Are you aware that the NVMeOF target
driver calls page_alloc() from the hot path but that there are plans to
avoid these calls in the hot path by using a caching mechanism similar to
the SGV cache in SCST? Are you aware that a significant latency reduction
can be achieved by changing the SCST SGV cache from a global into a per-CPU
cache? Regarding the SRP measurements: have you tried to set the
never_register kernel module parameter to true? I'm asking this because I
think that mode is most similar to how the IBNBD initiator driver works.

Thanks,

Bart.

Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-06 Thread Ming Lei
Hi Kashyap,

On Tue, Feb 06, 2018 at 07:57:35PM +0530, Kashyap Desai wrote:
> > -Original Message-
> > From: Ming Lei [mailto:ming@redhat.com]
> > Sent: Tuesday, February 6, 2018 6:02 PM
> > To: Kashyap Desai
> > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph
> > Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; Arun Easi; Omar
> Sandoval;
> > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> Peter
> > Rivera; Paolo Bonzini; Laurence Oberman
> > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> > force_blk_mq
> >
> > Hi Kashyap,
> >
> > On Tue, Feb 06, 2018 at 04:59:51PM +0530, Kashyap Desai wrote:
> > > > -Original Message-
> > > > From: Ming Lei [mailto:ming@redhat.com]
> > > > Sent: Tuesday, February 6, 2018 1:35 PM
> > > > To: Kashyap Desai
> > > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org;
> > > > Christoph Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; Arun
> > > > Easi; Omar
> > > Sandoval;
> > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > > Peter
> > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > introduce force_blk_mq
> > > >
> > > > Hi Kashyap,
> > > >
> > > > On Tue, Feb 06, 2018 at 11:33:50AM +0530, Kashyap Desai wrote:
> > > > > > > We still have more than one reply queue ending up completion
> > > > > > > one
> > > CPU.
> > > > > >
> > > > > > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that
> > > > > > means smp_affinity_enable has to be set as 1, but seems it is
> > > > > > the default
> > > > > setting.
> > > > > >
> > > > > > Please see kernel/irq/affinity.c, especially
> > > > > > irq_calc_affinity_vectors()
> > > > > which
> > > > > > figures out an optimal number of vectors, and the computation is
> > > > > > based
> > > > > on
> > > > > > cpumask_weight(cpu_possible_mask) now. If all offline CPUs are
> > > > > > mapped to some of reply queues, these queues won't be active(no
> > > > > > request submitted
> > > > > to
> > > > > > these queues). The mechanism of PCI_IRQ_AFFINITY basically makes
> > > > > > sure
> > > > > that
> > > > > > more than one irq vector won't be handled by one same CPU, and
> > > > > > the irq vector spread is done in irq_create_affinity_masks().
> > > > > >
> > > > > > > Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver
> > > > > > > via module parameter to simulate the issue. We need more
> > > > > > > number of Online CPU than reply-queue.
> > > > > >
> > > > > > IMO, you don't need to simulate the issue,
> > > > > > pci_alloc_irq_vectors(
> > > > > > PCI_IRQ_AFFINITY) will handle that for you. You can dump the
> > > > > > returned
> > > > > irq
> > > > > > vector number, num_possible_cpus()/num_online_cpus() and each
> > > > > > irq vector's affinity assignment.
> > > > > >
> > > > > > > We may see completion redirected to original CPU because of
> > > > > > > "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep
> > > > > > > one CPU busy in local ISR routine.
> > > > > >
> > > > > > Could you dump each irq vector's affinity assignment of your
> > > > > > megaraid in
> > > > > your
> > > > > > test?
> > > > >
> > > > > To quickly reproduce, I restricted to single MSI-x vector on
> > > > > megaraid_sas driver.  System has total 16 online CPUs.
> > > >
> > > > I suggest you don't do the restriction of single MSI-x vector, and
> > > > just
> > > use the
> > > > device supported number of msi-x vector.
> > >
> > > Hi Ming,  CPU lock up is seen even though it is not single msi-x
> vector.
> > > Actual scenario need some specific topology and server for overnight
> test.
> > > Issue can be seen on servers which has more than 16 logical CPUs and
> > > Thunderbolt series MR controller which supports at max 16 MSIx
> vectors.
> > > >
> > > > >
> > > > > Output of affinity hints.
> > > > > kernel version:
> > > > > Linux rhel7.3 4.15.0-rc1+ #2 SMP Mon Feb 5 12:13:34 EST 2018
> > > > > x86_64
> > > > > x86_64
> > > > > x86_64 GNU/Linux
> > > > > PCI name is 83:00.0, dump its irq affinity:
> > > > > irq 105, cpu list 0-3,8-11
> > > >
> > > > In this case, which CPU is selected for handling the interrupt is
> > > decided by
> > > > interrupt controller, and it is easy to cause CPU overload if
> > > > interrupt
> > > controller
> > > > always selects one same CPU to handle the irq.
> > > >
> > > > >
> > > > > Affinity mask is created properly, but only CPU-0 is overloaded
> > > > > with interrupt processing.
> > > > >
> > > > > # numactl --hardware
> > > > > available: 2 nodes (0-1)
> > > > > node 0 cpus: 0 1 2 3 8 9 10 11
> > > > > node 0 size: 47861 MB
> > > > > node 0 free: 46516 MB
> > > > > node 1 cpus: 4 5 6 7 12 13 14 15
> > > > > node 1 size: 64491 MB
> > > > > node 1 free: 62805 MB
> > > > > node distances:
> > > > > node   0   1
> > > > >   0:  10  21
> > > > >   1:  21  10
> > > > >
> > > > > Output of  

Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-06 Thread Bart Van Assche
On Tue, 2018-02-06 at 10:44 +0100, Danil Kipnis wrote:
> the configuration (which devices can be accessed by a particular
> client) can happen also after the kernel target module is loaded. The
> directory in  is a module parameter and is fixed. It
> contains for example "/ibnbd_devices/". But a particular client X
> would be able to only access the devices located in the subdirectory
> "/ibnbd_devices/client_x/". (The sessionname here is client_x) One can
> add or remove the devices from that directory (those are just symlinks
> to /dev/xxx) at any time - before or after the server module is
> loaded. But you are right, we need something additional in order to be
> able to specify which devices a client can access writable and which
> readonly. May be another subdirectories "wr" and "ro" for each client:
> those under /ibnbd_devices/client_x/ro/ can only be read by client_x
> and those in /ibnbd_devices/client_x/wr/ can also be written to?

Please use a standard kernel filesystem (sysfs or configfs) instead of
reinventing it.

Thanks,

Bart.

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Holger Hoffstätte
On 02/06/18 15:55, Paolo Valente wrote:
> 
> 
>> Il giorno 06 feb 2018, alle ore 14:40, Holger Hoffstätte 
>>  ha scritto:
>>
>>
>> The plot thickens!
>>
> 
> Yep, the culprit seems clearer, though ...
> 
>> Just as I was about to post that I didn't have any problems - because
>> I didn't have any - I decided to do a second test, activated bfq on my
>> workstation, on a hunch typed "sync" and .. the machine locked up, hard.
>>
>> Rebooted, activated bfq, typed sync..sync hangs. Luckily this time
>> a second terminal was still alive, so I could capture a trace for
>> your enjoyment:
>>
>> Feb  6 14:28:17 ragnarok kernel: io scheduler bfq registered
>> Feb  6 14:28:20 ragnarok kernel: BUG: unable to handle kernel NULL pointer 
>> dereference at 0030
>> Feb  6 14:28:20 ragnarok kernel: IP: bfq_put_queue+0x10b/0x130 [bfq]
>> Feb  6 14:28:20 ragnarok kernel: PGD 0 P4D 0 
>> Feb  6 14:28:20 ragnarok kernel: Oops:  [#1] SMP PTI
>> Feb  6 14:28:20 ragnarok kernel: Modules linked in: bfq lz4 lz4_compress 
>> lz4_decompress nfs lockd grace sunrpc autofs4 sch_fq_codel it87 hwmon_vid 
>> x86_pkg_temp_thermal snd_hda_codec_realtek coretemp radeon crc32_pclmul 
>> snd_hda_codec_generic crc32c_intel pcbc snd_hda_codec_hdmi i2c_algo_bit 
>> aesni_intel drm_kms_helper aes_x86_64 uvcvideo syscopyarea crypto_simd 
>> snd_hda_intel sysfillrect cryptd snd_usb_audio sysimgblt videobuf2_vmalloc 
>> glue_helper fb_sys_fops snd_hda_codec snd_hwdep videobuf2_memops ttm 
>> videobuf2_v4l2 snd_usbmidi_lib videobuf2_core snd_rawmidi snd_hda_core drm 
>> snd_seq_device videodev snd_pcm i2c_i801 usbhid snd_timer i2c_core snd 
>> backlight soundcore r8169 parport_pc mii parport
>> Feb  6 14:28:20 ragnarok kernel: CPU: 0 PID: 4 Comm: kworker/0:0H Not 
>> tainted 4.14.18 #1
>> Feb  6 14:28:20 ragnarok kernel: Hardware name: Gigabyte Technology Co., 
>> Ltd. P67-DS3-B3/P67-DS3-B3, BIOS F1 05/06/2011
>> Feb  6 14:28:20 ragnarok kernel: Workqueue: kblockd blk_mq_requeue_work
>> Feb  6 14:28:20 ragnarok kernel: task: 88060395a1c0 task.stack: 
>> c9044000
>> Feb  6 14:28:20 ragnarok kernel: RIP: 0010:bfq_put_queue+0x10b/0x130 [bfq]
>> Feb  6 14:28:20 ragnarok kernel: RSP: 0018:c9047ca0 EFLAGS: 00010286
>> Feb  6 14:28:20 ragnarok kernel: RAX: 0008 RBX: 8806023db690 
>> RCX: 
>> Feb  6 14:28:20 ragnarok kernel: RDX:  RSI: 880601bb39b0 
>> RDI: 880601a56400
>> Feb  6 14:28:20 ragnarok kernel: RBP: 01bb3980 R08: 0053 
>> R09: 8806023db690
>> Feb  6 14:28:20 ragnarok kernel: R10: 1dd0f11e R11: 080a011b 
>> R12: 880601a56400
>> Feb  6 14:28:20 ragnarok kernel: R13: 8806023dbed0 R14: 0053 
>> R15: 
>> Feb  6 14:28:20 ragnarok kernel: FS:  () 
>> GS:88061f40() knlGS:
>> Feb  6 14:28:20 ragnarok kernel: CS:  0010 DS:  ES:  CR0: 
>> 80050033
>> Feb  6 14:28:20 ragnarok kernel: CR2: 0030 CR3: 0200a002 
>> CR4: 000606f0
>> Feb  6 14:28:20 ragnarok kernel: Call Trace:
>> Feb  6 14:28:20 ragnarok kernel:  bfq_finish_requeue_request+0x4b/0x370 [bfq]
>> Feb  6 14:28:20 ragnarok kernel:  __blk_mq_requeue_request+0x57/0x130
>> Feb  6 14:28:20 ragnarok kernel:  blk_mq_dispatch_rq_list+0x1b3/0x510
>> Feb  6 14:28:20 ragnarok kernel:  ? __bfq_bfqd_reset_in_service+0x20/0x70 
>> [bfq]
>> Feb  6 14:28:20 ragnarok kernel:  ? bfq_bfqq_expire+0x212/0x740 [bfq]
>> Feb  6 14:28:20 ragnarok kernel:  blk_mq_sched_dispatch_requests+0xf0/0x170
>> Feb  6 14:28:20 ragnarok kernel:  __blk_mq_run_hw_queue+0x4e/0x90
>> Feb  6 14:28:20 ragnarok kernel:  __blk_mq_delay_run_hw_queue+0x73/0x80
>> Feb  6 14:28:20 ragnarok kernel:  blk_mq_run_hw_queue+0x53/0x150
>> Feb  6 14:28:20 ragnarok kernel:  blk_mq_run_hw_queues+0x3a/0x50
>> Feb  6 14:28:20 ragnarok kernel:  blk_mq_requeue_work+0x104/0x110
>> Feb  6 14:28:20 ragnarok kernel:  process_one_work+0x1d4/0x3d0
>> Feb  6 14:28:20 ragnarok kernel:  worker_thread+0x2b/0x3c0
>> Feb  6 14:28:20 ragnarok kernel:  ? process_one_work+0x3d0/0x3d0
>> Feb  6 14:28:20 ragnarok kernel:  kthread+0x117/0x130
>> Feb  6 14:28:20 ragnarok kernel:  ? kthread_create_on_node+0x40/0x40
>> Feb  6 14:28:20 ragnarok kernel:  ret_from_fork+0x1f/0x30
>> Feb  6 14:28:20 ragnarok kernel: Code: c1 e8 06 83 e0 01 48 83 f8 01 45 19 
>> f6 e8 ce 3a 00 00 41 83 e6 ee 48 89 c7 41 83 c6 53 e8 9e 3a 00 00 49 89 d9 
>> 45 89 f0 44 89 f9 <48> 8b 70 28 48 c7 c2 d8 00 25 a0 55 4c 89 ef e8 11 ba ea 
>> e0 8b 
>> Feb  6 14:28:20 ragnarok kernel: RIP: bfq_put_queue+0x10b/0x130 [bfq] RSP: 
>> c9047ca0
>> Feb  6 14:28:20 ragnarok kernel: CR2: 0030
>> Feb  6 14:28:20 ragnarok kernel: ---[ end trace 8b782ace30a4e7d8 ]---
>>
> 
> Same request: please
> gdb /block/bfq-iosched.o
> list *(bfq_finish_requeue_request+0x4b)
> list *(bfq_put_queue+0x10b)

(gdb) list 

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Oleksandr Natalenko

06.02.2018 15:50, Paolo Valente wrote:

Could you please do a
gdb /block/bfq-iosched.o # or vmlinux.o if bfq is builtin
list *(bfq_finish_requeue_request+0x54)
list *(bfq_put_queue+0x10b)
for me?


Yes. Just give me some time to recompile the kernel with minimal debug 
info enabled. I'll post then list/disasm and new stacktraces.


Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Paolo Valente


> Il giorno 06 feb 2018, alle ore 14:40, Holger Hoffstätte 
>  ha scritto:
> 
> 
> The plot thickens!
> 

Yep, the culprit seems clearer, though ...

> Just as I was about to post that I didn't have any problems - because
> I didn't have any - I decided to do a second test, activated bfq on my
> workstation, on a hunch typed "sync" and .. the machine locked up, hard.
> 
> Rebooted, activated bfq, typed sync..sync hangs. Luckily this time
> a second terminal was still alive, so I could capture a trace for
> your enjoyment:
> 
> Feb  6 14:28:17 ragnarok kernel: io scheduler bfq registered
> Feb  6 14:28:20 ragnarok kernel: BUG: unable to handle kernel NULL pointer 
> dereference at 0030
> Feb  6 14:28:20 ragnarok kernel: IP: bfq_put_queue+0x10b/0x130 [bfq]
> Feb  6 14:28:20 ragnarok kernel: PGD 0 P4D 0 
> Feb  6 14:28:20 ragnarok kernel: Oops:  [#1] SMP PTI
> Feb  6 14:28:20 ragnarok kernel: Modules linked in: bfq lz4 lz4_compress 
> lz4_decompress nfs lockd grace sunrpc autofs4 sch_fq_codel it87 hwmon_vid 
> x86_pkg_temp_thermal snd_hda_codec_realtek coretemp radeon crc32_pclmul 
> snd_hda_codec_generic crc32c_intel pcbc snd_hda_codec_hdmi i2c_algo_bit 
> aesni_intel drm_kms_helper aes_x86_64 uvcvideo syscopyarea crypto_simd 
> snd_hda_intel sysfillrect cryptd snd_usb_audio sysimgblt videobuf2_vmalloc 
> glue_helper fb_sys_fops snd_hda_codec snd_hwdep videobuf2_memops ttm 
> videobuf2_v4l2 snd_usbmidi_lib videobuf2_core snd_rawmidi snd_hda_core drm 
> snd_seq_device videodev snd_pcm i2c_i801 usbhid snd_timer i2c_core snd 
> backlight soundcore r8169 parport_pc mii parport
> Feb  6 14:28:20 ragnarok kernel: CPU: 0 PID: 4 Comm: kworker/0:0H Not tainted 
> 4.14.18 #1
> Feb  6 14:28:20 ragnarok kernel: Hardware name: Gigabyte Technology Co., Ltd. 
> P67-DS3-B3/P67-DS3-B3, BIOS F1 05/06/2011
> Feb  6 14:28:20 ragnarok kernel: Workqueue: kblockd blk_mq_requeue_work
> Feb  6 14:28:20 ragnarok kernel: task: 88060395a1c0 task.stack: 
> c9044000
> Feb  6 14:28:20 ragnarok kernel: RIP: 0010:bfq_put_queue+0x10b/0x130 [bfq]
> Feb  6 14:28:20 ragnarok kernel: RSP: 0018:c9047ca0 EFLAGS: 00010286
> Feb  6 14:28:20 ragnarok kernel: RAX: 0008 RBX: 8806023db690 
> RCX: 
> Feb  6 14:28:20 ragnarok kernel: RDX:  RSI: 880601bb39b0 
> RDI: 880601a56400
> Feb  6 14:28:20 ragnarok kernel: RBP: 01bb3980 R08: 0053 
> R09: 8806023db690
> Feb  6 14:28:20 ragnarok kernel: R10: 1dd0f11e R11: 080a011b 
> R12: 880601a56400
> Feb  6 14:28:20 ragnarok kernel: R13: 8806023dbed0 R14: 0053 
> R15: 
> Feb  6 14:28:20 ragnarok kernel: FS:  () 
> GS:88061f40() knlGS:
> Feb  6 14:28:20 ragnarok kernel: CS:  0010 DS:  ES:  CR0: 
> 80050033
> Feb  6 14:28:20 ragnarok kernel: CR2: 0030 CR3: 0200a002 
> CR4: 000606f0
> Feb  6 14:28:20 ragnarok kernel: Call Trace:
> Feb  6 14:28:20 ragnarok kernel:  bfq_finish_requeue_request+0x4b/0x370 [bfq]
> Feb  6 14:28:20 ragnarok kernel:  __blk_mq_requeue_request+0x57/0x130
> Feb  6 14:28:20 ragnarok kernel:  blk_mq_dispatch_rq_list+0x1b3/0x510
> Feb  6 14:28:20 ragnarok kernel:  ? __bfq_bfqd_reset_in_service+0x20/0x70 
> [bfq]
> Feb  6 14:28:20 ragnarok kernel:  ? bfq_bfqq_expire+0x212/0x740 [bfq]
> Feb  6 14:28:20 ragnarok kernel:  blk_mq_sched_dispatch_requests+0xf0/0x170
> Feb  6 14:28:20 ragnarok kernel:  __blk_mq_run_hw_queue+0x4e/0x90
> Feb  6 14:28:20 ragnarok kernel:  __blk_mq_delay_run_hw_queue+0x73/0x80
> Feb  6 14:28:20 ragnarok kernel:  blk_mq_run_hw_queue+0x53/0x150
> Feb  6 14:28:20 ragnarok kernel:  blk_mq_run_hw_queues+0x3a/0x50
> Feb  6 14:28:20 ragnarok kernel:  blk_mq_requeue_work+0x104/0x110
> Feb  6 14:28:20 ragnarok kernel:  process_one_work+0x1d4/0x3d0
> Feb  6 14:28:20 ragnarok kernel:  worker_thread+0x2b/0x3c0
> Feb  6 14:28:20 ragnarok kernel:  ? process_one_work+0x3d0/0x3d0
> Feb  6 14:28:20 ragnarok kernel:  kthread+0x117/0x130
> Feb  6 14:28:20 ragnarok kernel:  ? kthread_create_on_node+0x40/0x40
> Feb  6 14:28:20 ragnarok kernel:  ret_from_fork+0x1f/0x30
> Feb  6 14:28:20 ragnarok kernel: Code: c1 e8 06 83 e0 01 48 83 f8 01 45 19 f6 
> e8 ce 3a 00 00 41 83 e6 ee 48 89 c7 41 83 c6 53 e8 9e 3a 00 00 49 89 d9 45 89 
> f0 44 89 f9 <48> 8b 70 28 48 c7 c2 d8 00 25 a0 55 4c 89 ef e8 11 ba ea e0 8b 
> Feb  6 14:28:20 ragnarok kernel: RIP: bfq_put_queue+0x10b/0x130 [bfq] RSP: 
> c9047ca0
> Feb  6 14:28:20 ragnarok kernel: CR2: 0030
> Feb  6 14:28:20 ragnarok kernel: ---[ end trace 8b782ace30a4e7d8 ]---
> 

Same request: please
gdb /block/bfq-iosched.o
list *(bfq_finish_requeue_request+0x4b)
list *(bfq_put_queue+0x10b)

Thanks,
Paolo


> Yes, this is 4.14.x but with most-of block/4.16, rock-solid otherwise (in 
> daily use).
> Looks like there is something wrong with this patch after 

RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-06 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Tuesday, February 6, 2018 6:02 PM
> To: Kashyap Desai
> Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> Hi Kashyap,
>
> On Tue, Feb 06, 2018 at 04:59:51PM +0530, Kashyap Desai wrote:
> > > -Original Message-
> > > From: Ming Lei [mailto:ming@redhat.com]
> > > Sent: Tuesday, February 6, 2018 1:35 PM
> > > To: Kashyap Desai
> > > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org;
> > > Christoph Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; Arun
> > > Easi; Omar
> > Sandoval;
> > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > Peter
> > > Rivera; Paolo Bonzini; Laurence Oberman
> > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > introduce force_blk_mq
> > >
> > > Hi Kashyap,
> > >
> > > On Tue, Feb 06, 2018 at 11:33:50AM +0530, Kashyap Desai wrote:
> > > > > > We still have more than one reply queue ending up completion
> > > > > > one
> > CPU.
> > > > >
> > > > > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that
> > > > > means smp_affinity_enable has to be set as 1, but seems it is
> > > > > the default
> > > > setting.
> > > > >
> > > > > Please see kernel/irq/affinity.c, especially
> > > > > irq_calc_affinity_vectors()
> > > > which
> > > > > figures out an optimal number of vectors, and the computation is
> > > > > based
> > > > on
> > > > > cpumask_weight(cpu_possible_mask) now. If all offline CPUs are
> > > > > mapped to some of reply queues, these queues won't be active(no
> > > > > request submitted
> > > > to
> > > > > these queues). The mechanism of PCI_IRQ_AFFINITY basically makes
> > > > > sure
> > > > that
> > > > > more than one irq vector won't be handled by one same CPU, and
> > > > > the irq vector spread is done in irq_create_affinity_masks().
> > > > >
> > > > > > Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver
> > > > > > via module parameter to simulate the issue. We need more
> > > > > > number of Online CPU than reply-queue.
> > > > >
> > > > > IMO, you don't need to simulate the issue,
> > > > > pci_alloc_irq_vectors(
> > > > > PCI_IRQ_AFFINITY) will handle that for you. You can dump the
> > > > > returned
> > > > irq
> > > > > vector number, num_possible_cpus()/num_online_cpus() and each
> > > > > irq vector's affinity assignment.
> > > > >
> > > > > > We may see completion redirected to original CPU because of
> > > > > > "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep
> > > > > > one CPU busy in local ISR routine.
> > > > >
> > > > > Could you dump each irq vector's affinity assignment of your
> > > > > megaraid in
> > > > your
> > > > > test?
> > > >
> > > > To quickly reproduce, I restricted to single MSI-x vector on
> > > > megaraid_sas driver.  System has total 16 online CPUs.
> > >
> > > I suggest you don't do the restriction of single MSI-x vector, and
> > > just
> > use the
> > > device supported number of msi-x vector.
> >
> > Hi Ming,  CPU lock up is seen even though it is not single msi-x
vector.
> > Actual scenario need some specific topology and server for overnight
test.
> > Issue can be seen on servers which has more than 16 logical CPUs and
> > Thunderbolt series MR controller which supports at max 16 MSIx
vectors.
> > >
> > > >
> > > > Output of affinity hints.
> > > > kernel version:
> > > > Linux rhel7.3 4.15.0-rc1+ #2 SMP Mon Feb 5 12:13:34 EST 2018
> > > > x86_64
> > > > x86_64
> > > > x86_64 GNU/Linux
> > > > PCI name is 83:00.0, dump its irq affinity:
> > > > irq 105, cpu list 0-3,8-11
> > >
> > > In this case, which CPU is selected for handling the interrupt is
> > decided by
> > > interrupt controller, and it is easy to cause CPU overload if
> > > interrupt
> > controller
> > > always selects one same CPU to handle the irq.
> > >
> > > >
> > > > Affinity mask is created properly, but only CPU-0 is overloaded
> > > > with interrupt processing.
> > > >
> > > > # numactl --hardware
> > > > available: 2 nodes (0-1)
> > > > node 0 cpus: 0 1 2 3 8 9 10 11
> > > > node 0 size: 47861 MB
> > > > node 0 free: 46516 MB
> > > > node 1 cpus: 4 5 6 7 12 13 14 15
> > > > node 1 size: 64491 MB
> > > > node 1 free: 62805 MB
> > > > node distances:
> > > > node   0   1
> > > >   0:  10  21
> > > >   1:  21  10
> > > >
> > > > Output of  system activities (sar).  (gnice is 100% and it is
> > > > consumed in megaraid_sas ISR routine.)
> > > >
> > > >
> > > > 12:44:40 PM CPU  %usr %nice  %sys   %iowait
%steal
> > > > %irq %soft%guest%gnice %idle
> > > > 12:44:41 PM all 6.03  0.0029.98  0.00
> > > > 0.00 

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Mike Galbraith
On Tue, 2018-02-06 at 13:43 +0100, Holger Hoffstätte wrote:
> 
> A much more interesting question to me is why there is kyber in the middle. :)

Yeah, given per sysfs I have zero devices using kyber.

-Mike


Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Oleksandr Natalenko

Hi.

06.02.2018 14:46, Mike Galbraith wrote:

Sorry for the noise, but just to make it clear, are we talking about
"deadline" or "mq-deadline" now?


mq-deadline.


Okay, I've spent a little bit more time on stressing the VM with BFQ + 
this patch enabled, and managed to get it crashed relatively easy just 
by traversing through all the files in the /usr and reading them. 3 
stacktraces from different boots are below:


===
[   33.147909] BUG: unable to handle kernel NULL pointer dereference at  
 (null)

[   33.156874] IP: blk_flush_complete_seq+0x20a/0x300
[   33.160400] PGD 0 P4D 0
[   33.162682] Oops: 0002 [#1] PREEMPT SMP PTI
[   33.166091] Modules linked in: crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel pcbc nls_iso8859_1 nls_cp437 aesni_intel vfat 
aes_x86_64 crypto_simd fat glue_helper cryptd bochs_drm ttm iTCO_wdt 
drm_kms_helper iTCO_vendor_support ppdev psmouse drm input_leds 
led_class syscopyarea intel_agp parport_pc joydev parport evdev rtc_cmos 
mousedev intel_gtt mac_hid i2c_i801 sysfillrect lpc_ich sysimgblt 
agpgart fb_sys_fops qemu_fw_cfg sch_fq_codel ip_tables x_tables 
hid_generic usbhid hid xfs libcrc32c crc32c_generic dm_mod raid10 md_mod 
sr_mod cdrom sd_mod uhci_hcd ahci libahci xhci_pci serio_raw atkbd 
libata ehci_pci virtio_net libps2 xhci_hcd ehci_hcd crc32c_intel 
scsi_mod virtio_pci usbcore virtio_ring usb_common virtio i8042 serio

[   33.226238] CPU: 0 PID: 0 Comm: BFS/0 Not tainted 4.15.0-pf2 #1
[   33.232823] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
0.0.0 02/06/2015

[   33.245062] RIP: 0010:blk_flush_complete_seq+0x20a/0x300
[   33.247781] RSP: 0018:88001f803ed0 EFLAGS: 00010086
[   33.250184] RAX: 8800199b5e00 RBX: 8800199b5de0 RCX: 
8800197f72a8
[   33.253220] RDX:  RSI: 8800199b5e10 RDI: 
8800197f7200
[   33.257429] RBP: 8800194090a0 R08: 8800199b5df0 R09: 

[   33.260756] R10: 88001df2b600 R11: 88001f803ce8 R12: 

[   33.268840] R13: 8800199b5e30 R14: 0046 R15: 
8800199b5de0
[   33.272722] FS:  () GS:88001f80() 
knlGS:

[   33.276798] CS:  0010 DS:  ES:  CR0: 80050033
[   33.278981] CR2:  CR3: 04008003 CR4: 
003606f0
[   33.282078] DR0:  DR1:  DR2: 

[   33.285566] DR3:  DR6: fffe0ff0 DR7: 
0400

[   33.289322] Call Trace:
[   33.291094]  
[   33.294208]  mq_flush_data_end_io+0xb3/0xf0
[   33.301226]  scsi_end_request+0x90/0x1e0 [scsi_mod]
[   33.303660]  scsi_io_completion+0x237/0x650 [scsi_mod]
[   33.306833]  flush_smp_call_function_queue+0x7c/0xf0
[   33.311369]  smp_call_function_single_interrupt+0x32/0xf0
[   33.314245]  call_function_single_interrupt+0xa2/0xb0
[   33.316755]  
[   33.318365] RIP: 0010:native_safe_halt+0x2/0x10
[   33.321988] RSP: 0018:82003ea0 EFLAGS: 0202 ORIG_RAX: 
ff04
[   33.324995] RAX:  RBX:  RCX: 

[   33.334532] RDX:  RSI: 81e43850 RDI: 
81e43a9d
[   33.336643] RBP: 82128080 R08: 000e7441b6ce R09: 
88001a4c2c00
[   33.339282] R10:  R11: 88001bceb7f0 R12: 
81ea5c92
[   33.344597] R13:  R14:  R15: 
1e07cd69

[   33.346526]  default_idle+0x18/0x130
[   33.347929]  do_idle+0x167/0x1d0
[   33.349201]  cpu_startup_entry+0x6f/0x80
[   33.350553]  start_kernel+0x4c2/0x4e2
[   33.351822]  secondary_startup_64+0xa5/0xb0
[   33.354470] Code: 39 d0 0f 84 8f 00 00 00 48 8b 97 b0 00 00 00 49 c1 
e0 04 45 31 e4 48 8b b7 a8 00 00 00 49 01 d8 48 8d 8f a8 00 00 00 48 89 
56 08 <48> 89 32 49 8b 50 18 49 89 48 18 48 89 87 a8 00 00 00 48 89 97
[   33.364997] RIP: blk_flush_complete_seq+0x20a/0x300 RSP: 
88001f803ed0

[   33.366759] CR2: 
[   33.367921] ---[ end trace 5179c1a9cbc4eaa2 ]---
[   33.369413] Kernel panic - not syncing: Fatal exception in interrupt
[   33.372089] Kernel Offset: disabled
[   33.374413] ---[ end Kernel panic - not syncing: Fatal exception in 
interrupt

===

===
[   30.937747] BUG: unable to handle kernel NULL pointer dereference at 
0028

[   30.940984] IP: bfq_put_queue+0x10b/0x130
[   30.941609] PGD 0 P4D 0
[   30.941976] Oops:  [#1] PREEMPT SMP PTI
[   30.942560] Modules linked in: crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel nls_iso8859_1 bochs_drm pcbc nls_cp437 vfat ttm 
aesni_intel fat aes_x86_64 crypto_simd glue_helper cryptd drm_kms_helper 
iTCO_wdt drm ppdev iTCO_vendor_support input_leds psmouse led_class 
intel_agp parport_pc joydev mousedev intel_gtt syscopyarea parport 
rtc_cmos evdev sysfillrect sysimgblt agpgart fb_sys_fops i2c_i801 
mac_hid qemu_fw_cfg lpc_ich sch_fq_codel ip_tables x_tables xfs 
libcrc32c crc32c_generic dm_mod raid10 md_mod hid_generic usbhid hid 
sr_mod 

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Mike Galbraith
On Tue, 2018-02-06 at 13:26 +0100, Paolo Valente wrote:
> 
> ok, right in the middle of bfq this time ... Was this the first OOPS in your 
> kernel log?

Yeah.


Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Mike Galbraith
On Tue, 2018-02-06 at 13:16 +0100, Oleksandr Natalenko wrote:
> Hi.
> 
> 06.02.2018 12:57, Mike Galbraith wrote:
> > Not me.  Box seems to be fairly sure that it is bfq.  Twice again box
> > went belly up on me in fairly short order with bfq, but seemed fine
> > with deadline.  I'm currently running deadline again, and box again
> > seems solid, thought I won't say _is_ solid until it's been happily
> > trundling along with deadline for a quite a bit longer.
> 
> Sorry for the noise, but just to make it clear, are we talking about 
> "deadline" or "mq-deadline" now?

mq-deadline.


Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Holger Hoffstätte

The plot thickens!

Just as I was about to post that I didn't have any problems - because
I didn't have any - I decided to do a second test, activated bfq on my
workstation, on a hunch typed "sync" and .. the machine locked up, hard.

Rebooted, activated bfq, typed sync..sync hangs. Luckily this time
a second terminal was still alive, so I could capture a trace for
your enjoyment:

Feb  6 14:28:17 ragnarok kernel: io scheduler bfq registered
Feb  6 14:28:20 ragnarok kernel: BUG: unable to handle kernel NULL pointer 
dereference at 0030
Feb  6 14:28:20 ragnarok kernel: IP: bfq_put_queue+0x10b/0x130 [bfq]
Feb  6 14:28:20 ragnarok kernel: PGD 0 P4D 0 
Feb  6 14:28:20 ragnarok kernel: Oops:  [#1] SMP PTI
Feb  6 14:28:20 ragnarok kernel: Modules linked in: bfq lz4 lz4_compress 
lz4_decompress nfs lockd grace sunrpc autofs4 sch_fq_codel it87 hwmon_vid 
x86_pkg_temp_thermal snd_hda_codec_realtek coretemp radeon crc32_pclmul 
snd_hda_codec_generic crc32c_intel pcbc snd_hda_codec_hdmi i2c_algo_bit 
aesni_intel drm_kms_helper aes_x86_64 uvcvideo syscopyarea crypto_simd 
snd_hda_intel sysfillrect cryptd snd_usb_audio sysimgblt videobuf2_vmalloc 
glue_helper fb_sys_fops snd_hda_codec snd_hwdep videobuf2_memops ttm 
videobuf2_v4l2 snd_usbmidi_lib videobuf2_core snd_rawmidi snd_hda_core drm 
snd_seq_device videodev snd_pcm i2c_i801 usbhid snd_timer i2c_core snd 
backlight soundcore r8169 parport_pc mii parport
Feb  6 14:28:20 ragnarok kernel: CPU: 0 PID: 4 Comm: kworker/0:0H Not tainted 
4.14.18 #1
Feb  6 14:28:20 ragnarok kernel: Hardware name: Gigabyte Technology Co., Ltd. 
P67-DS3-B3/P67-DS3-B3, BIOS F1 05/06/2011
Feb  6 14:28:20 ragnarok kernel: Workqueue: kblockd blk_mq_requeue_work
Feb  6 14:28:20 ragnarok kernel: task: 88060395a1c0 task.stack: 
c9044000
Feb  6 14:28:20 ragnarok kernel: RIP: 0010:bfq_put_queue+0x10b/0x130 [bfq]
Feb  6 14:28:20 ragnarok kernel: RSP: 0018:c9047ca0 EFLAGS: 00010286
Feb  6 14:28:20 ragnarok kernel: RAX: 0008 RBX: 8806023db690 
RCX: 
Feb  6 14:28:20 ragnarok kernel: RDX:  RSI: 880601bb39b0 
RDI: 880601a56400
Feb  6 14:28:20 ragnarok kernel: RBP: 01bb3980 R08: 0053 
R09: 8806023db690
Feb  6 14:28:20 ragnarok kernel: R10: 1dd0f11e R11: 080a011b 
R12: 880601a56400
Feb  6 14:28:20 ragnarok kernel: R13: 8806023dbed0 R14: 0053 
R15: 
Feb  6 14:28:20 ragnarok kernel: FS:  () 
GS:88061f40() knlGS:
Feb  6 14:28:20 ragnarok kernel: CS:  0010 DS:  ES:  CR0: 
80050033
Feb  6 14:28:20 ragnarok kernel: CR2: 0030 CR3: 0200a002 
CR4: 000606f0
Feb  6 14:28:20 ragnarok kernel: Call Trace:
Feb  6 14:28:20 ragnarok kernel:  bfq_finish_requeue_request+0x4b/0x370 [bfq]
Feb  6 14:28:20 ragnarok kernel:  __blk_mq_requeue_request+0x57/0x130
Feb  6 14:28:20 ragnarok kernel:  blk_mq_dispatch_rq_list+0x1b3/0x510
Feb  6 14:28:20 ragnarok kernel:  ? __bfq_bfqd_reset_in_service+0x20/0x70 [bfq]
Feb  6 14:28:20 ragnarok kernel:  ? bfq_bfqq_expire+0x212/0x740 [bfq]
Feb  6 14:28:20 ragnarok kernel:  blk_mq_sched_dispatch_requests+0xf0/0x170
Feb  6 14:28:20 ragnarok kernel:  __blk_mq_run_hw_queue+0x4e/0x90
Feb  6 14:28:20 ragnarok kernel:  __blk_mq_delay_run_hw_queue+0x73/0x80
Feb  6 14:28:20 ragnarok kernel:  blk_mq_run_hw_queue+0x53/0x150
Feb  6 14:28:20 ragnarok kernel:  blk_mq_run_hw_queues+0x3a/0x50
Feb  6 14:28:20 ragnarok kernel:  blk_mq_requeue_work+0x104/0x110
Feb  6 14:28:20 ragnarok kernel:  process_one_work+0x1d4/0x3d0
Feb  6 14:28:20 ragnarok kernel:  worker_thread+0x2b/0x3c0
Feb  6 14:28:20 ragnarok kernel:  ? process_one_work+0x3d0/0x3d0
Feb  6 14:28:20 ragnarok kernel:  kthread+0x117/0x130
Feb  6 14:28:20 ragnarok kernel:  ? kthread_create_on_node+0x40/0x40
Feb  6 14:28:20 ragnarok kernel:  ret_from_fork+0x1f/0x30
Feb  6 14:28:20 ragnarok kernel: Code: c1 e8 06 83 e0 01 48 83 f8 01 45 19 f6 
e8 ce 3a 00 00 41 83 e6 ee 48 89 c7 41 83 c6 53 e8 9e 3a 00 00 49 89 d9 45 89 
f0 44 89 f9 <48> 8b 70 28 48 c7 c2 d8 00 25 a0 55 4c 89 ef e8 11 ba ea e0 8b 
Feb  6 14:28:20 ragnarok kernel: RIP: bfq_put_queue+0x10b/0x130 [bfq] RSP: 
c9047ca0
Feb  6 14:28:20 ragnarok kernel: CR2: 0030
Feb  6 14:28:20 ragnarok kernel: ---[ end trace 8b782ace30a4e7d8 ]---

Yes, this is 4.14.x but with most-of block/4.16, rock-solid otherwise (in daily 
use).
Looks like there is something wrong with this patch after all..

cheers
Holger



Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-06 Thread Roman Penyaev
Hi Sagi,

On Mon, Feb 5, 2018 at 1:16 PM, Sagi Grimberg  wrote:
> Hi Roman and the team,
>
> On 02/02/2018 04:08 PM, Roman Pen wrote:
>>
>> This series introduces IBNBD/IBTRS modules.
>>
>> IBTRS (InfiniBand Transport) is a reliable high speed transport library
>> which allows for establishing connection between client and server
>> machines via RDMA.
>
>
> So its not strictly infiniband correct?

This is RDMA.  Original IB prefix is a bit confusing, that's true.

>  It is optimized to transfer (read/write) IO blocks
>>
>> in the sense that it follows the BIO semantics of providing the
>> possibility to either write data from a scatter-gather list to the
>> remote side or to request ("read") data transfer from the remote side
>> into a given set of buffers.
>>
>> IBTRS is multipath capable and provides I/O fail-over and load-balancing
>> functionality.
>
>
> Couple of questions on your multipath implementation?
> 1. What was your main objective over dm-multipath?

No objections, mpath is a part of the transport ibtrs library.

> 2. What was the consideration of this implementation over
> creating a stand-alone bio based device node to reinject the
> bio to the original block device?

ibnbd and ibtrs are separate, on fail-over or load-balancing
we work with IO requests inside a library.

>> IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
>> (client and server) that allow for remote access of a block device on
>> the server over IBTRS protocol. After being mapped, the remote block
>> devices can be accessed on the client side as local block devices.
>> Internally IBNBD uses IBTRS as an RDMA transport library.
>>
>> Why?
>>
>> - IBNBD/IBTRS is developed in order to map thin provisioned volumes,
>>   thus internal protocol is simple and consists of several request
>>  types only without awareness of underlaying hardware devices.
>
>
> Can you explain how the protocol is developed for thin-p? What are the
> essence of how its suited for it?

Here I wanted to emphasize, that we do not support any HW commands,
like nvme does, thus internal protocol consists of several commands.
So answering on your question "how the protocol is developed for thin-p"
I would put it another way around: "protocol does nothing to support
real device, because all we need is to map thin-p volumes".  It is just
simpler.

>> - IBTRS was developed as an independent RDMA transport library, which
>>   supports fail-over and load-balancing policies using multipath, thus
>>  it can be used for any other IO needs rather than only for block
>>  device.
>
>
> What do you mean by "any other IO"?

I mean other IO producers, not only ibnbd, since this is just a transport
library.

>
>> - IBNBD/IBTRS is faster than NVME over RDMA.  Old comparison results:
>>   https://www.spinics.net/lists/linux-rdma/msg48799.html
>>   (I retested on latest 4.14 kernel - there is no any significant
>>   difference, thus I post the old link).
>
>
> That is interesting to learn.
>
> Reading your reference brings a couple of questions though,
> - Its unclear to me how ibnbd performs reads without performing memory
>   registration. Is it using the global dma rkey?

Yes, global rkey.

WRITE: writes from client
READ: writes from server

> - Its unclear to me how there is a difference in noreg in writes,
>   because for small writes nvme-rdma never register memory (it uses
>   inline data).

No support for that.

> - Looks like with nvme-rdma you max out your iops at 1.6 MIOPs, that
>   seems considerably low against other reports. Can you try and explain
>   what was the bottleneck? This can be a potential bug and I (and the
>   rest of the community is interesting in knowing more details).

Sure, I can try.  BTW, what are other reports and numbers?

> - srp/scst comparison is really not fair having it in legacy request
>   mode. Can you please repeat it and report a bug to either linux-rdma
>   or to the scst mailing list?

Yep, I can retest with mq.

> - Your latency measurements are surprisingly high for a null target
>   device (even for low end nvme device actually) regardless of the
>   transport implementation.

Hm, network configuration?  These are results on machines dedicated
to our team for testing in one of our datacenters. Nothing special
in configuration.

> For example:
> - QD=1 read latency is 648.95 for ibnbd (I assume usecs right?) which is
>   fairly high. on nvme-rdma its 1058 us, which means over 1 millisecond
>   and even 1.254 ms for srp. Last time I tested nvme-rdma read QD=1
>   latency I got ~14 us. So something does not add up here. If this is
>   not some configuration issue, then we have serious bugs to handle..
>
> - QD=16 the read latencies are > 10ms for null devices?! I'm having
>   troubles understanding how you were able to get such high latencies
>   (> 100 ms for QD>=100)

What QD stands for? queue depth?  This is not a queue depth, this 

Re: [PATCH 09/24] ibtrs: server: main functionality

2018-02-06 Thread Roman Penyaev
On Mon, Feb 5, 2018 at 12:29 PM, Sagi Grimberg  wrote:
> Hi Roman,
>
> Some comments below.
>
>
> On 02/02/2018 04:08 PM, Roman Pen wrote:
>>
>> This is main functionality of ibtrs-server module, which accepts
>> set of RDMA connections (so called IBTRS session), creates/destroys
>> sysfs entries associated with IBTRS session and notifies upper layer
>> (user of IBTRS API) about RDMA requests or link events.
>>
>> Signed-off-by: Roman Pen 
>> Signed-off-by: Danil Kipnis 
>> Cc: Jack Wang 
>> ---
>>   drivers/infiniband/ulp/ibtrs/ibtrs-srv.c | 1811
>> ++
>>   1 file changed, 1811 insertions(+)
>>
>> diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
>> b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
>> new file mode 100644
>> index ..0d1fc08bd821
>> --- /dev/null
>> +++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
>> @@ -0,0 +1,1811 @@
>> +/*
>> + * InfiniBand Transport Layer
>> + *
>> + * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
>> + * Authors: Fabian Holler 
>> + *  Jack Wang 
>> + *  Kleber Souza 
>> + *  Danil Kipnis 
>> + *  Roman Penyaev 
>> + *  Milind Dumbare 
>> + *
>> + * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
>> + * Authors: Danil Kipnis 
>> + *  Roman Penyaev 
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public License
>> + * as published by the Free Software Foundation; either version 2
>> + * of the License, or (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, see .
>> + */
>> +
>> +#undef pr_fmt
>> +#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
>> +
>> +#include 
>> +#include 
>> +
>> +#include "ibtrs-srv.h"
>> +#include "ibtrs-log.h"
>> +
>> +MODULE_AUTHOR("ib...@profitbricks.com");
>> +MODULE_DESCRIPTION("IBTRS Server");
>> +MODULE_VERSION(IBTRS_VER_STRING);
>> +MODULE_LICENSE("GPL");
>> +
>> +#define DEFAULT_MAX_IO_SIZE_KB 128
>> +#define DEFAULT_MAX_IO_SIZE (DEFAULT_MAX_IO_SIZE_KB * 1024)
>> +#define MAX_REQ_SIZE PAGE_SIZE
>> +#define MAX_SG_COUNT ((MAX_REQ_SIZE - sizeof(struct ibtrs_msg_rdma_read))
>> \
>> + / sizeof(struct ibtrs_sg_desc))
>> +
>> +static int max_io_size = DEFAULT_MAX_IO_SIZE;
>> +static int rcv_buf_size = DEFAULT_MAX_IO_SIZE + MAX_REQ_SIZE;
>> +
>> +static int max_io_size_set(const char *val, const struct kernel_param
>> *kp)
>> +{
>> +   int err, ival;
>> +
>> +   err = kstrtoint(val, 0, );
>> +   if (err)
>> +   return err;
>> +
>> +   if (ival < 4096 || ival + MAX_REQ_SIZE > (4096 * 1024) ||
>> +   (ival + MAX_REQ_SIZE) % 512 != 0) {
>> +   pr_err("Invalid max io size value %d, has to be"
>> +  " > %d, < %d\n", ival, 4096, 4194304);
>> +   return -EINVAL;
>> +   }
>> +
>> +   max_io_size = ival;
>> +   rcv_buf_size = max_io_size + MAX_REQ_SIZE;
>> +   pr_info("max io size changed to %d\n", ival);
>> +
>> +   return 0;
>> +}
>> +
>> +static const struct kernel_param_ops max_io_size_ops = {
>> +   .set= max_io_size_set,
>> +   .get= param_get_int,
>> +};
>> +module_param_cb(max_io_size, _io_size_ops, _io_size, 0444);
>> +MODULE_PARM_DESC(max_io_size,
>> +"Max size for each IO request, when change the unit is in
>> byte"
>> +" (default: " __stringify(DEFAULT_MAX_IO_SIZE_KB) "KB)");
>> +
>> +#define DEFAULT_SESS_QUEUE_DEPTH 512
>> +static int sess_queue_depth = DEFAULT_SESS_QUEUE_DEPTH;
>> +module_param_named(sess_queue_depth, sess_queue_depth, int, 0444);
>> +MODULE_PARM_DESC(sess_queue_depth,
>> +"Number of buffers for pending I/O requests to allocate"
>> +" per session. Maximum: "
>> __stringify(MAX_SESS_QUEUE_DEPTH)
>> +" (default: " __stringify(DEFAULT_SESS_QUEUE_DEPTH) ")");
>> +
>> +/* We guarantee to serve 10 paths at least */
>> +#define CHUNK_POOL_SIZE (DEFAULT_SESS_QUEUE_DEPTH * 10)
>> +static mempool_t *chunk_pool;
>> +
>> +static int retry_count = 7;
>> +
>> +static int retry_count_set(const char *val, const struct kernel_param
>> *kp)
>> +{
>> +   int err, 

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Holger Hoffstätte
On 02/06/18 13:26, Paolo Valente wrote:
(..)
> As Oleksadr asked too, is it deadline or mq-deadline?

You can use deadline as alias as long as blk-mq is active.
This doesn't work when mq-deadline is built as a module, but that
doesn't seem to be the problem here.
 
>> [  484.179292] BUG: unable to handle kernel paging request at 
>> a0817000
>> [  484.179436] IP: __trace_note_message+0x1f/0xd0
>> [  484.179576] PGD 1e0c067 P4D 1e0c067 PUD 1e0d063 PMD 3faff2067 PTE 0
>> [  484.179719] Oops:  [#1] SMP PTI
>> [  484.179861] Dumping ftrace buffer:
>> [  484.180011](ftrace buffer empty)
>> [  484.180138] Modules linked in: fuse(E) ebtable_filter(E) ebtables(E) 
>> af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) 
>> nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ipt_REJECT(E) xt_tcpudp(E) 
>> iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) 
>> nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) 
>> ip_tables(E) xt_conntrack(E) nf_conntrack(E) ip6table_filter(E) 
>> ip6_tables(E) x_tables(E) nls_iso8859_1(E) nls_cp437(E) intel_rapl(E) 
>> x86_pkg_temp_thermal(E) intel_powerclamp(E) snd_hda_codec_hdmi(E) 
>> coretemp(E) kvm_intel(E) snd_hda_codec_realtek(E) kvm(E) 
>> snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) sr_mod(E) 
>> snd_hwdep(E) cdrom(E) joydev(E) snd_hda_core(E) snd_pcm(E) snd_timer(E) 
>> irqbypass(E) snd(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) 
>> r8169(E)
>> [  484.180740]  iTCO_wdt(E) ghash_clmulni_intel(E) mii(E) 
>> iTCO_vendor_support(E) pcbc(E) aesni_intel(E) soundcore(E) aes_x86_64(E) 
>> shpchp(E) crypto_simd(E) lpc_ich(E) glue_helper(E) i2c_i801(E) mei_me(E) 
>> mfd_core(E) mei(E) cryptd(E) intel_smartconnect(E) pcspkr(E) fan(E) 
>> thermal(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) 
>> hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) 
>> hid_generic(E) usbhid(E) nouveau(E) wmi(E) i2c_algo_bit(E) drm_kms_helper(E) 
>> syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ahci(E) 
>> xhci_pci(E) ehci_pci(E) libahci(E) ttm(E) ehci_hcd(E) xhci_hcd(E) libata(E) 
>> drm(E) usbcore(E) video(E) button(E) sd_mod(E) vfat(E) fat(E) virtio_blk(E) 
>> virtio_mmio(E) virtio_pci(E) virtio_ring(E) virtio(E) ext4(E) crc16(E) 
>> mbcache(E) jbd2(E) loop(E) sg(E) dm_multipath(E)
>> [  484.181421]  dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) 
>> scsi_mod(E) efivarfs(E) autofs4(E)
>> [  484.181583] CPU: 3 PID: 500 Comm: kworker/3:1H Tainted: GE
>> 4.15.0.ge237f98-master #609
>> [  484.181746] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
>> 09/23/2013
>> [  484.181910] Workqueue: kblockd blk_mq_requeue_work
>> [  484.182076] RIP: 0010:__trace_note_message+0x1f/0xd0
>> [  484.182250] RSP: 0018:8803f45bfc20 EFLAGS: 00010282
>> [  484.182436] RAX:  RBX: a0817000 RCX: 
>> 8803
>> [  484.182622] RDX: 81bf514d RSI:  RDI: 
>> a0817000
>> [  484.182810] RBP: 8803f45bfc80 R08: 0041 R09: 
>> 8803f69cc5d0
>> [  484.182998] R10: 8803f80b47d0 R11: 1000 R12: 
>> 8803f45e8000
>> [  484.183185] R13: 000d R14:  R15: 
>> 8803fba112c0
>> [  484.183372] FS:  () GS:88041ecc() 
>> knlGS:
>> [  484.183561] CS:  0010 DS:  ES:  CR0: 80050033
>> [  484.183747] CR2: a0817000 CR3: 01e0a006 CR4: 
>> 001606e0
>> [  484.183934] Call Trace:
>> [  484.184122]  bfq_put_queue+0xd3/0xe0
>> [  484.184305]  bfq_finish_requeue_request+0x72/0x350
>> [  484.184493]  __blk_mq_requeue_request+0x8f/0x120
>> [  484.184678]  blk_mq_dispatch_rq_list+0x342/0x550
>> [  484.184866]  ? kyber_dispatch_request+0xd0/0xd0
>> [  484.185053]  blk_mq_sched_dispatch_requests+0xf7/0x180
>> [  484.185238]  __blk_mq_run_hw_queue+0x58/0xd0
>> [  484.185429]  __blk_mq_delay_run_hw_queue+0x99/0xa0
>> [  484.185614]  blk_mq_run_hw_queue+0x54/0xf0
>> [  484.185805]  blk_mq_run_hw_queues+0x4b/0x60
>> [  484.185994]  blk_mq_requeue_work+0x13a/0x150
>> [  484.186192]  process_one_work+0x147/0x350
>> [  484.186383]  worker_thread+0x47/0x3e0
>> [  484.186572]  kthread+0xf8/0x130
>> [  484.186760]  ? rescuer_thread+0x360/0x360
>> [  484.186948]  ? kthread_stop+0x120/0x120
>> [  484.187137]  ret_from_fork+0x35/0x40
>> [  484.187321] Code: ff 48 89 44 24 10 e9 58 fd ff ff 90 55 48 89 e5 41 55 
>> 41 54 53 48 89 fb 48 83 ec 48 48 89 4c 24 30 4c 89 44 24 38 4c 89 4c 24 40 
>> <83> 3f 02 0f 85 87 00 00 00 f6 43 21 04 75 0b 48 83 c4 48 5b 41 
>> [  484.187525] RIP: __trace_note_message+0x1f/0xd0 RSP: 8803f45bfc20
>> [  484.187727] CR2: a0817000
> 
> ok, right in the middle of bfq this time ... Was this the first OOPS in your 
> kernel log?

A much more interesting question to me is why there is kyber in the middle. :)

Holger


Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Paolo Valente


> Il giorno 06 feb 2018, alle ore 13:26, Paolo Valente 
>  ha scritto:
> 
> 
> 
>> Il giorno 06 feb 2018, alle ore 12:57, Mike Galbraith  ha 
>> scritto:
>> 
>> On Tue, 2018-02-06 at 10:38 +0100, Paolo Valente wrote:
>>> 
>>> Hi Mike,
>>> as you can imagine, I didn't get any failure in my pre-submission
>>> tests on this patch.  In addition, it is not that easy to link this
>>> patch, which just adds some internal bfq housekeeping in case of a
>>> requeue, with a corruption of external lists for general I/O
>>> management.
>>> 
>>> In this respect, as Oleksandr comments point out, by switching from
>>> cfq to bfq, you switch between much more than two schedulers.  Anyway,
>>> who knows ...
>> 
>> Not me.  Box seems to be fairly sure that it is bfq.
> 
> Yeah, sorry for the too short comment: what I meant is that cfq (and
> deadline) are in legacy blk, while bfq is in blk-mq.  So, to use bfq,
> you must also switch from legacy-blk I/O stack to blk-mq I/O stack.
> 
> 
>> Twice again box
>> went belly up on me in fairly short order with bfq, but seemed fine
>> with deadline.  I'm currently running deadline again, and box again
>> seems solid, thought I won't say _is_ solid until it's been happily
>> trundling along with deadline for a quite a bit longer.
>> 
> 
> As Oleksadr asked too, is it deadline or mq-deadline?
> 
>> I was ssh'd in during the last episode, got this out.  I should be
>> getting crash dumps, but seems kdump is only working intermittently
>> atm.  I did get one earlier, but 3 of 4 times not.  Hohum.
>> 
>> [  484.179292] BUG: unable to handle kernel paging request at 
>> a0817000
>> [  484.179436] IP: __trace_note_message+0x1f/0xd0
>> [  484.179576] PGD 1e0c067 P4D 1e0c067 PUD 1e0d063 PMD 3faff2067 PTE 0
>> [  484.179719] Oops:  [#1] SMP PTI
>> [  484.179861] Dumping ftrace buffer:
>> [  484.180011](ftrace buffer empty)
>> [  484.180138] Modules linked in: fuse(E) ebtable_filter(E) ebtables(E) 
>> af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) 
>> nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ipt_REJECT(E) xt_tcpudp(E) 
>> iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) 
>> nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) 
>> ip_tables(E) xt_conntrack(E) nf_conntrack(E) ip6table_filter(E) 
>> ip6_tables(E) x_tables(E) nls_iso8859_1(E) nls_cp437(E) intel_rapl(E) 
>> x86_pkg_temp_thermal(E) intel_powerclamp(E) snd_hda_codec_hdmi(E) 
>> coretemp(E) kvm_intel(E) snd_hda_codec_realtek(E) kvm(E) 
>> snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) sr_mod(E) 
>> snd_hwdep(E) cdrom(E) joydev(E) snd_hda_core(E) snd_pcm(E) snd_timer(E) 
>> irqbypass(E) snd(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) 
>> r8169(E)
>> [  484.180740]  iTCO_wdt(E) ghash_clmulni_intel(E) mii(E) 
>> iTCO_vendor_support(E) pcbc(E) aesni_intel(E) soundcore(E) aes_x86_64(E) 
>> shpchp(E) crypto_simd(E) lpc_ich(E) glue_helper(E) i2c_i801(E) mei_me(E) 
>> mfd_core(E) mei(E) cryptd(E) intel_smartconnect(E) pcspkr(E) fan(E) 
>> thermal(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) 
>> hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) 
>> hid_generic(E) usbhid(E) nouveau(E) wmi(E) i2c_algo_bit(E) drm_kms_helper(E) 
>> syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ahci(E) 
>> xhci_pci(E) ehci_pci(E) libahci(E) ttm(E) ehci_hcd(E) xhci_hcd(E) libata(E) 
>> drm(E) usbcore(E) video(E) button(E) sd_mod(E) vfat(E) fat(E) virtio_blk(E) 
>> virtio_mmio(E) virtio_pci(E) virtio_ring(E) virtio(E) ext4(E) crc16(E) 
>> mbcache(E) jbd2(E) loop(E) sg(E) dm_multipath(E)
>> [  484.181421]  dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) 
>> scsi_mod(E) efivarfs(E) autofs4(E)
>> [  484.181583] CPU: 3 PID: 500 Comm: kworker/3:1H Tainted: GE
>> 4.15.0.ge237f98-master #609
>> [  484.181746] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
>> 09/23/2013
>> [  484.181910] Workqueue: kblockd blk_mq_requeue_work
>> [  484.182076] RIP: 0010:__trace_note_message+0x1f/0xd0
>> [  484.182250] RSP: 0018:8803f45bfc20 EFLAGS: 00010282
>> [  484.182436] RAX:  RBX: a0817000 RCX: 
>> 8803
>> [  484.182622] RDX: 81bf514d RSI:  RDI: 
>> a0817000
>> [  484.182810] RBP: 8803f45bfc80 R08: 0041 R09: 
>> 8803f69cc5d0
>> [  484.182998] R10: 8803f80b47d0 R11: 1000 R12: 
>> 8803f45e8000
>> [  484.183185] R13: 000d R14:  R15: 
>> 8803fba112c0
>> [  484.183372] FS:  () GS:88041ecc() 
>> knlGS:
>> [  484.183561] CS:  0010 DS:  ES:  CR0: 80050033
>> [  484.183747] CR2: a0817000 CR3: 01e0a006 CR4: 
>> 001606e0
>> [  484.183934] Call Trace:
>> [  484.184122]  bfq_put_queue+0xd3/0xe0
>> [  484.184305]  bfq_finish_requeue_request+0x72/0x350
>> [  

Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-06 Thread Ming Lei
Hi Kashyap,

On Tue, Feb 06, 2018 at 04:59:51PM +0530, Kashyap Desai wrote:
> > -Original Message-
> > From: Ming Lei [mailto:ming@redhat.com]
> > Sent: Tuesday, February 6, 2018 1:35 PM
> > To: Kashyap Desai
> > Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph
> > Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; Arun Easi; Omar
> Sandoval;
> > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> Peter
> > Rivera; Paolo Bonzini; Laurence Oberman
> > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> > force_blk_mq
> >
> > Hi Kashyap,
> >
> > On Tue, Feb 06, 2018 at 11:33:50AM +0530, Kashyap Desai wrote:
> > > > > We still have more than one reply queue ending up completion one
> CPU.
> > > >
> > > > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that means
> > > > smp_affinity_enable has to be set as 1, but seems it is the default
> > > setting.
> > > >
> > > > Please see kernel/irq/affinity.c, especially
> > > > irq_calc_affinity_vectors()
> > > which
> > > > figures out an optimal number of vectors, and the computation is
> > > > based
> > > on
> > > > cpumask_weight(cpu_possible_mask) now. If all offline CPUs are
> > > > mapped to some of reply queues, these queues won't be active(no
> > > > request submitted
> > > to
> > > > these queues). The mechanism of PCI_IRQ_AFFINITY basically makes
> > > > sure
> > > that
> > > > more than one irq vector won't be handled by one same CPU, and the
> > > > irq vector spread is done in irq_create_affinity_masks().
> > > >
> > > > > Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver via
> > > > > module parameter to simulate the issue. We need more number of
> > > > > Online CPU than reply-queue.
> > > >
> > > > IMO, you don't need to simulate the issue, pci_alloc_irq_vectors(
> > > > PCI_IRQ_AFFINITY) will handle that for you. You can dump the
> > > > returned
> > > irq
> > > > vector number, num_possible_cpus()/num_online_cpus() and each irq
> > > > vector's affinity assignment.
> > > >
> > > > > We may see completion redirected to original CPU because of
> > > > > "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep one
> > > > > CPU busy in local ISR routine.
> > > >
> > > > Could you dump each irq vector's affinity assignment of your
> > > > megaraid in
> > > your
> > > > test?
> > >
> > > To quickly reproduce, I restricted to single MSI-x vector on
> > > megaraid_sas driver.  System has total 16 online CPUs.
> >
> > I suggest you don't do the restriction of single MSI-x vector, and just
> use the
> > device supported number of msi-x vector.
> 
> Hi Ming,  CPU lock up is seen even though it is not single msi-x vector.
> Actual scenario need some specific topology and server for overnight test.
> Issue can be seen on servers which has more than 16 logical CPUs and
> Thunderbolt series MR controller which supports at max 16 MSIx vectors.
> >
> > >
> > > Output of affinity hints.
> > > kernel version:
> > > Linux rhel7.3 4.15.0-rc1+ #2 SMP Mon Feb 5 12:13:34 EST 2018 x86_64
> > > x86_64
> > > x86_64 GNU/Linux
> > > PCI name is 83:00.0, dump its irq affinity:
> > > irq 105, cpu list 0-3,8-11
> >
> > In this case, which CPU is selected for handling the interrupt is
> decided by
> > interrupt controller, and it is easy to cause CPU overload if interrupt
> controller
> > always selects one same CPU to handle the irq.
> >
> > >
> > > Affinity mask is created properly, but only CPU-0 is overloaded with
> > > interrupt processing.
> > >
> > > # numactl --hardware
> > > available: 2 nodes (0-1)
> > > node 0 cpus: 0 1 2 3 8 9 10 11
> > > node 0 size: 47861 MB
> > > node 0 free: 46516 MB
> > > node 1 cpus: 4 5 6 7 12 13 14 15
> > > node 1 size: 64491 MB
> > > node 1 free: 62805 MB
> > > node distances:
> > > node   0   1
> > >   0:  10  21
> > >   1:  21  10
> > >
> > > Output of  system activities (sar).  (gnice is 100% and it is consumed
> > > in megaraid_sas ISR routine.)
> > >
> > >
> > > 12:44:40 PM CPU  %usr %nice  %sys   %iowait%steal
> > > %irq %soft%guest%gnice %idle
> > > 12:44:41 PM all 6.03  0.0029.98  0.00
> > > 0.00 0.000.000.000.00 63.99
> > > 12:44:41 PM   0 0.00  0.00 0.000.00
> > > 0.00 0.000.000.00   100.00 0
> > >
> > >
> > > In my test, I used rq_affinity is set to 2. (QUEUE_FLAG_SAME_FORCE). I
> > > also used " host_tagset" V2 patch set for megaraid_sas.
> > >
> > > Using RFC requested in -
> > > "https://marc.info/?l=linux-scsi=151601833418346=2 " lockup is
> > > avoided (you can noticed that gnice is shifted to softirq. Even though
> > > it is 100% consumed, There is always exit for existing completion loop
> > > due to irqpoll_weight  @irq_poll_init().
> > >
> > > Average:CPU  %usr %nice  %sys   %iowait%steal
> > > %irq %soft%guest%gnice 

Re: [PATCH 07/24] ibtrs: client: sysfs interface functions

2018-02-06 Thread Roman Penyaev
On Mon, Feb 5, 2018 at 12:20 PM, Sagi Grimberg  wrote:
> Hi Roman,
>
>
>> This is the sysfs interface to IBTRS sessions on client side:
>>
>>/sys/kernel/ibtrs_client//
>>  *** IBTRS session created by ibtrs_clt_open() API call
>>  |
>>  |- max_reconnect_attempts
>>  |  *** number of reconnect attempts for session
>>  |
>>  |- add_path
>>  |  *** adds another connection path into IBTRS session
>>  |
>>  |- paths//
>> *** established paths to server in a session
>> |
>> |- disconnect
>> |  *** disconnect path
>> |
>> |- reconnect
>> |  *** reconnect path
>> |
>> |- remove_path
>> |  *** remove current path
>> |
>> |- state
>> |  *** retrieve current path state
>> |
>> |- stats/
>>*** current path statistics
>>|
>>   |- cpu_migration
>>   |- rdma
>>   |- rdma_lat
>>   |- reconnects
>>   |- reset_all
>>   |- sg_entries
>>   |- wc_completions
>>
>> Signed-off-by: Roman Pen 
>> Signed-off-by: Danil Kipnis 
>> Cc: Jack Wang 
>
>
> I think stats usually belong in debugfs.

I will change that.

--
Roman


Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Paolo Valente


> Il giorno 06 feb 2018, alle ore 12:57, Mike Galbraith  ha 
> scritto:
> 
> On Tue, 2018-02-06 at 10:38 +0100, Paolo Valente wrote:
>> 
>> Hi Mike,
>> as you can imagine, I didn't get any failure in my pre-submission
>> tests on this patch.  In addition, it is not that easy to link this
>> patch, which just adds some internal bfq housekeeping in case of a
>> requeue, with a corruption of external lists for general I/O
>> management.
>> 
>> In this respect, as Oleksandr comments point out, by switching from
>> cfq to bfq, you switch between much more than two schedulers.  Anyway,
>> who knows ...
> 
> Not me.  Box seems to be fairly sure that it is bfq.

Yeah, sorry for the too short comment: what I meant is that cfq (and
deadline) are in legacy blk, while bfq is in blk-mq.  So, to use bfq,
you must also switch from legacy-blk I/O stack to blk-mq I/O stack.


>  Twice again box
> went belly up on me in fairly short order with bfq, but seemed fine
> with deadline.  I'm currently running deadline again, and box again
> seems solid, thought I won't say _is_ solid until it's been happily
> trundling along with deadline for a quite a bit longer.
> 

As Oleksadr asked too, is it deadline or mq-deadline?

> I was ssh'd in during the last episode, got this out.  I should be
> getting crash dumps, but seems kdump is only working intermittently
> atm.  I did get one earlier, but 3 of 4 times not.  Hohum.
> 
> [  484.179292] BUG: unable to handle kernel paging request at a0817000
> [  484.179436] IP: __trace_note_message+0x1f/0xd0
> [  484.179576] PGD 1e0c067 P4D 1e0c067 PUD 1e0d063 PMD 3faff2067 PTE 0
> [  484.179719] Oops:  [#1] SMP PTI
> [  484.179861] Dumping ftrace buffer:
> [  484.180011](ftrace buffer empty)
> [  484.180138] Modules linked in: fuse(E) ebtable_filter(E) ebtables(E) 
> af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) 
> nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ipt_REJECT(E) xt_tcpudp(E) 
> iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) 
> nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) 
> xt_conntrack(E) nf_conntrack(E) ip6table_filter(E) ip6_tables(E) x_tables(E) 
> nls_iso8859_1(E) nls_cp437(E) intel_rapl(E) x86_pkg_temp_thermal(E) 
> intel_powerclamp(E) snd_hda_codec_hdmi(E) coretemp(E) kvm_intel(E) 
> snd_hda_codec_realtek(E) kvm(E) snd_hda_codec_generic(E) snd_hda_intel(E) 
> snd_hda_codec(E) sr_mod(E) snd_hwdep(E) cdrom(E) joydev(E) snd_hda_core(E) 
> snd_pcm(E) snd_timer(E) irqbypass(E) snd(E) crct10dif_pclmul(E) 
> crc32_pclmul(E) crc32c_intel(E) r8169(E)
> [  484.180740]  iTCO_wdt(E) ghash_clmulni_intel(E) mii(E) 
> iTCO_vendor_support(E) pcbc(E) aesni_intel(E) soundcore(E) aes_x86_64(E) 
> shpchp(E) crypto_simd(E) lpc_ich(E) glue_helper(E) i2c_i801(E) mei_me(E) 
> mfd_core(E) mei(E) cryptd(E) intel_smartconnect(E) pcspkr(E) fan(E) 
> thermal(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) 
> hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) 
> usbhid(E) nouveau(E) wmi(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) 
> sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ahci(E) xhci_pci(E) ehci_pci(E) 
> libahci(E) ttm(E) ehci_hcd(E) xhci_hcd(E) libata(E) drm(E) usbcore(E) 
> video(E) button(E) sd_mod(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) 
> virtio_pci(E) virtio_ring(E) virtio(E) ext4(E) crc16(E) mbcache(E) jbd2(E) 
> loop(E) sg(E) dm_multipath(E)
> [  484.181421]  dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) 
> scsi_mod(E) efivarfs(E) autofs4(E)
> [  484.181583] CPU: 3 PID: 500 Comm: kworker/3:1H Tainted: GE
> 4.15.0.ge237f98-master #609
> [  484.181746] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
> 09/23/2013
> [  484.181910] Workqueue: kblockd blk_mq_requeue_work
> [  484.182076] RIP: 0010:__trace_note_message+0x1f/0xd0
> [  484.182250] RSP: 0018:8803f45bfc20 EFLAGS: 00010282
> [  484.182436] RAX:  RBX: a0817000 RCX: 
> 8803
> [  484.182622] RDX: 81bf514d RSI:  RDI: 
> a0817000
> [  484.182810] RBP: 8803f45bfc80 R08: 0041 R09: 
> 8803f69cc5d0
> [  484.182998] R10: 8803f80b47d0 R11: 1000 R12: 
> 8803f45e8000
> [  484.183185] R13: 000d R14:  R15: 
> 8803fba112c0
> [  484.183372] FS:  () GS:88041ecc() 
> knlGS:
> [  484.183561] CS:  0010 DS:  ES:  CR0: 80050033
> [  484.183747] CR2: a0817000 CR3: 01e0a006 CR4: 
> 001606e0
> [  484.183934] Call Trace:
> [  484.184122]  bfq_put_queue+0xd3/0xe0
> [  484.184305]  bfq_finish_requeue_request+0x72/0x350
> [  484.184493]  __blk_mq_requeue_request+0x8f/0x120
> [  484.184678]  blk_mq_dispatch_rq_list+0x342/0x550
> [  484.184866]  ? kyber_dispatch_request+0xd0/0xd0
> [  484.185053]  blk_mq_sched_dispatch_requests+0xf7/0x180
> [  

Re: [PATCH 04/24] ibtrs: client: private header with client structs and functions

2018-02-06 Thread Roman Penyaev
Hi Sagi,

On Mon, Feb 5, 2018 at 11:59 AM, Sagi Grimberg  wrote:
> Hi Roman,
>
>
>> +struct ibtrs_clt_io_req {
>> +   struct list_headlist;
>> +   struct ibtrs_iu *iu;
>> +   struct scatterlist  *sglist; /* list holding user data */
>> +   unsigned intsg_cnt;
>> +   unsigned intsg_size;
>> +   unsigned intdata_len;
>> +   unsigned intusr_len;
>> +   void*priv;
>> +   boolin_use;
>> +   struct ibtrs_clt_con*con;
>> +   union {
>> +   struct ib_pool_fmr  **fmr_list;
>> +   struct ibtrs_fr_desc**fr_list;
>> +   };
>
>
> We are pretty much stuck with fmrs for legacy devices, it has
> no future support plans, please don't add new dependencies
> on it. Its already hard enough to get rid of it.

Got it, we have a plan to get rid of fmr.  But as I remember our
internal tests: fr is slower.  The question: why that can be
according to your experience?  I will retest, but still that is
interesting to know.

>> +   void*map_page;
>> +   struct ibtrs_tag*tag;
>
>
> Can I ask why do you need another tag that is not the request
> tag?

Once I responded already, the summary is the following:

1. Indeed mq supports tags sharing, but only between hw queues, not
globally, so for us that means tags->nr_hw_queues = 1, which kills
performance.

2. We need tags sharing in the transport library, which should not
be tightly coupled with block device.


>> +   u16 nmdesc;
>> +   enum dma_data_direction dir;
>> +   ibtrs_conf_fn   *conf;
>> +   unsigned long   start_time;
>> +};
>> +
>
>
>> +static inline struct ibtrs_clt_con *to_clt_con(struct ibtrs_con *c)
>> +{
>> +   if (unlikely(!c))
>> +   return NULL;
>> +
>> +   return container_of(c, struct ibtrs_clt_con, c);
>> +}
>> +
>> +static inline struct ibtrs_clt_sess *to_clt_sess(struct ibtrs_sess *s)
>> +{
>> +   if (unlikely(!s))
>> +   return NULL;
>> +
>> +   return container_of(s, struct ibtrs_clt_sess, s);
>> +}
>
>
> Seems a bit awkward that container_of wrappers check pointer validity...

That can be fixed, frankly, I don't remember code paths where I
implicitly rely on that returned null: session or connection are
always expected as valid pointers.

>> +/**
>> + * list_next_or_null_rr - get next list element in round-robin fashion.
>> + * @pos: entry, starting cursor.
>> + * @head:head of the list to examine. This list must have at least
>> one
>> + *   element, namely @pos.
>> + * @member:  name of the list_head structure within typeof(*pos).
>> + *
>> + * Important to understand that @pos is a list entry, which can be
>> already
>> + * removed using list_del_rcu(), so if @head has become empty NULL will
>> be
>> + * returned. Otherwise next element is returned in round-robin fashion.
>> + */
>> +#define list_next_or_null_rcu_rr(pos, head, member) ({ \
>> +   typeof(pos) next = NULL;\
>> +   \
>> +   if (!list_empty(head))  \
>> +   next = (pos)->member.next != (head) ?   \
>> +   list_entry_rcu((pos)->member.next,  \
>> +  typeof(*pos), member) :  \
>> +   list_entry_rcu((pos)->member.next->next,\
>> +  typeof(*pos), member);   \
>> +   next;   \
>> +})
>
>
> Why is this local to your driver?

Yeah, of course I can try to extend list.h

>> +
>> +/* See ibtrs-log.h */
>> +#define TYPES_TO_SESSNAME(obj) \
>> +   LIST(CASE(obj, struct ibtrs_clt_sess *, s.sessname),\
>> +CASE(obj, struct ibtrs_clt *, sessname))
>> +
>> +#define TAG_SIZE(clt) (sizeof(struct ibtrs_tag) + (clt)->pdu_sz)
>> +#define GET_TAG(clt, idx) ((clt)->tags + TAG_SIZE(clt) * idx)
>
>
> Still don't understand why this is even needed..

--
Roman


[PATCH 4/5] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-02-06 Thread Ming Lei
84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
may cause irq vector assigned to all offline CPUs, and this kind of
assignment may cause much less irq vectors mapped to online CPUs, and
performance may get hurt.

For example, in a 8 cores system, 0~3 online, 4~8 offline/not present,
see 'lscpu':

[ming@box]$lscpu
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):4
On-line CPU(s) list:   0-3
Thread(s) per core:1
Core(s) per socket:2
Socket(s): 2
NUMA node(s):  2
...
NUMA node0 CPU(s): 0-3
NUMA node1 CPU(s):
...

For example, one device has 4 queues:

1) before 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
irq 39, cpu list 0
irq 40, cpu list 1
irq 41, cpu list 2
irq 42, cpu list 3

2) after 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
irq 39, cpu list 0-2
irq 40, cpu list 3-4,6
irq 41, cpu list 5
irq 42, cpu list 7

3) after applying this patch against V4.15+:
irq 39, cpu list 0,4
irq 40, cpu list 1,6
irq 41, cpu list 2,5
irq 42, cpu list 3,7

This patch tries to do irq vector spread among online CPUs as far as
possible by 2 stages spread.

The above assignment 3) isn't the optimal result from NUMA view, but it
returns more irq vectors with online CPU mapped, given in reality one CPU
should be enough to handle one irq vector, so it is better to do this way.

Cc: Thomas Gleixner 
Cc: Christoph Hellwig 
Reported-by: Laurence Oberman 
Signed-off-by: Ming Lei 
---
 kernel/irq/affinity.c | 35 +--
 1 file changed, 29 insertions(+), 6 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 9801aecf8763..6755ed77d017 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -106,6 +106,9 @@ int irq_build_affinity_masks(const struct irq_affinity 
*affd,
nodemask_t nodemsk = NODE_MASK_NONE;
int n, nodes, cpus_per_vec, extra_vecs, done = 0;
 
+   if (!cpumask_weight(cpu_mask))
+   return 0;
+
nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, );
 
/*
@@ -175,9 +178,9 @@ struct cpumask *
 irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 {
int affv = nvecs - affd->pre_vectors - affd->post_vectors;
-   int curvec;
+   int curvec, vecs_offline, vecs_online;
struct cpumask *masks;
-   cpumask_var_t nmsk, *node_to_cpumask;
+   cpumask_var_t nmsk, cpu_mask, *node_to_cpumask;
 
/*
 * If there aren't any vectors left after applying the pre/post
@@ -193,9 +196,12 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
if (!masks)
goto out;
 
+   if (!alloc_cpumask_var(_mask, GFP_KERNEL))
+   goto out;
+
node_to_cpumask = alloc_node_to_cpumask();
if (!node_to_cpumask)
-   goto out;
+   goto out_free_cpu_mask;
 
/* Fill out vectors at the beginning that don't need affinity */
for (curvec = 0; curvec < affd->pre_vectors; curvec++)
@@ -204,15 +210,32 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
/* Stabilize the cpumasks */
get_online_cpus();
build_node_to_cpumask(node_to_cpumask);
-   curvec += irq_build_affinity_masks(affd, curvec, affv,
-  node_to_cpumask,
-  cpu_possible_mask, nmsk, masks);
+   /* spread on online CPUs starting from the vector of affd->pre_vectors 
*/
+   vecs_online = irq_build_affinity_masks(affd, curvec, affv,
+  node_to_cpumask,
+  cpu_online_mask, nmsk, masks);
+
+   /* spread on offline CPUs starting from the next vector to be handled */
+   if (vecs_online >= affv)
+   curvec = affd->pre_vectors;
+   else
+   curvec = affd->pre_vectors + vecs_online;
+   cpumask_andnot(cpu_mask, cpu_possible_mask, cpu_online_mask);
+   vecs_offline = irq_build_affinity_masks(affd, curvec, affv,
+   node_to_cpumask,
+   cpu_mask, nmsk, masks);
put_online_cpus();
 
/* Fill out vectors at the end that don't need affinity */
+   if (vecs_online + vecs_offline >= affv)
+   curvec = affv + affd->pre_vectors;
+   else
+   curvec = affd->pre_vectors + vecs_online + vecs_offline;
for (; curvec < nvecs; curvec++)
cpumask_copy(masks + curvec, 

[PATCH 3/5] genirq/affinity: support to do irq vectors spread starting from any vector

2018-02-06 Thread Ming Lei
Now two parameters(start_vec, affv) are introduced to 
irq_build_affinity_masks(),
then this helper can build the affinity of each irq vector starting from
the irq vector of 'start_vec', and handle at most 'affv' vectors.

This way is required to do 2-stages irq vectors spread among all
possible CPUs.

Cc: Thomas Gleixner 
Cc: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 kernel/irq/affinity.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 6af3f6727f63..9801aecf8763 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -94,17 +94,17 @@ static int get_nodes_in_cpumask(const cpumask_var_t 
*node_to_cpumask,
return nodes;
 }
 
-int irq_build_affinity_masks(int nvecs, const struct irq_affinity *affd,
+int irq_build_affinity_masks(const struct irq_affinity *affd,
+const int start_vec, const int affv,
 const cpumask_var_t *node_to_cpumask,
 const struct cpumask *cpu_mask,
 struct cpumask *nmsk,
 struct cpumask *masks)
 {
-   int affv = nvecs - affd->pre_vectors - affd->post_vectors;
int last_affv = affv + affd->pre_vectors;
-   int curvec = affd->pre_vectors;
+   int curvec = start_vec;
nodemask_t nodemsk = NODE_MASK_NONE;
-   int n, nodes, cpus_per_vec, extra_vecs;
+   int n, nodes, cpus_per_vec, extra_vecs, done = 0;
 
nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_mask, );
 
@@ -116,8 +116,10 @@ int irq_build_affinity_masks(int nvecs, const struct 
irq_affinity *affd,
for_each_node_mask(n, nodemsk) {
cpumask_copy(masks + curvec,
 node_to_cpumask[n]);
-   if (++curvec == last_affv)
+   if (++done == affv)
break;
+   if (++curvec == last_affv)
+   curvec = affd->pre_vectors;
}
goto out;
}
@@ -150,13 +152,16 @@ int irq_build_affinity_masks(int nvecs, const struct 
irq_affinity *affd,
irq_spread_init_one(masks + curvec, nmsk, cpus_per_vec);
}
 
-   if (curvec >= last_affv)
+   done += v;
+   if (done >= affv)
break;
+   if (curvec >= last_affv)
+   curvec = affd->pre_vectors;
--nodes;
}
 
 out:
-   return curvec - affd->pre_vectors;
+   return done;
 }
 
 /**
@@ -169,6 +174,7 @@ int irq_build_affinity_masks(int nvecs, const struct 
irq_affinity *affd,
 struct cpumask *
 irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
 {
+   int affv = nvecs - affd->pre_vectors - affd->post_vectors;
int curvec;
struct cpumask *masks;
cpumask_var_t nmsk, *node_to_cpumask;
@@ -198,7 +204,8 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
/* Stabilize the cpumasks */
get_online_cpus();
build_node_to_cpumask(node_to_cpumask);
-   curvec += irq_build_affinity_masks(nvecs, affd, node_to_cpumask,
+   curvec += irq_build_affinity_masks(affd, curvec, affv,
+  node_to_cpumask,
   cpu_possible_mask, nmsk, masks);
put_online_cpus();
 
-- 
2.9.5



[PATCH 1/5] genirq/affinity: rename *node_to_possible_cpumask as *node_to_cpumask

2018-02-06 Thread Ming Lei
The following patches will introduce two stage irq spread for improving
irq spread on all possible CPUs.

No funtional change.

Cc: Thomas Gleixner 
Cc: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 kernel/irq/affinity.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index a37a3b4b6342..4b1c4763212d 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -39,7 +39,7 @@ static void irq_spread_init_one(struct cpumask *irqmsk, 
struct cpumask *nmsk,
}
 }
 
-static cpumask_var_t *alloc_node_to_possible_cpumask(void)
+static cpumask_var_t *alloc_node_to_cpumask(void)
 {
cpumask_var_t *masks;
int node;
@@ -62,7 +62,7 @@ static cpumask_var_t *alloc_node_to_possible_cpumask(void)
return NULL;
 }
 
-static void free_node_to_possible_cpumask(cpumask_var_t *masks)
+static void free_node_to_cpumask(cpumask_var_t *masks)
 {
int node;
 
@@ -71,7 +71,7 @@ static void free_node_to_possible_cpumask(cpumask_var_t 
*masks)
kfree(masks);
 }
 
-static void build_node_to_possible_cpumask(cpumask_var_t *masks)
+static void build_node_to_cpumask(cpumask_var_t *masks)
 {
int cpu;
 
@@ -79,14 +79,14 @@ static void build_node_to_possible_cpumask(cpumask_var_t 
*masks)
cpumask_set_cpu(cpu, masks[cpu_to_node(cpu)]);
 }
 
-static int get_nodes_in_cpumask(cpumask_var_t *node_to_possible_cpumask,
+static int get_nodes_in_cpumask(cpumask_var_t *node_to_cpumask,
const struct cpumask *mask, nodemask_t *nodemsk)
 {
int n, nodes = 0;
 
/* Calculate the number of nodes in the supplied affinity mask */
for_each_node(n) {
-   if (cpumask_intersects(mask, node_to_possible_cpumask[n])) {
+   if (cpumask_intersects(mask, node_to_cpumask[n])) {
node_set(n, *nodemsk);
nodes++;
}
@@ -109,7 +109,7 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
int last_affv = affv + affd->pre_vectors;
nodemask_t nodemsk = NODE_MASK_NONE;
struct cpumask *masks;
-   cpumask_var_t nmsk, *node_to_possible_cpumask;
+   cpumask_var_t nmsk, *node_to_cpumask;
 
/*
 * If there aren't any vectors left after applying the pre/post
@@ -125,8 +125,8 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
if (!masks)
goto out;
 
-   node_to_possible_cpumask = alloc_node_to_possible_cpumask();
-   if (!node_to_possible_cpumask)
+   node_to_cpumask = alloc_node_to_cpumask();
+   if (!node_to_cpumask)
goto out;
 
/* Fill out vectors at the beginning that don't need affinity */
@@ -135,8 +135,8 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
 
/* Stabilize the cpumasks */
get_online_cpus();
-   build_node_to_possible_cpumask(node_to_possible_cpumask);
-   nodes = get_nodes_in_cpumask(node_to_possible_cpumask, 
cpu_possible_mask,
+   build_node_to_cpumask(node_to_cpumask);
+   nodes = get_nodes_in_cpumask(node_to_cpumask, cpu_possible_mask,
 );
 
/*
@@ -146,7 +146,7 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
if (affv <= nodes) {
for_each_node_mask(n, nodemsk) {
cpumask_copy(masks + curvec,
-node_to_possible_cpumask[n]);
+node_to_cpumask[n]);
if (++curvec == last_affv)
break;
}
@@ -160,7 +160,7 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
vecs_per_node = (affv - (curvec - affd->pre_vectors)) / nodes;
 
/* Get the cpus on this node which are in the mask */
-   cpumask_and(nmsk, cpu_possible_mask, 
node_to_possible_cpumask[n]);
+   cpumask_and(nmsk, cpu_possible_mask, node_to_cpumask[n]);
 
/* Calculate the number of cpus per vector */
ncpus = cpumask_weight(nmsk);
@@ -192,7 +192,7 @@ irq_create_affinity_masks(int nvecs, const struct 
irq_affinity *affd)
/* Fill out vectors at the end that don't need affinity */
for (; curvec < nvecs; curvec++)
cpumask_copy(masks + curvec, irq_default_affinity);
-   free_node_to_possible_cpumask(node_to_possible_cpumask);
+   free_node_to_cpumask(node_to_cpumask);
 out:
free_cpumask_var(nmsk);
return masks;
-- 
2.9.5



[PATCH 0/5] genirq/affinity: irq vector spread among online CPUs as far as possible

2018-02-06 Thread Ming Lei
Hi,

This patchset tries to spread among online CPUs as far as possible, so
that we can avoid to allocate too less irq vectors with online CPUs
mapped.

For example, in a 8cores system, 4 cpu cores(4~7) are offline/non present,
on a device with 4 queues:

1) before this patchset
irq 39, cpu list 0-2
irq 40, cpu list 3-4,6
irq 41, cpu list 5
irq 42, cpu list 7

2) after this patchset
irq 39, cpu list 0,4
irq 40, cpu list 1,6
irq 41, cpu list 2,5
irq 42, cpu list 3,7

Without this patchset, only two vectors(39, 40) can be active, but there
can be 4 active irq vectors after applying this patchset.

One disadvantage is that CPUs from different NUMA node can be mapped to
one same irq vector. Given generally one CPU should be enough to handle
one irq vector, it shouldn't be a big deal. Especailly more vectors have
to be allocated, otherwise performance can be hurt in current
assignment.

Thanks
Ming

Ming Lei (5):
  genirq/affinity: rename *node_to_possible_cpumask as *node_to_cpumask
  genirq/affinity: move actual irq vector spread into one helper
  genirq/affinity: support to do irq vectors spread starting from any
vector
  genirq/affinity: irq vector spread among online CPUs as far as
possible
  nvme: pci: pass max vectors as num_possible_cpus() to
pci_alloc_irq_vectors

 drivers/nvme/host/pci.c |   2 +-
 kernel/irq/affinity.c   | 145 +++-
 2 files changed, 95 insertions(+), 52 deletions(-)

-- 
2.9.5



Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Oleksandr Natalenko

Hi.

06.02.2018 12:57, Mike Galbraith wrote:

Not me.  Box seems to be fairly sure that it is bfq.  Twice again box
went belly up on me in fairly short order with bfq, but seemed fine
with deadline.  I'm currently running deadline again, and box again
seems solid, thought I won't say _is_ solid until it's been happily
trundling along with deadline for a quite a bit longer.


Sorry for the noise, but just to make it clear, are we talking about 
"deadline" or "mq-deadline" now?


Re: [PATCH 03/24] ibtrs: core: lib functions shared between client and server modules

2018-02-06 Thread Roman Penyaev
Hi Sagi,

On Mon, Feb 5, 2018 at 11:52 AM, Sagi Grimberg  wrote:
> Hi Roman,
>
> Here are some comments below.
>
>> +int ibtrs_post_recv_empty(struct ibtrs_con *con, struct ib_cqe *cqe)
>> +{
>> +   struct ib_recv_wr wr, *bad_wr;
>> +
>> +   wr.next= NULL;
>> +   wr.wr_cqe  = cqe;
>> +   wr.sg_list = NULL;
>> +   wr.num_sge = 0;
>> +
>> +   return ib_post_recv(con->qp, , _wr);
>> +}
>> +EXPORT_SYMBOL_GPL(ibtrs_post_recv_empty);
>
>
> What is this designed to do?

Each IO completion (response from server to client) is an immediate
message with no data inside.  Using IMM field we are able to find IO
in the inflight table and complete it with an error code if any.

>> +int ibtrs_iu_post_rdma_write_imm(struct ibtrs_con *con, struct ibtrs_iu
>> *iu,
>> +struct ib_sge *sge, unsigned int num_sge,
>> +u32 rkey, u64 rdma_addr, u32 imm_data,
>> +enum ib_send_flags flags)
>> +{
>> +   struct ib_send_wr *bad_wr;
>> +   struct ib_rdma_wr wr;
>> +   int i;
>> +
>> +   wr.wr.next= NULL;
>> +   wr.wr.wr_cqe  = >cqe;
>> +   wr.wr.sg_list = sge;
>> +   wr.wr.num_sge = num_sge;
>> +   wr.rkey   = rkey;
>> +   wr.remote_addr= rdma_addr;
>> +   wr.wr.opcode  = IB_WR_RDMA_WRITE_WITH_IMM;
>> +   wr.wr.ex.imm_data = cpu_to_be32(imm_data);
>> +   wr.wr.send_flags  = flags;
>> +
>> +   /*
>> +* If one of the sges has 0 size, the operation will fail with an
>> +* length error
>> +*/
>> +   for (i = 0; i < num_sge; i++)
>> +   if (WARN_ON(sge[i].length == 0))
>> +   return -EINVAL;
>> +
>> +   return ib_post_send(con->qp, , _wr);
>> +}
>> +EXPORT_SYMBOL_GPL(ibtrs_iu_post_rdma_write_imm);
>> +
>> +int ibtrs_post_rdma_write_imm_empty(struct ibtrs_con *con, struct ib_cqe
>> *cqe,
>> +   u32 imm_data, enum ib_send_flags
>> flags)
>> +{
>> +   struct ib_send_wr wr, *bad_wr;
>> +
>> +   memset(, 0, sizeof(wr));
>> +   wr.wr_cqe   = cqe;
>> +   wr.send_flags   = flags;
>> +   wr.opcode   = IB_WR_RDMA_WRITE_WITH_IMM;
>> +   wr.ex.imm_data  = cpu_to_be32(imm_data);
>> +
>> +   return ib_post_send(con->qp, , _wr);
>> +}
>> +EXPORT_SYMBOL_GPL(ibtrs_post_rdma_write_imm_empty);
>
>
> Christoph did a great job adding a generic rdma rw API, please
> reuse it, if you rely on needed functionality that does not exist
> there, please enhance it instead of open-coding a new rdma engine
> library.

Good to know, thanks.

>> +static int ibtrs_ib_dev_init(struct ibtrs_ib_dev *d, struct ib_device
>> *dev)
>> +{
>> +   int err;
>> +
>> +   d->pd = ib_alloc_pd(dev, IB_PD_UNSAFE_GLOBAL_RKEY);
>> +   if (IS_ERR(d->pd))
>> +   return PTR_ERR(d->pd);
>> +   d->dev = dev;
>> +   d->lkey = d->pd->local_dma_lkey;
>> +   d->rkey = d->pd->unsafe_global_rkey;
>> +
>> +   err = ibtrs_query_device(d);
>> +   if (unlikely(err))
>> +   ib_dealloc_pd(d->pd);
>> +
>> +   return err;
>> +}
>
>
> I must say that this makes me frustrated.. We stopped doing these
> sort of things long time ago. No way we can even consider accepting
> the unsafe use of the global rkey exposing the entire memory space for
> remote access permissions.
>
> Sorry for being blunt, but this protocol design which makes a concious
> decision to expose unconditionally is broken by definition.

I suppose we can also afford the same trick which nvme does: provide
register_always module argument, can we?  That can be also interesting
to measure the performance difference.

When I did nvme testing with register_always=true/false I saw the
difference.  Would be nice to measure ibtrs with register_always=true
once ibtrs supports that.

>> +struct ibtrs_ib_dev *ibtrs_ib_dev_find_get(struct rdma_cm_id *cm_id)
>> +{
>> +   struct ibtrs_ib_dev *dev;
>> +   int err;
>> +
>> +   mutex_lock(_list_mutex);
>> +   list_for_each_entry(dev, _list, entry) {
>> +   if (dev->dev->node_guid == cm_id->device->node_guid &&
>> +   kref_get_unless_zero(>ref))
>> +   goto out_unlock;
>> +   }
>> +   dev = kzalloc(sizeof(*dev), GFP_KERNEL);
>> +   if (unlikely(!dev))
>> +   goto out_err;
>> +
>> +   kref_init(>ref);
>> +   err = ibtrs_ib_dev_init(dev, cm_id->device);
>> +   if (unlikely(err))
>> +   goto out_free;
>> +   list_add(>entry, _list);
>> +out_unlock:
>> +   mutex_unlock(_list_mutex);
>> +
>> +   return dev;
>> +
>> +out_free:
>> +   kfree(dev);
>> +out_err:
>> +   mutex_unlock(_list_mutex);
>> +
>> +   return NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(ibtrs_ib_dev_find_get);
>
>
> Is it time to make this a common helper in rdma_cm?

True, can be also the patch for nvme.

>> +static 

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Mike Galbraith
On Tue, 2018-02-06 at 10:38 +0100, Paolo Valente wrote:
> 
> Hi Mike,
> as you can imagine, I didn't get any failure in my pre-submission
> tests on this patch.  In addition, it is not that easy to link this
> patch, which just adds some internal bfq housekeeping in case of a
> requeue, with a corruption of external lists for general I/O
> management.
> 
> In this respect, as Oleksandr comments point out, by switching from
> cfq to bfq, you switch between much more than two schedulers.  Anyway,
> who knows ...

Not me.  Box seems to be fairly sure that it is bfq.  Twice again box
went belly up on me in fairly short order with bfq, but seemed fine
with deadline.  I'm currently running deadline again, and box again
seems solid, thought I won't say _is_ solid until it's been happily
trundling along with deadline for a quite a bit longer.

I was ssh'd in during the last episode, got this out.  I should be
getting crash dumps, but seems kdump is only working intermittently
atm.  I did get one earlier, but 3 of 4 times not.  Hohum.

[  484.179292] BUG: unable to handle kernel paging request at a0817000
[  484.179436] IP: __trace_note_message+0x1f/0xd0
[  484.179576] PGD 1e0c067 P4D 1e0c067 PUD 1e0d063 PMD 3faff2067 PTE 0
[  484.179719] Oops:  [#1] SMP PTI
[  484.179861] Dumping ftrace buffer:
[  484.180011](ftrace buffer empty)
[  484.180138] Modules linked in: fuse(E) ebtable_filter(E) ebtables(E) 
af_packet(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) 
nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ipt_REJECT(E) xt_tcpudp(E) 
iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) 
nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) 
xt_conntrack(E) nf_conntrack(E) ip6table_filter(E) ip6_tables(E) x_tables(E) 
nls_iso8859_1(E) nls_cp437(E) intel_rapl(E) x86_pkg_temp_thermal(E) 
intel_powerclamp(E) snd_hda_codec_hdmi(E) coretemp(E) kvm_intel(E) 
snd_hda_codec_realtek(E) kvm(E) snd_hda_codec_generic(E) snd_hda_intel(E) 
snd_hda_codec(E) sr_mod(E) snd_hwdep(E) cdrom(E) joydev(E) snd_hda_core(E) 
snd_pcm(E) snd_timer(E) irqbypass(E) snd(E) crct10dif_pclmul(E) crc32_pclmul(E) 
crc32c_intel(E) r8169(E)
[  484.180740]  iTCO_wdt(E) ghash_clmulni_intel(E) mii(E) 
iTCO_vendor_support(E) pcbc(E) aesni_intel(E) soundcore(E) aes_x86_64(E) 
shpchp(E) crypto_simd(E) lpc_ich(E) glue_helper(E) i2c_i801(E) mei_me(E) 
mfd_core(E) mei(E) cryptd(E) intel_smartconnect(E) pcspkr(E) fan(E) thermal(E) 
nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) 
hid_logitech_hidpp(E) hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) 
usbhid(E) nouveau(E) wmi(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) 
sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ahci(E) xhci_pci(E) ehci_pci(E) 
libahci(E) ttm(E) ehci_hcd(E) xhci_hcd(E) libata(E) drm(E) usbcore(E) video(E) 
button(E) sd_mod(E) vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) virtio_pci(E) 
virtio_ring(E) virtio(E) ext4(E) crc16(E) mbcache(E) jbd2(E) loop(E) sg(E) 
dm_multipath(E)
[  484.181421]  dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) 
scsi_mod(E) efivarfs(E) autofs4(E)
[  484.181583] CPU: 3 PID: 500 Comm: kworker/3:1H Tainted: GE
4.15.0.ge237f98-master #609
[  484.181746] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
09/23/2013
[  484.181910] Workqueue: kblockd blk_mq_requeue_work
[  484.182076] RIP: 0010:__trace_note_message+0x1f/0xd0
[  484.182250] RSP: 0018:8803f45bfc20 EFLAGS: 00010282
[  484.182436] RAX:  RBX: a0817000 RCX: 8803
[  484.182622] RDX: 81bf514d RSI:  RDI: a0817000
[  484.182810] RBP: 8803f45bfc80 R08: 0041 R09: 8803f69cc5d0
[  484.182998] R10: 8803f80b47d0 R11: 1000 R12: 8803f45e8000
[  484.183185] R13: 000d R14:  R15: 8803fba112c0
[  484.183372] FS:  () GS:88041ecc() 
knlGS:
[  484.183561] CS:  0010 DS:  ES:  CR0: 80050033
[  484.183747] CR2: a0817000 CR3: 01e0a006 CR4: 001606e0
[  484.183934] Call Trace:
[  484.184122]  bfq_put_queue+0xd3/0xe0
[  484.184305]  bfq_finish_requeue_request+0x72/0x350
[  484.184493]  __blk_mq_requeue_request+0x8f/0x120
[  484.184678]  blk_mq_dispatch_rq_list+0x342/0x550
[  484.184866]  ? kyber_dispatch_request+0xd0/0xd0
[  484.185053]  blk_mq_sched_dispatch_requests+0xf7/0x180
[  484.185238]  __blk_mq_run_hw_queue+0x58/0xd0
[  484.185429]  __blk_mq_delay_run_hw_queue+0x99/0xa0
[  484.185614]  blk_mq_run_hw_queue+0x54/0xf0
[  484.185805]  blk_mq_run_hw_queues+0x4b/0x60
[  484.185994]  blk_mq_requeue_work+0x13a/0x150
[  484.186192]  process_one_work+0x147/0x350
[  484.186383]  worker_thread+0x47/0x3e0
[  484.186572]  kthread+0xf8/0x130
[  484.186760]  ? rescuer_thread+0x360/0x360
[  484.186948]  ? kthread_stop+0x120/0x120
[  484.187137]  ret_from_fork+0x35/0x40
[  484.187321] Code: ff 48 89 44 24 10 

[PATCH V2] lightnvm: pblk: add padding distribution sysfs attribute

2018-02-06 Thread hans . ml . holmberg
From: Hans Holmberg 

When pblk receives a sync, all data up to that point in the write buffer
must be comitted to persistent storage, and as flash memory comes with a
minimal write size there is a significant cost involved both in terms
of time for completing the sync and in terms of write amplification
padded sectors for filling up to the minimal write size.

In order to get a better understanding of the costs involved for syncs,
Add a sysfs attribute to pblk: padded_dist, showing a normalized
distribution of sectors padded. In order to facilitate measurements of
specific workloads during the lifetime of the pblk instance, the
distribution can be reset by writing 0 to the attribute.

Do this by introducing counters for each possible padding:
{0..(minimal write size - 1)} and calculate the normalized distribution
when showing the attribute.

Signed-off-by: Hans Holmberg 
Signed-off-by: Javier González 
Rearranged total_buckets statement in pblk_sysfs_get_padding_dist
Signed-off-by: Matias Bjørling 
---

Changes since V1:

* Picked up Matias rearrengment of the total_buckets_statement
* Fixed build problems reported by kbuild on i386 by using sector_div
  instead of / when calculating the padding distribution and turning
  nr_flush into atomic64_t (which makes more sense anyway)

 drivers/lightnvm/pblk-init.c  | 16 +++-
 drivers/lightnvm/pblk-rb.c| 17 +
 drivers/lightnvm/pblk-sysfs.c | 86 ++-
 drivers/lightnvm/pblk.h   |  6 ++-
 4 files changed, 112 insertions(+), 13 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 7eedc5d..bf9bc31 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -921,6 +921,7 @@ static void pblk_free(struct pblk *pblk)
 {
pblk_luns_free(pblk);
pblk_lines_free(pblk);
+   kfree(pblk->pad_dist);
pblk_line_meta_free(pblk);
pblk_core_free(pblk);
pblk_l2p_free(pblk);
@@ -998,11 +999,13 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
pblk->pad_rst_wa = 0;
pblk->gc_rst_wa = 0;
 
+   atomic64_set(>nr_flush, 0);
+   pblk->nr_flush_rst = 0;
+
 #ifdef CONFIG_NVM_DEBUG
atomic_long_set(>inflight_writes, 0);
atomic_long_set(>padded_writes, 0);
atomic_long_set(>padded_wb, 0);
-   atomic_long_set(>nr_flush, 0);
atomic_long_set(>req_writes, 0);
atomic_long_set(>sub_writes, 0);
atomic_long_set(>sync_writes, 0);
@@ -1034,10 +1037,17 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
goto fail_free_luns;
}
 
+   pblk->pad_dist = kzalloc((pblk->min_write_pgs - 1) * sizeof(atomic64_t),
+GFP_KERNEL);
+   if (!pblk->pad_dist) {
+   ret = -ENOMEM;
+   goto fail_free_line_meta;
+   }
+
ret = pblk_core_init(pblk);
if (ret) {
pr_err("pblk: could not initialize core\n");
-   goto fail_free_line_meta;
+   goto fail_free_pad_dist;
}
 
ret = pblk_l2p_init(pblk);
@@ -1097,6 +1107,8 @@ static void *pblk_init(struct nvm_tgt_dev *dev, struct 
gendisk *tdisk,
pblk_l2p_free(pblk);
 fail_free_core:
pblk_core_free(pblk);
+fail_free_pad_dist:
+   kfree(pblk->pad_dist);
 fail_free_line_meta:
pblk_line_meta_free(pblk);
 fail_free_luns:
diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index 7044b55..8b14340 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -437,9 +437,7 @@ static int pblk_rb_may_write_flush(struct pblk_rb *rb, 
unsigned int nr_entries,
if (bio->bi_opf & REQ_PREFLUSH) {
struct pblk *pblk = container_of(rb, struct pblk, rwb);
 
-#ifdef CONFIG_NVM_DEBUG
-   atomic_long_inc(>nr_flush);
-#endif
+   atomic64_inc(>nr_flush);
if (pblk_rb_flush_point_set(>rwb, bio, mem))
*io_ret = NVM_IO_OK;
}
@@ -620,14 +618,17 @@ unsigned int pblk_rb_read_to_bio(struct pblk_rb *rb, 
struct nvm_rq *rqd,
pr_err("pblk: could not pad page in write bio\n");
return NVM_IO_ERR;
}
-   }
 
-   atomic64_add(pad, &((struct pblk *)
-   (container_of(rb, struct pblk, rwb)))->pad_wa);
+   if (pad < pblk->min_write_pgs)
+   atomic64_inc(>pad_dist[pad - 1]);
+   else
+   pr_warn("pblk: padding more than min. sectors\n");
+
+   atomic64_add(pad, >pad_wa);
+   }
 
 #ifdef CONFIG_NVM_DEBUG
-   atomic_long_add(pad, &((struct pblk *)
-   (container_of(rb, struct pblk, rwb)))->padded_writes);
+   atomic_long_add(pad, >padded_writes);
 #endif
 

Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-06 Thread Roman Penyaev
On Mon, Feb 5, 2018 at 6:20 PM, Bart Van Assche  wrote:
> On Mon, 2018-02-05 at 18:16 +0100, Roman Penyaev wrote:
>> Everything (fio jobs, setup, etc) is given in the same link:
>>
>> https://www.spinics.net/lists/linux-rdma/msg48799.html
>>
>> at the bottom you will find links on google docs with many pages
>> and archived fio jobs and scripts. (I do not remember exactly,
>> one year passed, but there should be everything).
>>
>> Regarding smaller iodepth_batch_submit - that decreases performance.
>> Once I played with that, even introduced new iodepth_batch_complete_max
>> option for fio, but then I decided to stop and simply chose this
>> configuration, which provides me fastest results.
>
> Hello Roman,
>
> That's weird. For which protocols did reducing iodepth_batch_submit lead
> to lower performance: all the tested protocols or only some of them?

Hi Bart,

Seems that does not depend on protocol (when I tested it was true for nvme
and ibnbd).  That depends on a load.  On high load (1 or few fio jobs are
dedicated to each cpu, and we have 64 cpus) it turns out to be faster to wait
completions for all queue for that particular block dev, instead of switching
from kernel to userspace for each completed IO.

But I can assure you that performance difference is very minor, it exists,
but it does not change the whole picture of what you see on this google
sheet. So what I tried to achieve is to squeeze everything I could, nothing
more.

--
Roman


Re: [PATCH rfc 0/5] generic adaptive IRQ moderation library for I/O devices

2018-02-06 Thread Or Gerlitz
On Tue, Feb 6, 2018 at 11:25 AM, Sagi Grimberg  wrote:
> Hey Or,
>
>> not any more, Andy and Tal made the mlx5 AM code a kernel library
>> which is called DIM
>>
>> f4e5f0e MAINTAINERS: add entry for Dynamic Interrupt Moderation
>> 6a8788f bnxt_en: add support for software dynamic interrupt moderation
>> 8115b75 net/dim: use struct net_dim_sample as arg to net_dim
>> 4c4dbb4 net/mlx5e: Move dynamic interrupt coalescing code to include/linux
>> 9a31742 net/mlx5e: Change Mellanox references in DIM code
>> b9c872f net/mlx5e: Move generic functions to new file
>> f5e7f67 net/mlx5e: Move AM logic enums
>> 138968e net/mlx5e: Remove rq references in mlx5e_rx_am
>> f58ee09 net/mlx5e: Move interrupt moderation forward declarations
>> 98dd1ed net/mlx5e: Move interrupt moderation structs to new file
>>
>> can you make use of that? cc-ing them in case you have questions/comments
>
>
> Thanks a lot for the pointer, I'm not following net-next regularly.
> This seems to be a copy of the mlx5 code.

Just to make it clear, the code was moved to a library and not copied... so
there's single instance of the code upstream and lets try to find a way for
your use case to take advantage of it possibly under some modifications.


RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-06 Thread Kashyap Desai
> -Original Message-
> From: Ming Lei [mailto:ming@redhat.com]
> Sent: Tuesday, February 6, 2018 1:35 PM
> To: Kashyap Desai
> Cc: Hannes Reinecke; Jens Axboe; linux-block@vger.kernel.org; Christoph
> Hellwig; Mike Snitzer; linux-s...@vger.kernel.org; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> Hi Kashyap,
>
> On Tue, Feb 06, 2018 at 11:33:50AM +0530, Kashyap Desai wrote:
> > > > We still have more than one reply queue ending up completion one
CPU.
> > >
> > > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that means
> > > smp_affinity_enable has to be set as 1, but seems it is the default
> > setting.
> > >
> > > Please see kernel/irq/affinity.c, especially
> > > irq_calc_affinity_vectors()
> > which
> > > figures out an optimal number of vectors, and the computation is
> > > based
> > on
> > > cpumask_weight(cpu_possible_mask) now. If all offline CPUs are
> > > mapped to some of reply queues, these queues won't be active(no
> > > request submitted
> > to
> > > these queues). The mechanism of PCI_IRQ_AFFINITY basically makes
> > > sure
> > that
> > > more than one irq vector won't be handled by one same CPU, and the
> > > irq vector spread is done in irq_create_affinity_masks().
> > >
> > > > Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver via
> > > > module parameter to simulate the issue. We need more number of
> > > > Online CPU than reply-queue.
> > >
> > > IMO, you don't need to simulate the issue, pci_alloc_irq_vectors(
> > > PCI_IRQ_AFFINITY) will handle that for you. You can dump the
> > > returned
> > irq
> > > vector number, num_possible_cpus()/num_online_cpus() and each irq
> > > vector's affinity assignment.
> > >
> > > > We may see completion redirected to original CPU because of
> > > > "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep one
> > > > CPU busy in local ISR routine.
> > >
> > > Could you dump each irq vector's affinity assignment of your
> > > megaraid in
> > your
> > > test?
> >
> > To quickly reproduce, I restricted to single MSI-x vector on
> > megaraid_sas driver.  System has total 16 online CPUs.
>
> I suggest you don't do the restriction of single MSI-x vector, and just
use the
> device supported number of msi-x vector.

Hi Ming,  CPU lock up is seen even though it is not single msi-x vector.
Actual scenario need some specific topology and server for overnight test.
Issue can be seen on servers which has more than 16 logical CPUs and
Thunderbolt series MR controller which supports at max 16 MSIx vectors.
>
> >
> > Output of affinity hints.
> > kernel version:
> > Linux rhel7.3 4.15.0-rc1+ #2 SMP Mon Feb 5 12:13:34 EST 2018 x86_64
> > x86_64
> > x86_64 GNU/Linux
> > PCI name is 83:00.0, dump its irq affinity:
> > irq 105, cpu list 0-3,8-11
>
> In this case, which CPU is selected for handling the interrupt is
decided by
> interrupt controller, and it is easy to cause CPU overload if interrupt
controller
> always selects one same CPU to handle the irq.
>
> >
> > Affinity mask is created properly, but only CPU-0 is overloaded with
> > interrupt processing.
> >
> > # numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3 8 9 10 11
> > node 0 size: 47861 MB
> > node 0 free: 46516 MB
> > node 1 cpus: 4 5 6 7 12 13 14 15
> > node 1 size: 64491 MB
> > node 1 free: 62805 MB
> > node distances:
> > node   0   1
> >   0:  10  21
> >   1:  21  10
> >
> > Output of  system activities (sar).  (gnice is 100% and it is consumed
> > in megaraid_sas ISR routine.)
> >
> >
> > 12:44:40 PM CPU  %usr %nice  %sys   %iowait%steal
> > %irq %soft%guest%gnice %idle
> > 12:44:41 PM all 6.03  0.0029.98  0.00
> > 0.00 0.000.000.000.00 63.99
> > 12:44:41 PM   0 0.00  0.00 0.000.00
> > 0.00 0.000.000.00   100.00 0
> >
> >
> > In my test, I used rq_affinity is set to 2. (QUEUE_FLAG_SAME_FORCE). I
> > also used " host_tagset" V2 patch set for megaraid_sas.
> >
> > Using RFC requested in -
> > "https://marc.info/?l=linux-scsi=151601833418346=2 " lockup is
> > avoided (you can noticed that gnice is shifted to softirq. Even though
> > it is 100% consumed, There is always exit for existing completion loop
> > due to irqpoll_weight  @irq_poll_init().
> >
> > Average:CPU  %usr %nice  %sys   %iowait%steal
> > %irq %soft%guest%gnice %idle
> > Average:all  4.25  0.0021.61  0.00
> > 0.00  0.00 6.61   0.00  0.00 67.54
> > Average:  0   0.00  0.00 0.00  0.00
> > 0.00  0.00   100.000.00  0.00  0.00
> >
> >
> > Hope this clarifies. We need 

Re: [PATCH 4/5] lightnvm: pblk: add padding distribution sysfs attribute

2018-02-06 Thread Hans Holmberg
On Tue, Feb 6, 2018 at 11:50 AM, Matias Bjørling  wrote:
> On 02/06/2018 10:27 AM, Hans Holmberg wrote:
>>
>> Hi Matias,
>>
>> On Wed, Jan 31, 2018 at 9:44 AM, Matias Bjørling  wrote:
>>>
>>> On 01/31/2018 03:06 AM, Javier González wrote:


 From: Hans Holmberg 

 When pblk receives a sync, all data up to that point in the write buffer
 must be comitted to persistent storage, and as flash memory comes with a
 minimal write size there is a significant cost involved both in terms
 of time for completing the sync and in terms of write amplification
 padded sectors for filling up to the minimal write size.

 In order to get a better understanding of the costs involved for syncs,
 Add a sysfs attribute to pblk: padded_dist, showing a normalized
 distribution of sectors padded. In order to facilitate measurements of
 specific workloads during the lifetime of the pblk instance, the
 distribution can be reset by writing 0 to the attribute.

 Do this by introducing counters for each possible padding:
 {0..(minimal write size - 1)} and calculate the normalized distribution
 when showing the attribute.

 Signed-off-by: Hans Holmberg 
 Signed-off-by: Javier González 
 ---
drivers/lightnvm/pblk-init.c  | 16 --
drivers/lightnvm/pblk-rb.c| 15 +-
drivers/lightnvm/pblk-sysfs.c | 68
 +++
drivers/lightnvm/pblk.h   |  6 +++-
4 files changed, 95 insertions(+), 10 deletions(-)

 diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
 index 7eedc5daebc8..a5e3510c0f38 100644
 --- a/drivers/lightnvm/pblk-init.c
 +++ b/drivers/lightnvm/pblk-init.c
 @@ -921,6 +921,7 @@ static void pblk_free(struct pblk *pblk)
{
  pblk_luns_free(pblk);
  pblk_lines_free(pblk);
 +   kfree(pblk->pad_dist);
  pblk_line_meta_free(pblk);
  pblk_core_free(pblk);
  pblk_l2p_free(pblk);
 @@ -998,11 +999,13 @@ static void *pblk_init(struct nvm_tgt_dev *dev,
 struct gendisk *tdisk,
  pblk->pad_rst_wa = 0;
  pblk->gc_rst_wa = 0;
+ atomic_long_set(>nr_flush, 0);
 +   pblk->nr_flush_rst = 0;
 +
#ifdef CONFIG_NVM_DEBUG
  atomic_long_set(>inflight_writes, 0);
  atomic_long_set(>padded_writes, 0);
  atomic_long_set(>padded_wb, 0);
 -   atomic_long_set(>nr_flush, 0);
  atomic_long_set(>req_writes, 0);
  atomic_long_set(>sub_writes, 0);
  atomic_long_set(>sync_writes, 0);
 @@ -1034,10 +1037,17 @@ static void *pblk_init(struct nvm_tgt_dev *dev,
 struct gendisk *tdisk,
  goto fail_free_luns;
  }
+ pblk->pad_dist = kzalloc((pblk->min_write_pgs - 1) *
 sizeof(atomic64_t),
 +GFP_KERNEL);
 +   if (!pblk->pad_dist) {
 +   ret = -ENOMEM;
 +   goto fail_free_line_meta;
 +   }
 +
  ret = pblk_core_init(pblk);
  if (ret) {
  pr_err("pblk: could not initialize core\n");
 -   goto fail_free_line_meta;
 +   goto fail_free_pad_dist;
  }
  ret = pblk_l2p_init(pblk);
 @@ -1097,6 +1107,8 @@ static void *pblk_init(struct nvm_tgt_dev *dev,
 struct gendisk *tdisk,
  pblk_l2p_free(pblk);
fail_free_core:
  pblk_core_free(pblk);
 +fail_free_pad_dist:
 +   kfree(pblk->pad_dist);
fail_free_line_meta:
  pblk_line_meta_free(pblk);
fail_free_luns:
 diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
 index 7044b5599cc4..264372d7b537 100644
 --- a/drivers/lightnvm/pblk-rb.c
 +++ b/drivers/lightnvm/pblk-rb.c
 @@ -437,9 +437,7 @@ static int pblk_rb_may_write_flush(struct pblk_rb
 *rb,
 unsigned int nr_entries,
  if (bio->bi_opf & REQ_PREFLUSH) {
  struct pblk *pblk = container_of(rb, struct pblk, rwb);
-#ifdef CONFIG_NVM_DEBUG
  atomic_long_inc(>nr_flush);
 -#endif
  if (pblk_rb_flush_point_set(>rwb, bio, mem))
  *io_ret = NVM_IO_OK;
  }
 @@ -620,14 +618,17 @@ unsigned int pblk_rb_read_to_bio(struct pblk_rb
 *rb,
 struct nvm_rq *rqd,
  pr_err("pblk: could not pad page in write
 bio\n");
  return NVM_IO_ERR;
  }
 +
 +   if (pad < pblk->min_write_pgs)
 +   atomic64_inc(>pad_dist[pad - 1]);
 +   else
 

Re: [PATCH v5 06/10] bcache: add stop_when_cache_set_failed option to backing device

2018-02-06 Thread Coly Li
On 06/02/2018 6:50 PM, Nix wrote:
> On 6 Feb 2018, Coly Li verbalised:
>> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
>> index 7917b3820dd5..263164490833 100644
>> --- a/drivers/md/bcache/bcache.h
>> +++ b/drivers/md/bcache/bcache.h
>> @@ -287,6 +287,12 @@ struct io {
>>  sector_tlast;
>>  };
>>  
>> +enum stop_on_faliure {
> 
> failure?
> 
>> +BCH_CACHED_DEV_STOP_ATUO = 0,
> 
> AUTO?
> 
>> +enum stop_on_faliurestop_when_cache_set_failed;
> 
> (... and all uses would need changing too.)
> 
Thanks for the errata, I just send a single update for this patch.

Coly Li


[PATCH v5 06/10] bcache: add stop_when_cache_set_failed option to backing device

2018-02-06 Thread Coly Li
When there are too many I/O errors on cache device, current bcache code
will retire the whole cache set, and detach all bcache devices. But the
detached bcache devices are not stopped, which is problematic when bcache
is in writeback mode.

If the retired cache set has dirty data of backing devices, continue
writing to bcache device will write to backing device directly. If the
LBA of write request has a dirty version cached on cache device, next time
when the cache device is re-registered and backing device re-attached to
it again, the stale dirty data on cache device will be written to backing
device, and overwrite latest directly written data. This situation causes
a quite data corruption.

But we cannot simply stop all attached bcache devices when the cache set is
broken or disconnected. For example, use bcache to accelerate performance
of an email service. In such workload, if cache device is broken but no
dirty data lost, keep the bcache device alive and permit email service
continue to access user data might be a better solution for the cache
device failure.

Nix  points out the issue and provides the above example
to explain why it might be necessary to not stop bcache device for broken
cache device. Pavel Goran  provides a brilliant
suggestion to provide "always" and "auto" options to per-cached device
sysfs file stop_when_cache_set_failed. If cache set is retiring and the
backing device has no dirty data on cache, it should be safe to keep the
bcache device alive. In this case, if stop_when_cache_set_failed is set to
"auto", the device failure handling code will not stop this bcache device
and permit application to access the backing device with a unattached
bcache device.

Changelog:
v3: fix typos pointed out by Nix.
v2: change option values of stop_when_cache_set_failed from 1/0 to
"auto"/"always".
v1: initial version, stop_when_cache_set_failed can be 0 (not stop) or 1
(always stop).

Signed-off-by: Coly Li 
Cc: Nix 
Cc: Pavel Goran 
Cc: Michael Lyle 
Cc: Junhui Tang 
Cc: Hannes Reinecke 
---
 drivers/md/bcache/bcache.h |  9 +
 drivers/md/bcache/super.c  | 82 --
 drivers/md/bcache/sysfs.c  | 17 ++
 3 files changed, 98 insertions(+), 10 deletions(-)

diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index 7917b3820dd5..263164490833 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -287,6 +287,12 @@ struct io {
sector_tlast;
 };
 
+enum stop_on_failure {
+   BCH_CACHED_DEV_STOP_AUTO = 0,
+   BCH_CACHED_DEV_STOP_ALWAYS,
+   BCH_CACHED_DEV_STOP_MODE_MAX,
+};
+
 struct cached_dev {
struct list_headlist;
struct bcache_devicedisk;
@@ -379,6 +385,8 @@ struct cached_dev {
unsignedwriteback_rate_i_term_inverse;
unsignedwriteback_rate_p_term_inverse;
unsignedwriteback_rate_minimum;
+
+   enum stop_on_failurestop_when_cache_set_failed;
 };
 
 enum alloc_reserve {
@@ -924,6 +932,7 @@ void bch_write_bdev_super(struct cached_dev *, struct 
closure *);
 
 extern struct workqueue_struct *bcache_wq;
 extern const char * const bch_cache_modes[];
+extern const char * const bch_stop_on_failure_modes[];
 extern struct mutex bch_register_lock;
 extern struct list_head bch_cache_sets;
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index f8b0d1196c12..e335433bdfb7 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -47,6 +47,14 @@ const char * const bch_cache_modes[] = {
NULL
 };
 
+/* Default is -1; we skip past it for stop_when_cache_set_failed */
+const char * const bch_stop_on_failure_modes[] = {
+   "default",
+   "auto",
+   "always",
+   NULL
+};
+
 static struct kobject *bcache_kobj;
 struct mutex bch_register_lock;
 LIST_HEAD(bch_cache_sets);
@@ -1187,6 +1195,9 @@ static int cached_dev_init(struct cached_dev *dc, 
unsigned block_size)
max(dc->disk.disk->queue->backing_dev_info->ra_pages,
q->backing_dev_info->ra_pages);
 
+   /* default to auto */
+   dc->stop_when_cache_set_failed = BCH_CACHED_DEV_STOP_AUTO;
+
bch_cached_dev_request_init(dc);
bch_cached_dev_writeback_init(dc);
return 0;
@@ -1463,25 +1474,76 @@ static void cache_set_flush(struct closure *cl)
closure_return(cl);
 }
 
+/*
+ * This function is only called when CACHE_SET_IO_DISABLE is set, which means
+ * cache set is unregistering due to too many I/O errors. In this condition,
+ * the bcache device might be stopped, it depends on stop_when_cache_set_failed
+ * value and whether the broken cache has dirty data:
+ *
+ * dc->stop_when_cache_set_faileddc->has_dirty   stop bcache device
+ *  

Re: [PATCH 4/5] lightnvm: pblk: add padding distribution sysfs attribute

2018-02-06 Thread Matias Bjørling

On 02/06/2018 10:27 AM, Hans Holmberg wrote:

Hi Matias,

On Wed, Jan 31, 2018 at 9:44 AM, Matias Bjørling  wrote:

On 01/31/2018 03:06 AM, Javier González wrote:


From: Hans Holmberg 

When pblk receives a sync, all data up to that point in the write buffer
must be comitted to persistent storage, and as flash memory comes with a
minimal write size there is a significant cost involved both in terms
of time for completing the sync and in terms of write amplification
padded sectors for filling up to the minimal write size.

In order to get a better understanding of the costs involved for syncs,
Add a sysfs attribute to pblk: padded_dist, showing a normalized
distribution of sectors padded. In order to facilitate measurements of
specific workloads during the lifetime of the pblk instance, the
distribution can be reset by writing 0 to the attribute.

Do this by introducing counters for each possible padding:
{0..(minimal write size - 1)} and calculate the normalized distribution
when showing the attribute.

Signed-off-by: Hans Holmberg 
Signed-off-by: Javier González 
---
   drivers/lightnvm/pblk-init.c  | 16 --
   drivers/lightnvm/pblk-rb.c| 15 +-
   drivers/lightnvm/pblk-sysfs.c | 68
+++
   drivers/lightnvm/pblk.h   |  6 +++-
   4 files changed, 95 insertions(+), 10 deletions(-)

diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
index 7eedc5daebc8..a5e3510c0f38 100644
--- a/drivers/lightnvm/pblk-init.c
+++ b/drivers/lightnvm/pblk-init.c
@@ -921,6 +921,7 @@ static void pblk_free(struct pblk *pblk)
   {
 pblk_luns_free(pblk);
 pblk_lines_free(pblk);
+   kfree(pblk->pad_dist);
 pblk_line_meta_free(pblk);
 pblk_core_free(pblk);
 pblk_l2p_free(pblk);
@@ -998,11 +999,13 @@ static void *pblk_init(struct nvm_tgt_dev *dev,
struct gendisk *tdisk,
 pblk->pad_rst_wa = 0;
 pblk->gc_rst_wa = 0;
   + atomic_long_set(>nr_flush, 0);
+   pblk->nr_flush_rst = 0;
+
   #ifdef CONFIG_NVM_DEBUG
 atomic_long_set(>inflight_writes, 0);
 atomic_long_set(>padded_writes, 0);
 atomic_long_set(>padded_wb, 0);
-   atomic_long_set(>nr_flush, 0);
 atomic_long_set(>req_writes, 0);
 atomic_long_set(>sub_writes, 0);
 atomic_long_set(>sync_writes, 0);
@@ -1034,10 +1037,17 @@ static void *pblk_init(struct nvm_tgt_dev *dev,
struct gendisk *tdisk,
 goto fail_free_luns;
 }
   + pblk->pad_dist = kzalloc((pblk->min_write_pgs - 1) *
sizeof(atomic64_t),
+GFP_KERNEL);
+   if (!pblk->pad_dist) {
+   ret = -ENOMEM;
+   goto fail_free_line_meta;
+   }
+
 ret = pblk_core_init(pblk);
 if (ret) {
 pr_err("pblk: could not initialize core\n");
-   goto fail_free_line_meta;
+   goto fail_free_pad_dist;
 }
 ret = pblk_l2p_init(pblk);
@@ -1097,6 +1107,8 @@ static void *pblk_init(struct nvm_tgt_dev *dev,
struct gendisk *tdisk,
 pblk_l2p_free(pblk);
   fail_free_core:
 pblk_core_free(pblk);
+fail_free_pad_dist:
+   kfree(pblk->pad_dist);
   fail_free_line_meta:
 pblk_line_meta_free(pblk);
   fail_free_luns:
diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
index 7044b5599cc4..264372d7b537 100644
--- a/drivers/lightnvm/pblk-rb.c
+++ b/drivers/lightnvm/pblk-rb.c
@@ -437,9 +437,7 @@ static int pblk_rb_may_write_flush(struct pblk_rb *rb,
unsigned int nr_entries,
 if (bio->bi_opf & REQ_PREFLUSH) {
 struct pblk *pblk = container_of(rb, struct pblk, rwb);
   -#ifdef CONFIG_NVM_DEBUG
 atomic_long_inc(>nr_flush);
-#endif
 if (pblk_rb_flush_point_set(>rwb, bio, mem))
 *io_ret = NVM_IO_OK;
 }
@@ -620,14 +618,17 @@ unsigned int pblk_rb_read_to_bio(struct pblk_rb *rb,
struct nvm_rq *rqd,
 pr_err("pblk: could not pad page in write bio\n");
 return NVM_IO_ERR;
 }
+
+   if (pad < pblk->min_write_pgs)
+   atomic64_inc(>pad_dist[pad - 1]);
+   else
+   pr_warn("pblk: padding more than min. sectors\n");



Curious, when would this happen? Would it be an implementation error or
something a user did wrong?


This would be an implementation error - and this check is just a
safeguard to make sure we won't go out of bounds when updating the
statistics.



Ah, does it make sense to have such defensive programming, when the 
value is never persisted? An 64 bit integer is quite large, and if only 
used for workloads for a single instance, it would practically never 
trigger, and if it did, do one care?






Looks good to me. Let me know if you want to send a new patch, or if 

Re: [PATCH v5 06/10] bcache: add stop_when_cache_set_failed option to backing device

2018-02-06 Thread Nix
On 6 Feb 2018, Coly Li verbalised:
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index 7917b3820dd5..263164490833 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -287,6 +287,12 @@ struct io {
>   sector_tlast;
>  };
>  
> +enum stop_on_faliure {

failure?

> + BCH_CACHED_DEV_STOP_ATUO = 0,

AUTO?

> + enum stop_on_faliurestop_when_cache_set_failed;

(... and all uses would need changing too.)

-- 
NULL && (void)


Re: [PATCH 2/2] block, char_dev: Use correct format specifier for unsigned ints

2018-02-06 Thread Greg KH
On Mon, Feb 05, 2018 at 06:25:27PM -0800, Srivatsa S. Bhat wrote:
> From: Srivatsa S. Bhat 
> 
> register_blkdev() and __register_chrdev_region() treat the major
> number as an unsigned int. So print it the same way to avoid
> absurd error statements such as:
> "... major requested (-1) is greater than the maximum (511) ..."
> (and also fix off-by-one bugs in the error prints).
> 
> While at it, also update the comment describing register_blkdev().
> 
> Signed-off-by: Srivatsa S. Bhat 
> ---
> 
>  block/genhd.c |   19 +++
>  fs/char_dev.c |4 ++--
>  2 files changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 88a53c1..21a18e5 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -308,19 +308,22 @@ void blkdev_show(struct seq_file *seqf, off_t offset)
>  /**
>   * register_blkdev - register a new block device
>   *
> - * @major: the requested major device number [1..255]. If @major = 0, try to
> - * allocate any unused major number.
> + * @major: the requested major device number [1..BLKDEV_MAJOR_MAX-1]. If
> + * @major = 0, try to allocate any unused major number.
>   * @name: the name of the new block device as a zero terminated string
>   *
>   * The @name must be unique within the system.
>   *
>   * The return value depends on the @major input parameter:
>   *
> - *  - if a major device number was requested in range [1..255] then the
> - *function returns zero on success, or a negative error code
> + *  - if a major device number was requested in range [1..BLKDEV_MAJOR_MAX-1]
> + *then the function returns zero on success, or a negative error code
>   *  - if any unused major number was requested with @major = 0 parameter
>   *then the return value is the allocated major number in range
> - *[1..255] or a negative error code otherwise
> + *[1..BLKDEV_MAJOR_MAX-1] or a negative error code otherwise
> + *
> + * See Documentation/admin-guide/devices.txt for the list of allocated
> + * major numbers.
>   */
>  int register_blkdev(unsigned int major, const char *name)
>  {
> @@ -347,8 +350,8 @@ int register_blkdev(unsigned int major, const char *name)
>   }
>  
>   if (major >= BLKDEV_MAJOR_MAX) {
> - pr_err("register_blkdev: major requested (%d) is greater than 
> the maximum (%d) for %s\n",
> -major, BLKDEV_MAJOR_MAX, name);
> + pr_err("register_blkdev: major requested (%u) is greater than 
> the maximum (%u) for %s\n",
> +major, BLKDEV_MAJOR_MAX-1, name);
>  
>   ret = -EINVAL;
>   goto out;
> @@ -375,7 +378,7 @@ int register_blkdev(unsigned int major, const char *name)
>   ret = -EBUSY;
>  
>   if (ret < 0) {
> - printk("register_blkdev: cannot get major %d for %s\n",
> + printk("register_blkdev: cannot get major %u for %s\n",
>  major, name);
>   kfree(p);
>   }
> diff --git a/fs/char_dev.c b/fs/char_dev.c
> index 33c9385..a279c58 100644
> --- a/fs/char_dev.c
> +++ b/fs/char_dev.c
> @@ -121,8 +121,8 @@ __register_chrdev_region(unsigned int major, unsigned int 
> baseminor,
>   }
>  
>   if (major >= CHRDEV_MAJOR_MAX) {
> - pr_err("CHRDEV \"%s\" major requested (%d) is greater than the 
> maximum (%d)\n",
> -name, major, CHRDEV_MAJOR_MAX);
> + pr_err("CHRDEV \"%s\" major requested (%u) is greater than the 
> maximum (%u)\n",
> +name, major, CHRDEV_MAJOR_MAX-1);
>   ret = -EINVAL;
>   goto out;
>   }

Thanks for both of these patches, if Al doesn't grab them, I will after
4.16-rc1 comes out.

greg k-h


Re: [PATCH V2 8/8] scsi: hpsa: use blk_mq to solve irq affinity issue

2018-02-06 Thread Ming Lei
On Tue, Feb 06, 2018 at 09:39:26AM +0100, Hannes Reinecke wrote:
> On 02/05/2018 04:20 PM, Ming Lei wrote:
> > This patch uses .force_blk_mq to drive HPSA via SCSI_MQ, meantime maps
> > each reply queue to blk_mq's hw queue, then .queuecommand can always
> > choose the hw queue as the reply queue. And if no any online CPU is
> > mapped to one hw queue, request can't be submitted to this hw queue
> > at all, finally the irq affinity issue is solved.
> > 
> > Cc: Hannes Reinecke 
> > Cc: Arun Easi 
> > Cc: Omar Sandoval ,
> > Cc: "Martin K. Petersen" ,
> > Cc: James Bottomley ,
> > Cc: Christoph Hellwig ,
> > Cc: Don Brace 
> > Cc: Kashyap Desai 
> > Cc: Peter Rivera 
> > Cc: Paolo Bonzini 
> > Cc: Mike Snitzer 
> > Tested-by: Laurence Oberman 
> > Signed-off-by: Ming Lei 
> > ---
> >  drivers/scsi/hpsa.c | 51 
> > ++-
> >  1 file changed, 34 insertions(+), 17 deletions(-)
> > 
> > diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> > index 443eabf63a9f..e517a4c74a28 100644
> > --- a/drivers/scsi/hpsa.c
> > +++ b/drivers/scsi/hpsa.c
> > @@ -51,6 +51,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include "hpsa_cmd.h"
> > @@ -956,6 +957,13 @@ static struct device_attribute *hpsa_shost_attrs[] = {
> >  #define HPSA_NRESERVED_CMDS(HPSA_CMDS_RESERVED_FOR_DRIVER +\
> >  HPSA_MAX_CONCURRENT_PASSTHRUS)
> >  
> > +static int hpsa_map_queues(struct Scsi_Host *shost)
> > +{
> > +struct ctlr_info *h = shost_to_hba(shost);
> > +
> > +return blk_mq_pci_map_queues(>tag_set, h->pdev);
> > +}
> > +
> >  static struct scsi_host_template hpsa_driver_template = {
> > .module = THIS_MODULE,
> > .name   = HPSA,
> > @@ -974,10 +982,13 @@ static struct scsi_host_template hpsa_driver_template 
> > = {
> >  #ifdef CONFIG_COMPAT
> > .compat_ioctl   = hpsa_compat_ioctl,
> >  #endif
> > +   .map_queues = hpsa_map_queues,
> > .sdev_attrs = hpsa_sdev_attrs,
> > .shost_attrs = hpsa_shost_attrs,
> > .max_sectors = 1024,
> > .no_write_same = 1,
> > +   .force_blk_mq = 1,
> > +   .host_tagset = 1,
> >  };
> >  
> >  static inline u32 next_command(struct ctlr_info *h, u8 q)
> > @@ -1045,11 +1056,7 @@ static void set_performant_mode(struct ctlr_info *h, 
> > struct CommandList *c,
> > c->busaddr |= 1 | (h->blockFetchTable[c->Header.SGList] << 1);
> > if (unlikely(!h->msix_vectors))
> > return;
> > -   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> > -   c->Header.ReplyQueue =
> > -   raw_smp_processor_id() % h->nreply_queues;
> > -   else
> > -   c->Header.ReplyQueue = reply_queue % h->nreply_queues;
> > +   c->Header.ReplyQueue = reply_queue;
> > }
> >  }
> >  
> > @@ -1063,10 +1070,7 @@ static void set_ioaccel1_performant_mode(struct 
> > ctlr_info *h,
> >  * Tell the controller to post the reply to the queue for this
> >  * processor.  This seems to give the best I/O throughput.
> >  */
> > -   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> > -   cp->ReplyQueue = smp_processor_id() % h->nreply_queues;
> > -   else
> > -   cp->ReplyQueue = reply_queue % h->nreply_queues;
> > +   cp->ReplyQueue = reply_queue;
> > /*
> >  * Set the bits in the address sent down to include:
> >  *  - performant mode bit (bit 0)
> > @@ -1087,10 +1091,7 @@ static void set_ioaccel2_tmf_performant_mode(struct 
> > ctlr_info *h,
> > /* Tell the controller to post the reply to the queue for this
> >  * processor.  This seems to give the best I/O throughput.
> >  */
> > -   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> > -   cp->reply_queue = smp_processor_id() % h->nreply_queues;
> > -   else
> > -   cp->reply_queue = reply_queue % h->nreply_queues;
> > +   cp->reply_queue = reply_queue;
> > /* Set the bits in the address sent down to include:
> >  *  - performant mode bit not used in ioaccel mode 2
> >  *  - pull count (bits 0-3)
> > @@ -1109,10 +1110,7 @@ static void set_ioaccel2_performant_mode(struct 
> > ctlr_info *h,
> >  * Tell the controller to post the reply to the queue for this
> >  * processor.  This seems to give the best I/O throughput.
> >  */
> > -   if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> > -   cp->reply_queue = smp_processor_id() % h->nreply_queues;
> > -   else
> > -   cp->reply_queue = reply_queue % h->nreply_queues;
> > +   cp->reply_queue = reply_queue;
> > /*
> >  * Set the 

Re: [PATCH rfc 0/5] generic adaptive IRQ moderation library for I/O devices

2018-02-06 Thread Tal Gilboa

On 2/6/2018 11:34 AM, Sagi Grimberg wrote:

Hi Tal,


I think Tal has idea/s on how the existing library can be changed to
support more modes/models

What I was thinking is allowing DIM algorithm to disregard data which 
is 0. Currently if bytes == 0 we return "SAME" immediately. We can 
change it to simply move to the packets check (which may be renamed to 
"completions"). This way you could use DIM while only optimizing to 
(P1) high packet rate and (P2) low interrupt rate.


That was exactly where I started from. But unfortunately it did not work
well :(

 From my experiments, the moderation was all over the place failing to
converge. At least the workloads that I've tested with, it was more
successful to have a stricter step policy and pulling towards latency
if we are consistently catching single completion per event.

I'm not an expert here at all, but at this point, based on my attempts
so far, I'm not convinced the current net_dim scheme could work.
I do believe we can make it work. I see your addition of the cpe part to 
stats compare. Might not be a bad idea for networking devices. Overall, 
it seems to me like this would be a private case of the general DIM 
optimization, since it doesn't need to account for aggregation, for 
instance, which breaks the "more packets == more data" ratio.


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-06 Thread Danil Kipnis
On Mon, Feb 5, 2018 at 7:38 PM, Bart Van Assche  wrote:
> On 02/05/18 08:40, Danil Kipnis wrote:
>>
>> It just occurred to me, that we could easily extend the interface in
>> such a way that each client (i.e. each session) would have on server
>> side her own directory with the devices it can access. I.e. instead of
>> just "dev_search_path" per server, any client would be able to only
>> access devices under /session_name. (session name
>> must already be generated by each client in a unique way). This way
>> one could have an explicit control over which devices can be accessed
>> by which clients. Do you think that would do it?
>
>
> Hello Danil,
>
> That sounds interesting to me. However, I think that approach requires to
> configure client access completely before the kernel target side module is
> loaded. It does not allow to configure permissions dynamically after the
> kernel target module has been loaded. Additionally, I don't see how to
> support attributes per (initiator, block device) pair with that approach.
> LIO e.g. supports the
> /sys/kernel/config/target/srpt/*/*/acls/*/lun_*/write_protect attribute. You
> may want to implement similar functionality if you want to convince more
> users to use IBNBD.
>
> Thanks,
>
> Bart.

Hello Bart,

the configuration (which devices can be accessed by a particular
client) can happen also after the kernel target module is loaded. The
directory in  is a module parameter and is fixed. It
contains for example "/ibnbd_devices/". But a particular client X
would be able to only access the devices located in the subdirectory
"/ibnbd_devices/client_x/". (The sessionname here is client_x) One can
add or remove the devices from that directory (those are just symlinks
to /dev/xxx) at any time - before or after the server module is
loaded. But you are right, we need something additional in order to be
able to specify which devices a client can access writable and which
readonly. May be another subdirectories "wr" and "ro" for each client:
those under /ibnbd_devices/client_x/ro/ can only be read by client_x
and those in /ibnbd_devices/client_x/wr/ can also be written to?

Thanks,

Danil.


Re: [PATCH 3/4] lightnvm: add 2.0 geometry identification

2018-02-06 Thread Matias Bjørling

On 02/05/2018 07:04 PM, Randy Dunlap wrote:

On 02/05/2018 04:15 AM, Matias Bjørling wrote:

Implement the geometry data structures for 2.0 and enable a drive
to be identified as one, including exposing the appropriate 2.0
sysfs entries.

Signed-off-by: Matias Bjørling 
---
  drivers/lightnvm/core.c  |   2 +-
  drivers/nvme/host/lightnvm.c | 334 +--
  include/linux/lightnvm.h |  11 +-
  3 files changed, 295 insertions(+), 52 deletions(-)

diff --git a/drivers/lightnvm/core.c b/drivers/lightnvm/core.c
index c72863b36439..250e74dfa120 100644
--- a/drivers/lightnvm/core.c
+++ b/drivers/lightnvm/core.c
@@ -934,7 +934,7 @@ static int nvm_init(struct nvm_dev *dev)
pr_debug("nvm: ver:%x nvm_vendor:%x\n",
dev->identity.ver_id, dev->identity.vmnt);
  
-	if (dev->identity.ver_id != 1) {

+   if (dev->identity.ver_id != 1 && dev->identity.ver_id != 2) {
pr_err("nvm: device not supported by kernel.");
goto err;
}


Hi,
The pr_err() above could be a bit more informative to the user. E.g.,
pr_err("nvm: device ver_id %d not supported by kernel.",
dev->identity.ver_id);

BTW, isn't that line missing a '\n'?



Good point! Thanks. I'll add it in.


Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Paolo Valente


> Il giorno 06 feb 2018, alle ore 08:56, Mike Galbraith  ha 
> scritto:
> 
> On Tue, 2018-02-06 at 08:44 +0100, Oleksandr Natalenko wrote:
>> Hi, Paolo.
>> 
>> I can confirm that this patch fixes cfdisk hang for me. I've also tried 
>> to trigger the issue Mike has encountered, but with no luck (maybe, I 
>> wasn't insistent enough, just was doing dd on usb-storage device in the 
>> VM).
> 
> I was doing kbuilds, and it blew up on me twice.  Switching back to cfq
> seemed to confirm it was indeed the patch causing trouble, but that's
> by no means a certainty.
> 

Hi Mike,
as you can imagine, I didn't get any failure in my pre-submission
tests on this patch.  In addition, it is not that easy to link this
patch, which just adds some internal bfq housekeeping in case of a
requeue, with a corruption of external lists for general I/O
management.

In this respect, as Oleksandr comments point out, by switching from
cfq to bfq, you switch between much more than two schedulers.  Anyway,
who knows ...

Maybe this report will ring a bell with some block-layer expert, or,
hopefully, next feedback will help puzzle this out.

Thanks,
Paolo


>   -Mike



Re: [PATCH rfc 0/5] generic adaptive IRQ moderation library for I/O devices

2018-02-06 Thread Sagi Grimberg

Hi Tal,


I think Tal has idea/s on how the existing library can be changed to
support more modes/models

What I was thinking is allowing DIM algorithm to disregard data which is 
0. Currently if bytes == 0 we return "SAME" immediately. We can change 
it to simply move to the packets check (which may be renamed to 
"completions"). This way you could use DIM while only optimizing to (P1) 
high packet rate and (P2) low interrupt rate.


That was exactly where I started from. But unfortunately it did not work
well :(

From my experiments, the moderation was all over the place failing to
converge. At least the workloads that I've tested with, it was more
successful to have a stricter step policy and pulling towards latency
if we are consistently catching single completion per event.

I'm not an expert here at all, but at this point, based on my attempts
so far, I'm not convinced the current net_dim scheme could work.


Re: [PATCH 4/5] lightnvm: pblk: add padding distribution sysfs attribute

2018-02-06 Thread Hans Holmberg
Hi Matias,

On Wed, Jan 31, 2018 at 9:44 AM, Matias Bjørling  wrote:
> On 01/31/2018 03:06 AM, Javier González wrote:
>>
>> From: Hans Holmberg 
>>
>> When pblk receives a sync, all data up to that point in the write buffer
>> must be comitted to persistent storage, and as flash memory comes with a
>> minimal write size there is a significant cost involved both in terms
>> of time for completing the sync and in terms of write amplification
>> padded sectors for filling up to the minimal write size.
>>
>> In order to get a better understanding of the costs involved for syncs,
>> Add a sysfs attribute to pblk: padded_dist, showing a normalized
>> distribution of sectors padded. In order to facilitate measurements of
>> specific workloads during the lifetime of the pblk instance, the
>> distribution can be reset by writing 0 to the attribute.
>>
>> Do this by introducing counters for each possible padding:
>> {0..(minimal write size - 1)} and calculate the normalized distribution
>> when showing the attribute.
>>
>> Signed-off-by: Hans Holmberg 
>> Signed-off-by: Javier González 
>> ---
>>   drivers/lightnvm/pblk-init.c  | 16 --
>>   drivers/lightnvm/pblk-rb.c| 15 +-
>>   drivers/lightnvm/pblk-sysfs.c | 68
>> +++
>>   drivers/lightnvm/pblk.h   |  6 +++-
>>   4 files changed, 95 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/lightnvm/pblk-init.c b/drivers/lightnvm/pblk-init.c
>> index 7eedc5daebc8..a5e3510c0f38 100644
>> --- a/drivers/lightnvm/pblk-init.c
>> +++ b/drivers/lightnvm/pblk-init.c
>> @@ -921,6 +921,7 @@ static void pblk_free(struct pblk *pblk)
>>   {
>> pblk_luns_free(pblk);
>> pblk_lines_free(pblk);
>> +   kfree(pblk->pad_dist);
>> pblk_line_meta_free(pblk);
>> pblk_core_free(pblk);
>> pblk_l2p_free(pblk);
>> @@ -998,11 +999,13 @@ static void *pblk_init(struct nvm_tgt_dev *dev,
>> struct gendisk *tdisk,
>> pblk->pad_rst_wa = 0;
>> pblk->gc_rst_wa = 0;
>>   + atomic_long_set(>nr_flush, 0);
>> +   pblk->nr_flush_rst = 0;
>> +
>>   #ifdef CONFIG_NVM_DEBUG
>> atomic_long_set(>inflight_writes, 0);
>> atomic_long_set(>padded_writes, 0);
>> atomic_long_set(>padded_wb, 0);
>> -   atomic_long_set(>nr_flush, 0);
>> atomic_long_set(>req_writes, 0);
>> atomic_long_set(>sub_writes, 0);
>> atomic_long_set(>sync_writes, 0);
>> @@ -1034,10 +1037,17 @@ static void *pblk_init(struct nvm_tgt_dev *dev,
>> struct gendisk *tdisk,
>> goto fail_free_luns;
>> }
>>   + pblk->pad_dist = kzalloc((pblk->min_write_pgs - 1) *
>> sizeof(atomic64_t),
>> +GFP_KERNEL);
>> +   if (!pblk->pad_dist) {
>> +   ret = -ENOMEM;
>> +   goto fail_free_line_meta;
>> +   }
>> +
>> ret = pblk_core_init(pblk);
>> if (ret) {
>> pr_err("pblk: could not initialize core\n");
>> -   goto fail_free_line_meta;
>> +   goto fail_free_pad_dist;
>> }
>> ret = pblk_l2p_init(pblk);
>> @@ -1097,6 +1107,8 @@ static void *pblk_init(struct nvm_tgt_dev *dev,
>> struct gendisk *tdisk,
>> pblk_l2p_free(pblk);
>>   fail_free_core:
>> pblk_core_free(pblk);
>> +fail_free_pad_dist:
>> +   kfree(pblk->pad_dist);
>>   fail_free_line_meta:
>> pblk_line_meta_free(pblk);
>>   fail_free_luns:
>> diff --git a/drivers/lightnvm/pblk-rb.c b/drivers/lightnvm/pblk-rb.c
>> index 7044b5599cc4..264372d7b537 100644
>> --- a/drivers/lightnvm/pblk-rb.c
>> +++ b/drivers/lightnvm/pblk-rb.c
>> @@ -437,9 +437,7 @@ static int pblk_rb_may_write_flush(struct pblk_rb *rb,
>> unsigned int nr_entries,
>> if (bio->bi_opf & REQ_PREFLUSH) {
>> struct pblk *pblk = container_of(rb, struct pblk, rwb);
>>   -#ifdef CONFIG_NVM_DEBUG
>> atomic_long_inc(>nr_flush);
>> -#endif
>> if (pblk_rb_flush_point_set(>rwb, bio, mem))
>> *io_ret = NVM_IO_OK;
>> }
>> @@ -620,14 +618,17 @@ unsigned int pblk_rb_read_to_bio(struct pblk_rb *rb,
>> struct nvm_rq *rqd,
>> pr_err("pblk: could not pad page in write bio\n");
>> return NVM_IO_ERR;
>> }
>> +
>> +   if (pad < pblk->min_write_pgs)
>> +   atomic64_inc(>pad_dist[pad - 1]);
>> +   else
>> +   pr_warn("pblk: padding more than min. sectors\n");
>
>
> Curious, when would this happen? Would it be an implementation error or
> something a user did wrong?

This would be an implementation error - and this check is just a
safeguard to make sure we won't go out of bounds when updating the
statistics.

>
>
>> +
>> +   atomic64_add(pad, >pad_wa);
>> }
>>   - 

Re: [PATCH rfc 0/5] generic adaptive IRQ moderation library for I/O devices

2018-02-06 Thread Sagi Grimberg

Hey Or,


not any more, Andy and Tal made the mlx5 AM code a kernel library
which is called DIM

f4e5f0e MAINTAINERS: add entry for Dynamic Interrupt Moderation
6a8788f bnxt_en: add support for software dynamic interrupt moderation
8115b75 net/dim: use struct net_dim_sample as arg to net_dim
4c4dbb4 net/mlx5e: Move dynamic interrupt coalescing code to include/linux
9a31742 net/mlx5e: Change Mellanox references in DIM code
b9c872f net/mlx5e: Move generic functions to new file
f5e7f67 net/mlx5e: Move AM logic enums
138968e net/mlx5e: Remove rq references in mlx5e_rx_am
f58ee09 net/mlx5e: Move interrupt moderation forward declarations
98dd1ed net/mlx5e: Move interrupt moderation structs to new file

can you make use of that? cc-ing them in case you have questions/comments


Thanks a lot for the pointer, I'm not following net-next regularly.
This seems to be a copy of the mlx5 code.

So as mentioned in the cover letter, the implementation was inspired
from the mlx5 driver as it had some of the abstraction properties I was
aiming to achieve.

What I didn't mention, was that I started from using the mlx5
implementation as is (slightly modified). Unfortunately in the
workloads that I've tested, I was not able to get the am level to
converge. So I went in a slightly different direction which gave me
better results (although still not perfect and lightly tested).

This led me to believe that mlx5 offered strategy might not suite
storage I/O workloads (although I do acknowledge that I could be proven
wrong here).

For example, a design choice I took was that the moderation scheme would
be more aggressive towards latency on the expense of throughput
optimization since high end storage devices are often expected to meet
strict latency requirements. Does that makes sense for network devices?
I don't know.

Another difference is that unlike net devices, storage I/O transactions
usually look like request/response where the data transfer is offloaded
by the controller/hba (net devices moderate on raw RX datagrams).
Moreover, it might not be trivial to track the bytes/sec from a
completion at all (might be embedded somewhere in the consumer context).

So overall, at this point I'm a bit skeptical that it will makes sense
to commonise the two implementations. I'd like to hear some more input
if this is something that the community is interested in first. If this
is something people are interested in, we can look if it makes sense to
reuse net_dim.h or not.


Re: [PATCH rfc 0/5] generic adaptive IRQ moderation library for I/O devices

2018-02-06 Thread Tal Gilboa

On 2/6/2018 10:54 AM, Or Gerlitz wrote:

On Tue, Feb 6, 2018 at 12:03 AM, Sagi Grimberg  wrote:


The main reason why this implementation is different then the common networking 
devices
implementation (and kept separate) is that in my mind at least, network devices 
are different
animals than other I/O devices in the sense that:


Oh, I see now  that you do refer to the netdev library, I was confused
by "In the networking stack, each device driver
implements adaptive IRQ moderation on its own" comment.


(a) network devices rely heavily on byte count of raw ethernet frames for 
adaptive moderation
 while in storage or I/O, the byte count is often a result of a 
submission/completion transaction
 and sometimes even known only to the application on top of the 
infrastructure (like in the
 rdma case).



(b) Performance characteristics and expectations in representative workloads.
(c) network devices collect all sort of stats for different functionalities 
(where adaptive moderation
 is only one use-case) and I'm not sure at all that a subset of stats could 
easily migrate to a different
 context.


I think Tal has idea/s on how the existing library can be changed to
support more modes/models

What I was thinking is allowing DIM algorithm to disregard data which is 
0. Currently if bytes == 0 we return "SAME" immediately. We can change 
it to simply move to the packets check (which may be renamed to 
"completions"). This way you could use DIM while only optimizing to (P1) 
high packet rate and (P2) low interrupt rate.


Re: [PATCH rfc 0/5] generic adaptive IRQ moderation library for I/O devices

2018-02-06 Thread Or Gerlitz
On Tue, Feb 6, 2018 at 12:03 AM, Sagi Grimberg  wrote:

> The main reason why this implementation is different then the common 
> networking devices
> implementation (and kept separate) is that in my mind at least, network 
> devices are different
> animals than other I/O devices in the sense that:

Oh, I see now  that you do refer to the netdev library, I was confused
by "In the networking stack, each device driver
implements adaptive IRQ moderation on its own" comment.

> (a) network devices rely heavily on byte count of raw ethernet frames for 
> adaptive moderation
> while in storage or I/O, the byte count is often a result of a 
> submission/completion transaction
> and sometimes even known only to the application on top of the 
> infrastructure (like in the
> rdma case).

> (b) Performance characteristics and expectations in representative workloads.
> (c) network devices collect all sort of stats for different functionalities 
> (where adaptive moderation
> is only one use-case) and I'm not sure at all that a subset of stats 
> could easily migrate to a different
> context.

I think Tal has idea/s on how the existing library can be changed to
support more modes/models


Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Mike Galbraith
On Tue, 2018-02-06 at 09:37 +0100, Oleksandr Natalenko wrote:
> Hi.
> 
> 06.02.2018 08:56, Mike Galbraith wrote:
> > I was doing kbuilds, and it blew up on me twice.  Switching back to cfq
> > seemed to confirm it was indeed the patch causing trouble, but that's
> > by no means a certainty.
> 
> Just to note, I was using v4.15.1, not the latest git HEAD. Are you able 
> to reproduce it on the stable kernel?

I didn't even try to wedge it into 4.15.1, tested it as posted.

>  Also, assuming this issue might be 
> unrelated to the BFQ itself, did you manage to reproduce the trouble 
> with another blk-mq scheduler (not CFQ, but mq-deadline/Kyber)?

I can give another scheduler a go this afternoon.

-Mike


Re: [PATCH V2 8/8] scsi: hpsa: use blk_mq to solve irq affinity issue

2018-02-06 Thread Hannes Reinecke
On 02/05/2018 04:20 PM, Ming Lei wrote:
> This patch uses .force_blk_mq to drive HPSA via SCSI_MQ, meantime maps
> each reply queue to blk_mq's hw queue, then .queuecommand can always
> choose the hw queue as the reply queue. And if no any online CPU is
> mapped to one hw queue, request can't be submitted to this hw queue
> at all, finally the irq affinity issue is solved.
> 
> Cc: Hannes Reinecke 
> Cc: Arun Easi 
> Cc: Omar Sandoval ,
> Cc: "Martin K. Petersen" ,
> Cc: James Bottomley ,
> Cc: Christoph Hellwig ,
> Cc: Don Brace 
> Cc: Kashyap Desai 
> Cc: Peter Rivera 
> Cc: Paolo Bonzini 
> Cc: Mike Snitzer 
> Tested-by: Laurence Oberman 
> Signed-off-by: Ming Lei 
> ---
>  drivers/scsi/hpsa.c | 51 ++-
>  1 file changed, 34 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 443eabf63a9f..e517a4c74a28 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -51,6 +51,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include "hpsa_cmd.h"
> @@ -956,6 +957,13 @@ static struct device_attribute *hpsa_shost_attrs[] = {
>  #define HPSA_NRESERVED_CMDS  (HPSA_CMDS_RESERVED_FOR_DRIVER +\
>HPSA_MAX_CONCURRENT_PASSTHRUS)
>  
> +static int hpsa_map_queues(struct Scsi_Host *shost)
> +{
> +struct ctlr_info *h = shost_to_hba(shost);
> +
> +return blk_mq_pci_map_queues(>tag_set, h->pdev);
> +}
> +
>  static struct scsi_host_template hpsa_driver_template = {
>   .module = THIS_MODULE,
>   .name   = HPSA,
> @@ -974,10 +982,13 @@ static struct scsi_host_template hpsa_driver_template = 
> {
>  #ifdef CONFIG_COMPAT
>   .compat_ioctl   = hpsa_compat_ioctl,
>  #endif
> + .map_queues = hpsa_map_queues,
>   .sdev_attrs = hpsa_sdev_attrs,
>   .shost_attrs = hpsa_shost_attrs,
>   .max_sectors = 1024,
>   .no_write_same = 1,
> + .force_blk_mq = 1,
> + .host_tagset = 1,
>  };
>  
>  static inline u32 next_command(struct ctlr_info *h, u8 q)
> @@ -1045,11 +1056,7 @@ static void set_performant_mode(struct ctlr_info *h, 
> struct CommandList *c,
>   c->busaddr |= 1 | (h->blockFetchTable[c->Header.SGList] << 1);
>   if (unlikely(!h->msix_vectors))
>   return;
> - if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> - c->Header.ReplyQueue =
> - raw_smp_processor_id() % h->nreply_queues;
> - else
> - c->Header.ReplyQueue = reply_queue % h->nreply_queues;
> + c->Header.ReplyQueue = reply_queue;
>   }
>  }
>  
> @@ -1063,10 +1070,7 @@ static void set_ioaccel1_performant_mode(struct 
> ctlr_info *h,
>* Tell the controller to post the reply to the queue for this
>* processor.  This seems to give the best I/O throughput.
>*/
> - if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> - cp->ReplyQueue = smp_processor_id() % h->nreply_queues;
> - else
> - cp->ReplyQueue = reply_queue % h->nreply_queues;
> + cp->ReplyQueue = reply_queue;
>   /*
>* Set the bits in the address sent down to include:
>*  - performant mode bit (bit 0)
> @@ -1087,10 +1091,7 @@ static void set_ioaccel2_tmf_performant_mode(struct 
> ctlr_info *h,
>   /* Tell the controller to post the reply to the queue for this
>* processor.  This seems to give the best I/O throughput.
>*/
> - if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> - cp->reply_queue = smp_processor_id() % h->nreply_queues;
> - else
> - cp->reply_queue = reply_queue % h->nreply_queues;
> + cp->reply_queue = reply_queue;
>   /* Set the bits in the address sent down to include:
>*  - performant mode bit not used in ioaccel mode 2
>*  - pull count (bits 0-3)
> @@ -1109,10 +1110,7 @@ static void set_ioaccel2_performant_mode(struct 
> ctlr_info *h,
>* Tell the controller to post the reply to the queue for this
>* processor.  This seems to give the best I/O throughput.
>*/
> - if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> - cp->reply_queue = smp_processor_id() % h->nreply_queues;
> - else
> - cp->reply_queue = reply_queue % h->nreply_queues;
> + cp->reply_queue = reply_queue;
>   /*
>* Set the bits in the address sent down to include:
>*  - performant mode bit not used in ioaccel mode 2
> @@ -1152,11 +1150,27 @@ static void 
> dial_up_lockup_detection_on_fw_flash_complete(struct 

Re: [PATCH BUGFIX 1/1] block, bfq: add requeue-request hook

2018-02-06 Thread Oleksandr Natalenko

Hi.

06.02.2018 08:56, Mike Galbraith wrote:

I was doing kbuilds, and it blew up on me twice.  Switching back to cfq
seemed to confirm it was indeed the patch causing trouble, but that's
by no means a certainty.


Just to note, I was using v4.15.1, not the latest git HEAD. Are you able 
to reproduce it on the stable kernel? Also, assuming this issue might be 
unrelated to the BFQ itself, did you manage to reproduce the trouble 
with another blk-mq scheduler (not CFQ, but mq-deadline/Kyber)?


Regards,
  Oleksandr


Re: [PATCH V2 7/8] scsi: hpsa: call hpsa_hba_inquiry() after adding host

2018-02-06 Thread Hannes Reinecke
On 02/05/2018 04:20 PM, Ming Lei wrote:
> So that we can decide the default reply queue by the map created
> during adding host.
> 
> Cc: Hannes Reinecke 
> Cc: Arun Easi 
> Cc: Omar Sandoval ,
> Cc: "Martin K. Petersen" ,
> Cc: James Bottomley ,
> Cc: Christoph Hellwig ,
> Cc: Don Brace 
> Cc: Kashyap Desai 
> Cc: Peter Rivera 
> Cc: Paolo Bonzini 
> Cc: Mike Snitzer 
> Tested-by: Laurence Oberman 
> Signed-off-by: Ming Lei 
> ---
>  drivers/scsi/hpsa.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 287e5eb0723f..443eabf63a9f 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -8691,12 +8691,9 @@ static int hpsa_init_one(struct pci_dev *pdev, const 
> struct pci_device_id *ent)
>   /* Disable discovery polling.*/
>   h->discovery_polling = 0;
>  
> -
>   /* Turn the interrupts on so we can service requests */
>   h->access.set_intr_mask(h, HPSA_INTR_ON);
>  
> - hpsa_hba_inquiry(h);
> -
>   h->lastlogicals = kzalloc(sizeof(*(h->lastlogicals)), GFP_KERNEL);
>   if (!h->lastlogicals)
>   dev_info(>pdev->dev,
> @@ -8707,6 +8704,8 @@ static int hpsa_init_one(struct pci_dev *pdev, const 
> struct pci_device_id *ent)
>   if (rc)
>   goto clean7; /* perf, sg, cmd, irq, shost, pci, lu, aer/h */
>  
> + hpsa_hba_inquiry(h);
> +
>   /* Monitor the controller for firmware lockups */
>   h->heartbeat_sample_interval = HEARTBEAT_SAMPLE_INTERVAL;
>   INIT_DELAYED_WORK(>monitor_ctlr_work, hpsa_monitor_ctlr_worker);
> 
Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH V2 8/8] scsi: hpsa: use blk_mq to solve irq affinity issue

2018-02-06 Thread Ming Lei
Hello chenxiang,

On Tue, Feb 06, 2018 at 10:18:19AM +0800, chenxiang (M) wrote:
> 在 2018/2/5 23:20, Ming Lei 写道:
> > This patch uses .force_blk_mq to drive HPSA via SCSI_MQ, meantime maps
> > each reply queue to blk_mq's hw queue, then .queuecommand can always
> > choose the hw queue as the reply queue. And if no any online CPU is
> > mapped to one hw queue, request can't be submitted to this hw queue
> > at all, finally the irq affinity issue is solved.
> > 
> > Cc: Hannes Reinecke 
> > Cc: Arun Easi 
> > Cc: Omar Sandoval ,
> > Cc: "Martin K. Petersen" ,
> > Cc: James Bottomley ,
> > Cc: Christoph Hellwig ,
> > Cc: Don Brace 
> > Cc: Kashyap Desai 
> > Cc: Peter Rivera 
> > Cc: Paolo Bonzini 
> > Cc: Mike Snitzer 
> > Tested-by: Laurence Oberman 
> > Signed-off-by: Ming Lei 
> > ---
> >   drivers/scsi/hpsa.c | 51 
> > ++-
> >   1 file changed, 34 insertions(+), 17 deletions(-)
> > 
> > diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> > index 443eabf63a9f..e517a4c74a28 100644
> > --- a/drivers/scsi/hpsa.c
> > +++ b/drivers/scsi/hpsa.c
> > @@ -51,6 +51,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >   #include "hpsa_cmd.h"
> > @@ -956,6 +957,13 @@ static struct device_attribute *hpsa_shost_attrs[] = {
> >   #define HPSA_NRESERVED_CMDS   (HPSA_CMDS_RESERVED_FOR_DRIVER +\
> >  HPSA_MAX_CONCURRENT_PASSTHRUS)
> > +static int hpsa_map_queues(struct Scsi_Host *shost)
> > +{
> > +struct ctlr_info *h = shost_to_hba(shost);
> > +
> > +return blk_mq_pci_map_queues(>tag_set, h->pdev);
> > +}
> > +
> 
> Hi Lei Ming,
> It is okay to use blk_mq_pci_map_queue to solve automatic irq affinity issue
> when the first interrupt vector for queues is 0.
> But if the first interrupt vector for queues is not 0,  we seems couldn't
> use blk_mq_pci_map_queue directly,
> such as blk_mq_virtio_map_queues, it realizes a interface itself. Is it
> possible to provide a general interface for those
> situations?

I guess it isn't necessary to do that, as you see .map_queues has been
introduced to 'scsi_host_template' for dealing driver specific irq
vector difference, such as, virtio-pci, 'irq_affinity' is needed for
excluding 'pre_vectors' which should serve as virtio config vector.

But that should belong to another topic about implementing generic
.map_queues interface, and seems not related with this patch, since
the usage of blk_mq_pci_map_queues() in this patch is correct.

Thanks,
Ming


Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-06 Thread Ming Lei
Hi Kashyap,

On Tue, Feb 06, 2018 at 11:33:50AM +0530, Kashyap Desai wrote:
> > > We still have more than one reply queue ending up completion one CPU.
> >
> > pci_alloc_irq_vectors(PCI_IRQ_AFFINITY) has to be used, that means
> > smp_affinity_enable has to be set as 1, but seems it is the default
> setting.
> >
> > Please see kernel/irq/affinity.c, especially irq_calc_affinity_vectors()
> which
> > figures out an optimal number of vectors, and the computation is based
> on
> > cpumask_weight(cpu_possible_mask) now. If all offline CPUs are mapped to
> > some of reply queues, these queues won't be active(no request submitted
> to
> > these queues). The mechanism of PCI_IRQ_AFFINITY basically makes sure
> that
> > more than one irq vector won't be handled by one same CPU, and the irq
> > vector spread is done in irq_create_affinity_masks().
> >
> > > Try to reduce MSI-x vector of megaraid_sas or mpt3sas driver via
> > > module parameter to simulate the issue. We need more number of Online
> > > CPU than reply-queue.
> >
> > IMO, you don't need to simulate the issue, pci_alloc_irq_vectors(
> > PCI_IRQ_AFFINITY) will handle that for you. You can dump the returned
> irq
> > vector number, num_possible_cpus()/num_online_cpus() and each irq
> > vector's affinity assignment.
> >
> > > We may see completion redirected to original CPU because of
> > > "QUEUE_FLAG_SAME_FORCE", but ISR of low level driver can keep one CPU
> > > busy in local ISR routine.
> >
> > Could you dump each irq vector's affinity assignment of your megaraid in
> your
> > test?
> 
> To quickly reproduce, I restricted to single MSI-x vector on megaraid_sas
> driver.  System has total 16 online CPUs.

I suggest you don't do the restriction of single MSI-x vector, and just
use the device supported number of msi-x vector.

> 
> Output of affinity hints.
> kernel version:
> Linux rhel7.3 4.15.0-rc1+ #2 SMP Mon Feb 5 12:13:34 EST 2018 x86_64 x86_64
> x86_64 GNU/Linux
> PCI name is 83:00.0, dump its irq affinity:
> irq 105, cpu list 0-3,8-11

In this case, which CPU is selected for handling the interrupt is
decided by interrupt controller, and it is easy to cause CPU overload
if interrupt controller always selects one same CPU to handle the irq.

> 
> Affinity mask is created properly, but only CPU-0 is overloaded with
> interrupt processing.
> 
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 8 9 10 11
> node 0 size: 47861 MB
> node 0 free: 46516 MB
> node 1 cpus: 4 5 6 7 12 13 14 15
> node 1 size: 64491 MB
> node 1 free: 62805 MB
> node distances:
> node   0   1
>   0:  10  21
>   1:  21  10
> 
> Output of  system activities (sar).  (gnice is 100% and it is consumed in
> megaraid_sas ISR routine.)
> 
> 
> 12:44:40 PM CPU  %usr %nice  %sys   %iowait%steal
> %irq %soft%guest%gnice %idle
> 12:44:41 PM all 6.03  0.0029.98  0.00
> 0.00 0.000.000.000.00 63.99
> 12:44:41 PM   0 0.00  0.00 0.000.00
> 0.00 0.000.000.00   100.00 0
> 
> 
> In my test, I used rq_affinity is set to 2. (QUEUE_FLAG_SAME_FORCE). I
> also used " host_tagset" V2 patch set for megaraid_sas.
> 
> Using RFC requested in -
> "https://marc.info/?l=linux-scsi=151601833418346=2 " lockup is avoided
> (you can noticed that gnice is shifted to softirq. Even though it is 100%
> consumed, There is always exit for existing completion loop due to
> irqpoll_weight  @irq_poll_init().
> 
> Average:CPU  %usr %nice  %sys   %iowait%steal
> %irq %soft%guest%gnice %idle
> Average:all  4.25  0.0021.61  0.00
> 0.00  0.00 6.61   0.00  0.00 67.54
> Average:  0   0.00  0.00 0.00  0.00
> 0.00  0.00   100.000.00  0.00  0.00
> 
> 
> Hope this clarifies. We need different fix to avoid lockups. Can we
> consider using irq poll interface if #CPU is more than Reply queue/MSI-x.
> ?

Please use the device's supported msi-x vectors number, and see if there
is this issue. If there is, you can use irq poll too, which isn't contradictory
with the blk-mq approach taken by this patchset.

Hope I clarifies too, :-)


Thanks, 
Ming