[PATCH 4/5] scsi: introduce force_blk_mq

2018-02-02 Thread Ming Lei
>From scsi driver view, it is a bit troublesome to support both blk-mq
and non-blk-mq at the same time, especially when drivers need to support
multi hw-queue.

This patch introduces 'force_blk_mq' to scsi_host_template so that drivers
can provide blk-mq only support, so driver code can avoid the trouble
for supporting both.

This patch may clean up driver a lot by providing blk-mq only support, 
espeically
it is easier to convert multiple reply queues into blk_mq's MQ for the following
purposes:

1) use blk_mq multiple hw queue to deal with allocated irq vectors of all 
offline
CPU affinity[1]:

[1] https://marc.info/?l=linux-kernel&m=151748144730409&w=2

Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
for some irq vectors, this can't be avoided even though the allocation
is improved.

So all these drivers have to avoid to ask HBA to complete request in
reply queue which hasn't online CPUs assigned.

This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
hw queue by mapping each reply queue into hctx.

2) some drivers[1] require to complete request in the submission CPU for
avoiding hard/soft lockup, which is easily done with blk_mq, so not necessary
to reinvent wheels for solving the problem.

[2] https://marc.info/?t=15160185141&r=1&w=2

Sovling the above issues for non-MQ path may not be easy, or introduce
unnecessary work, especially we plan to enable SCSI_MQ soon as discussed
recently[3]:

[3] https://marc.info/?l=linux-scsi&m=151727684915589&w=2

Cc: Hannes Reinecke 
Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Laurence Oberman 
Cc: Mike Snitzer 
Signed-off-by: Ming Lei 
---
 drivers/scsi/hosts.c | 1 +
 include/scsi/scsi_host.h | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index fe3a0da3ec97..c75cebd7911d 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -471,6 +471,7 @@ struct Scsi_Host *scsi_host_alloc(struct scsi_host_template 
*sht, int privsize)
shost->dma_boundary = 0x;
 
shost->use_blk_mq = scsi_use_blk_mq;
+   shost->use_blk_mq = scsi_use_blk_mq || !!shost->hostt->force_blk_mq;
 
device_initialize(&shost->shost_gendev);
dev_set_name(&shost->shost_gendev, "host%d", shost->host_no);
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index a8b7bf879ced..4118760e5c32 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -452,6 +452,9 @@ struct scsi_host_template {
/* True if the controller does not support WRITE SAME */
unsigned no_write_same:1;
 
+   /* tell scsi core we support blk-mq only */
+   unsigned force_blk_mq:1;
+
/*
 * Countdown for host blocking with no commands outstanding.
 */
-- 
2.9.5



[PATCH 2/5] blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS

2018-02-02 Thread Ming Lei
Quite a few HBAs(such as HPSA, megaraid, mpt3sas, ..) support multiple
reply queues, but tags is often HBA wide.

These HBAs have switched to use pci_alloc_irq_vectors(PCI_IRQ_AFFINITY)
for automatic affinity assignment.

Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
for some irq vectors, this can't be avoided even though the allocation
is improved.

So all these drivers have to avoid to ask HBA to complete request in
reply queue which hasn't online CPUs assigned, and HPSA has been broken
with v4.15+:

https://marc.info/?l=linux-kernel&m=151748144730409&w=2

This issue can be solved generically and easily via blk_mq(scsi_mq) multiple
hw queue by mapping each reply queue into hctx, but one tricky thing is
the HBA wide(instead of hw queue wide) tags.

This patch is based on the following Hannes's patch:

https://marc.info/?l=linux-block&m=149132580511346&w=2

One big difference with Hannes's is that this patch only makes the tags sbitmap
and active_queues data structure HBA wide, and others are kept as NUMA locality,
such as request, hctx, tags, ...

The following patch will support global tags on null_blk, also the performance
data is provided, no obvious performance loss is observed when the whole
hw queue depth is same.

Cc: Hannes Reinecke 
Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Laurence Oberman 
Cc: Mike Snitzer 
Signed-off-by: Ming Lei 
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq-sched.c   |  2 +-
 block/blk-mq-tag.c | 23 ++-
 block/blk-mq-tag.h |  5 -
 block/blk-mq.c | 29 -
 block/blk-mq.h |  3 ++-
 include/linux/blk-mq.h |  2 ++
 7 files changed, 52 insertions(+), 13 deletions(-)

diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 0dfafa4b655a..0f0fafe03f5d 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -206,6 +206,7 @@ static const char *const hctx_flag_name[] = {
HCTX_FLAG_NAME(SHOULD_MERGE),
HCTX_FLAG_NAME(TAG_SHARED),
HCTX_FLAG_NAME(SG_MERGE),
+   HCTX_FLAG_NAME(GLOBAL_TAGS),
HCTX_FLAG_NAME(BLOCKING),
HCTX_FLAG_NAME(NO_SCHED),
 };
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index 55c0a745b427..191d4bc95f0e 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -495,7 +495,7 @@ static int blk_mq_sched_alloc_tags(struct request_queue *q,
int ret;
 
hctx->sched_tags = blk_mq_alloc_rq_map(set, hctx_idx, q->nr_requests,
-  set->reserved_tags);
+  set->reserved_tags, false);
if (!hctx->sched_tags)
return -ENOMEM;
 
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 571797dc36cb..66377d09eaeb 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -379,9 +379,11 @@ static struct blk_mq_tags *blk_mq_init_bitmap_tags(struct 
blk_mq_tags *tags,
return NULL;
 }
 
-struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
+struct blk_mq_tags *blk_mq_init_tags(struct blk_mq_tag_set *set,
+unsigned int total_tags,
 unsigned int reserved_tags,
-int node, int alloc_policy)
+int node, int alloc_policy,
+bool global_tag)
 {
struct blk_mq_tags *tags;
 
@@ -397,6 +399,14 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int 
total_tags,
tags->nr_tags = total_tags;
tags->nr_reserved_tags = reserved_tags;
 
+   WARN_ON(global_tag && !set->global_tags);
+   if (global_tag && set->global_tags) {
+   tags->bitmap_tags = set->global_tags->bitmap_tags;
+   tags->breserved_tags = set->global_tags->breserved_tags;
+   tags->active_queues = set->global_tags->active_queues;
+   tags->global_tags = true;
+   return tags;
+   }
tags->bitmap_tags = &tags->__bitmap_tags;
tags->breserved_tags = &tags->__breserved_tags;
tags->active_queues = &tags->__active_queues;
@@ -406,8 +416,10 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int 
total_tags,
 
 void blk_mq_free_tags(struct blk_mq_tags *tags)
 {
-   sbitmap_queue_free(tags->bitmap_tags);
-   sbitmap_queue_free(tags->breserved_tags);
+   if (!tags->global_tags) {
+   sbitmap_queue_free(tags->bitmap_tags);
+   sbitmap_queue_free(tags->breserved_tags);
+   }
kfree(tags);
 }
 
@@ -441,7 +453,8 @@ int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
if (tdepth > 16 * BLKDEV_MAX_RQ)
return -EINVAL;
 
-   new = blk_mq_alloc

[PATCH 5/5] scsi: virtio_scsi: fix IO hang by irq vector automatic affinity

2018-02-02 Thread Ming Lei
Now 84676c1f21e8ff5(genirq/affinity: assign vectors to all possible CPUs)
has been merged to V4.16-rc, and it is easy to allocate all offline CPUs
for some irq vectors, this can't be avoided even though the allocation
is improved.

For example, on a 8cores VM, 4~7 are not-present/offline, 4 queues of
virtio-scsi, the irq affinity assigned can become the following shape:

irq 36, cpu list 0-7
irq 37, cpu list 0-7
irq 38, cpu list 0-7
irq 39, cpu list 0-1
irq 40, cpu list 4,6
irq 41, cpu list 2-3
irq 42, cpu list 5,7

Then IO hang is triggered in case of non-SCSI_MQ.

Given storage IO is always C/S model, there isn't such issue with 
SCSI_MQ(blk-mq),
because no IO can be submitted to one hw queue if the hw queue hasn't online
CPUs.

Fix this issue by forcing to use blk_mq.

BTW, I have been used virtio-scsi(scsi_mq) for several years, and it has
been quite stable, so it shouldn't cause extra risk.

Cc: Hannes Reinecke 
Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Paolo Bonzini 
Cc: Laurence Oberman 
Cc: Mike Snitzer 
Signed-off-by: Ming Lei 
---
 drivers/scsi/virtio_scsi.c | 59 +++---
 1 file changed, 3 insertions(+), 56 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 7c28e8d4955a..54e3a0f6844c 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -91,9 +91,6 @@ struct virtio_scsi_vq {
 struct virtio_scsi_target_state {
seqcount_t tgt_seq;
 
-   /* Count of outstanding requests. */
-   atomic_t reqs;
-
/* Currently active virtqueue for requests sent to this target. */
struct virtio_scsi_vq *req_vq;
 };
@@ -152,8 +149,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
*vscsi, void *buf)
struct virtio_scsi_cmd *cmd = buf;
struct scsi_cmnd *sc = cmd->sc;
struct virtio_scsi_cmd_resp *resp = &cmd->resp.cmd;
-   struct virtio_scsi_target_state *tgt =
-   scsi_target(sc->device)->hostdata;
 
dev_dbg(&sc->device->sdev_gendev,
"cmd %p response %u status %#02x sense_len %u\n",
@@ -210,8 +205,6 @@ static void virtscsi_complete_cmd(struct virtio_scsi 
*vscsi, void *buf)
}
 
sc->scsi_done(sc);
-
-   atomic_dec(&tgt->reqs);
 }
 
 static void virtscsi_vq_done(struct virtio_scsi *vscsi,
@@ -580,10 +573,7 @@ static int virtscsi_queuecommand_single(struct Scsi_Host 
*sh,
struct scsi_cmnd *sc)
 {
struct virtio_scsi *vscsi = shost_priv(sh);
-   struct virtio_scsi_target_state *tgt =
-   scsi_target(sc->device)->hostdata;
 
-   atomic_inc(&tgt->reqs);
return virtscsi_queuecommand(vscsi, &vscsi->req_vqs[0], sc);
 }
 
@@ -596,55 +586,11 @@ static struct virtio_scsi_vq *virtscsi_pick_vq_mq(struct 
virtio_scsi *vscsi,
return &vscsi->req_vqs[hwq];
 }
 
-static struct virtio_scsi_vq *virtscsi_pick_vq(struct virtio_scsi *vscsi,
-  struct virtio_scsi_target_state 
*tgt)
-{
-   struct virtio_scsi_vq *vq;
-   unsigned long flags;
-   u32 queue_num;
-
-   local_irq_save(flags);
-   if (atomic_inc_return(&tgt->reqs) > 1) {
-   unsigned long seq;
-
-   do {
-   seq = read_seqcount_begin(&tgt->tgt_seq);
-   vq = tgt->req_vq;
-   } while (read_seqcount_retry(&tgt->tgt_seq, seq));
-   } else {
-   /* no writes can be concurrent because of atomic_t */
-   write_seqcount_begin(&tgt->tgt_seq);
-
-   /* keep previous req_vq if a reader just arrived */
-   if (unlikely(atomic_read(&tgt->reqs) > 1)) {
-   vq = tgt->req_vq;
-   goto unlock;
-   }
-
-   queue_num = smp_processor_id();
-   while (unlikely(queue_num >= vscsi->num_queues))
-   queue_num -= vscsi->num_queues;
-   tgt->req_vq = vq = &vscsi->req_vqs[queue_num];
- unlock:
-   write_seqcount_end(&tgt->tgt_seq);
-   }
-   local_irq_restore(flags);
-
-   return vq;
-}
-
 static int virtscsi_queuecommand_multi(struct Scsi_Host *sh,
   struct scsi_cmnd *sc)
 {
struct virtio_scsi *vscsi = shost_priv(sh);
-   struct virtio_scsi_target_state *tgt =
-   scsi_target(sc->device)->hostdata;
-   struct virtio_scsi_vq *req_vq;
-
-   if (shost_use_blk_mq(sh))
-   req_vq = virtscsi_pick_vq_mq(vscsi, sc);
-   else
-   req_vq = virtscsi_pick_vq(vscsi, tgt);
+   struct virtio_scsi_vq *req_vq = virtscsi_pick_vq_mq(vscsi, sc);
 
return virtscsi_queuecommand(vscsi, r

[PATCH 3/5] block: null_blk: introduce module parameter of 'g_global_tags'

2018-02-02 Thread Ming Lei
This patch introduces the parameter of 'g_global_tags' so that we can
test this feature by null_blk easiy.

Not see obvious performance drop with global_tags when the whole hw
depth is kept as same:

1) no 'global_tags', each hw queue depth is 1, and 4 hw queues
modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=0 
submit_queues=4 hw_queue_depth=1

2) 'global_tags', global hw queue depth is 4, and 4 hw queues
modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 global_tags=0 
submit_queues=4 hw_queue_depth=4

3) fio test done in above two settings:
   fio --bs=4k --size=512G  --rw=randread --norandommap --direct=1 
--ioengine=libaio --iodepth=4 --runtime=$RUNTIME --group_reporting=1  
--name=nullb0 --filename=/dev/nullb0 --name=nullb1 --filename=/dev/nullb1 
--name=nullb2 --filename=/dev/nullb2 --name=nullb3 --filename=/dev/nullb3

1M IOPS can be reached in both above tests which is done in one VM.

Cc: Hannes Reinecke 
Cc: Arun Easi 
Cc: Omar Sandoval ,
Cc: "Martin K. Petersen" ,
Cc: James Bottomley ,
Cc: Christoph Hellwig ,
Cc: Don Brace 
Cc: Kashyap Desai 
Cc: Peter Rivera 
Cc: Laurence Oberman 
Cc: Mike Snitzer 
Signed-off-by: Ming Lei 
---
 drivers/block/null_blk.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 287a09611c0f..ad0834efad42 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -163,6 +163,10 @@ static int g_submit_queues = 1;
 module_param_named(submit_queues, g_submit_queues, int, S_IRUGO);
 MODULE_PARM_DESC(submit_queues, "Number of submission queues");
 
+static int g_global_tags = 0;
+module_param_named(global_tags, g_global_tags, int, S_IRUGO);
+MODULE_PARM_DESC(global_tags, "All submission queues share one tags");
+
 static int g_home_node = NUMA_NO_NODE;
 module_param_named(home_node, g_home_node, int, S_IRUGO);
 MODULE_PARM_DESC(home_node, "Home node for the device");
@@ -1622,6 +1626,8 @@ static int null_init_tag_set(struct nullb *nullb, struct 
blk_mq_tag_set *set)
set->flags = BLK_MQ_F_SHOULD_MERGE;
if (g_no_sched)
set->flags |= BLK_MQ_F_NO_SCHED;
+   if (g_global_tags)
+   set->flags |= BLK_MQ_F_GLOBAL_TAGS;
set->driver_data = NULL;
 
if ((nullb && nullb->dev->blocking) || g_blocking)
-- 
2.9.5



[PATCH 1/5] blk-mq: tags: define several fields of tags as pointer

2018-02-02 Thread Ming Lei
This patch changes tags->breserved_tags, tags->bitmap_tags and
tags->active_queues as pointer, and prepares for supporting global tags.

No functional change.

Cc: Laurence Oberman 
Cc: Mike Snitzer 
Cc: Christoph Hellwig 
Signed-off-by: Ming Lei 
---
 block/bfq-iosched.c|  4 ++--
 block/blk-mq-debugfs.c | 10 +-
 block/blk-mq-tag.c | 48 ++--
 block/blk-mq-tag.h | 10 +++---
 block/blk-mq.c |  2 +-
 block/kyber-iosched.c  |  2 +-
 6 files changed, 42 insertions(+), 34 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index 47e6ec7427c4..1e1211814a57 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -534,9 +534,9 @@ static void bfq_limit_depth(unsigned int op, struct 
blk_mq_alloc_data *data)
WARN_ON_ONCE(1);
return;
}
-   bt = &tags->breserved_tags;
+   bt = tags->breserved_tags;
} else
-   bt = &tags->bitmap_tags;
+   bt = tags->bitmap_tags;
 
if (unlikely(bfqd->sb_shift != bt->sb.shift))
bfq_update_depths(bfqd, bt);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 21cbc1f071c6..0dfafa4b655a 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -433,14 +433,14 @@ static void blk_mq_debugfs_tags_show(struct seq_file *m,
seq_printf(m, "nr_tags=%u\n", tags->nr_tags);
seq_printf(m, "nr_reserved_tags=%u\n", tags->nr_reserved_tags);
seq_printf(m, "active_queues=%d\n",
-  atomic_read(&tags->active_queues));
+  atomic_read(tags->active_queues));
 
seq_puts(m, "\nbitmap_tags:\n");
-   sbitmap_queue_show(&tags->bitmap_tags, m);
+   sbitmap_queue_show(tags->bitmap_tags, m);
 
if (tags->nr_reserved_tags) {
seq_puts(m, "\nbreserved_tags:\n");
-   sbitmap_queue_show(&tags->breserved_tags, m);
+   sbitmap_queue_show(tags->breserved_tags, m);
}
 }
 
@@ -471,7 +471,7 @@ static int hctx_tags_bitmap_show(void *data, struct 
seq_file *m)
if (res)
goto out;
if (hctx->tags)
-   sbitmap_bitmap_show(&hctx->tags->bitmap_tags.sb, m);
+   sbitmap_bitmap_show(&hctx->tags->bitmap_tags->sb, m);
mutex_unlock(&q->sysfs_lock);
 
 out:
@@ -505,7 +505,7 @@ static int hctx_sched_tags_bitmap_show(void *data, struct 
seq_file *m)
if (res)
goto out;
if (hctx->sched_tags)
-   sbitmap_bitmap_show(&hctx->sched_tags->bitmap_tags.sb, m);
+   sbitmap_bitmap_show(&hctx->sched_tags->bitmap_tags->sb, m);
mutex_unlock(&q->sysfs_lock);
 
 out:
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 336dde07b230..571797dc36cb 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -18,7 +18,7 @@ bool blk_mq_has_free_tags(struct blk_mq_tags *tags)
if (!tags)
return true;
 
-   return sbitmap_any_bit_clear(&tags->bitmap_tags.sb);
+   return sbitmap_any_bit_clear(&tags->bitmap_tags->sb);
 }
 
 /*
@@ -28,7 +28,7 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
 {
if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state) &&
!test_and_set_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-   atomic_inc(&hctx->tags->active_queues);
+   atomic_inc(hctx->tags->active_queues);
 
return true;
 }
@@ -38,9 +38,9 @@ bool __blk_mq_tag_busy(struct blk_mq_hw_ctx *hctx)
  */
 void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool include_reserve)
 {
-   sbitmap_queue_wake_all(&tags->bitmap_tags);
+   sbitmap_queue_wake_all(tags->bitmap_tags);
if (include_reserve)
-   sbitmap_queue_wake_all(&tags->breserved_tags);
+   sbitmap_queue_wake_all(tags->breserved_tags);
 }
 
 /*
@@ -54,7 +54,7 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
if (!test_and_clear_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
return;
 
-   atomic_dec(&tags->active_queues);
+   atomic_dec(tags->active_queues);
 
blk_mq_tag_wakeup_all(tags, false);
 }
@@ -79,7 +79,7 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
if (bt->sb.depth == 1)
return true;
 
-   users = atomic_read(&hctx->tags->active_queues);
+   users = atomic_read(hctx->tags->active_queues);
if (!users)
return true;
 
@@ -117,10 +117,10 @@ unsigned int blk_mq_get_tag(struct blk_mq_alloc_data 
*data)
WARN_ON_ONCE(1);
return BLK_MQ_TAG_FAIL;
}
-   bt = &tags->breserved_tags;
+   bt = tags->breserved_tags;
tag_offset = 0;
} else {
-   bt = &tags->bitmap_tags;
+   bt = tags->bitmap_tags;
tag_offset = tags->nr_reserved_tags;
}
 
@@ -1

[PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

2018-02-02 Thread Ming Lei
Hi All,

This patchset supports global tags which was started by Hannes originally:

https://marc.info/?l=linux-block&m=149132580511346&w=2

Also inroduce 'force_blk_mq' to 'struct scsi_host_template', so that
driver can avoid to support two IO paths(legacy and blk-mq), especially
recent discusion mentioned that SCSI_MQ will be enabled at default soon.

https://marc.info/?l=linux-scsi&m=151727684915589&w=2

With the above two changes, it should be easier to convert SCSI drivers'
reply queue into blk-mq's hctx, then the automatic irq affinity issue can
be solved easily, please see detailed descrption in commit log.

Also drivers may require to complete request on the submission CPU
for avoiding hard/soft deadlock, which can be done easily with blk_mq
too.

https://marc.info/?t=15160185141&r=1&w=2

The final patch uses the introduced 'force_blk_mq' to fix virtio_scsi
so that IO hang issue can be avoided inside legacy IO path, this issue is
a bit generic, at least HPSA/virtio-scsi are found broken with v4.15+.

Thanks
Ming

Ming Lei (5):
  blk-mq: tags: define several fields of tags as pointer
  blk-mq: introduce BLK_MQ_F_GLOBAL_TAGS
  block: null_blk: introduce module parameter of 'g_global_tags'
  scsi: introduce force_blk_mq
  scsi: virtio_scsi: fix IO hang by irq vector automatic affinity

 block/bfq-iosched.c|  4 +--
 block/blk-mq-debugfs.c | 11 
 block/blk-mq-sched.c   |  2 +-
 block/blk-mq-tag.c | 67 +-
 block/blk-mq-tag.h | 15 ---
 block/blk-mq.c | 31 -
 block/blk-mq.h |  3 ++-
 block/kyber-iosched.c  |  2 +-
 drivers/block/null_blk.c   |  6 +
 drivers/scsi/hosts.c   |  1 +
 drivers/scsi/virtio_scsi.c | 59 +++-
 include/linux/blk-mq.h |  2 ++
 include/scsi/scsi_host.h   |  3 +++
 13 files changed, 105 insertions(+), 101 deletions(-)

-- 
2.9.5



Re: [PATCH v2 2/2] block: Fix a race between the throttling code and request queue initialization

2018-02-02 Thread Joseph Qi
Hi Bart,

On 18/2/3 00:21, Bart Van Assche wrote:
> On Fri, 2018-02-02 at 09:02 +0800, Joseph Qi wrote:
>> We triggered this race when using single queue. I'm not sure if it
>> exists in multi-queue.
> 
> Regarding the races between modifying the queue_lock pointer and the code that
> uses that pointer, I think the following construct in blk_cleanup_queue() is
> sufficient to avoid races between the queue_lock pointer assignment and the 
> code
> that executes concurrently with blk_cleanup_queue():
> 
>   spin_lock_irq(lock);
>   if (q->queue_lock != &q->__queue_lock)
>   q->queue_lock = &q->__queue_lock;
>   spin_unlock_irq(lock);
> 
IMO, the race also exists.

blk_cleanup_queue   blkcg_print_blkgs
  spin_lock_irq(lock) (1)   spin_lock_irq(blkg->q->queue_lock) (2,5)
q->queue_lock = &q->__queue_lock (3)
  spin_unlock_irq(lock) (4)
spin_unlock_irq(blkg->q->queue_lock) (6)

(1) take driver lock;
(2) busy loop for driver lock;
(3) override driver lock with internal lock;
(4) unlock driver lock; 
(5) can take driver lock now;
(6) but unlock internal lock.

If we get blkg->q->queue_lock to local first like blk_cleanup_queue,
it indeed can fix the different lock use in lock/unlock. But since
blk_cleanup_queue has overridden queue lock to internal lock now, I'm
afraid we couldn't still use driver lock in blkcg_print_blkgs.

Thanks,
Joseph

> In other words, I think that this patch series should be sufficient to address
> all races between .queue_lock assignments and the code that uses that pointer.
> 
> Thanks,
> 
> Bart.
> 


Re: [LSF/MM TOPIC] get_user_pages() and filesystems

2018-02-02 Thread Liu Bo
Hi Jan,

On Thu, Jan 25, 2018 at 12:57:27PM +0100, Jan Kara wrote:
> Hello,
> 
> this is about a problem I have identified last month and for which I still
> don't have good solution. Some discussion of the problem happened here [1]
> where also technical details are posted but culprit of the problem is
> relatively simple: Lots of places in kernel (fs code, writeback logic,
> stable-pages framework for DIF/DIX) assume that file pages in page cache
> can be modified either via write(2), truncate(2), fallocate(2) or similar
> code paths explicitely manipulating with file space or via a writeable
> mapping into page tables. In particular we assume that if we block all the
> above paths by taking proper locks, block page faults, and unmap (/ map
> read-only) the page, it cannot be modified. But this assumption is violated
> by get_user_pages() users (such as direct IO or RDMA drivers - and we've
> got reports from such users of weird things happening).
> 
> The problem with GUP users is that they acquire page reference (at that
> point page is writeably mapped into page tables) and some time in future
> (which can be quite far in case of RDMA) page contents gets modified and
> page marked dirty.

I got a question here, when you say 'page contents gets modified', do
you mean that GUP users modify the page content?

I have another story about GUP users who use direct-IO, qemu sometimes
doesn't work well with btrfs when checksum enabled and reports
checksum failures when guest OS doesn't use stable pages, where it is
not GUP users but the original file/mapping that may be changing the
page content in flight.

So looks like either way we kinda have problems.

Thanks,

-liubo

> 
> The question is how to properly solve this problem. One obvious way is to
> indicate page has a GUP reference and block its unmapping / remapping RO
> until that is dropped. But this has a technical problem (how to find space
> in struct page for such tracking) and a design problem (blocking e.g.
> writeback for hours because some RDMA app used a file mapping as a buffer
> simply is not acceptable). There are also various modifications to this
> solution like refuse to use file pages for RDMA
> (get_user_pages_longterm()) and block waiting for users like direct IO, or
> require that RDMA users provide a way to revoke access from GUPed pages.
> 
> Another obvious solution is to try to remove the assumption from all those
> places - i.e., use bounce buffers for DIF/DIX, make sure filesystems are
> prepared for dirty pages suddenly appearing in files and handle that as
> good as they can. They really need to sensibly handle only a case when
> underlying storage is already allocated / reserved, in all other cases I
> believe they are fine in just discarding the data. This would be very
> tedious but I believe it could be done. But overall long-term maintenance
> burden of this solution just doesn't seem worth it to me.
> 
> Another possible solution might be that GUP users (at least the long term
> ones) won't get references directly to page cache pages but only to some
> bounce pages (something like cow on private file mappings) and data would
> be just copied to the page cache pages at set_page_dirty_lock() time (we
> would probably have to move these users to a completely new API to keep our
> sanity). This would have userspace visible impacts (data won't be visible
> in the file until GUP user is done with it) but maybe it would be
> acceptable. Also how to keep association to the original pagecache page
> (and how it should be handled when underlying file just goes away) is
> unclear.
> 
> So clever ideas are needed and possibly some input from FS / MM / RDMA
> folks about what might be acceptable.
> 
>   Honza
> 
> [1] https://www.spinics.net/lists/linux-xfs/msg14468.html
> 
> -- 
> Jan Kara 
> SUSE Labs, CR



Re: [LSF/MM TOPIC] block: extend generic biosets to allow per-device frontpad

2018-02-02 Thread Mike Snitzer
On Fri, Feb 02 2018 at 11:08am -0500,
Mike Snitzer  wrote:
> 
> But if the bioset enhancements are implemented properly then the kernels
> N biosets shouldn't need to be in doubt.  They'll all just adapt to have
> N backing mempools (N being for N conflicting front_pad requirements).

This should've read:

"But if the bioset enhancements are implemented properly then the kernels
N biosets ability to provide adequate front_pad shouldn't need to be in
doubt.  They'll all just adapt to have M backing mempools (for M
conflicting front_pad requirements)."

What this implies is there would need to be a way for the bioset
code to maintain a global graph of all biosets in the system.  And when
a device comes along with a unique bioset front_pad requirement, that
isn't already met by existing mempool, the device's driver (DM in
my case) would call into a bioset interface that would add a new backing
mempool, that accounts for the front_pad increase, to each bioset in the
system.

Not liking that (DM) device creation would potentially spawn a new
mempool within each existing bioset.  It could/would easily result in
many of those mempools going completely unused.

In addition: how would a bio_alloc_bioset() call _know_ the bio was for
use on a specific block device?  The entire beauty of the existing
bio_set code, especially for upper layers like filesystems, is it _is_
device agnostic.

So all this could be the worst idea ever.. not sure.  I've deferred
judging it one way or the other because the details are shakey at best.

And I still need to look closer at all the existing code.

Mike


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-02 Thread Bart Van Assche
On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
> o Simple configuration of IBNBD:
>- Server side is completely passive: volumes do not need to be
>  explicitly exported.

That sounds like a security hole? I think the ability to configure whether or
not an initiator is allowed to log in is essential and also which volumes an
initiator has access to.

>- Only IB port GID and device path needed on client side to map
>  a block device.

I think IP addressing is preferred over GID addressing in RoCE networks.
Additionally, have you noticed that GUID configuration support has been added
to the upstream ib_srpt driver? Using GIDs has a very important disadvantage,
namely that at least in IB networks the prefix will change if the subnet
manager is reconfigured. Additionally, in IB networks it may happen that the
target driver is loaded and configured before the GID has been assigned to
all RDMA ports.

Thanks,

Bart.

Re: [PATCH 05/24] ibtrs: client: main functionality

2018-02-02 Thread Bart Van Assche
On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
> +static inline struct ibtrs_tag *
> +__ibtrs_get_tag(struct ibtrs_clt *clt, enum ibtrs_clt_con_type con_type)
> +{
> + size_t max_depth = clt->queue_depth;
> + struct ibtrs_tag *tag;
> + int cpu, bit;
> +
> + cpu = get_cpu();
> + do {
> + bit = find_first_zero_bit(clt->tags_map, max_depth);
> + if (unlikely(bit >= max_depth)) {
> + put_cpu();
> + return NULL;
> + }
> +
> + } while (unlikely(test_and_set_bit_lock(bit, clt->tags_map)));
> + put_cpu();
> +
> + tag = GET_TAG(clt, bit);
> + WARN_ON(tag->mem_id != bit);
> + tag->cpu_id = cpu;
> + tag->con_type = con_type;
> +
> + return tag;
> +}
> +
> +static inline void __ibtrs_put_tag(struct ibtrs_clt *clt,
> +struct ibtrs_tag *tag)
> +{
> + clear_bit_unlock(tag->mem_id, clt->tags_map);
> +}
> +
> +struct ibtrs_tag *ibtrs_clt_get_tag(struct ibtrs_clt *clt,
> + enum ibtrs_clt_con_type con_type,
> + int can_wait)
> +{
> + struct ibtrs_tag *tag;
> + DEFINE_WAIT(wait);
> +
> + tag = __ibtrs_get_tag(clt, con_type);
> + if (likely(tag) || !can_wait)
> + return tag;
> +
> + do {
> + prepare_to_wait(&clt->tags_wait, &wait, TASK_UNINTERRUPTIBLE);
> + tag = __ibtrs_get_tag(clt, con_type);
> + if (likely(tag))
> + break;
> +
> + io_schedule();
> + } while (1);
> +
> + finish_wait(&clt->tags_wait, &wait);
> +
> + return tag;
> +}
> +EXPORT_SYMBOL(ibtrs_clt_get_tag);
> +
> +void ibtrs_clt_put_tag(struct ibtrs_clt *clt, struct ibtrs_tag *tag)
> +{
> + if (WARN_ON(!test_bit(tag->mem_id, clt->tags_map)))
> + return;
> +
> + __ibtrs_put_tag(clt, tag);
> +
> + /*
> +  * Putting a tag is a barrier, so we will observe
> +  * new entry in the wait list, no worries.
> +  */
> + if (waitqueue_active(&clt->tags_wait))
> + wake_up(&clt->tags_wait);
> +}
> +EXPORT_SYMBOL(ibtrs_clt_put_tag);

Do these functions have any advantage over the code in lib/sbitmap.c? If not,
please call the sbitmap functions instead of adding an additional tag allocator.

Thanks,

Bart.

Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-02 Thread Doug Ledford
On Fri, 2018-02-02 at 16:07 +, Bart Van Assche wrote:
> On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
> > Since the first version the following was changed:
> > 
> >- Load-balancing and IO fail-over using multipath features were added.
> >- Major parts of the code were rewritten, simplified and overall code
> >  size was reduced by a quarter.
> 
> That is interesting to know, but what happened to the feedback that Sagi and
> I provided on v1? Has that feedback been addressed? See also
> https://www.spinics.net/lists/linux-rdma/msg47819.html and
> https://www.spinics.net/lists/linux-rdma/msg47879.html.
> 
> Regarding multipath support: there are already two multipath implementations
> upstream (dm-mpath and the multipath implementation in the NVMe initiator).
> I'm not sure we want a third multipath implementation in the Linux kernel.

There's more than that.  There was also md-multipath, and smc-r includes
another version of multipath, plus I assume we support mptcp as well.

But, to be fair, the different multipaths in this list serve different
purposes and I'm not sure they could all be generalized out and served
by a single multipath code.  Although, fortunately, md-multipath is
deprecated, so no need to worry about it, and it is only dm-multipath
and nvme multipath that deal directly with block devices and assume
block semantics.  If I read the cover letter right (and I haven't dug
into the code to confirm this), the ibtrs multipath has much more in
common with smc-r multipath, where it doesn't really assume a block
layer device sits on top of it, it's more of a pure network multipath,
which the implementation of smc-r is and mptcp would be too.  I would
like to see a core RDMA multipath implementation soon that would
abstract out some of these multipath tasks, at least across RDMA links,
and that didn't have the current limitations (smc-r only supports RoCE
links, and it sounds like ibtrs only supports IB like links, but maybe
I'm wrong there, I haven't looked at the patches yet).

-- 
Doug Ledford 
GPG KeyID: B826A3330E572FDD
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

signature.asc
Description: This is a digitally signed message part


Re: [PATCH v2 2/2] block: Fix a race between the throttling code and request queue initialization

2018-02-02 Thread Bart Van Assche
On Fri, 2018-02-02 at 09:02 +0800, Joseph Qi wrote:
> We triggered this race when using single queue. I'm not sure if it
> exists in multi-queue.

Regarding the races between modifying the queue_lock pointer and the code that
uses that pointer, I think the following construct in blk_cleanup_queue() is
sufficient to avoid races between the queue_lock pointer assignment and the code
that executes concurrently with blk_cleanup_queue():

spin_lock_irq(lock);
if (q->queue_lock != &q->__queue_lock)
q->queue_lock = &q->__queue_lock;
spin_unlock_irq(lock);

In other words, I think that this patch series should be sufficient to address
all races between .queue_lock assignments and the code that uses that pointer.

Thanks,

Bart.

Re: [LSF/MM TOPIC] block: extend generic biosets to allow per-device frontpad

2018-02-02 Thread Mike Snitzer
On Fri, Feb 02 2018 at  1:19am -0500,
NeilBrown  wrote:

> On Mon, Jan 29 2018, Mike Snitzer wrote:
> 
> > I'd like to enable bio-based DM to _not_ need to clone bios.  But to do
> > so each bio-based DM target's required per-bio-data would need to be
> > provided by upper layer biosets (as opposed to the bioset DM currently
> > creates).
> >
> > So my thinking is that all system-level biosets (e.g. fs_bio_set,
> > blkdev_dio_pool) would redirect to a device specific variant bioset IFF
> > the underlying device advertises the need for a specific per-bio-data
> > payload to be provided.
> >
> > I know this _could_ become a rathole but I'd like to avoid reverting DM
> > back to the days of having to worry about managing mempools for the
> > purpose of per-io allocations.  I've grown spoiled by the performance
> > and elegance that comes with having the bio and per-bio-data allocated
> > from the same bioset.
> >
> > Thoughts?
> 
> md/raid0 remaps each bio and passes it directly down to one of several
> devices.
> I think your scheme would mean that it would need to clone each bio to
> make sure it is from the correctly sized pool.

Not sure why the md/raid0 device would need to do anything (not unless
it wanted to take advantage of this).

The model, in my head, would be that this is _not_ intended for an
arbitrary stacked MD or DM device.  So the underlying device(s) for
a DM or MD device, that wants to leverage upper layer bio_set provided
front_pad, would be a real device (e.g. NVMe, SCSI, whatever).

But if a mix of underlying drivers were used (each with unique
per-bio-data, aka front_pad, requirements) then the blk_stack_limits()
interface _could_ build that information up.  And while that pretty much
gets us all we'd need to support md/raid0 on arbitrary stacked DM or MD
volumes that isn't what I'm interested in.  But supporting an arbitrary
stacked device could be done as follow-on work.

I'm specifically looking to optimize DM's new DM_TYPE_NVME_BIO_BASED
device (see commits 22c11858e, 978e51ba3, cd02538445, etc) so that bios
are just remapped to the underlying NVMe device without cloning.  DM
core now takes care to validate that a DM_TYPE_NVME_BIO_BASED DM device
(e.g. a DM multipath device -- via "queue_mode nvme") is _only_ stacked
directly ontop of native NVMe devices.

> I suspect it could be made to work though.
> 
> 1/ have a way for the driver receiving a bio to discover how much
>frontpad was allocated.

Yes

> 2/ require drivers to accept bios with any size of frontpad, but a
>fast-path is taken if it is already big enough.

Yes and no, the driver really should be able to trust that the block
layer is sending it bios with adeuqate front_pad (if the device
registered its front_pad requirements).

That said, making it optional (via "hint") is likely safer from the
standpoint that it is less cut-throat given we'd be depending on the
various biosets to respect our wishes.  I just worry that having the
code for falling back to cloning, if the front_pad isn't adequate, will
defeat much of the benefit of optimizing what is intended to be a faster
fast path.

But if the bioset enhancements are implemented properly then the kernels
N biosets shouldn't need to be in doubt.  They'll all just adapt to have
N backing mempools (N being for N conflicting front_pad requirements).

> 3/ allow a block device to advertise it's preferred frontpad.

s/preferred/required/ in my mental model.

> 4/ make sure your config-change-notification mechanism can communicate
>changes to this number.

You're referring to he notifier chain idea about restacking
queue_limits?  Yes, that'd be needed once that exists.

> 5/ gather statistics on what percentage of bios have a too-small
>frontpad.
>
> Then start modifying places that allocate bios to use the hint,
> and when benchmarks show the percentage is high - use it to encourage
> other people to allocate better bios.

This shouldn't be needed if we were to go the route where bio_sets
dynamically select the appropriate mempool based on the front_pad
requirements advertised by the underlying device ("3/" above).

Thanks,
Mike


Re: [PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-02 Thread Bart Van Assche
On Fri, 2018-02-02 at 15:08 +0100, Roman Pen wrote:
> Since the first version the following was changed:
> 
>- Load-balancing and IO fail-over using multipath features were added.
>- Major parts of the code were rewritten, simplified and overall code
>  size was reduced by a quarter.

That is interesting to know, but what happened to the feedback that Sagi and
I provided on v1? Has that feedback been addressed? See also
https://www.spinics.net/lists/linux-rdma/msg47819.html and
https://www.spinics.net/lists/linux-rdma/msg47879.html.

Regarding multipath support: there are already two multipath implementations
upstream (dm-mpath and the multipath implementation in the NVMe initiator).
I'm not sure we want a third multipath implementation in the Linux kernel.

Thanks,

Bart.

Re: [PATCH 23/24] ibnbd: a bit of documentation

2018-02-02 Thread Bart Van Assche
On Fri, 2018-02-02 at 15:09 +0100, Roman Pen wrote:
> +Entries under /sys/kernel/ibnbd_client/
> +===
> [ ... ]

You will need Greg KH's permission to add new entries directly under 
/sys/kernel.
Since I think that it is unlikely that he will give that permission: have you
considered to add the new client entries under /sys/class/block for the client 
and
/sys/kernel/configfs/ibnbd for the target, similar to what the NVMeOF drivers do
today?

Bart.

Re: [PATCH 16/24] ibnbd: client: main functionality

2018-02-02 Thread Jens Axboe
On 2/2/18 7:08 AM, Roman Pen wrote:
> This is main functionality of ibnbd-client module, which provides
> interface to map remote device as local block device /dev/ibnbd
> and feeds IBTRS with IO requests.

Kill the legacy IO path for this, the driver should only support
blk-mq. Hence kill off your BLK_RQ part, that will eliminate
the dual path you have too.

-- 
Jens Axboe



Re: [PATCH] block: skd: fix incorrect linux/slab_def.h inclusion

2018-02-02 Thread Jens Axboe
On 2/2/18 8:03 AM, Arnd Bergmann wrote:
> skd includes slab_def.h to get access to the slab cache object size.
> However, including this header breaks when we use SLUB or SLOB instead of
> the SLAB allocator, since the structure layout is completely different,
> as shown by this warning when we build this driver in one of the invalid
> configurations with link-time optimizations enabled:
> 
> include/linux/slab.h:715:0: error: type of 'kmem_cache_size' does not match 
> original declaration [-Werror=lto-type-mismatch]
>  unsigned int kmem_cache_size(struct kmem_cache *s);
> 
> mm/slab_common.c:77:14: note: 'kmem_cache_size' was previously declared here
>  unsigned int kmem_cache_size(struct kmem_cache *s)
>   ^
> mm/slab_common.c:77:14: note: code may be misoptimized unless 
> -fno-strict-aliasing is used
> include/linux/slab.h:147:0: error: type of 'kmem_cache_destroy' does not 
> match original declaration [-Werror=lto-type-mismatch]
>  void kmem_cache_destroy(struct kmem_cache *);
> 
> mm/slab_common.c:858:6: note: 'kmem_cache_destroy' was previously declared 
> here
>  void kmem_cache_destroy(struct kmem_cache *s)
>   ^
> mm/slab_common.c:858:6: note: code may be misoptimized unless 
> -fno-strict-aliasing is used
> include/linux/slab.h:140:0: error: type of 'kmem_cache_create' does not match 
> original declaration [-Werror=lto-type-mismatch]
>  struct kmem_cache *kmem_cache_create(const char *name, size_t size,
> 
> mm/slab_common.c:534:1: note: 'kmem_cache_create' was previously declared here
>  kmem_cache_create(const char *name, size_t size, size_t align,>  ^
> 
> This removes the header inclusion and instead uses the kmem_cache_size()
> interface to get the size in a reliable way.

Thanks Arnd, applied.

-- 
Jens Axboe



[PATCH] block: skd: fix incorrect linux/slab_def.h inclusion

2018-02-02 Thread Arnd Bergmann
skd includes slab_def.h to get access to the slab cache object size.
However, including this header breaks when we use SLUB or SLOB instead of
the SLAB allocator, since the structure layout is completely different,
as shown by this warning when we build this driver in one of the invalid
configurations with link-time optimizations enabled:

include/linux/slab.h:715:0: error: type of 'kmem_cache_size' does not match 
original declaration [-Werror=lto-type-mismatch]
 unsigned int kmem_cache_size(struct kmem_cache *s);

mm/slab_common.c:77:14: note: 'kmem_cache_size' was previously declared here
 unsigned int kmem_cache_size(struct kmem_cache *s)
  ^
mm/slab_common.c:77:14: note: code may be misoptimized unless 
-fno-strict-aliasing is used
include/linux/slab.h:147:0: error: type of 'kmem_cache_destroy' does not match 
original declaration [-Werror=lto-type-mismatch]
 void kmem_cache_destroy(struct kmem_cache *);

mm/slab_common.c:858:6: note: 'kmem_cache_destroy' was previously declared here
 void kmem_cache_destroy(struct kmem_cache *s)
  ^
mm/slab_common.c:858:6: note: code may be misoptimized unless 
-fno-strict-aliasing is used
include/linux/slab.h:140:0: error: type of 'kmem_cache_create' does not match 
original declaration [-Werror=lto-type-mismatch]
 struct kmem_cache *kmem_cache_create(const char *name, size_t size,

mm/slab_common.c:534:1: note: 'kmem_cache_create' was previously declared here
 kmem_cache_create(const char *name, size_t size, size_t align,
 ^

This removes the header inclusion and instead uses the kmem_cache_size()
interface to get the size in a reliable way.

Signed-off-by: Arnd Bergmann 
---
 drivers/block/skd_main.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index de0d08133c7e..e41935ab41ef 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -32,7 +32,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -2603,7 +2602,8 @@ static void *skd_alloc_dma(struct skd_device *skdev, 
struct kmem_cache *s,
buf = kmem_cache_alloc(s, gfp);
if (!buf)
return NULL;
-   *dma_handle = dma_map_single(dev, buf, s->size, dir);
+   *dma_handle = dma_map_single(dev, buf,
+kmem_cache_size(s), dir);
if (dma_mapping_error(dev, *dma_handle)) {
kmem_cache_free(s, buf);
buf = NULL;
@@ -2618,7 +2618,8 @@ static void skd_free_dma(struct skd_device *skdev, struct 
kmem_cache *s,
if (!vaddr)
return;
 
-   dma_unmap_single(&skdev->pdev->dev, dma_handle, s->size, dir);
+   dma_unmap_single(&skdev->pdev->dev, dma_handle,
+kmem_cache_size(s), dir);
kmem_cache_free(s, vaddr);
 }
 
-- 
2.9.0



Re: [PATCH v2 2/2] block: Fix a race between the throttling code and request queue initialization

2018-02-02 Thread Jens Axboe
On 2/1/18 6:02 PM, Joseph Qi wrote:
> Hi Bart,
> 
> On 18/2/2 00:16, Bart Van Assche wrote:
>> On Thu, 2018-02-01 at 09:53 +0800, Joseph Qi wrote:
>>> I'm afraid the risk may also exist in blk_cleanup_queue, which will
>>> set queue_lock to to the default internal lock.
>>>
>>> spin_lock_irq(lock);
>>> if (q->queue_lock != &q->__queue_lock)
>>> q->queue_lock = &q->__queue_lock;
>>> spin_unlock_irq(lock);
>>>
>>> I'm thinking of getting blkg->q->queue_lock to local first, but this
>>> will result in still using driver lock even the queue_lock has already
>>> been set to the default internal lock.
>>
>> Hello Joseph,
>>
>> I think the race between the queue_lock assignment in blk_cleanup_queue()
>> and the use of that pointer by cgroup attributes could be solved by
>> removing the visibility of these attributes from blk_cleanup_queue() instead
>> of __blk_release_queue(). However, last time I proposed to move code from
>> __blk_release_queue() into blk_cleanup_queue() I received the feedback that
>> from some kernel developers that they didn't like this.
>>
>> Is the block driver that triggered the race on the q->queue_lock assignment
>> using legacy (single queue) or multiqueue (blk-mq) mode? If that driver is
>> using legacy mode, are you aware that there are plans to remove legacy mode
>> from the upstream kernel? And if your driver is using multiqueue mode, how
>> about the following change instead of the two patches in this patch series:
>>
> We triggered this race when using single queue. I'm not sure if it
> exists in multi-queue.
> Do you mean upstream won't fix bugs any more in single queue?

No, we'll still fix bugs in the legacy path, we just won't introduce
any new features of accept any new drivers that use that path.
Ultimately that path will go away once there are no more users,
but until then it is maintained.

-- 
Jens Axboe



Re: [PATCH v2] buffer: Avoid setting buffer bits that are already set

2018-02-02 Thread Jens Axboe
On 2/2/18 1:07 AM, kemi wrote:
> Hi, Jens
>   Could you help to merge this patch to your tree? Thanks

Yes, I'll queue it up, thanks.

-- 
Jens Axboe



[PATCH 15/24] ibnbd: client: private header with client structs and functions

2018-02-02 Thread Roman Pen
This header describes main structs and functions used by ibnbd-client
module, mainly for managing IBNBD sessions and mapped block devices,
creating and destroying sysfs entries.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/block/ibnbd/ibnbd-clt.h | 193 
 1 file changed, 193 insertions(+)

diff --git a/drivers/block/ibnbd/ibnbd-clt.h b/drivers/block/ibnbd/ibnbd-clt.h
new file mode 100644
index ..b3d72b2962dd
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-clt.h
@@ -0,0 +1,193 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#ifndef IBNBD_CLT_H
+#define IBNBD_CLT_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "ibtrs.h"
+#include "ibnbd-proto.h"
+#include "ibnbd-log.h"
+
+#define BMAX_SEGMENTS 31
+#define RECONNECT_DELAY 30
+#define MAX_RECONNECTS -1
+
+enum ibnbd_clt_dev_state {
+   DEV_STATE_INIT,
+   DEV_STATE_MAPPED,
+   DEV_STATE_MAPPED_DISCONNECTED,
+   DEV_STATE_UNMAPPED,
+};
+
+enum ibnbd_queue_mode {
+   BLK_MQ,
+   BLK_RQ
+};
+
+struct ibnbd_iu_comp {
+   wait_queue_head_t wait;
+   int errno;
+};
+
+struct ibnbd_iu {
+   union {
+   struct request *rq; /* for block io */
+   void *buf; /* for user messages */
+   };
+   struct ibtrs_tag*tag;
+   union {
+   /* use to send msg associated with a dev */
+   struct ibnbd_clt_dev *dev;
+   /* use to send msg associated with a sess */
+   struct ibnbd_clt_session *sess;
+   };
+   blk_status_tstatus;
+   struct scatterlist  sglist[BMAX_SEGMENTS];
+   struct work_struct  work;
+   int errno;
+   struct ibnbd_iu_comp*comp;
+};
+
+struct ibnbd_cpu_qlist {
+   struct list_headrequeue_list;
+   spinlock_t  requeue_lock;
+   unsigned intcpu;
+};
+
+struct ibnbd_clt_session {
+   struct list_headlist;
+   struct ibtrs_clt*ibtrs;
+   wait_queue_head_t   ibtrs_waitq;
+   boolibtrs_ready;
+   struct ibnbd_cpu_qlist  __percpu
+   *cpu_queues;
+   DECLARE_BITMAP(cpu_queues_bm, NR_CPUS);
+   int __percpu*cpu_rr; /* per-cpu var for CPU round-robin */
+   atomic_tbusy;
+   int queue_depth;
+   u32 max_io_size;
+   struct blk_mq_tag_set   tag_set;
+   struct mutexlock; /* protects state and devs_list */
+   struct list_headdevs_list; /* list of struct ibnbd_clt_dev */
+   refcount_t  refcount;
+   charsessname[NAME_MAX];
+   u8  ver; /* protocol version */
+};
+
+/**
+ * Submission queues.
+ */
+struct ibnbd_queue {
+   struct list_headrequeue_list;
+   unsigned long   in_list;
+   struct ibnbd_clt_dev*dev;
+   struct blk_mq_hw_ctx*hctx;
+};
+
+struct ibnbd_clt_dev {
+   struct ibnbd_clt_session*sess;
+   struct request_queue*queue;
+   struct ibnbd_queue  *hw_queues;
+   struct delayed_work rq_delay_work;
+   u32 device_id;
+   /* local Idr index - used to track minor number allocations. */
+   u32 clt_device_id;
+   struct mutexlock;
+   enum ibnbd_clt_dev_statedev_state;
+   enum ibnbd_queue_mode   queue_mode;
+   enum ibnbd_io_mode  io_mode; /* user requested */
+   enum ibnbd_io_mode  remote_io_mode; /* server really used */
+   charpathname[NAME_MAX];
+   enum ibnbd_access_mode  access_mode;
+   boolread_only;
+   boolrotational;
+   u32 max_hw_sectors;
+   u32   

[PATCH 17/24] ibnbd: client: sysfs interface functions

2018-02-02 Thread Roman Pen
This is the sysfs interface to IBNBD block devices on client side:

  /sys/kernel/ibnbd_client/
|- map_device
|  *** maps remote device
|
|- devices/
   *** all mapped devices

  /sys/block/ibnbd/ibnbd_client/
|- unmap_device
|  *** unmaps device
|
|- state
|  *** device state
|
|- session
|  *** session name
|
|- mapping_path
   *** path of the dev that was mapped on server

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/block/ibnbd/ibnbd-clt-sysfs.c | 723 ++
 1 file changed, 723 insertions(+)

diff --git a/drivers/block/ibnbd/ibnbd-clt-sysfs.c 
b/drivers/block/ibnbd/ibnbd-clt-sysfs.c
new file mode 100644
index ..2770b5c81c23
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-clt-sysfs.c
@@ -0,0 +1,723 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "ibnbd-clt.h"
+
+static struct kobject *ibnbd_kobject;
+static struct kobject *ibnbd_devices_kobject;
+
+enum {
+   IBNBD_OPT_ERR   = 0,
+   IBNBD_OPT_PATH  = 1 << 0,
+   IBNBD_OPT_DEV_PATH  = 1 << 1,
+   IBNBD_OPT_ACCESS_MODE   = 1 << 3,
+   IBNBD_OPT_INPUT_MODE= 1 << 4,
+   IBNBD_OPT_IO_MODE   = 1 << 5,
+   IBNBD_OPT_SESSNAME  = 1 << 6,
+};
+
+static unsigned int ibnbd_opt_mandatory[] = {
+   IBNBD_OPT_PATH,
+   IBNBD_OPT_DEV_PATH,
+   IBNBD_OPT_SESSNAME,
+};
+
+static const match_table_t ibnbd_opt_tokens = {
+   {   IBNBD_OPT_PATH, "path=%s"   },
+   {   IBNBD_OPT_DEV_PATH, "device_path=%s"},
+   {   IBNBD_OPT_ACCESS_MODE,  "access_mode=%s"},
+   {   IBNBD_OPT_INPUT_MODE,   "input_mode=%s" },
+   {   IBNBD_OPT_IO_MODE,  "io_mode=%s"},
+   {   IBNBD_OPT_SESSNAME, "sessname=%s"   },
+   {   IBNBD_OPT_ERR,  NULL},
+};
+
+/* remove new line from string */
+static void strip(char *s)
+{
+   char *p = s;
+
+   while (*s != '\0') {
+   if (*s != '\n')
+   *p++ = *s++;
+   else
+   ++s;
+   }
+   *p = '\0';
+}
+
+static int ibnbd_clt_parse_map_options(const char *buf,
+  char *sessname,
+  struct ibtrs_addr *paths,
+  size_t *path_cnt,
+  size_t max_path_cnt,
+  char *pathname,
+  enum ibnbd_access_mode *access_mode,
+  enum ibnbd_queue_mode *queue_mode,
+  enum ibnbd_io_mode *io_mode)
+{
+   char *options, *sep_opt;
+   char *p;
+   substring_t args[MAX_OPT_ARGS];
+   int opt_mask = 0;
+   int token;
+   int ret = -EINVAL;
+   int i;
+   int p_cnt = 0;
+
+   options = kstrdup(buf, GFP_KERNEL);
+   if (!options)
+   return -ENOMEM;
+
+   options = strstrip(options);
+   strip(options);
+   sep_opt = options;
+   while ((p = strsep(&sep_opt, " ")) != NULL) {
+   if (!*p)
+   continue;
+
+   token = match_token(p, ibnbd_opt_tokens, args);
+   opt_mask |= token;
+
+   switch (token) {
+   case IBNBD_OPT_SESSNAME:
+   p = match_strdup(args);
+   if (!p) {
+   ret = -ENOMEM;
+   goto out;
+   }
+   if (strlen(p) > NAME_MAX) {
+   pr_e

[PATCH 24/24] MAINTAINERS: Add maintainer for IBNBD/IBTRS modules

2018-02-02 Thread Roman Pen
Signed-off-by: Roman Pen 
Cc: Danil Kipnis 
Cc: Jack Wang 
---
 MAINTAINERS | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 18994806e441..fad9c2529f8a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6714,6 +6714,20 @@ IBM ServeRAID RAID DRIVER
 S: Orphan
 F: drivers/scsi/ips.*
 
+IBNBD BLOCK DRIVERS
+M: IBNBD/IBTRS Storage Team 
+L: linux-block@vger.kernel.org
+S: Maintained
+T: git git://github.com/profitbricks/ibnbd.git
+F: drivers/block/ibnbd/
+
+IBTRS TRANSPORT DRIVERS
+M: IBNBD/IBTRS Storage Team 
+L: linux-r...@vger.kernel.org
+S: Maintained
+T: git git://github.com/profitbricks/ibnbd.git
+F: drivers/infiniband/ulp/ibtrs/
+
 ICH LPC AND GPIO DRIVER
 M: Peter Tyser 
 S: Maintained
-- 
2.13.1



[PATCH 14/24] ibnbd: private headers with IBNBD protocol structs and helpers

2018-02-02 Thread Roman Pen
These are common private headers with IBNBD protocol structures,
logging, sysfs and other helper functions, which are used on
both client and server sides.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/block/ibnbd/ibnbd-log.h   |  71 
 drivers/block/ibnbd/ibnbd-proto.h | 360 ++
 2 files changed, 431 insertions(+)

diff --git a/drivers/block/ibnbd/ibnbd-log.h b/drivers/block/ibnbd/ibnbd-log.h
new file mode 100644
index ..489343a61171
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-log.h
@@ -0,0 +1,71 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#ifndef IBNBD_LOG_H
+#define IBNBD_LOG_H
+
+#include "ibnbd-clt.h"
+#include "ibnbd-srv.h"
+
+#define ibnbd_diskname(dev) ({ \
+   struct gendisk *gd = ((struct ibnbd_clt_dev *)dev)->gd; \
+   gd ? gd->disk_name : "";\
+})
+
+void unknown_type(void);
+
+#define ibnbd_log(fn, dev, fmt, ...) ({
\
+   __builtin_choose_expr(  \
+   __builtin_types_compatible_p(   \
+   typeof(dev), struct ibnbd_clt_dev *),   \
+   fn("<%s@%s> %s: " fmt, (dev)->pathname, \
+  (dev)->sess->sessname, ibnbd_diskname(dev),  \
+  ##__VA_ARGS__),  \
+   __builtin_choose_expr(  \
+   __builtin_types_compatible_p(typeof(dev),   \
+   struct ibnbd_srv_sess_dev *),   \
+   fn("<%s@%s>: " fmt, (dev)->pathname,\
+  (dev)->sess->sessname, ##__VA_ARGS__),   
\
+   unknown_type()));   \
+})
+
+#define ibnbd_err(dev, fmt, ...)   \
+   ibnbd_log(pr_err, dev, fmt, ##__VA_ARGS__)
+#define ibnbd_err_rl(dev, fmt, ...)\
+   ibnbd_log(pr_err_ratelimited, dev, fmt, ##__VA_ARGS__)
+#define ibnbd_wrn(dev, fmt, ...)   \
+   ibnbd_log(pr_warn, dev, fmt, ##__VA_ARGS__)
+#define ibnbd_wrn_rl(dev, fmt, ...) \
+   ibnbd_log(pr_warn_ratelimited, dev, fmt, ##__VA_ARGS__)
+#define ibnbd_info(dev, fmt, ...) \
+   ibnbd_log(pr_info, dev, fmt, ##__VA_ARGS__)
+#define ibnbd_info_rl(dev, fmt, ...) \
+   ibnbd_log(pr_info_ratelimited, dev, fmt, ##__VA_ARGS__)
+
+#endif /* IBNBD_LOG_H */
diff --git a/drivers/block/ibnbd/ibnbd-proto.h 
b/drivers/block/ibnbd/ibnbd-proto.h
new file mode 100644
index ..c809705a2322
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-proto.h
@@ -0,0 +1,360 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#ifndef IBNBD_PROTO_H
+#define IBNBD_PROTO_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 

[PATCH 18/24] ibnbd: server: private header with server structs and functions

2018-02-02 Thread Roman Pen
This header describes main structs and functions used by ibnbd-server
module, namely structs for managing sessions from different clients
and mapped (opened) devices.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/block/ibnbd/ibnbd-srv.h | 100 
 1 file changed, 100 insertions(+)

diff --git a/drivers/block/ibnbd/ibnbd-srv.h b/drivers/block/ibnbd/ibnbd-srv.h
new file mode 100644
index ..191a1650bc1d
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-srv.h
@@ -0,0 +1,100 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#ifndef IBNBD_SRV_H
+#define IBNBD_SRV_H
+
+#include 
+#include 
+#include 
+
+#include "ibtrs.h"
+#include "ibnbd-proto.h"
+#include "ibnbd-log.h"
+
+struct ibnbd_srv_session {
+   /* Entry inside global sess_list */
+   struct list_headlist;
+   struct ibtrs_srv*ibtrs;
+   charsessname[NAME_MAX];
+   int queue_depth;
+   struct bio_set  *sess_bio_set;
+
+   rwlock_tindex_lock cacheline_aligned;
+   struct idr  index_idr;
+   /* List of struct ibnbd_srv_sess_dev */
+   struct list_headsess_dev_list;
+   struct mutexlock;
+   u8  ver;
+};
+
+struct ibnbd_srv_dev {
+   /* Entry inside global dev_list */
+   struct list_headlist;
+   struct kobject  dev_kobj;
+   struct kobject  dev_sessions_kobj;
+   struct kref kref;
+   charid[NAME_MAX];
+   /* List of ibnbd_srv_sess_dev structs */
+   struct list_headsess_dev_list;
+   struct mutexlock;
+   int open_write_cnt;
+   enum ibnbd_io_mode  mode;
+};
+
+/* Structure which binds N devices and N sessions */
+struct ibnbd_srv_sess_dev {
+   /* Entry inside ibnbd_srv_dev struct */
+   struct list_headdev_list;
+   /* Entry inside ibnbd_srv_session struct */
+   struct list_headsess_list;
+   struct ibnbd_dev*ibnbd_dev;
+   struct ibnbd_srv_session*sess;
+   struct ibnbd_srv_dev*dev;
+   struct kobject  kobj;
+   struct completion   *sysfs_release_compl;
+   u32 device_id;
+   fmode_t open_flags;
+   struct kref kref;
+   struct completion   *destroy_comp;
+   charpathname[NAME_MAX];
+};
+
+/* ibnbd-srv-sysfs.c */
+
+int ibnbd_srv_create_dev_sysfs(struct ibnbd_srv_dev *dev,
+  struct block_device *bdev,
+  const char *dir_name);
+void ibnbd_srv_destroy_dev_sysfs(struct ibnbd_srv_dev *dev);
+int ibnbd_srv_create_dev_session_sysfs(struct ibnbd_srv_sess_dev *sess_dev);
+void ibnbd_srv_destroy_dev_session_sysfs(struct ibnbd_srv_sess_dev *sess_dev);
+int ibnbd_srv_create_sysfs_files(void);
+void ibnbd_srv_destroy_sysfs_files(void);
+
+#endif /* IBNBD_SRV_H */
-- 
2.13.1



[PATCH 19/24] ibnbd: server: main functionality

2018-02-02 Thread Roman Pen
This is main functionality of ibnbd-server module, which handles IBTRS
events and IBNBD protocol requests, like map (open) or unmap (close)
device.  Also server side is responsible for processing incoming IBTRS
IO requests and forward them to local mapped devices.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/block/ibnbd/ibnbd-srv.c | 901 
 1 file changed, 901 insertions(+)

diff --git a/drivers/block/ibnbd/ibnbd-srv.c b/drivers/block/ibnbd/ibnbd-srv.c
new file mode 100644
index ..a32d22ab67a3
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-srv.c
@@ -0,0 +1,901 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include 
+#include 
+
+#include "ibnbd-srv.h"
+#include "ibnbd-srv-dev.h"
+
+MODULE_AUTHOR("ib...@profitbricks.com");
+MODULE_VERSION(IBNBD_VER_STRING);
+MODULE_DESCRIPTION("InfiniBand Network Block Device Server");
+MODULE_LICENSE("GPL");
+
+#define DEFAULT_DEV_SEARCH_PATH "/"
+
+static char dev_search_path[PATH_MAX] = DEFAULT_DEV_SEARCH_PATH;
+
+static int dev_search_path_set(const char *val, const struct kernel_param *kp)
+{
+   char *dup;
+
+   if (strlen(val) >= sizeof(dev_search_path))
+   return -EINVAL;
+
+   dup = kstrdup(val, GFP_KERNEL);
+
+   if (dup[strlen(dup) - 1] == '\n')
+   dup[strlen(dup) - 1] = '\0';
+
+   strlcpy(dev_search_path, dup, sizeof(dev_search_path));
+
+   kfree(dup);
+   pr_info("dev_search_path changed to '%s'\n", dev_search_path);
+
+   return 0;
+}
+
+static struct kparam_string dev_search_path_kparam_str = {
+   .maxlen = sizeof(dev_search_path),
+   .string = dev_search_path
+};
+
+static const struct kernel_param_ops dev_search_path_ops = {
+   .set= dev_search_path_set,
+   .get= param_get_string,
+};
+
+module_param_cb(dev_search_path, &dev_search_path_ops,
+   &dev_search_path_kparam_str, 0444);
+MODULE_PARM_DESC(dev_search_path, "Sets the device_search_path."
+" When a device is mapped this path is prepended to the"
+" device_path from the map_device operation."
+" (default: " DEFAULT_DEV_SEARCH_PATH ")");
+
+static int def_io_mode = IBNBD_BLOCKIO;
+module_param(def_io_mode, int, 0444);
+MODULE_PARM_DESC(def_io_mode, "By default, export devices in"
+" blockio(" __stringify(_IBNBD_BLOCKIO) ") or"
+" fileio(" __stringify(_IBNBD_FILEIO) ") mode."
+" (default: " __stringify(_IBNBD_BLOCKIO) " (blockio))");
+
+static DEFINE_MUTEX(sess_lock);
+static DEFINE_SPINLOCK(dev_lock);
+
+static LIST_HEAD(sess_list);
+static LIST_HEAD(dev_list);
+
+struct ibnbd_io_private {
+   struct ibtrs_srv_op *id;
+   struct ibnbd_srv_sess_dev   *sess_dev;
+};
+
+static void ibnbd_sess_dev_release(struct kref *kref)
+{
+   struct ibnbd_srv_sess_dev *sess_dev;
+
+   sess_dev = container_of(kref, struct ibnbd_srv_sess_dev, kref);
+   complete(sess_dev->destroy_comp);
+}
+
+static inline void ibnbd_put_sess_dev(struct ibnbd_srv_sess_dev *sess_dev)
+{
+   kref_put(&sess_dev->kref, ibnbd_sess_dev_release);
+}
+
+static void ibnbd_endio(void *priv, int error)
+{
+   struct ibnbd_io_private *ibnbd_priv = priv;
+   struct ibnbd_srv_sess_dev *sess_dev = ibnbd_priv->sess_dev;
+
+   ibnbd_put_sess_dev(sess_dev);
+
+   ibtrs_srv_resp_rdma(ibnbd_priv->id, error);
+
+   kfree(priv);
+}
+
+static struct ibnbd_srv_sess_dev *
+ibnbd_get_sess_dev(int dev_id, struct ibnbd_srv_session *srv_sess)
+{
+   struct ibnbd_srv_sess_dev *sess_dev;
+   int ret = 0;
+
+   read_lock(&srv_sess->index_lock);
+   sess_dev = idr_find(&srv_sess->index_idr, dev_id);
+   if (likely(sess_dev))
+   ret = kref_get_unless_zero(&sess_dev->kref);
+   read_unlock(&srv_se

[PATCH 16/24] ibnbd: client: main functionality

2018-02-02 Thread Roman Pen
This is main functionality of ibnbd-client module, which provides
interface to map remote device as local block device /dev/ibnbd
and feeds IBTRS with IO requests.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/block/ibnbd/ibnbd-clt.c | 1959 +++
 1 file changed, 1959 insertions(+)

diff --git a/drivers/block/ibnbd/ibnbd-clt.c b/drivers/block/ibnbd/ibnbd-clt.c
new file mode 100644
index ..b5bc71414778
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-clt.c
@@ -0,0 +1,1959 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "ibnbd-clt.h"
+
+MODULE_AUTHOR("ib...@profitbricks.com");
+MODULE_DESCRIPTION("InfiniBand Network Block Device Client");
+MODULE_VERSION(IBNBD_VER_STRING);
+MODULE_LICENSE("GPL");
+
+static int ibnbd_client_major;
+static DEFINE_IDA(index_ida);
+static DEFINE_MUTEX(ida_lock);
+static DEFINE_MUTEX(sess_lock);
+static LIST_HEAD(sess_list);
+
+static bool softirq_enable;
+module_param(softirq_enable, bool, 0444);
+MODULE_PARM_DESC(softirq_enable, "finish request in softirq_fn."
+" (default: 0)");
+/*
+ * Maximum number of partitions an instance can have.
+ * 6 bits = 64 minors = 63 partitions (one minor is used for the device itself)
+ */
+#define IBNBD_PART_BITS6
+#define KERNEL_SECTOR_SIZE  512
+
+static inline bool ibnbd_clt_get_sess(struct ibnbd_clt_session *sess)
+{
+   return refcount_inc_not_zero(&sess->refcount);
+}
+
+static void free_sess(struct ibnbd_clt_session *sess);
+
+static void ibnbd_clt_put_sess(struct ibnbd_clt_session *sess)
+{
+   might_sleep();
+
+   if (refcount_dec_and_test(&sess->refcount))
+   free_sess(sess);
+}
+
+static inline bool ibnbd_clt_dev_is_mapped(struct ibnbd_clt_dev *dev)
+{
+   return dev->dev_state == DEV_STATE_MAPPED;
+}
+
+static void ibnbd_clt_put_dev(struct ibnbd_clt_dev *dev)
+{
+   might_sleep();
+
+   if (refcount_dec_and_test(&dev->refcount)) {
+   mutex_lock(&ida_lock);
+   ida_simple_remove(&index_ida, dev->clt_device_id);
+   mutex_unlock(&ida_lock);
+   kfree(dev->hw_queues);
+   ibnbd_clt_put_sess(dev->sess);
+   kfree(dev);
+   }
+}
+
+static inline bool ibnbd_clt_get_dev(struct ibnbd_clt_dev *dev)
+{
+   return refcount_inc_not_zero(&dev->refcount);
+}
+
+static void ibnbd_clt_set_dev_attr(struct ibnbd_clt_dev *dev,
+  const struct ibnbd_msg_open_rsp *rsp)
+{
+   struct ibnbd_clt_session *sess = dev->sess;
+
+   dev->device_id  = le32_to_cpu(rsp->device_id);
+   dev->nsectors   = le64_to_cpu(rsp->nsectors);
+   dev->logical_block_size = le16_to_cpu(rsp->logical_block_size);
+   dev->physical_block_size= le16_to_cpu(rsp->physical_block_size);
+   dev->max_write_same_sectors = le32_to_cpu(rsp->max_write_same_sectors);
+   dev->max_discard_sectors= le32_to_cpu(rsp->max_discard_sectors);
+   dev->discard_granularity= le32_to_cpu(rsp->discard_granularity);
+   dev->discard_alignment  = le32_to_cpu(rsp->discard_alignment);
+   dev->secure_discard = le16_to_cpu(rsp->secure_discard);
+   dev->rotational = rsp->rotational;
+   dev->remote_io_mode = rsp->io_mode;
+
+   dev->max_hw_sectors = sess->max_io_size / dev->logical_block_size;
+   dev->max_segments = BMAX_SEGMENTS;
+
+   if (dev->remote_io_mode == IBNBD_BLOCKIO) {
+   dev->max_hw_sectors = min_t(u32, dev->max_hw_sectors,
+   le32_to_cpu(rsp->max_hw_sectors));
+   dev->max_segments = min_t(u16, dev->max_segments,
+ le16_to_cpu(rsp->max_seg

[PATCH 20/24] ibnbd: server: functionality for IO submission to file or block dev

2018-02-02 Thread Roman Pen
This provides helper functions for IO submission to file or block dev.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/block/ibnbd/ibnbd-srv-dev.c | 410 
 drivers/block/ibnbd/ibnbd-srv-dev.h | 149 +
 2 files changed, 559 insertions(+)

diff --git a/drivers/block/ibnbd/ibnbd-srv-dev.c 
b/drivers/block/ibnbd/ibnbd-srv-dev.c
new file mode 100644
index ..a5894849b9d5
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-srv-dev.c
@@ -0,0 +1,410 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibnbd-srv-dev.h"
+#include "ibnbd-log.h"
+
+#define IBNBD_DEV_MAX_FILEIO_ACTIVE_WORKERS 0
+
+struct ibnbd_dev_file_io_work {
+   struct ibnbd_dev*dev;
+   void*priv;
+
+   sector_tsector;
+   void*data;
+   size_t  len;
+   size_t  bi_size;
+   enum ibnbd_io_flags flags;
+
+   struct work_struct  work;
+};
+
+struct ibnbd_dev_blk_io {
+   struct ibnbd_dev *dev;
+   void *priv;
+};
+
+static struct workqueue_struct *fileio_wq;
+
+int ibnbd_dev_init(void)
+{
+   fileio_wq = alloc_workqueue("%s", WQ_UNBOUND,
+   IBNBD_DEV_MAX_FILEIO_ACTIVE_WORKERS,
+   "ibnbd_server_fileio_wq");
+   if (!fileio_wq)
+   return -ENOMEM;
+
+   return 0;
+}
+
+void ibnbd_dev_destroy(void)
+{
+   destroy_workqueue(fileio_wq);
+}
+
+static inline struct block_device *ibnbd_dev_open_bdev(const char *path,
+  fmode_t flags)
+{
+   return blkdev_get_by_path(path, flags, THIS_MODULE);
+}
+
+static int ibnbd_dev_blk_open(struct ibnbd_dev *dev, const char *path,
+ fmode_t flags)
+{
+   dev->bdev = ibnbd_dev_open_bdev(path, flags);
+   return PTR_ERR_OR_ZERO(dev->bdev);
+}
+
+static int ibnbd_dev_vfs_open(struct ibnbd_dev *dev, const char *path,
+ fmode_t flags)
+{
+   int oflags = O_DSYNC; /* enable write-through */
+
+   if (flags & FMODE_WRITE)
+   oflags |= O_RDWR;
+   else if (flags & FMODE_READ)
+   oflags |= O_RDONLY;
+   else
+   return -EINVAL;
+
+   dev->file = filp_open(path, oflags, 0);
+   return PTR_ERR_OR_ZERO(dev->file);
+}
+
+struct ibnbd_dev *ibnbd_dev_open(const char *path, fmode_t flags,
+enum ibnbd_io_mode mode, struct bio_set *bs,
+ibnbd_dev_io_fn io_cb)
+{
+   struct ibnbd_dev *dev;
+   int ret;
+
+   dev = kzalloc(sizeof(*dev), GFP_KERNEL);
+   if (!dev)
+   return ERR_PTR(-ENOMEM);
+
+   if (mode == IBNBD_BLOCKIO) {
+   dev->blk_open_flags = flags;
+   ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags);
+   if (ret)
+   goto err;
+   } else if (mode == IBNBD_FILEIO) {
+   dev->blk_open_flags = FMODE_READ;
+   ret = ibnbd_dev_blk_open(dev, path, dev->blk_open_flags);
+   if (ret)
+   goto err;
+
+   ret = ibnbd_dev_vfs_open(dev, path, flags);
+   if (ret)
+   goto blk_put;
+   }
+
+   dev->blk_open_flags = flags;
+   dev->mode   = mode;
+   dev->io_cb  = io_cb;
+   bdevname(dev->bdev, dev->name);
+   dev->ibd_bio_set= bs;
+
+   return dev;
+
+blk_put:
+   blkdev_put(dev->bdev, dev->blk_open_flags);
+err:
+   kfree(dev);
+   return ERR_PTR(ret);
+}
+
+void ibnbd_dev_close(struct ibnbd_dev *dev)
+{
+   flush_workqueue(fileio_wq);
+   blkdev_put(dev->bdev, dev->blk_open_flags);
+   

[PATCH 21/24] ibnbd: server: sysfs interface functions

2018-02-02 Thread Roman Pen
This is the sysfs interface to IBNBD mapped devices on server side:

  /sys/kernel/ibnbd_server/devices//
|- block_dev
|  *** link pointing to the corresponding block device sysfs entry
|
|- sessions//
|  *** sessions directory
   |
   |- read_only
   |  *** is devices mapped as read only
   |
   |- mapping_path
  *** relative device path provided by the client during mapping

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/block/ibnbd/ibnbd-srv-sysfs.c | 264 ++
 1 file changed, 264 insertions(+)

diff --git a/drivers/block/ibnbd/ibnbd-srv-sysfs.c 
b/drivers/block/ibnbd/ibnbd-srv-sysfs.c
new file mode 100644
index ..a0efd6a2accb
--- /dev/null
+++ b/drivers/block/ibnbd/ibnbd-srv-sysfs.c
@@ -0,0 +1,264 @@
+/*
+ * InfiniBand Network Block Driver
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "ibnbd-srv.h"
+
+static struct kobject *ibnbd_srv_kobj;
+static struct kobject *ibnbd_srv_devices_kobj;
+
+static struct attribute *ibnbd_srv_default_dev_attrs[] = {
+   NULL,
+};
+
+static struct attribute_group ibnbd_srv_default_dev_attr_group = {
+   .attrs = ibnbd_srv_default_dev_attrs,
+};
+
+static ssize_t ibnbd_srv_attr_show(struct kobject *kobj, struct attribute 
*attr,
+  char *page)
+{
+   struct kobj_attribute *kattr;
+   int ret = -EIO;
+
+   kattr = container_of(attr, struct kobj_attribute, attr);
+   if (kattr->show)
+   ret = kattr->show(kobj, kattr, page);
+   return ret;
+}
+
+static ssize_t ibnbd_srv_attr_store(struct kobject *kobj,
+   struct attribute *attr,
+   const char *page, size_t length)
+{
+   struct kobj_attribute *kattr;
+   int ret = -EIO;
+
+   kattr = container_of(attr, struct kobj_attribute, attr);
+   if (kattr->store)
+   ret = kattr->store(kobj, kattr, page, length);
+   return ret;
+}
+
+static const struct sysfs_ops ibnbd_srv_sysfs_ops = {
+   .show   = ibnbd_srv_attr_show,
+   .store  = ibnbd_srv_attr_store,
+};
+
+static struct kobj_type ibnbd_srv_dev_ktype = {
+   .sysfs_ops  = &ibnbd_srv_sysfs_ops,
+};
+
+static struct kobj_type ibnbd_srv_dev_sessions_ktype = {
+   .sysfs_ops  = &ibnbd_srv_sysfs_ops,
+};
+
+int ibnbd_srv_create_dev_sysfs(struct ibnbd_srv_dev *dev,
+  struct block_device *bdev,
+  const char *dir_name)
+{
+   struct kobject *bdev_kobj;
+   int ret;
+
+   ret = kobject_init_and_add(&dev->dev_kobj, &ibnbd_srv_dev_ktype,
+  ibnbd_srv_devices_kobj, dir_name);
+   if (ret)
+   return ret;
+
+   ret = kobject_init_and_add(&dev->dev_sessions_kobj,
+  &ibnbd_srv_dev_sessions_ktype,
+  &dev->dev_kobj, "sessions");
+   if (ret)
+   goto err;
+
+   ret = sysfs_create_group(&dev->dev_kobj,
+&ibnbd_srv_default_dev_attr_group);
+   if (ret)
+   goto err2;
+
+   bdev_kobj = &disk_to_dev(bdev->bd_disk)->kobj;
+   ret = sysfs_create_link(&dev->dev_kobj, bdev_kobj, "block_dev");
+   if (ret)
+   goto err3;
+
+   return 0;
+
+err3:
+   sysfs_remove_group(&dev->dev_kobj,
+  &ibnbd_srv_default_dev_attr_group);
+err2:
+   kobject_del(&dev->dev_sessions_kobj);
+   kobject_put(&dev->dev_sessions_kobj);
+err:
+   kobject_del(&dev->dev_kobj);
+   kobject_put(&dev->dev_kobj);
+   return ret;
+}
+
+void ibnbd_srv_destroy_dev_sysfs(struct ibnbd_srv_dev *dev)
+{
+   sysfs_remove_link(&dev->dev_kobj, "block_dev");

[PATCH 22/24] ibnbd: include client and server modules into kernel compilation

2018-02-02 Thread Roman Pen
Add IBNBD Makefile, Kconfig and also corresponding lines into upper
block layer files.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/block/Kconfig|  2 ++
 drivers/block/Makefile   |  1 +
 drivers/block/ibnbd/Kconfig  | 22 ++
 drivers/block/ibnbd/Makefile | 13 +
 4 files changed, 38 insertions(+)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 40579d0cb3d1..483aae5d391e 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -477,4 +477,6 @@ config BLK_DEV_RSXX
  To compile this driver as a module, choose M here: the
  module will be called rsxx.
 
+source "drivers/block/ibnbd/Kconfig"
+
 endif # BLK_DEV
diff --git a/drivers/block/Makefile b/drivers/block/Makefile
index dc061158b403..65346a1d0b1a 100644
--- a/drivers/block/Makefile
+++ b/drivers/block/Makefile
@@ -38,6 +38,7 @@ obj-$(CONFIG_BLK_DEV_PCIESSD_MTIP32XX)+= mtip32xx/
 obj-$(CONFIG_BLK_DEV_RSXX) += rsxx/
 obj-$(CONFIG_BLK_DEV_NULL_BLK) += null_blk.o
 obj-$(CONFIG_ZRAM) += zram/
+obj-$(CONFIG_BLK_DEV_IBNBD)+= ibnbd/
 
 skd-y  := skd_main.o
 swim_mod-y := swim.o swim_asm.o
diff --git a/drivers/block/ibnbd/Kconfig b/drivers/block/ibnbd/Kconfig
new file mode 100644
index ..c5cc7d111c7a
--- /dev/null
+++ b/drivers/block/ibnbd/Kconfig
@@ -0,0 +1,22 @@
+config BLK_DEV_IBNBD
+   boolean
+
+config BLK_DEV_IBNBD_CLIENT
+   tristate "Network block device driver on top of IBTRS transport"
+   depends on INFINIBAND_IBTRS_CLIENT
+   select BLK_DEV_IBNBD
+   help
+ IBNBD client allows for mapping of a remote block devices over
+ IBTRS protocol from a target system where IBNBD server is running.
+
+ If unsure, say N.
+
+config BLK_DEV_IBNBD_SERVER
+   tristate "Network block device over RDMA Infiniband server support"
+   depends on INFINIBAND_IBTRS_SERVER
+   select BLK_DEV_IBNBD
+   help
+ IBNBD server allows for exporting local block devices to a remote 
client
+ over IBTRS protocol.
+
+ If unsure, say N.
diff --git a/drivers/block/ibnbd/Makefile b/drivers/block/ibnbd/Makefile
new file mode 100644
index ..5f20e72e0633
--- /dev/null
+++ b/drivers/block/ibnbd/Makefile
@@ -0,0 +1,13 @@
+ccflags-y := -Idrivers/infiniband/ulp/ibtrs
+
+ibnbd-client-y := ibnbd-clt.o \
+ ibnbd-clt-sysfs.o
+
+ibnbd-server-y := ibnbd-srv.o \
+ ibnbd-srv-dev.o \
+ ibnbd-srv-sysfs.o
+
+obj-$(CONFIG_BLK_DEV_IBNBD_CLIENT) += ibnbd-client.o
+obj-$(CONFIG_BLK_DEV_IBNBD_SERVER) += ibnbd-server.o
+
+-include $(src)/compat/compat.mk
-- 
2.13.1



[PATCH 23/24] ibnbd: a bit of documentation

2018-02-02 Thread Roman Pen
README with description of major sysfs entries.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/block/ibnbd/README | 272 +
 1 file changed, 272 insertions(+)

diff --git a/drivers/block/ibnbd/README b/drivers/block/ibnbd/README
new file mode 100644
index ..e0feb39fad14
--- /dev/null
+++ b/drivers/block/ibnbd/README
@@ -0,0 +1,272 @@
+***
+Infiniband Network Block Device (IBNBD)
+***
+
+Introduction
+
+
+IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
+(client and server) that allow for remote access of a block device on
+the server over IBTRS protocol using the RDMA (InfiniBand, RoCE, iWarp)
+transport. After being mapped, the remote block devices can be accessed
+on the client side as local block devices.
+
+I/O is transfered between client and server by the IBTRS transport
+modules. The administration of IBNBD and IBTRS modules is done via
+sysfs entries.
+
+Requirements
+
+
+  IBTRS kernel modules
+
+Quick Start
+---
+
+Server side:
+  # modprobe ibnbd_server
+
+Client side:
+  # modprobe ibnbd_client
+  # echo "sessname=blya path=ip:10.50.100.66 device_path=/dev/ram0" > \
+/sys/kernel/ibnbd_client/map_device
+
+  Where "sessname=" is a session name, a string to identify the session
+  on client and on server sides; "path=" is a destination IP address or
+  a pair of a source and a destination IPs, separated by comma.  Multiple
+  "path=" options can be specified in order to use multipath  (see IBTRS
+  description for details); "device_path=" is the block device to be
+  mapped from the server side. After the session to the server machine is
+  established, the mapped device will appear on the client side under
+  /dev/ibnbd.
+
+
+==
+Client Sysfs Interface
+==
+
+All sysfs files that are not read-only provide the usage information on read:
+
+Example:
+  # cat /sys/kernel/ibnbd_client/map_device
+
+  > Usage: echo "sessname= path=<[srcaddr,]dstaddr>
+  > [path=<[srcaddr,]dstaddr>] device_path=
+  > [access_mode=] [input_mode=]
+  > [io_mode=]" > map_device
+  >
+  > addr ::= [ ip: | ip: | gid: ]
+
+Entries under /sys/kernel/ibnbd_client/
+===
+
+map_device (RW)
+---
+
+Expected format is the following:
+
+sessname=
+path=<[srcaddr,]dstaddr> [path=<[srcaddr,]dstaddr> ...]
+device_path=
+[access_mode=]
+[input_mode=]
+[io_mode=]
+
+Where:
+
+sessname: accepts a string not bigger than 256 chars, which identifies
+  a given session on the client and on the server.
+ I.e. "clt_hostname-srv_hostname" could be a natural choice.
+
+path: describes a connection between the client and the server by
+ specifying destination and, when required, the source address.
+ The addresses are to be provided in the following format:
+
+ip:
+ip:
+gid:
+
+  for example:
+
+  path=ip:10.0.0.66
+ The single addr is treated as the destination.
+ The connection will be established to this
+ server from any client IP address.
+
+  path=ip:10.0.0.66,ip:10.0.1.66
+ First addr is the source address and the second
+ is the destination.
+
+  If multiple "path=" options are specified multiple connection
+  will be established and data will be sent according to
+  the selected multipath policy (see IBTRS mp_policy sysfs entry
+  description).
+
+device_path: Path to the block device on the server side. Path is specified
+relative to the directory on server side configured in the
+ 'dev_search_path' module parameter of the ibnbd_server.
+ The ibnbd_server prepends the  received from client
+with  and tries to open the
+/ block device.  On success,
+a /dev/ibnbd device file, a /sys/block/ibnbd_client/ibnbd/
+directory and an entry in /sys/kernel/ibnbd_client/devices will be
+ created.
+
+access_mode: the access_mode parameter specifies if the device is to be
+ mapped as "ro" read-only or "rw" read-write. The server allows
+a device to be exported in rw mode only once. The "migration"
+ access mode has to be specified if a second mapping in read-write
+mode is desired.
+
+ By default "rw" is used.
+
+input_mode: the input_mode parameter specifies the internal I/O
+processing mode of the block device on the client.  Accepts
+"mq" and "rq".
+
+By default "mq" mode is used.
+
+io_mode:  the io_mode parameter specifies if the device on the server
+  will be opened as blo

[PATCH 11/24] ibtrs: server: sysfs interface functions

2018-02-02 Thread Roman Pen
This is the sysfs interface to IBTRS sessions on server side:

  /sys/kernel/ibtrs_server//
*** IBTRS session accepted from a client peer
|
|- paths//
   *** established paths from a client in a session
   |
   |- disconnect
   |  *** disconnect path
   |
   |- hca_name
   |  *** HCA name
   |
   |- hca_port
   |  *** HCA port
   |
   |- stats/
  *** current path statistics
  |
  |- rdma
  |- reset_all
  |- wc_completions

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c | 278 +
 1 file changed, 278 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c 
b/drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c
new file mode 100644
index ..ec2c86fe4181
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv-sysfs.c
@@ -0,0 +1,278 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibtrs-pri.h"
+#include "ibtrs-srv.h"
+#include "ibtrs-log.h"
+
+static struct kobject *ibtrs_kobj;
+
+static struct kobj_type ktype = {
+   .sysfs_ops  = &kobj_sysfs_ops,
+};
+
+static ssize_t ibtrs_srv_disconnect_show(struct kobject *kobj,
+struct kobj_attribute *attr,
+char *page)
+{
+   return scnprintf(page, PAGE_SIZE, "Usage: echo 1 > %s\n",
+attr->attr.name);
+}
+
+static ssize_t ibtrs_srv_disconnect_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+   struct ibtrs_srv_sess *sess;
+   char str[MAXHOSTNAMELEN];
+
+   sess = container_of(kobj, struct ibtrs_srv_sess, kobj);
+   if (!sysfs_streq(buf, "1")) {
+   ibtrs_err(sess, "%s: invalid value: '%s'\n",
+ attr->attr.name, buf);
+   return -EINVAL;
+   }
+
+   sockaddr_to_str((struct sockaddr *)&sess->s.dst_addr, str, sizeof(str));
+
+   ibtrs_info(sess, "disconnect for path %s requested\n", str);
+   ibtrs_srv_queue_close(sess);
+
+   return count;
+}
+
+static struct kobj_attribute ibtrs_srv_disconnect_attr =
+   __ATTR(disconnect, 0644,
+  ibtrs_srv_disconnect_show, ibtrs_srv_disconnect_store);
+
+static ssize_t ibtrs_srv_hca_port_show(struct kobject *kobj,
+  struct kobj_attribute *attr,
+  char *page)
+{
+   struct ibtrs_srv_sess *sess;
+   struct ibtrs_con *usr_con;
+
+   sess = container_of(kobj, typeof(*sess), kobj);
+   usr_con = sess->s.con[0];
+
+   return scnprintf(page, PAGE_SIZE, "%u\n",
+usr_con->cm_id->port_num);
+}
+
+static struct kobj_attribute ibtrs_srv_hca_port_attr =
+   __ATTR(hca_port, 0444, ibtrs_srv_hca_port_show, NULL);
+
+static ssize_t ibtrs_srv_hca_name_show(struct kobject *kobj,
+  struct kobj_attribute *attr,
+  char *page)
+{
+   struct ibtrs_srv_sess *sess;
+
+   sess = container_of(kobj, struct ibtrs_srv_sess, kobj);
+
+   return scnprintf(page, PAGE_SIZE, "%s\n",
+sess->s.ib_dev->dev->name);
+}
+
+static struct kobj_attribute ibtrs_srv_hca_name_attr =
+   __ATTR(hca_name, 0444, ibtrs_srv_hca_name_show, NULL);
+
+static struct attribute *ibtrs_srv_sess_attrs[] = {
+   &ibtrs_srv_hca_name_attr.attr,
+   &ibtrs_srv_hca_port_attr.attr,
+   &ibtrs_srv_disconnect_attr.attr,
+   NULL,
+};
+
+static struct attribute_group ibtrs_srv_sess_attr_group = {
+   .attrs = ibtrs_srv_sess_attrs,
+};
+
+STAT_ATTR(struct ibtrs_srv_sess, rdma,
+ ibtrs_sr

[PATCH 12/24] ibtrs: include client and server modules into kernel compilation

2018-02-02 Thread Roman Pen
Add IBTRS Makefile, Kconfig and also corresponding lines into upper
layer infiniband/ulp files.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/infiniband/Kconfig|  1 +
 drivers/infiniband/ulp/Makefile   |  1 +
 drivers/infiniband/ulp/ibtrs/Kconfig  | 20 
 drivers/infiniband/ulp/ibtrs/Makefile | 15 +++
 4 files changed, 37 insertions(+)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index cbf186522016..7adbd0e272c4 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -93,6 +93,7 @@ source "drivers/infiniband/ulp/srpt/Kconfig"
 
 source "drivers/infiniband/ulp/iser/Kconfig"
 source "drivers/infiniband/ulp/isert/Kconfig"
+source "drivers/infiniband/ulp/ibtrs/Kconfig"
 
 source "drivers/infiniband/ulp/opa_vnic/Kconfig"
 source "drivers/infiniband/sw/rdmavt/Kconfig"
diff --git a/drivers/infiniband/ulp/Makefile b/drivers/infiniband/ulp/Makefile
index 437813c7b481..1c4f10dc8d49 100644
--- a/drivers/infiniband/ulp/Makefile
+++ b/drivers/infiniband/ulp/Makefile
@@ -5,3 +5,4 @@ obj-$(CONFIG_INFINIBAND_SRPT)   += srpt/
 obj-$(CONFIG_INFINIBAND_ISER)  += iser/
 obj-$(CONFIG_INFINIBAND_ISERT) += isert/
 obj-$(CONFIG_INFINIBAND_OPA_VNIC)  += opa_vnic/
+obj-$(CONFIG_INFINIBAND_IBTRS) += ibtrs/
diff --git a/drivers/infiniband/ulp/ibtrs/Kconfig 
b/drivers/infiniband/ulp/ibtrs/Kconfig
new file mode 100644
index ..eaeb8f3f6b4e
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/Kconfig
@@ -0,0 +1,20 @@
+config INFINIBAND_IBTRS
+   tristate
+   depends on INFINIBAND_ADDR_TRANS
+
+config INFINIBAND_IBTRS_CLIENT
+   tristate "IBTRS client module"
+   depends on INFINIBAND_ADDR_TRANS
+   select INFINIBAND_IBTRS
+   help
+ IBTRS client allows for simplified data transfer and connection
+ establishment over RDMA (InfiniBand, RoCE, iWarp). Uses BIO-like
+ READ/WRITE semantics and provides multipath capabilities.
+
+config INFINIBAND_IBTRS_SERVER
+   tristate "IBTRS server module"
+   depends on INFINIBAND_ADDR_TRANS
+   select INFINIBAND_IBTRS
+   help
+ IBTRS server module processing connection and IO requests received
+ from the IBTRS client module.
diff --git a/drivers/infiniband/ulp/ibtrs/Makefile 
b/drivers/infiniband/ulp/ibtrs/Makefile
new file mode 100644
index ..e6ea858745ad
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/Makefile
@@ -0,0 +1,15 @@
+ibtrs-client-y := ibtrs-clt.o \
+ ibtrs-clt-stats.o \
+ ibtrs-clt-sysfs.o
+
+ibtrs-server-y := ibtrs-srv.o \
+ ibtrs-srv-stats.o \
+ ibtrs-srv-sysfs.o
+
+ibtrs-core-y := ibtrs.o
+
+obj-$(CONFIG_INFINIBAND_IBTRS)+= ibtrs-core.o
+obj-$(CONFIG_INFINIBAND_IBTRS_CLIENT) += ibtrs-client.o
+obj-$(CONFIG_INFINIBAND_IBTRS_SERVER) += ibtrs-server.o
+
+-include $(src)/compat/compat.mk
-- 
2.13.1



[PATCH 13/24] ibtrs: a bit of documentation

2018-02-02 Thread Roman Pen
README with description of major sysfs entries.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/infiniband/ulp/ibtrs/README | 238 
 1 file changed, 238 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/README 
b/drivers/infiniband/ulp/ibtrs/README
new file mode 100644
index ..ed506c7e202d
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/README
@@ -0,0 +1,238 @@
+
+InfiniBand Transport (IBTRS)
+
+
+IBTRS (InfiniBand Transport) is a reliable high speed transport library
+which provides support to establish optimal number of connections
+between client and server machines using RDMA (InfiniBand, RoCE, iWarp)
+transport. It is optimized to transfer (read/write) IO blocks.
+
+In its core interface it follows the BIO semantics of providing the
+possibility to either write data from an sg list to the remote side
+or to request ("read") data transfer from the remote side into a given
+sg list.
+
+IBTRS provides I/O fail-over and load-balancing capabilities by using
+multipath I/O (see "add_path" and "mp_policy" configuration entries).
+
+IBTRS is used by the IBNBD (Infiniband Network Block Device) modules.
+
+==
+Client Sysfs Interface
+==
+
+This chapter describes only the most important files of sysfs interface
+on client side.
+
+Entries under /sys/kernel/ibtrs_client/
+===
+
+When a user of IBTRS API creates a new session, a directory entry with
+the name of that session is created.
+
+Entries under /sys/kernel/ibtrs_client//
+==
+
+add_path (RW)
+-
+
+Adds a new path (connection) to an existing session. Expected format is the
+following:
+
+  <[source addr,]destination addr>
+
+  *addr ::= [ ip: | gid: ]
+
+max_reconnect_attempts (RW)
+---
+
+Maximum number reconnect attempts the client should make before giving up
+after connection breaks unexpectedly.
+
+mp_policy (RW)
+--
+
+Multipath policy specifies which path should be selected on each IO:
+
+   round-robin (0):
+   select path in per CPU round-robin manner.
+
+   min-inflight (1):
+   select path with minimum inflights.
+
+Entries under /sys/kernel/ibtrs_client//paths/
+
+
+
+Each path belonging to a given session is listed here by its destination
+address. When a new path is added to a session by writing to the "add_path"
+entry, a directory with the corresponding destination address is created.
+
+Entries under /sys/kernel/ibtrs_client//paths//
+
+
+state (R)
+-
+
+Contains "connected" if the session is connected to the peer and fully
+functional.  Otherwise the file contains "disconnected"
+
+reconnect (RW)
+--
+
+Write "1" to the file in order to reconnect the path.
+Operation is blocking and returns 0 if reconnect was successfull.
+
+disconnect (RW)
+---
+
+Write "1" to the file in order to disconnect the path.
+Operation blocks until IBTRS path is disconnected.
+
+remove_path (RW)
+
+
+Write "1" to the file in order to disconnected and remove the path
+from the session.  Operation blocks until the path is disconnected
+and removed from the session.
+
+Entries under /sys/kernel/ibtrs_client//paths//stats/
+==
+
+Write "0" to any file in that directory to reset corresponding statistics.
+
+reset_all (RW)
+--
+
+Read will return usage help, write 0 will clear all the statistics.
+
+sg_entries (RW)
+---
+
+Data to be transfered via RDMA is passed to IBTRS as scather-gather
+list. A scather-gather list can contain multiple entries.
+Scather-gather list with less entries require less processing power
+and can therefore transfered faster. The file sg_entries outputs a
+per-CPU distribution table for the number of entries in the
+scather-gather lists, that were passed to the IBTRS API function
+ibtrs_clt_request (READ or WRITE).
+
+cpu_migration (RW)
+--
+
+IBTRS expects that each HCA IRQ is pinned to a separate CPU. If it's
+not the case, the processing of an I/O response could be processed on a
+different CPU than where it was originally submitted.  This file shows
+how many interrupts where generated on a non expected CPU.
+"from:" is the CPU on which the IRQ was expected, but not generated.
+"to:" is the CPU on which the IRQ was generated, but not expected.
+
+reconnects (RW)
+---
+
+Contains 2 unsigned int values, the first one records number of successful
+reconnects in the path lifetime, the second one records number of failed
+reconnects in the path lifetime.
+
+rdma_lat (RW)
+-
+
+Latency distribution of IB

[PATCH 10/24] ibtrs: server: statistics functions

2018-02-02 Thread Roman Pen
This introduces set of functions used on server side to account
statistics of RDMA data sent/received.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c | 110 +
 1 file changed, 110 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c 
b/drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c
new file mode 100644
index ..441b07fdf44a
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv-stats.c
@@ -0,0 +1,110 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibtrs-srv.h"
+
+void ibtrs_srv_update_rdma_stats(struct ibtrs_srv_stats *s,
+size_t size, int d)
+{
+   atomic64_inc(&s->rdma_stats.dir[d].cnt);
+   atomic64_add(size, &s->rdma_stats.dir[d].size_total);
+}
+
+void ibtrs_srv_update_wc_stats(struct ibtrs_srv_stats *s)
+{
+   atomic64_inc(&s->wc_comp.calls);
+   atomic64_inc(&s->wc_comp.total_wc_cnt);
+}
+
+int ibtrs_srv_reset_rdma_stats(struct ibtrs_srv_stats *stats, bool enable)
+{
+   if (enable) {
+   struct ibtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
+
+   memset(r, 0, sizeof(*r));
+   return 0;
+   }
+
+   return -EINVAL;
+}
+
+ssize_t ibtrs_srv_stats_rdma_to_str(struct ibtrs_srv_stats *stats,
+   char *page, size_t len)
+{
+   struct ibtrs_srv_stats_rdma_stats *r = &stats->rdma_stats;
+   struct ibtrs_srv_sess *sess;
+
+   sess = container_of(stats, typeof(*sess), stats);
+
+   return scnprintf(page, len, "%ld %ld %ld %ld %u\n",
+atomic64_read(&r->dir[READ].cnt),
+atomic64_read(&r->dir[READ].size_total),
+atomic64_read(&r->dir[WRITE].cnt),
+atomic64_read(&r->dir[WRITE].size_total),
+atomic_read(&sess->ids_inflight));
+}
+
+int ibtrs_srv_reset_wc_completion_stats(struct ibtrs_srv_stats *stats,
+   bool enable)
+{
+   if (enable) {
+   memset(&stats->wc_comp, 0, sizeof(stats->wc_comp));
+   return 0;
+   }
+
+   return -EINVAL;
+}
+
+int ibtrs_srv_stats_wc_completion_to_str(struct ibtrs_srv_stats *stats,
+char *buf, size_t len)
+{
+   return snprintf(buf, len, "%ld %ld\n",
+   atomic64_read(&stats->wc_comp.total_wc_cnt),
+   atomic64_read(&stats->wc_comp.calls));
+}
+
+ssize_t ibtrs_srv_reset_all_help(struct ibtrs_srv_stats *stats,
+char *page, size_t len)
+{
+   return scnprintf(page, PAGE_SIZE, "echo 1 to reset all statistics\n");
+}
+
+int ibtrs_srv_reset_all_stats(struct ibtrs_srv_stats *stats, bool enable)
+{
+   if (enable) {
+   ibtrs_srv_reset_wc_completion_stats(stats, enable);
+   ibtrs_srv_reset_rdma_stats(stats, enable);
+   return 0;
+   }
+
+   return -EINVAL;
+}
-- 
2.13.1



[PATCH 04/24] ibtrs: client: private header with client structs and functions

2018-02-02 Thread Roman Pen
This header describes main structs and functions used by ibtrs-client
module, mainly for managing IBTRS sessions, creating/destroying sysfs
entries, accounting statistics on client side.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/infiniband/ulp/ibtrs/ibtrs-clt.h | 338 +++
 1 file changed, 338 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-clt.h 
b/drivers/infiniband/ulp/ibtrs/ibtrs-clt.h
new file mode 100644
index ..b57af19ac833
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-clt.h
@@ -0,0 +1,338 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#ifndef IBTRS_CLT_H
+#define IBTRS_CLT_H
+
+#include "ibtrs-pri.h"
+
+/**
+ * enum ibtrs_clt_state - Client states.
+ */
+enum ibtrs_clt_state {
+   IBTRS_CLT_CONNECTING,
+   IBTRS_CLT_CONNECTING_ERR,
+   IBTRS_CLT_RECONNECTING,
+   IBTRS_CLT_CONNECTED,
+   IBTRS_CLT_CLOSING,
+   IBTRS_CLT_CLOSED,
+   IBTRS_CLT_DEAD,
+};
+
+static inline const char *ibtrs_clt_state_str(enum ibtrs_clt_state state)
+{
+   switch (state) {
+   case IBTRS_CLT_CONNECTING:
+   return "IBTRS_CLT_CONNECTING";
+   case IBTRS_CLT_CONNECTING_ERR:
+   return "IBTRS_CLT_CONNECTING_ERR";
+   case IBTRS_CLT_RECONNECTING:
+   return "IBTRS_CLT_RECONNECTING";
+   case IBTRS_CLT_CONNECTED:
+   return "IBTRS_CLT_CONNECTED";
+   case IBTRS_CLT_CLOSING:
+   return "IBTRS_CLT_CLOSING";
+   case IBTRS_CLT_CLOSED:
+   return "IBTRS_CLT_CLOSED";
+   case IBTRS_CLT_DEAD:
+   return "IBTRS_CLT_DEAD";
+   default:
+   return "UNKNOWN";
+   }
+}
+
+enum ibtrs_fast_reg {
+   IBTRS_FAST_MEM_NONE,
+   IBTRS_FAST_MEM_FR,
+   IBTRS_FAST_MEM_FMR
+};
+
+enum ibtrs_mp_policy {
+   MP_POLICY_RR,
+   MP_POLICY_MIN_INFLIGHT,
+};
+
+struct ibtrs_clt_stats_reconnects {
+   int successful_cnt;
+   int fail_cnt;
+};
+
+struct ibtrs_clt_stats_wc_comp {
+   u32 cnt;
+   u64 total_cnt;
+};
+
+struct ibtrs_clt_stats_cpu_migr {
+   atomic_t from;
+   int to;
+};
+
+struct ibtrs_clt_stats_rdma {
+   struct {
+   u64 cnt;
+   u64 size_total;
+   } dir[2];
+
+   u64 failover_cnt;
+};
+
+struct ibtrs_clt_stats_rdma_lat {
+   u64 read;
+   u64 write;
+};
+
+#define MIN_LOG_SG 2
+#define MAX_LOG_SG 5
+#define MAX_LIN_SG BIT(MIN_LOG_SG)
+#define SG_DISTR_SZ (MAX_LOG_SG - MIN_LOG_SG + MAX_LIN_SG + 2)
+
+#define MAX_LOG_LAT 16
+#define MIN_LOG_LAT 0
+#define LOG_LAT_SZ (MAX_LOG_LAT - MIN_LOG_LAT + 2)
+
+struct ibtrs_clt_stats_pcpu {
+   struct ibtrs_clt_stats_cpu_migr cpu_migr;
+   struct ibtrs_clt_stats_rdma rdma;
+   u64 sg_list_total;
+   u64 sg_list_distr[SG_DISTR_SZ];
+   struct ibtrs_clt_stats_rdma_lat rdma_lat_distr[LOG_LAT_SZ];
+   struct ibtrs_clt_stats_rdma_lat rdma_lat_max;
+   struct ibtrs_clt_stats_wc_comp  wc_comp;
+};
+
+struct ibtrs_clt_stats {
+   boolenable_rdma_lat;
+   struct ibtrs_clt_stats_pcpu__percpu *pcpu_stats;
+   struct ibtrs_clt_stats_reconnects   reconnects;
+   atomic_tinflight;
+};
+
+struct ibtrs_clt_con {
+   struct ibtrs_conc;
+   unsignedcpu;
+   atomic_tio_cnt;
+   struct ibtrs_fr_pool*fr_pool;
+   int cm_err;
+};
+
+struct ibtrs_clt_io_req {
+   struct list_headlist;
+   struct ibtrs_iu *iu;
+   struct scatterlist  *sglist; /* list holding user data */
+   unsigned intsg_cnt;
+   unsigned intsg_size;
+   unsigned intdata_len;
+   unsigned int   

[PATCH 01/24] ibtrs: public interface header to establish RDMA connections

2018-02-02 Thread Roman Pen
Introduce public header which provides set of API functions to
establish RDMA connections from client to server machine using
IBTRS protocol, which manages RDMA connections for each session,
does multipathing and load balancing.

Main functions for client (active) side:

 ibtrs_clt_open() - Creates set of RDMA connections incapsulated
in IBTRS session and returns pointer on IBTRS
session object.
 ibtrs_clt_close() - Closes RDMA connections associated with IBTRS
 session.
 ibtrs_clt_request() - Requests zero-copy RDMA transfer to/from
   server.

Main functions for server (passive) side:

 ibtrs_srv_open() - Starts listening for IBTRS clients on specified
port and invokes IBTRS callbacks for incoming
RDMA requests or link events.
 ibtrs_srv_close() - Closes IBTRS server context.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/infiniband/ulp/ibtrs/ibtrs.h | 331 +++
 1 file changed, 331 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs.h 
b/drivers/infiniband/ulp/ibtrs/ibtrs.h
new file mode 100644
index ..747cdde3d9cf
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs.h
@@ -0,0 +1,331 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#ifndef IBTRS_H
+#define IBTRS_H
+
+#include 
+#include 
+
+struct ibtrs_clt;
+struct ibtrs_srv_ctx;
+struct ibtrs_srv;
+struct ibtrs_srv_op;
+
+/*
+ * Here goes IBTRS client API
+ */
+
+/**
+ * enum ibtrs_clt_link_ev - Events about connectivity state of a client
+ * @IBTRS_CLT_LINK_EV_RECONNECTED  Client was reconnected.
+ * @IBTRS_CLT_LINK_EV_DISCONNECTED Client was disconnected.
+ */
+enum ibtrs_clt_link_ev {
+   IBTRS_CLT_LINK_EV_RECONNECTED,
+   IBTRS_CLT_LINK_EV_DISCONNECTED,
+};
+
+/**
+ * Source and destination address of a path to be established
+ */
+struct ibtrs_addr {
+   struct sockaddr *src;
+   struct sockaddr *dst;
+};
+
+typedef void (link_clt_ev_fn)(void *priv, enum ibtrs_clt_link_ev ev);
+/**
+ * ibtrs_clt_open() - Open a session to a IBTRS client
+ * @priv:  User supplied private data.
+ * @link_ev:   Event notification for connection state changes
+ * @priv:  user supplied data that was passed to
+ * ibtrs_clt_open()
+ * @ev:Occurred event
+ * @sessname: name of the session
+ * @paths: Paths to be established defined by their src and dst addresses
+ * @path_cnt: Number of elemnts in the @paths array
+ * @port: port to be used by the IBTRS session
+ * @pdu_sz: Size of extra payload which can be accessed after tag allocation.
+ * @max_inflight_msg: Max. number of parallel inflight messages for the session
+ * @max_segments: Max. number of segments per IO request
+ * @reconnect_delay_sec: time between reconnect tries
+ * @max_reconnect_attempts: Number of times to reconnect on error before giving
+ * up, 0 for * disabled, -1 for forever
+ *
+ * Starts session establishment with the ibtrs_server. The function can block
+ * up to ~2000ms until it returns.
+ *
+ * Return a valid pointer on success otherwise PTR_ERR.
+ */
+struct ibtrs_clt *ibtrs_clt_open(void *priv, link_clt_ev_fn *link_ev,
+const char *sessname,
+const struct ibtrs_addr *paths,
+size_t path_cnt, short port,
+size_t pdu_sz, u8 reconnect_delay_sec,
+u16 max_segments,
+s16 max_reconnect_attempts);
+
+/**
+ * ibtrs_clt_close() - Close a session
+ * @sess: Session handler, is freed on return
+ */
+void ibtrs_clt_close(struct ibtrs_clt *sess);
+
+enum {
+   IBTRS_TAG_NOWAIT = 0,
+   IBTRS_TAG_WAIT   = 1,
+};
+
+/**
+ * enum ibtrs_clt_con_type() type of i

[PATCH 08/24] ibtrs: server: private header with server structs and functions

2018-02-02 Thread Roman Pen
This header describes main structs and functions used by ibtrs-server
module, mainly for accepting IBTRS sessions, creating/destroying
sysfs entries, accounting statistics on server side.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/infiniband/ulp/ibtrs/ibtrs-srv.h | 169 +++
 1 file changed, 169 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv.h 
b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.h
new file mode 100644
index ..f54e159eaf2a
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.h
@@ -0,0 +1,169 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#ifndef IBTRS_SRV_H
+#define IBTRS_SRV_H
+
+#include 
+#include "ibtrs-pri.h"
+
+/**
+ * enum ibtrs_srv_state - Server states.
+ */
+enum ibtrs_srv_state {
+   IBTRS_SRV_CONNECTING,
+   IBTRS_SRV_CONNECTED,
+   IBTRS_SRV_CLOSING,
+   IBTRS_SRV_CLOSED,
+};
+
+static inline const char *ibtrs_srv_state_str(enum ibtrs_srv_state state)
+{
+   switch (state) {
+   case IBTRS_SRV_CONNECTING:
+   return "IBTRS_SRV_CONNECTING";
+   case IBTRS_SRV_CONNECTED:
+   return "IBTRS_SRV_CONNECTED";
+   case IBTRS_SRV_CLOSING:
+   return "IBTRS_SRV_CLOSING";
+   case IBTRS_SRV_CLOSED:
+   return "IBTRS_SRV_CLOSED";
+   default:
+   return "UNKNOWN";
+   }
+}
+
+struct ibtrs_stats_wc_comp {
+   atomic64_t  calls;
+   atomic64_t  total_wc_cnt;
+};
+
+struct ibtrs_srv_stats_rdma_stats {
+   struct {
+   atomic64_t  cnt;
+   atomic64_t  size_total;
+   } dir[2];
+};
+
+struct ibtrs_srv_stats {
+   struct ibtrs_srv_stats_rdma_stats   rdma_stats;
+   atomic_tapm_cnt;
+   struct ibtrs_stats_wc_comp  wc_comp;
+};
+
+struct ibtrs_srv_con {
+   struct ibtrs_conc;
+   atomic_twr_cnt;
+};
+
+struct ibtrs_srv_op {
+   struct ibtrs_srv_con*con;
+   u32 msg_id;
+   u8  dir;
+   u64 data_dma_addr;
+   struct ibtrs_msg_rdma_read  *msg;
+   struct ib_rdma_wr   *tx_wr;
+   struct ib_sge   *tx_sg;
+};
+
+struct ibtrs_srv_sess {
+   struct ibtrs_sess   s;
+   struct ibtrs_srv*srv;
+   struct work_struct  close_work;
+   enum ibtrs_srv_statestate;
+   spinlock_t  state_lock;
+   int cur_cq_vector;
+   struct ibtrs_srv_op **ops_ids;
+   atomic_tids_inflight;
+   wait_queue_head_t   ids_waitq;
+   dma_addr_t  *rdma_addr;
+   boolestablished;
+   unsigned intmem_bits;
+   struct kobject  kobj;
+   struct kobject  kobj_stats;
+   struct ibtrs_srv_stats  stats;
+};
+
+struct ibtrs_srv {
+   struct list_headpaths_list;
+   int paths_up;
+   struct mutexpaths_ev_mutex;
+   size_t  paths_num;
+   struct mutexpaths_mutex;
+   uuid_t  paths_uuid;
+   refcount_t  refcount;
+   struct ibtrs_srv_ctx*ctx;
+   struct list_headctx_list;
+   void*priv;
+   size_t  queue_depth;
+   struct page **chunks;
+   struct kobject  kobj;
+   struct kobject  kobj_paths;
+};
+
+struct ibtrs_srv_ctx {
+   rdma_ev_fn *rdma_ev;
+   link_ev_fn *link_ev;
+   struct rdma_cm_id *cm_id_ip;
+   struct rdma_cm_id *cm_id_ib;
+   struct mutex srv_mutex;
+   struct list_head srv_list;
+};
+
+/* See ibtrs-log.h */
+#define TYPES_TO_SESSNAME(obj) \
+   LIST(C

[PATCH 02/24] ibtrs: private headers with IBTRS protocol structs and helpers

2018-02-02 Thread Roman Pen
These are common private headers with IBTRS protocol structures,
logging, sysfs and other helper functions, which are used on
both client and server sides.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/infiniband/ulp/ibtrs/ibtrs-log.h |  94 ++
 drivers/infiniband/ulp/ibtrs/ibtrs-pri.h | 494 +++
 2 files changed, 588 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-log.h 
b/drivers/infiniband/ulp/ibtrs/ibtrs-log.h
new file mode 100644
index ..308593785c64
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-log.h
@@ -0,0 +1,94 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#ifndef IBTRS_LOG_H
+#define IBTRS_LOG_H
+
+#define P1 )
+#define P2 ))
+#define P3 )))
+#define P4 
+#define P(N) P ## N
+
+#define CAT(a, ...) PRIMITIVE_CAT(a, __VA_ARGS__)
+#define PRIMITIVE_CAT(a, ...) a ## __VA_ARGS__
+
+#define COUNT_ARGS(...) COUNT_ARGS_(,##__VA_ARGS__,6,5,4,3,2,1,0)
+#define COUNT_ARGS_(z,a,b,c,d,e,f,cnt,...) cnt
+
+#define LIST(...)  \
+   __VA_ARGS__,\
+   ({ unknown_type(); NULL; }) \
+   CAT(P, COUNT_ARGS(__VA_ARGS__)) \
+
+#define EMPTY()
+#define DEFER(id) id EMPTY()
+
+#define _CASE(obj, type, member)   \
+   __builtin_choose_expr(  \
+   __builtin_types_compatible_p(   \
+   typeof(obj), type), \
+   ((type)obj)->member
+#define CASE(o, t, m) DEFER(_CASE)(o,t,m)
+
+/*
+ * Below we define retrieving of sessname from common IBTRS types.
+ * Client or server related types have to be defined by special
+ * TYPES_TO_SESSNAME macro.
+ */
+
+void unknown_type(void);
+
+#ifndef TYPES_TO_SESSNAME
+#define TYPES_TO_SESSNAME(...) ({ unknown_type(); NULL; })
+#endif
+
+#define ibtrs_prefix(obj)  \
+   _CASE(obj, struct ibtrs_con *,  sess->sessname),\
+   _CASE(obj, struct ibtrs_sess *, sessname),  \
+   TYPES_TO_SESSNAME(obj)  \
+   ))
+
+#define ibtrs_log(fn, obj, fmt, ...)   \
+   fn("<%s>: " fmt, ibtrs_prefix(obj), ##__VA_ARGS__)
+
+#define ibtrs_err(obj, fmt, ...)   \
+   ibtrs_log(pr_err, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_err_rl(obj, fmt, ...)\
+   ibtrs_log(pr_err_ratelimited, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_wrn(obj, fmt, ...)   \
+   ibtrs_log(pr_warn, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_wrn_rl(obj, fmt, ...) \
+   ibtrs_log(pr_warn_ratelimited, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_info(obj, fmt, ...) \
+   ibtrs_log(pr_info, obj, fmt, ##__VA_ARGS__)
+#define ibtrs_info_rl(obj, fmt, ...) \
+   ibtrs_log(pr_info_ratelimited, obj, fmt, ##__VA_ARGS__)
+
+#endif /* IBTRS_LOG_H */
diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-pri.h 
b/drivers/infiniband/ulp/ibtrs/ibtrs-pri.h
new file mode 100644
index ..b3b51af8607e
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-pri.h
@@ -0,0 +1,494 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but 

[PATCH 09/24] ibtrs: server: main functionality

2018-02-02 Thread Roman Pen
This is main functionality of ibtrs-server module, which accepts
set of RDMA connections (so called IBTRS session), creates/destroys
sysfs entries associated with IBTRS session and notifies upper layer
(user of IBTRS API) about RDMA requests or link events.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/infiniband/ulp/ibtrs/ibtrs-srv.c | 1811 ++
 1 file changed, 1811 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c 
b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
new file mode 100644
index ..0d1fc08bd821
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-srv.c
@@ -0,0 +1,1811 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include 
+#include 
+
+#include "ibtrs-srv.h"
+#include "ibtrs-log.h"
+
+MODULE_AUTHOR("ib...@profitbricks.com");
+MODULE_DESCRIPTION("IBTRS Server");
+MODULE_VERSION(IBTRS_VER_STRING);
+MODULE_LICENSE("GPL");
+
+#define DEFAULT_MAX_IO_SIZE_KB 128
+#define DEFAULT_MAX_IO_SIZE (DEFAULT_MAX_IO_SIZE_KB * 1024)
+#define MAX_REQ_SIZE PAGE_SIZE
+#define MAX_SG_COUNT ((MAX_REQ_SIZE - sizeof(struct ibtrs_msg_rdma_read)) \
+ / sizeof(struct ibtrs_sg_desc))
+
+static int max_io_size = DEFAULT_MAX_IO_SIZE;
+static int rcv_buf_size = DEFAULT_MAX_IO_SIZE + MAX_REQ_SIZE;
+
+static int max_io_size_set(const char *val, const struct kernel_param *kp)
+{
+   int err, ival;
+
+   err = kstrtoint(val, 0, &ival);
+   if (err)
+   return err;
+
+   if (ival < 4096 || ival + MAX_REQ_SIZE > (4096 * 1024) ||
+   (ival + MAX_REQ_SIZE) % 512 != 0) {
+   pr_err("Invalid max io size value %d, has to be"
+  " > %d, < %d\n", ival, 4096, 4194304);
+   return -EINVAL;
+   }
+
+   max_io_size = ival;
+   rcv_buf_size = max_io_size + MAX_REQ_SIZE;
+   pr_info("max io size changed to %d\n", ival);
+
+   return 0;
+}
+
+static const struct kernel_param_ops max_io_size_ops = {
+   .set= max_io_size_set,
+   .get= param_get_int,
+};
+module_param_cb(max_io_size, &max_io_size_ops, &max_io_size, 0444);
+MODULE_PARM_DESC(max_io_size,
+"Max size for each IO request, when change the unit is in byte"
+" (default: " __stringify(DEFAULT_MAX_IO_SIZE_KB) "KB)");
+
+#define DEFAULT_SESS_QUEUE_DEPTH 512
+static int sess_queue_depth = DEFAULT_SESS_QUEUE_DEPTH;
+module_param_named(sess_queue_depth, sess_queue_depth, int, 0444);
+MODULE_PARM_DESC(sess_queue_depth,
+"Number of buffers for pending I/O requests to allocate"
+" per session. Maximum: " __stringify(MAX_SESS_QUEUE_DEPTH)
+" (default: " __stringify(DEFAULT_SESS_QUEUE_DEPTH) ")");
+
+/* We guarantee to serve 10 paths at least */
+#define CHUNK_POOL_SIZE (DEFAULT_SESS_QUEUE_DEPTH * 10)
+static mempool_t *chunk_pool;
+
+static int retry_count = 7;
+
+static int retry_count_set(const char *val, const struct kernel_param *kp)
+{
+   int err, ival;
+
+   err = kstrtoint(val, 0, &ival);
+   if (err)
+   return err;
+
+   if (ival < MIN_RTR_CNT || ival > MAX_RTR_CNT) {
+   pr_err("Invalid retry count value %d, has to be"
+  " > %d, < %d\n", ival, MIN_RTR_CNT, MAX_RTR_CNT);
+   return -EINVAL;
+   }
+
+   retry_count = ival;
+   pr_info("QP retry count changed to %d\n", ival);
+
+   return 0;
+}
+
+static const struct kernel_param_ops retry_count_ops = {
+   .set= retry_count_set,
+   .get= param_get_int,
+};
+module_param_cb(retry_count, &retry_count_ops, &retry_count, 0644);
+
+MODULE_PARM_DESC(retry_count, "Number of times to send the message if the"
+" remote side didn't respond with Ack or Nack (default: 3,"
+" min: " __s

[PATCH 00/24] InfiniBand Transport (IBTRS) and Network Block Device (IBNBD)

2018-02-02 Thread Roman Pen
This series introduces IBNBD/IBTRS modules.

IBTRS (InfiniBand Transport) is a reliable high speed transport library
which allows for establishing connection between client and server
machines via RDMA. It is optimized to transfer (read/write) IO blocks
in the sense that it follows the BIO semantics of providing the
possibility to either write data from a scatter-gather list to the
remote side or to request ("read") data transfer from the remote side
into a given set of buffers.

IBTRS is multipath capable and provides I/O fail-over and load-balancing
functionality.

IBNBD (InfiniBand Network Block Device) is a pair of kernel modules
(client and server) that allow for remote access of a block device on
the server over IBTRS protocol. After being mapped, the remote block
devices can be accessed on the client side as local block devices.
Internally IBNBD uses IBTRS as an RDMA transport library.

Why?

   - IBNBD/IBTRS is developed in order to map thin provisioned volumes,
 thus internal protocol is simple and consists of several request
 types only without awareness of underlaying hardware devices.
   - IBTRS was developed as an independent RDMA transport library, which
 supports fail-over and load-balancing policies using multipath, thus
 it can be used for any other IO needs rather than only for block
 device.
   - IBNBD/IBTRS is faster than NVME over RDMA.  Old comparison results:
 https://www.spinics.net/lists/linux-rdma/msg48799.html
 (I retested on latest 4.14 kernel - there is no any significant
  difference, thus I post the old link).

Key features of IBTRS transport library and IBNBD block device:

o High throughput and low latency due to:
   - Only two RDMA messages per IO.
   - IMM InfiniBand messages on responses to reduce round trip latency.
   - Simplified memory management: memory allocation happens once on
 server side when IBTRS session is established.

o IO fail-over and load-balancing by using multipath.

o Simple configuration of IBNBD:
   - Server side is completely passive: volumes do not need to be
 explicitly exported.
   - Only IB port GID and device path needed on client side to map
 a block device.
   - A device is remapped automatically i.e. after storage reboot.

This series is a second try, first variant was published [1] and
presented on Vault in 2017 [2].

Since the first version the following was changed:

   - Load-balancing and IO fail-over using multipath features were added.
   - Major parts of the code were rewritten, simplified and overall code
 size was reduced by a quarter.

Commits for kernel can be found here:
   https://github.com/profitbricks/ibnbd/commits/linux-4.15-rc8

The out-of-tree modules are here:
   https://github.com/profitbricks/ibnbd/

[1] https://lwn.net/Articles/718181/
[2] 
http://events.linuxfoundation.org/sites/events/files/slides/IBNBD-Vault-2017.pdf

Roman Pen (24):
  ibtrs: public interface header to establish RDMA connections
  ibtrs: private headers with IBTRS protocol structs and helpers
  ibtrs: core: lib functions shared between client and server modules
  ibtrs: client: private header with client structs and functions
  ibtrs: client: main functionality
  ibtrs: client: statistics functions
  ibtrs: client: sysfs interface functions
  ibtrs: server: private header with server structs and functions
  ibtrs: server: main functionality
  ibtrs: server: statistics functions
  ibtrs: server: sysfs interface functions
  ibtrs: include client and server modules into kernel compilation
  ibtrs: a bit of documentation
  ibnbd: private headers with IBNBD protocol structs and helpers
  ibnbd: client: private header with client structs and functions
  ibnbd: client: main functionality
  ibnbd: client: sysfs interface functions
  ibnbd: server: private header with server structs and functions
  ibnbd: server: main functionality
  ibnbd: server: functionality for IO submission to file or block dev
  ibnbd: server: sysfs interface functions
  ibnbd: include client and server modules into kernel compilation
  ibnbd: a bit of documentation
  MAINTAINERS: Add maintainer for IBNBD/IBTRS modules

 MAINTAINERS|   14 +
 drivers/block/Kconfig  |2 +
 drivers/block/Makefile |1 +
 drivers/block/ibnbd/Kconfig|   22 +
 drivers/block/ibnbd/Makefile   |   13 +
 drivers/block/ibnbd/README |  272 ++
 drivers/block/ibnbd/ibnbd-clt-sysfs.c  |  723 +
 drivers/block/ibnbd/ibnbd-clt.c| 1959 +
 drivers/block/ibnbd/ibnbd-clt.h|  193 ++
 drivers/block/ibnbd/ibnbd-log.h|   71 +
 drivers/block/ibnbd/ibnbd-proto.h  |  360 +++
 drivers/block/ibnbd/ibnbd-srv-dev.c|  410 +++
 drivers/block/ibnbd/ibnbd-srv-dev.h|  149 +
 drivers/block/ibnbd/ibnbd-srv-sysfs.c  |  

[PATCH 03/24] ibtrs: core: lib functions shared between client and server modules

2018-02-02 Thread Roman Pen
This is a set of library functions existing as a ibtrs-core module,
used by client and server modules.

Mainly these functions wrap IB and RDMA calls and provide a bit higher
abstraction for implementing of IBTRS protocol on client or server
sides.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/infiniband/ulp/ibtrs/ibtrs.c | 582 +++
 1 file changed, 582 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs.c 
b/drivers/infiniband/ulp/ibtrs/ibtrs.c
new file mode 100644
index ..007380506959
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs.c
@@ -0,0 +1,582 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include 
+#include 
+
+#include "ibtrs-pri.h"
+#include "ibtrs-log.h"
+
+MODULE_AUTHOR("ib...@profitbricks.com");
+MODULE_DESCRIPTION("IBTRS Core");
+MODULE_VERSION(IBTRS_VER_STRING);
+MODULE_LICENSE("GPL");
+
+static LIST_HEAD(device_list);
+static DEFINE_MUTEX(device_list_mutex);
+
+struct ibtrs_iu *ibtrs_iu_alloc(u32 tag, size_t size, gfp_t gfp_mask,
+   struct ib_device *dma_dev,
+   enum dma_data_direction direction,
+   void (*done)(struct ib_cq *cq,
+struct ib_wc *wc))
+{
+   struct ibtrs_iu *iu;
+
+   iu = kmalloc(sizeof(*iu), gfp_mask);
+   if (unlikely(!iu))
+   return NULL;
+
+   iu->buf = kzalloc(size, gfp_mask);
+   if (unlikely(!iu->buf))
+   goto err1;
+
+   iu->dma_addr = ib_dma_map_single(dma_dev, iu->buf, size, direction);
+   if (unlikely(ib_dma_mapping_error(dma_dev, iu->dma_addr)))
+   goto err2;
+
+   iu->cqe.done  = done;
+   iu->size  = size;
+   iu->direction = direction;
+   iu->tag   = tag;
+
+   return iu;
+
+err2:
+   kfree(iu->buf);
+err1:
+   kfree(iu);
+
+   return NULL;
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_alloc);
+
+void ibtrs_iu_free(struct ibtrs_iu *iu, enum dma_data_direction dir,
+  struct ib_device *ibdev)
+{
+   if (!iu)
+   return;
+
+   ib_dma_unmap_single(ibdev, iu->dma_addr, iu->size, dir);
+   kfree(iu->buf);
+   kfree(iu);
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_free);
+
+int ibtrs_iu_post_recv(struct ibtrs_con *con, struct ibtrs_iu *iu)
+{
+   struct ibtrs_sess *sess = con->sess;
+   struct ib_recv_wr wr, *bad_wr;
+   struct ib_sge list;
+
+   list.addr   = iu->dma_addr;
+   list.length = iu->size;
+   list.lkey   = sess->ib_dev->lkey;
+
+   if (WARN_ON(list.length == 0)) {
+   ibtrs_wrn(con, "Posting receive work request failed,"
+ " sg list is empty\n");
+   return -EINVAL;
+   }
+
+   wr.next= NULL;
+   wr.wr_cqe  = &iu->cqe;
+   wr.sg_list = &list;
+   wr.num_sge = 1;
+
+   return ib_post_recv(con->qp, &wr, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_iu_post_recv);
+
+int ibtrs_post_recv_empty(struct ibtrs_con *con, struct ib_cqe *cqe)
+{
+   struct ib_recv_wr wr, *bad_wr;
+
+   wr.next= NULL;
+   wr.wr_cqe  = cqe;
+   wr.sg_list = NULL;
+   wr.num_sge = 0;
+
+   return ib_post_recv(con->qp, &wr, &bad_wr);
+}
+EXPORT_SYMBOL_GPL(ibtrs_post_recv_empty);
+
+int ibtrs_iu_post_send(struct ibtrs_con *con, struct ibtrs_iu *iu, size_t size)
+{
+   struct ibtrs_sess *sess = con->sess;
+   struct ib_send_wr wr, *bad_wr;
+   struct ib_sge list;
+
+   if ((WARN_ON(size == 0)))
+   return -EINVAL;
+
+   list.addr   = iu->dma_addr;
+   list.length = size;
+   list.lkey   = sess->ib_dev->lkey;
+
+   memset(&wr, 0, sizeof(wr));
+   wr.next   = NULL;
+   wr.wr_cqe = &iu->cqe;
+   wr.sg_list= &list;
+   wr.num_sge= 1;
+   wr.opcode 

[PATCH 06/24] ibtrs: client: statistics functions

2018-02-02 Thread Roman Pen
This introduces set of functions used on client side to account
statistics of RDMA data sent/received, amount of IOs inflight,
latency, cpu migrations, etc.  Almost all statistics is collected
using percpu variables.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c | 455 +
 1 file changed, 455 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c 
b/drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c
new file mode 100644
index ..af2ed05d2900
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-clt-stats.c
@@ -0,0 +1,455 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibtrs-clt.h"
+
+static inline int ibtrs_clt_ms_to_id(unsigned long ms)
+{
+   int id = ms ? ilog2(ms) - MIN_LOG_LAT + 1 : 0;
+
+   return clamp(id, 0, LOG_LAT_SZ - 1);
+}
+
+void ibtrs_clt_update_rdma_lat(struct ibtrs_clt_stats *stats, bool read,
+  unsigned long ms)
+{
+   struct ibtrs_clt_stats_pcpu *s;
+   int id;
+
+   id = ibtrs_clt_ms_to_id(ms);
+   s = this_cpu_ptr(stats->pcpu_stats);
+   if (read) {
+   s->rdma_lat_distr[id].read++;
+   if (s->rdma_lat_max.read < ms)
+   s->rdma_lat_max.read = ms;
+   } else {
+   s->rdma_lat_distr[id].write++;
+   if (s->rdma_lat_max.write < ms)
+   s->rdma_lat_max.write = ms;
+   }
+}
+
+void ibtrs_clt_decrease_inflight(struct ibtrs_clt_stats *stats)
+{
+   atomic_dec(&stats->inflight);
+}
+
+void ibtrs_clt_update_wc_stats(struct ibtrs_clt_con *con)
+{
+   struct ibtrs_clt_sess *sess = to_clt_sess(con->c.sess);
+   struct ibtrs_clt_stats *stats = &sess->stats;
+   struct ibtrs_clt_stats_pcpu *s;
+   int cpu;
+
+   cpu = raw_smp_processor_id();
+   s = this_cpu_ptr(stats->pcpu_stats);
+   s->wc_comp.cnt++;
+   s->wc_comp.total_cnt++;
+   if (unlikely(con->cpu != cpu)) {
+   s->cpu_migr.to++;
+
+   /* Careful here, override s pointer */
+   s = per_cpu_ptr(stats->pcpu_stats, con->cpu);
+   atomic_inc(&s->cpu_migr.from);
+   }
+}
+
+void ibtrs_clt_inc_failover_cnt(struct ibtrs_clt_stats *stats)
+{
+   struct ibtrs_clt_stats_pcpu *s;
+
+   s = this_cpu_ptr(stats->pcpu_stats);
+   s->rdma.failover_cnt++;
+}
+
+static inline u32 ibtrs_clt_stats_get_avg_wc_cnt(struct ibtrs_clt_stats *stats)
+{
+   u32 cnt = 0;
+   u64 sum = 0;
+   int cpu;
+
+   for_each_possible_cpu(cpu) {
+   struct ibtrs_clt_stats_pcpu *s;
+
+   s = per_cpu_ptr(stats->pcpu_stats, cpu);
+   sum += s->wc_comp.total_cnt;
+   cnt += s->wc_comp.cnt;
+   }
+
+   return cnt ? sum / cnt : 0;
+}
+
+int ibtrs_clt_stats_wc_completion_to_str(struct ibtrs_clt_stats *stats,
+char *buf, size_t len)
+{
+   return scnprintf(buf, len, "%u\n",
+ibtrs_clt_stats_get_avg_wc_cnt(stats));
+}
+
+ssize_t ibtrs_clt_stats_rdma_lat_distr_to_str(struct ibtrs_clt_stats *stats,
+ char *page, size_t len)
+{
+   struct ibtrs_clt_stats_rdma_lat res[LOG_LAT_SZ];
+   struct ibtrs_clt_stats_rdma_lat max;
+   struct ibtrs_clt_stats_pcpu *s;
+
+   ssize_t cnt = 0;
+   int i, cpu;
+
+   max.write = 0;
+   max.read = 0;
+   for_each_possible_cpu(cpu) {
+   s = per_cpu_ptr(stats->pcpu_stats, cpu);
+
+   if (max.write < s->rdma_lat_max.write)
+   max.write = s->rdma_lat_max.write;
+   if (max.read < s->rdma_lat_max.read)
+   max.read = s->rdma_lat_max.read;
+   }
+   for (i = 0; i < ARRAY_SIZE(res); i++) {
+

[PATCH 05/24] ibtrs: client: main functionality

2018-02-02 Thread Roman Pen
This is main functionality of ibtrs-client module, which manages
set of RDMA connections for each IBTRS session, does multipathing,
load balancing and failover of RDMA requests.

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/infiniband/ulp/ibtrs/ibtrs-clt.c | 3496 ++
 1 file changed, 3496 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-clt.c 
b/drivers/infiniband/ulp/ibtrs/ibtrs-clt.c
new file mode 100644
index ..aa0a17f2a78c
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-clt.c
@@ -0,0 +1,3496 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include 
+#include 
+
+#include "ibtrs-clt.h"
+#include "ibtrs-log.h"
+
+#define RECONNECT_SEED 8
+#define MAX_SEGMENTS 31
+
+#define IBTRS_CONNECT_TIMEOUT_MS 5000
+
+MODULE_AUTHOR("ib...@profitbricks.com");
+MODULE_DESCRIPTION("IBTRS Client");
+MODULE_VERSION(IBTRS_VER_STRING);
+MODULE_LICENSE("GPL");
+
+static bool use_fr;
+module_param(use_fr, bool, 0444);
+MODULE_PARM_DESC(use_fr, "use FRWR mode for memory registration if possible."
+" (default: 0)");
+
+static ushort nr_cons_per_session;
+module_param(nr_cons_per_session, ushort, 0444);
+MODULE_PARM_DESC(nr_cons_per_session, "Number of connections per session."
+" (default: nr_cpu_ids)");
+
+static int retry_count = 7;
+
+static int retry_count_set(const char *val, const struct kernel_param *kp)
+{
+   int err, ival;
+
+   err = kstrtoint(val, 0, &ival);
+   if (err)
+   return err;
+
+   if (ival < MIN_RTR_CNT || ival > MAX_RTR_CNT)
+   return -EINVAL;
+
+   retry_count = ival;
+
+   return 0;
+}
+
+static const struct kernel_param_ops retry_count_ops = {
+   .set= retry_count_set,
+   .get= param_get_int,
+};
+module_param_cb(retry_count, &retry_count_ops, &retry_count, 0644);
+
+MODULE_PARM_DESC(retry_count, "Number of times to send the message if the"
+" remote side didn't respond with Ack or Nack (default: 3,"
+" min: " __stringify(MIN_RTR_CNT) ", max: "
+__stringify(MAX_RTR_CNT) ")");
+
+static int fmr_sg_cnt = 4;
+module_param_named(fmr_sg_cnt, fmr_sg_cnt, int, 0644);
+MODULE_PARM_DESC(fmr_sg_cnt, "when sg_cnt is bigger than fmr_sg_cnt, enable"
+" FMR (default: 4)");
+
+static struct workqueue_struct *ibtrs_wq;
+
+static void ibtrs_rdma_error_recovery(struct ibtrs_clt_con *con);
+static void ibtrs_clt_rdma_done(struct ib_cq *cq, struct ib_wc *wc);
+
+static inline void ibtrs_clt_state_lock(void)
+{
+   rcu_read_lock();
+}
+
+static inline void ibtrs_clt_state_unlock(void)
+{
+   rcu_read_unlock();
+}
+
+#define cmpxchg_min(var, new) ({   \
+   typeof(var) old;\
+   \
+   do {\
+   old = var;  \
+   new = (!old ? new : min_t(typeof(var), old, new));  \
+   } while (cmpxchg(&var, old, new) != old);   \
+})
+
+static void ibtrs_clt_set_min_queue_depth(struct ibtrs_clt *clt, size_t new)
+{
+   /* Can be updated from different sessions (paths), so cmpxchg */
+
+   cmpxchg_min(clt->queue_depth, new);
+}
+
+static void ibtrs_clt_set_min_io_size(struct ibtrs_clt *clt, size_t new)
+{
+   /* Can be updated from different sessions (paths), so cmpxchg */
+
+   cmpxchg_min(clt->max_io_size, new);
+}
+
+bool ibtrs_clt_sess_is_connected(const struct ibtrs_clt_sess *sess)
+{
+   return sess->state == IBTRS_CLT_CONNECTED;
+}
+
+static inline bool ibtrs_clt_is_connected(const struct ibtrs_clt *clt)
+{
+   struct ibtrs_cl

[PATCH 07/24] ibtrs: client: sysfs interface functions

2018-02-02 Thread Roman Pen
This is the sysfs interface to IBTRS sessions on client side:

  /sys/kernel/ibtrs_client//
*** IBTRS session created by ibtrs_clt_open() API call
|
|- max_reconnect_attempts
|  *** number of reconnect attempts for session
|
|- add_path
|  *** adds another connection path into IBTRS session
|
|- paths//
   *** established paths to server in a session
   |
   |- disconnect
   |  *** disconnect path
   |
   |- reconnect
   |  *** reconnect path
   |
   |- remove_path
   |  *** remove current path
   |
   |- state
   |  *** retrieve current path state
   |
   |- stats/
  *** current path statistics
  |
  |- cpu_migration
  |- rdma
  |- rdma_lat
  |- reconnects
  |- reset_all
  |- sg_entries
  |- wc_completions

Signed-off-by: Roman Pen 
Signed-off-by: Danil Kipnis 
Cc: Jack Wang 
---
 drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c | 519 +
 1 file changed, 519 insertions(+)

diff --git a/drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c 
b/drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c
new file mode 100644
index ..04949d6d796b
--- /dev/null
+++ b/drivers/infiniband/ulp/ibtrs/ibtrs-clt-sysfs.c
@@ -0,0 +1,519 @@
+/*
+ * InfiniBand Transport Layer
+ *
+ * Copyright (c) 2014 - 2017 ProfitBricks GmbH. All rights reserved.
+ * Authors: Fabian Holler 
+ *  Jack Wang 
+ *  Kleber Souza 
+ *  Danil Kipnis 
+ *  Roman Penyaev 
+ *  Milind Dumbare 
+ *
+ * Copyright (c) 2017 - 2018 ProfitBricks GmbH. All rights reserved.
+ * Authors: Danil Kipnis 
+ *  Roman Penyaev 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see .
+ */
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME " L" __stringify(__LINE__) ": " fmt
+
+#include "ibtrs-pri.h"
+#include "ibtrs-clt.h"
+#include "ibtrs-log.h"
+
+static struct kobject *ibtrs_kobj;
+
+#define MIN_MAX_RECONN_ATT -1
+#define MAX_MAX_RECONN_ATT 
+
+static struct kobj_type ktype = {
+   .sysfs_ops = &kobj_sysfs_ops,
+};
+
+static ssize_t ibtrs_clt_max_reconn_attempts_show(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ char *page)
+{
+   struct ibtrs_clt *clt;
+
+   clt = container_of(kobj, struct ibtrs_clt, kobj);
+
+   return sprintf(page, "%d\n", ibtrs_clt_get_max_reconnect_attempts(clt));
+}
+
+static ssize_t ibtrs_clt_max_reconn_attempts_store(struct kobject *kobj,
+  struct kobj_attribute *attr,
+  const char *buf,
+  size_t count)
+{
+   struct ibtrs_clt *clt;
+   int value;
+   int ret;
+
+   clt = container_of(kobj, struct ibtrs_clt, kobj);
+
+   ret = kstrtoint(buf, 10, &value);
+   if (unlikely(ret)) {
+   ibtrs_err(clt, "%s: failed to convert string '%s' to int\n",
+ attr->attr.name, buf);
+   return ret;
+   }
+   if (unlikely(value > MAX_MAX_RECONN_ATT ||
+value < MIN_MAX_RECONN_ATT)) {
+   ibtrs_err(clt, "%s: invalid range"
+ " (provided: '%s', accepted: min: %d, max: %d)\n",
+ attr->attr.name, buf, MIN_MAX_RECONN_ATT,
+ MAX_MAX_RECONN_ATT);
+   return -EINVAL;
+   }
+   ibtrs_clt_set_max_reconnect_attempts(clt, value);
+
+   return count;
+}
+
+static struct kobj_attribute ibtrs_clt_max_reconnect_attempts_attr =
+   __ATTR(max_reconnect_attempts, 0644,
+  ibtrs_clt_max_reconn_attempts_show,
+  ibtrs_clt_max_reconn_attempts_store);
+
+static ssize_t ibtrs_clt_mp_policy_show(struct kobject *kobj,
+   struct kobj_attribute *attr,
+   char *page)
+{
+   struct ibtrs_clt *clt;
+
+   clt = container_of(kobj, struct ibtrs_clt, kobj);
+
+   switch (clt->mp_policy) {
+   case MP_POLICY_RR:
+   return sprintf(page, "round-robin (RR: %d)\n", clt->mp_policy);
+   case MP_POLICY_MIN_INFLIGHT:
+   return sprintf(page, "min-inflight (MI:

Re: [dm-devel] [LSF/MM TOPIC] block: extend generic biosets to allow per-device frontpad

2018-02-02 Thread NeilBrown
On Mon, Jan 29 2018, Mike Snitzer wrote:

> I'd like to enable bio-based DM to _not_ need to clone bios.  But to do
> so each bio-based DM target's required per-bio-data would need to be
> provided by upper layer biosets (as opposed to the bioset DM currently
> creates).
>
> So my thinking is that all system-level biosets (e.g. fs_bio_set,
> blkdev_dio_pool) would redirect to a device specific variant bioset IFF
> the underlying device advertises the need for a specific per-bio-data
> payload to be provided.
>
> I know this _could_ become a rathole but I'd like to avoid reverting DM
> back to the days of having to worry about managing mempools for the
> purpose of per-io allocations.  I've grown spoiled by the performance
> and elegance that comes with having the bio and per-bio-data allocated
> from the same bioset.
>
> Thoughts?

md/raid0 remaps each bio and passes it directly down to one of several
devices.
I think your scheme would mean that it would need to clone each bio to
make sure it is from the correctly sized pool.

I suspect it could be made to work though.

1/ have a way for the driver receiving a bio to discover how much
   frontpad was allocated.
2/ require drivers to accept bios with any size of frontpad, but a
   fast-path is taken if it is already big enough.
3/ allow a block device to advertise it's preferred frontpad.
4/ make sure your config-change-notification mechanism can communicate
   changes to this number.
5/ gather statistics on what percentage of bios have a too-small
   frontpad.

Then start modifying places that allocate bios to use the hint,
and when benchmarks show the percentage is high - use it to encourage
other people to allocate better bios.

NeilBrown


signature.asc
Description: PGP signature


Re: [dm-devel] [LSF/MM TOPIC] block, dm: restack queue_limits

2018-02-02 Thread NeilBrown
On Mon, Jan 29 2018, Mike Snitzer wrote:

> We currently don't restack the queue_limits if the lowest, or
> intermediate, layer of an IO stack changes.
>
> This is particularly unfortunate in the case of FLUSH/FUA which may
> change if/when a HW controller's BBU fails; whereby requiring the device
> advertise that it has a volatile write cache (WCE=1).
>
> But in the context of DM, really it'd be best if the entire stack of
> devices had their limits restacked if any underlying layer's limits
> change.
>
> In the past, Martin and I discussed that we should "just do it" but
> never did.  Not sure we need a lengthy discussion but figured I'd put it
> out there.

So much "yes"!!
Just a notifier chain would probably do.

I would like the notification to support changing the size of the device
too.
I see this as being two-stage.
1/ I'm going to change the device size to X - are you all OK with that?
2/ Device size is now X.

That allows md and dm to check that filesystems aren't going to get mad
when devices are made smaller, and can adjust (if they want to) when
devices get bigger.

Thanks,
NeilBrown

>
> Maybe I'll find time, between now and April, to try implementing it.
>
> Thanks,
> Mike
>
> --
> dm-devel mailing list
> dm-de...@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel


signature.asc
Description: PGP signature


Re: [PATCH v4 00/13] bcache: device failure handling improvement

2018-02-02 Thread Coly Li
On 02/02/2018 5:52 AM, Michael Lyle wrote:
> On 01/27/2018 05:56 PM, Coly Li wrote:
>> Hi maintainers and folks,
> 
> Hi Coly---
> 
>> This patch set tries to improve bcache device failure handling, includes
>> cache device and backing device failures.
> 
> I've skimmed this whole patchset and overall it looks good.  I have
> taken some pieces (2,6) for possible-4.16.  I understand you'll be
> sending out a v5 soon-- I'll perform a more detailed review then.

Hi Mike,

Since you will pick patch 2,6 from v4 patch set, I will not include them
in v5 patch set, and rebase the v5 cache set against the latest
bcache-for-next.

Thanks.

Coly Li


Re: [PATCH v2] buffer: Avoid setting buffer bits that are already set

2018-02-02 Thread kemi
Hi, Jens
  Could you help to merge this patch to your tree? Thanks

On 2017年11月03日 10:29, kemi wrote:
> 
> 
> On 2017年10月24日 09:16, Kemi Wang wrote:
>> It's expensive to set buffer flags that are already set, because that
>> causes a costly cache line transition.
>>
>> A common case is setting the "verified" flag during ext4 writes.
>> This patch checks for the flag being set first.
>>
>> With the AIM7/creat-clo benchmark testing on a 48G ramdisk based-on ext4
>> file system, we see 3.3%(15431->15936) improvement of aim7.jobs-per-min on
>> a 2-sockets broadwell platform.
>>
>> What the benchmark does is: it forks 3000 processes, and each  process do
>> the following:
>> a) open a new file
>> b) close the file
>> c) delete the file
>> until loop=100*1000 times.
>>
>> The original patch is contributed by Andi Kleen.
>>
>> Signed-off-by: Andi Kleen 
>> Signed-off-by: Kemi Wang 
>> Tested-by: Kemi Wang 
>> Reviewed-by: Jens Axboe 
>> ---
> 
> Seems that this patch is still not merged. Anything wrong with that? thanks
> 
>>  include/linux/buffer_head.h | 5 -
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
>> index c8dae55..211d8f5 100644
>> --- a/include/linux/buffer_head.h
>> +++ b/include/linux/buffer_head.h
>> @@ -80,11 +80,14 @@ struct buffer_head {
>>  /*
>>   * macro tricks to expand the set_buffer_foo(), clear_buffer_foo()
>>   * and buffer_foo() functions.
>> + * To avoid reset buffer flags that are already set, because that causes
>> + * a costly cache line transition, check the flag first.
>>   */
>>  #define BUFFER_FNS(bit, name)   
>> \
>>  static __always_inline void set_buffer_##name(struct buffer_head *bh)   
>> \
>>  {   \
>> -set_bit(BH_##bit, &(bh)->b_state);  \
>> +if (!test_bit(BH_##bit, &(bh)->b_state))\
>> +set_bit(BH_##bit, &(bh)->b_state);  \
>>  }   \
>>  static __always_inline void clear_buffer_##name(struct buffer_head *bh) 
>> \
>>  {   \
>>