Re: [PATCH 3/3] NVMe: Convert to blk-mq

2013-10-22 Thread Keith Busch

On Tue, 22 Oct 2013, Matias Bjorling wrote:

Den 22-10-2013 18:55, Keith Busch skrev:

On Fri, 18 Oct 2013, Matias Bjørling wrote:

On 10/18/2013 05:13 PM, Keith Busch wrote:

On Fri, 18 Oct 2013, Matias Bjorling wrote:

The nvme driver implements itself as a bio-based driver. This primarily
because of high lock congestion for high-performance nvm devices. To
remove the congestion within the traditional block layer, a multi-queue
block layer is being implemented.



-result = nvme_map_bio(nvmeq, iod, bio, dma_dir, psegs);
-if (result <= 0)
+if (nvme_map_rq(nvmeq, iod, rq, dma_dir))
goto free_cmdid;
-length = result;

-cmnd->rw.command_id = cmdid;
+length = blk_rq_bytes(rq);
+
+cmnd->rw.command_id = rq->tag;


The command ids have to be unique on a submission queue. Since each
namespace's blk-mq has its own 'tags' used as command ids here but share
submission queues, what's stopping the tags for commands sent to 
namespace

1 from clashing with tags for namespace 2?

I think this would work better if one blk-mq was created per device
rather than namespace. It would fix the tag problem above and save a
lot of memory potentially wasted on millions of requests allocated that
can't be used.


You're right. I didn't see the connection. In v3 I'll push struct 
request_queue to nvme_dev and map the queues appropriately. It will also 
fix the command id issues.


Just anticipating a possible issue with the suggestion. Will this separate
the logical block size from the request_queue? Each namespace can have
a different format, so the block size and request_queue can't be tied
together like it currently is for this to work.


If only a couple of different logical sizes are to be expected (1-4), we can 
keep a list of already initialized request queues, and use the one that match 
an already initialized?


The spec allows a namespace to have up to 16 different block formats and
they need not be the same 16 as another namespace on the same device.


From a practical standpoint, I don't think devices will support more

than a few formats, but even if you kept it to that many request queues,
you just get back to conflicting command id tags and some wasted memory.



Axboe, do you know of a better solution?


Re: [PATCH 3/3] NVMe: Convert to blk-mq

2013-10-22 Thread Matias Bjorling

Den 22-10-2013 18:55, Keith Busch skrev:

On Fri, 18 Oct 2013, Matias Bjørling wrote:

On 10/18/2013 05:13 PM, Keith Busch wrote:

On Fri, 18 Oct 2013, Matias Bjorling wrote:
The nvme driver implements itself as a bio-based driver. This 
primarily

because of high lock congestion for high-performance nvm devices. To
remove the congestion within the traditional block layer, a 
multi-queue

block layer is being implemented.



-result = nvme_map_bio(nvmeq, iod, bio, dma_dir, psegs);
-if (result <= 0)
+if (nvme_map_rq(nvmeq, iod, rq, dma_dir))
goto free_cmdid;
-length = result;

-cmnd->rw.command_id = cmdid;
+length = blk_rq_bytes(rq);
+
+cmnd->rw.command_id = rq->tag;


The command ids have to be unique on a submission queue. Since each
namespace's blk-mq has its own 'tags' used as command ids here but 
share
submission queues, what's stopping the tags for commands sent to 
namespace

1 from clashing with tags for namespace 2?

I think this would work better if one blk-mq was created per device
rather than namespace. It would fix the tag problem above and save a
lot of memory potentially wasted on millions of requests allocated that
can't be used.


You're right. I didn't see the connection. In v3 I'll push struct 
request_queue to nvme_dev and map the queues appropriately. It will 
also fix the command id issues.


Just anticipating a possible issue with the suggestion. Will this 
separate

the logical block size from the request_queue? Each namespace can have
a different format, so the block size and request_queue can't be tied
together like it currently is for this to work.


If only a couple of different logical sizes are to be expected (1-4), we 
can keep a list of already initialized request queues, and use the one 
that match an already initialized?


Axboe, do you know of a better solution?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] NVMe: Convert to blk-mq

2013-10-22 Thread Keith Busch

On Fri, 18 Oct 2013, Matias Bjørling wrote:

On 10/18/2013 05:13 PM, Keith Busch wrote:

On Fri, 18 Oct 2013, Matias Bjorling wrote:

The nvme driver implements itself as a bio-based driver. This primarily
because of high lock congestion for high-performance nvm devices. To
remove the congestion within the traditional block layer, a multi-queue
block layer is being implemented.



-result = nvme_map_bio(nvmeq, iod, bio, dma_dir, psegs);
-if (result <= 0)
+if (nvme_map_rq(nvmeq, iod, rq, dma_dir))
goto free_cmdid;
-length = result;

-cmnd->rw.command_id = cmdid;
+length = blk_rq_bytes(rq);
+
+cmnd->rw.command_id = rq->tag;


The command ids have to be unique on a submission queue. Since each
namespace's blk-mq has its own 'tags' used as command ids here but share
submission queues, what's stopping the tags for commands sent to namespace
1 from clashing with tags for namespace 2?

I think this would work better if one blk-mq was created per device
rather than namespace. It would fix the tag problem above and save a
lot of memory potentially wasted on millions of requests allocated that
can't be used.


You're right. I didn't see the connection. In v3 I'll push struct 
request_queue to nvme_dev and map the queues appropriately. It will also fix 
the command id issues.


Just anticipating a possible issue with the suggestion. Will this separate
the logical block size from the request_queue? Each namespace can have
a different format, so the block size and request_queue can't be tied
together like it currently is for this to work.

Re: [PATCH 3/3] NVMe: Convert to blk-mq

2013-10-22 Thread Keith Busch

On Fri, 18 Oct 2013, Matias Bjørling wrote:

On 10/18/2013 05:13 PM, Keith Busch wrote:

On Fri, 18 Oct 2013, Matias Bjorling wrote:

The nvme driver implements itself as a bio-based driver. This primarily
because of high lock congestion for high-performance nvm devices. To
remove the congestion within the traditional block layer, a multi-queue
block layer is being implemented.



-result = nvme_map_bio(nvmeq, iod, bio, dma_dir, psegs);
-if (result = 0)
+if (nvme_map_rq(nvmeq, iod, rq, dma_dir))
goto free_cmdid;
-length = result;

-cmnd-rw.command_id = cmdid;
+length = blk_rq_bytes(rq);
+
+cmnd-rw.command_id = rq-tag;


The command ids have to be unique on a submission queue. Since each
namespace's blk-mq has its own 'tags' used as command ids here but share
submission queues, what's stopping the tags for commands sent to namespace
1 from clashing with tags for namespace 2?

I think this would work better if one blk-mq was created per device
rather than namespace. It would fix the tag problem above and save a
lot of memory potentially wasted on millions of requests allocated that
can't be used.


You're right. I didn't see the connection. In v3 I'll push struct 
request_queue to nvme_dev and map the queues appropriately. It will also fix 
the command id issues.


Just anticipating a possible issue with the suggestion. Will this separate
the logical block size from the request_queue? Each namespace can have
a different format, so the block size and request_queue can't be tied
together like it currently is for this to work.

Re: [PATCH 3/3] NVMe: Convert to blk-mq

2013-10-22 Thread Matias Bjorling

Den 22-10-2013 18:55, Keith Busch skrev:

On Fri, 18 Oct 2013, Matias Bjørling wrote:

On 10/18/2013 05:13 PM, Keith Busch wrote:

On Fri, 18 Oct 2013, Matias Bjorling wrote:
The nvme driver implements itself as a bio-based driver. This 
primarily

because of high lock congestion for high-performance nvm devices. To
remove the congestion within the traditional block layer, a 
multi-queue

block layer is being implemented.



-result = nvme_map_bio(nvmeq, iod, bio, dma_dir, psegs);
-if (result = 0)
+if (nvme_map_rq(nvmeq, iod, rq, dma_dir))
goto free_cmdid;
-length = result;

-cmnd-rw.command_id = cmdid;
+length = blk_rq_bytes(rq);
+
+cmnd-rw.command_id = rq-tag;


The command ids have to be unique on a submission queue. Since each
namespace's blk-mq has its own 'tags' used as command ids here but 
share
submission queues, what's stopping the tags for commands sent to 
namespace

1 from clashing with tags for namespace 2?

I think this would work better if one blk-mq was created per device
rather than namespace. It would fix the tag problem above and save a
lot of memory potentially wasted on millions of requests allocated that
can't be used.


You're right. I didn't see the connection. In v3 I'll push struct 
request_queue to nvme_dev and map the queues appropriately. It will 
also fix the command id issues.


Just anticipating a possible issue with the suggestion. Will this 
separate

the logical block size from the request_queue? Each namespace can have
a different format, so the block size and request_queue can't be tied
together like it currently is for this to work.


If only a couple of different logical sizes are to be expected (1-4), we 
can keep a list of already initialized request queues, and use the one 
that match an already initialized?


Axboe, do you know of a better solution?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] NVMe: Convert to blk-mq

2013-10-22 Thread Keith Busch

On Tue, 22 Oct 2013, Matias Bjorling wrote:

Den 22-10-2013 18:55, Keith Busch skrev:

On Fri, 18 Oct 2013, Matias Bjørling wrote:

On 10/18/2013 05:13 PM, Keith Busch wrote:

On Fri, 18 Oct 2013, Matias Bjorling wrote:

The nvme driver implements itself as a bio-based driver. This primarily
because of high lock congestion for high-performance nvm devices. To
remove the congestion within the traditional block layer, a multi-queue
block layer is being implemented.



-result = nvme_map_bio(nvmeq, iod, bio, dma_dir, psegs);
-if (result = 0)
+if (nvme_map_rq(nvmeq, iod, rq, dma_dir))
goto free_cmdid;
-length = result;

-cmnd-rw.command_id = cmdid;
+length = blk_rq_bytes(rq);
+
+cmnd-rw.command_id = rq-tag;


The command ids have to be unique on a submission queue. Since each
namespace's blk-mq has its own 'tags' used as command ids here but share
submission queues, what's stopping the tags for commands sent to 
namespace

1 from clashing with tags for namespace 2?

I think this would work better if one blk-mq was created per device
rather than namespace. It would fix the tag problem above and save a
lot of memory potentially wasted on millions of requests allocated that
can't be used.


You're right. I didn't see the connection. In v3 I'll push struct 
request_queue to nvme_dev and map the queues appropriately. It will also 
fix the command id issues.


Just anticipating a possible issue with the suggestion. Will this separate
the logical block size from the request_queue? Each namespace can have
a different format, so the block size and request_queue can't be tied
together like it currently is for this to work.


If only a couple of different logical sizes are to be expected (1-4), we can 
keep a list of already initialized request queues, and use the one that match 
an already initialized?


The spec allows a namespace to have up to 16 different block formats and
they need not be the same 16 as another namespace on the same device.


From a practical standpoint, I don't think devices will support more

than a few formats, but even if you kept it to that many request queues,
you just get back to conflicting command id tags and some wasted memory.



Axboe, do you know of a better solution?


Re: [PATCH 3/3] NVMe: Convert to blk-mq

2013-10-18 Thread Matias Bjørling

On 10/18/2013 05:13 PM, Keith Busch wrote:

On Fri, 18 Oct 2013, Matias Bjorling wrote:

The nvme driver implements itself as a bio-based driver. This primarily
because of high lock congestion for high-performance nvm devices. To
remove the congestion within the traditional block layer, a multi-queue
block layer is being implemented.

This patch converts the current bio-based approach to work with the
request-based approach found in the multi-queue block layer. This means
that bio responsibility is moved from the driver, into the block layer.
In return the block layer packs request structures and submit them to
the nvme  according to the features/limits of nvme hardware.

The patch consists of:
* Initialization of multi-queue data structures
* Conversion of bio function call into request function calls.
* Separate cmdid patchs for admin and normal queues.
* Bio splits according to NOT_VIRT_MERGEABLE are assumed to be handled
  by blk-mq.
* Uses the timeout framework blk-mq where possible.

Signed-off-by: Matias Bjorling 
---
drivers/block/nvme-core.c | 765
+++---
drivers/block/nvme-scsi.c |  39 +--
include/linux/nvme.h  |   7 +-
3 files changed, 385 insertions(+), 426 deletions(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index e99a30a..36bf45c 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c


[snip]


-static void nvme_start_io_acct(struct bio *bio)
+static void nvme_start_io_acct(struct request *rq)
{
-struct gendisk *disk = bio->bi_bdev->bd_disk;
-const int rw = bio_data_dir(bio);
+struct gendisk *disk = rq->rq_disk;
+const int rw = rq_data_dir(rq);
int cpu = part_stat_lock();
part_round_stats(cpu, >part0);
part_stat_inc(cpu, >part0, ios[rw]);
-part_stat_add(cpu, >part0, sectors[rw], bio_sectors(bio));
+part_stat_add(cpu, >part0, sectors[rw], blk_rq_sectors(rq));
part_inc_in_flight(>part0, rw);
part_stat_unlock();
}

-static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
+static void nvme_end_io_acct(struct request *rq, unsigned long
start_time)
{
-struct gendisk *disk = bio->bi_bdev->bd_disk;
-const int rw = bio_data_dir(bio);
+struct gendisk *disk = rq->rq_disk;
+const int rw = rq_data_dir(rq);
unsigned long duration = jiffies - start_time;
int cpu = part_stat_lock();
part_stat_add(cpu, >part0, ticks[rw], duration);
@@ -342,23 +370,26 @@ static void nvme_end_io_acct(struct bio *bio,
unsigned long start_time)
part_stat_unlock();
}


I think you can remove the io accounting, right? These were added here
because the diskstats are not updated in the block layer for bio-based
block drivers.


Yeap, I'll make a patch for the next version that removes them.




@@ -715,32 +606,47 @@ static int nvme_submit_bio_queue(struct
nvme_queue *nvmeq, struct nvme_ns *ns,
dma_dir = DMA_FROM_DEVICE;
}

-result = nvme_map_bio(nvmeq, iod, bio, dma_dir, psegs);
-if (result <= 0)
+if (nvme_map_rq(nvmeq, iod, rq, dma_dir))
goto free_cmdid;
-length = result;

-cmnd->rw.command_id = cmdid;
+length = blk_rq_bytes(rq);
+
+cmnd->rw.command_id = rq->tag;


The command ids have to be unique on a submission queue. Since each
namespace's blk-mq has its own 'tags' used as command ids here but share
submission queues, what's stopping the tags for commands sent to namespace
1 from clashing with tags for namespace 2?

I think this would work better if one blk-mq was created per device
rather than namespace. It would fix the tag problem above and save a
lot of memory potentially wasted on millions of requests allocated that
can't be used.


You're right. I didn't see the connection. In v3 I'll push struct 
request_queue to nvme_dev and map the queues appropriately. It will also 
fix the command id issues.




Do you know how/if this is planned to work with scsi? Will there be one
blk-mq per LUN or per host controller?


Christoph Hellwig and Nicholas Bellinger are working on scsi-mq.

https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/log/?h=scsi-mq

I think they will map it out per controller. That'll be the most natural.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] NVMe: Convert to blk-mq

2013-10-18 Thread Keith Busch

On Fri, 18 Oct 2013, Matias Bjorling wrote:

The nvme driver implements itself as a bio-based driver. This primarily
because of high lock congestion for high-performance nvm devices. To
remove the congestion within the traditional block layer, a multi-queue
block layer is being implemented.

This patch converts the current bio-based approach to work with the
request-based approach found in the multi-queue block layer. This means
that bio responsibility is moved from the driver, into the block layer.
In return the block layer packs request structures and submit them to
the nvme  according to the features/limits of nvme hardware.

The patch consists of:
* Initialization of multi-queue data structures
* Conversion of bio function call into request function calls.
* Separate cmdid patchs for admin and normal queues.
* Bio splits according to NOT_VIRT_MERGEABLE are assumed to be handled
  by blk-mq.
* Uses the timeout framework blk-mq where possible.

Signed-off-by: Matias Bjorling 
---
drivers/block/nvme-core.c | 765 +++---
drivers/block/nvme-scsi.c |  39 +--
include/linux/nvme.h  |   7 +-
3 files changed, 385 insertions(+), 426 deletions(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index e99a30a..36bf45c 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c


[snip]


-static void nvme_start_io_acct(struct bio *bio)
+static void nvme_start_io_acct(struct request *rq)
{
-   struct gendisk *disk = bio->bi_bdev->bd_disk;
-   const int rw = bio_data_dir(bio);
+   struct gendisk *disk = rq->rq_disk;
+   const int rw = rq_data_dir(rq);
int cpu = part_stat_lock();
part_round_stats(cpu, >part0);
part_stat_inc(cpu, >part0, ios[rw]);
-   part_stat_add(cpu, >part0, sectors[rw], bio_sectors(bio));
+   part_stat_add(cpu, >part0, sectors[rw], blk_rq_sectors(rq));
part_inc_in_flight(>part0, rw);
part_stat_unlock();
}

-static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
+static void nvme_end_io_acct(struct request *rq, unsigned long start_time)
{
-   struct gendisk *disk = bio->bi_bdev->bd_disk;
-   const int rw = bio_data_dir(bio);
+   struct gendisk *disk = rq->rq_disk;
+   const int rw = rq_data_dir(rq);
unsigned long duration = jiffies - start_time;
int cpu = part_stat_lock();
part_stat_add(cpu, >part0, ticks[rw], duration);
@@ -342,23 +370,26 @@ static void nvme_end_io_acct(struct bio *bio, unsigned 
long start_time)
part_stat_unlock();
}


I think you can remove the io accounting, right? These were added here
because the diskstats are not updated in the block layer for bio-based
block drivers.


@@ -715,32 +606,47 @@ static int nvme_submit_bio_queue(struct nvme_queue 
*nvmeq, struct nvme_ns *ns,
dma_dir = DMA_FROM_DEVICE;
}

-   result = nvme_map_bio(nvmeq, iod, bio, dma_dir, psegs);
-   if (result <= 0)
+   if (nvme_map_rq(nvmeq, iod, rq, dma_dir))
goto free_cmdid;
-   length = result;

-   cmnd->rw.command_id = cmdid;
+   length = blk_rq_bytes(rq);
+
+   cmnd->rw.command_id = rq->tag;


The command ids have to be unique on a submission queue. Since each
namespace's blk-mq has its own 'tags' used as command ids here but share
submission queues, what's stopping the tags for commands sent to namespace
1 from clashing with tags for namespace 2?

I think this would work better if one blk-mq was created per device
rather than namespace. It would fix the tag problem above and save a
lot of memory potentially wasted on millions of requests allocated that
can't be used.

Do you know how/if this is planned to work with scsi? Will there be one
blk-mq per LUN or per host controller?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] NVMe: Convert to blk-mq

2013-10-18 Thread Matias Bjorling
The nvme driver implements itself as a bio-based driver. This primarily
because of high lock congestion for high-performance nvm devices. To
remove the congestion within the traditional block layer, a multi-queue
block layer is being implemented.

This patch converts the current bio-based approach to work with the
request-based approach found in the multi-queue block layer. This means
that bio responsibility is moved from the driver, into the block layer.
In return the block layer packs request structures and submit them to
the nvme  according to the features/limits of nvme hardware.

The patch consists of:
 * Initialization of multi-queue data structures
 * Conversion of bio function call into request function calls.
 * Separate cmdid patchs for admin and normal queues.
 * Bio splits according to NOT_VIRT_MERGEABLE are assumed to be handled
   by blk-mq.
 * Uses the timeout framework blk-mq where possible.

Signed-off-by: Matias Bjorling 
---
 drivers/block/nvme-core.c | 765 +++---
 drivers/block/nvme-scsi.c |  39 +--
 include/linux/nvme.h  |   7 +-
 3 files changed, 385 insertions(+), 426 deletions(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index e99a30a..36bf45c 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -17,9 +17,10 @@
  */
 
 #include 
-#include 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -60,6 +61,23 @@ static DEFINE_SPINLOCK(dev_list_lock);
 static LIST_HEAD(dev_list);
 static struct task_struct *nvme_thread;
 
+typedef void (*nvme_completion_fn)(struct nvme_dev *, void *,
+   struct nvme_completion *);
+
+struct nvme_cmd_info {
+   nvme_completion_fn fn;
+   void *ctx;
+   unsigned long timeout;
+   struct blk_mq_hw_ctx *hctx;
+};
+
+struct nvme_admin_queue {
+   wait_queue_head_t sq_full;
+   wait_queue_t sq_cong_wait;
+   atomic_t sq_cmdid;
+   struct nvme_cmd_info infos[NVME_ADMIN_Q_DEPTH];
+};
+
 /*
  * An NVM Express queue.  Each device has at least two (one for admin
  * commands and one for I/O commands).
@@ -68,13 +86,11 @@ struct nvme_queue {
struct device *q_dmadev;
struct nvme_dev *dev;
spinlock_t q_lock;
+   struct blk_mq_hw_ctx *hctx;
struct nvme_command *sq_cmds;
volatile struct nvme_completion *cqes;
dma_addr_t sq_dma_addr;
dma_addr_t cq_dma_addr;
-   wait_queue_head_t sq_full;
-   wait_queue_t sq_cong_wait;
-   struct bio_list sq_cong;
u32 __iomem *q_db;
u16 q_depth;
u16 cq_vector;
@@ -84,7 +100,7 @@ struct nvme_queue {
u8 cq_phase;
u8 cqe_seen;
u8 q_suspended;
-   unsigned long cmdid_data[];
+   struct nvme_admin_queue *admin;
 };
 
 /*
@@ -105,65 +121,72 @@ static inline void _nvme_check_size(void)
BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
 }
 
-typedef void (*nvme_completion_fn)(struct nvme_dev *, void *,
-   struct nvme_completion *);
+static inline struct nvme_cmd_info *nvme_get_rq_from_id(struct nvme_queue
+   *nvmeq, unsigned int cmdid)
+{
+   struct blk_mq_hw_ctx *hctx = nvmeq->hctx;
+   struct request *rq = blk_mq_tag_to_rq(hctx, cmdid);
 
-struct nvme_cmd_info {
-   nvme_completion_fn fn;
-   void *ctx;
-   unsigned long timeout;
-};
+   return rq->special;
+}
 
-static struct nvme_cmd_info *nvme_cmd_info(struct nvme_queue *nvmeq)
+static inline struct nvme_cmd_info *nvme_get_cmd_from_rq(struct request *rq)
 {
-   return (void *)>cmdid_data[BITS_TO_LONGS(nvmeq->q_depth)];
+   return rq->special;
+}
+
+static inline struct request *nvme_get_rq_from_cmd(struct nvme_cmd_info *info)
+{
+   return blk_mq_rq_from_pdu(info);
 }
 
 static unsigned nvme_queue_extra(int depth)
 {
-   return DIV_ROUND_UP(depth, 8) + (depth * sizeof(struct nvme_cmd_info));
+   return DIV_ROUND_UP(depth, 8);
 }
 
 /**
- * alloc_cmdid() - Allocate a Command ID
- * @nvmeq: The queue that will be used for this command
+ * configure_cmd_info() - Configure Command Information
+ * @info: The cmd_info that will be used for this command
  * @ctx: A pointer that will be passed to the handler
  * @handler: The function to call on completion
  *
- * Allocate a Command ID for a queue.  The data passed in will
+ * Configure a cmd info for a queue.  The data passed in will
  * be passed to the completion handler.  This is implemented by using
  * the bottom two bits of the ctx pointer to store the handler ID.
  * Passing in a pointer that's not 4-byte aligned will cause a BUG.
  * We can change this if it becomes a problem.
  *
- * May be called with local interrupts disabled and the q_lock held,
- * or with interrupts enabled and no locks held.
  */
-static int alloc_cmdid(struct nvme_queue *nvmeq, void *ctx,
+static inline void configure_cmd_info(struct 

[PATCH 3/3] NVMe: Convert to blk-mq

2013-10-18 Thread Matias Bjorling
The nvme driver implements itself as a bio-based driver. This primarily
because of high lock congestion for high-performance nvm devices. To
remove the congestion within the traditional block layer, a multi-queue
block layer is being implemented.

This patch converts the current bio-based approach to work with the
request-based approach found in the multi-queue block layer. This means
that bio responsibility is moved from the driver, into the block layer.
In return the block layer packs request structures and submit them to
the nvme  according to the features/limits of nvme hardware.

The patch consists of:
 * Initialization of multi-queue data structures
 * Conversion of bio function call into request function calls.
 * Separate cmdid patchs for admin and normal queues.
 * Bio splits according to NOT_VIRT_MERGEABLE are assumed to be handled
   by blk-mq.
 * Uses the timeout framework blk-mq where possible.

Signed-off-by: Matias Bjorling m...@bjorling.me
---
 drivers/block/nvme-core.c | 765 +++---
 drivers/block/nvme-scsi.c |  39 +--
 include/linux/nvme.h  |   7 +-
 3 files changed, 385 insertions(+), 426 deletions(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index e99a30a..36bf45c 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -17,9 +17,10 @@
  */
 
 #include linux/nvme.h
-#include linux/bio.h
+#include linux/atomic.h
 #include linux/bitops.h
 #include linux/blkdev.h
+#include linux/blk-mq.h
 #include linux/delay.h
 #include linux/errno.h
 #include linux/fs.h
@@ -60,6 +61,23 @@ static DEFINE_SPINLOCK(dev_list_lock);
 static LIST_HEAD(dev_list);
 static struct task_struct *nvme_thread;
 
+typedef void (*nvme_completion_fn)(struct nvme_dev *, void *,
+   struct nvme_completion *);
+
+struct nvme_cmd_info {
+   nvme_completion_fn fn;
+   void *ctx;
+   unsigned long timeout;
+   struct blk_mq_hw_ctx *hctx;
+};
+
+struct nvme_admin_queue {
+   wait_queue_head_t sq_full;
+   wait_queue_t sq_cong_wait;
+   atomic_t sq_cmdid;
+   struct nvme_cmd_info infos[NVME_ADMIN_Q_DEPTH];
+};
+
 /*
  * An NVM Express queue.  Each device has at least two (one for admin
  * commands and one for I/O commands).
@@ -68,13 +86,11 @@ struct nvme_queue {
struct device *q_dmadev;
struct nvme_dev *dev;
spinlock_t q_lock;
+   struct blk_mq_hw_ctx *hctx;
struct nvme_command *sq_cmds;
volatile struct nvme_completion *cqes;
dma_addr_t sq_dma_addr;
dma_addr_t cq_dma_addr;
-   wait_queue_head_t sq_full;
-   wait_queue_t sq_cong_wait;
-   struct bio_list sq_cong;
u32 __iomem *q_db;
u16 q_depth;
u16 cq_vector;
@@ -84,7 +100,7 @@ struct nvme_queue {
u8 cq_phase;
u8 cqe_seen;
u8 q_suspended;
-   unsigned long cmdid_data[];
+   struct nvme_admin_queue *admin;
 };
 
 /*
@@ -105,65 +121,72 @@ static inline void _nvme_check_size(void)
BUILD_BUG_ON(sizeof(struct nvme_smart_log) != 512);
 }
 
-typedef void (*nvme_completion_fn)(struct nvme_dev *, void *,
-   struct nvme_completion *);
+static inline struct nvme_cmd_info *nvme_get_rq_from_id(struct nvme_queue
+   *nvmeq, unsigned int cmdid)
+{
+   struct blk_mq_hw_ctx *hctx = nvmeq-hctx;
+   struct request *rq = blk_mq_tag_to_rq(hctx, cmdid);
 
-struct nvme_cmd_info {
-   nvme_completion_fn fn;
-   void *ctx;
-   unsigned long timeout;
-};
+   return rq-special;
+}
 
-static struct nvme_cmd_info *nvme_cmd_info(struct nvme_queue *nvmeq)
+static inline struct nvme_cmd_info *nvme_get_cmd_from_rq(struct request *rq)
 {
-   return (void *)nvmeq-cmdid_data[BITS_TO_LONGS(nvmeq-q_depth)];
+   return rq-special;
+}
+
+static inline struct request *nvme_get_rq_from_cmd(struct nvme_cmd_info *info)
+{
+   return blk_mq_rq_from_pdu(info);
 }
 
 static unsigned nvme_queue_extra(int depth)
 {
-   return DIV_ROUND_UP(depth, 8) + (depth * sizeof(struct nvme_cmd_info));
+   return DIV_ROUND_UP(depth, 8);
 }
 
 /**
- * alloc_cmdid() - Allocate a Command ID
- * @nvmeq: The queue that will be used for this command
+ * configure_cmd_info() - Configure Command Information
+ * @info: The cmd_info that will be used for this command
  * @ctx: A pointer that will be passed to the handler
  * @handler: The function to call on completion
  *
- * Allocate a Command ID for a queue.  The data passed in will
+ * Configure a cmd info for a queue.  The data passed in will
  * be passed to the completion handler.  This is implemented by using
  * the bottom two bits of the ctx pointer to store the handler ID.
  * Passing in a pointer that's not 4-byte aligned will cause a BUG.
  * We can change this if it becomes a problem.
  *
- * May be called with local interrupts disabled and the q_lock held,
- * or with interrupts enabled and 

Re: [PATCH 3/3] NVMe: Convert to blk-mq

2013-10-18 Thread Keith Busch

On Fri, 18 Oct 2013, Matias Bjorling wrote:

The nvme driver implements itself as a bio-based driver. This primarily
because of high lock congestion for high-performance nvm devices. To
remove the congestion within the traditional block layer, a multi-queue
block layer is being implemented.

This patch converts the current bio-based approach to work with the
request-based approach found in the multi-queue block layer. This means
that bio responsibility is moved from the driver, into the block layer.
In return the block layer packs request structures and submit them to
the nvme  according to the features/limits of nvme hardware.

The patch consists of:
* Initialization of multi-queue data structures
* Conversion of bio function call into request function calls.
* Separate cmdid patchs for admin and normal queues.
* Bio splits according to NOT_VIRT_MERGEABLE are assumed to be handled
  by blk-mq.
* Uses the timeout framework blk-mq where possible.

Signed-off-by: Matias Bjorling m...@bjorling.me
---
drivers/block/nvme-core.c | 765 +++---
drivers/block/nvme-scsi.c |  39 +--
include/linux/nvme.h  |   7 +-
3 files changed, 385 insertions(+), 426 deletions(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index e99a30a..36bf45c 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c


[snip]


-static void nvme_start_io_acct(struct bio *bio)
+static void nvme_start_io_acct(struct request *rq)
{
-   struct gendisk *disk = bio-bi_bdev-bd_disk;
-   const int rw = bio_data_dir(bio);
+   struct gendisk *disk = rq-rq_disk;
+   const int rw = rq_data_dir(rq);
int cpu = part_stat_lock();
part_round_stats(cpu, disk-part0);
part_stat_inc(cpu, disk-part0, ios[rw]);
-   part_stat_add(cpu, disk-part0, sectors[rw], bio_sectors(bio));
+   part_stat_add(cpu, disk-part0, sectors[rw], blk_rq_sectors(rq));
part_inc_in_flight(disk-part0, rw);
part_stat_unlock();
}

-static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
+static void nvme_end_io_acct(struct request *rq, unsigned long start_time)
{
-   struct gendisk *disk = bio-bi_bdev-bd_disk;
-   const int rw = bio_data_dir(bio);
+   struct gendisk *disk = rq-rq_disk;
+   const int rw = rq_data_dir(rq);
unsigned long duration = jiffies - start_time;
int cpu = part_stat_lock();
part_stat_add(cpu, disk-part0, ticks[rw], duration);
@@ -342,23 +370,26 @@ static void nvme_end_io_acct(struct bio *bio, unsigned 
long start_time)
part_stat_unlock();
}


I think you can remove the io accounting, right? These were added here
because the diskstats are not updated in the block layer for bio-based
block drivers.


@@ -715,32 +606,47 @@ static int nvme_submit_bio_queue(struct nvme_queue 
*nvmeq, struct nvme_ns *ns,
dma_dir = DMA_FROM_DEVICE;
}

-   result = nvme_map_bio(nvmeq, iod, bio, dma_dir, psegs);
-   if (result = 0)
+   if (nvme_map_rq(nvmeq, iod, rq, dma_dir))
goto free_cmdid;
-   length = result;

-   cmnd-rw.command_id = cmdid;
+   length = blk_rq_bytes(rq);
+
+   cmnd-rw.command_id = rq-tag;


The command ids have to be unique on a submission queue. Since each
namespace's blk-mq has its own 'tags' used as command ids here but share
submission queues, what's stopping the tags for commands sent to namespace
1 from clashing with tags for namespace 2?

I think this would work better if one blk-mq was created per device
rather than namespace. It would fix the tag problem above and save a
lot of memory potentially wasted on millions of requests allocated that
can't be used.

Do you know how/if this is planned to work with scsi? Will there be one
blk-mq per LUN or per host controller?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] NVMe: Convert to blk-mq

2013-10-18 Thread Matias Bjørling

On 10/18/2013 05:13 PM, Keith Busch wrote:

On Fri, 18 Oct 2013, Matias Bjorling wrote:

The nvme driver implements itself as a bio-based driver. This primarily
because of high lock congestion for high-performance nvm devices. To
remove the congestion within the traditional block layer, a multi-queue
block layer is being implemented.

This patch converts the current bio-based approach to work with the
request-based approach found in the multi-queue block layer. This means
that bio responsibility is moved from the driver, into the block layer.
In return the block layer packs request structures and submit them to
the nvme  according to the features/limits of nvme hardware.

The patch consists of:
* Initialization of multi-queue data structures
* Conversion of bio function call into request function calls.
* Separate cmdid patchs for admin and normal queues.
* Bio splits according to NOT_VIRT_MERGEABLE are assumed to be handled
  by blk-mq.
* Uses the timeout framework blk-mq where possible.

Signed-off-by: Matias Bjorling m...@bjorling.me
---
drivers/block/nvme-core.c | 765
+++---
drivers/block/nvme-scsi.c |  39 +--
include/linux/nvme.h  |   7 +-
3 files changed, 385 insertions(+), 426 deletions(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index e99a30a..36bf45c 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c


[snip]


-static void nvme_start_io_acct(struct bio *bio)
+static void nvme_start_io_acct(struct request *rq)
{
-struct gendisk *disk = bio-bi_bdev-bd_disk;
-const int rw = bio_data_dir(bio);
+struct gendisk *disk = rq-rq_disk;
+const int rw = rq_data_dir(rq);
int cpu = part_stat_lock();
part_round_stats(cpu, disk-part0);
part_stat_inc(cpu, disk-part0, ios[rw]);
-part_stat_add(cpu, disk-part0, sectors[rw], bio_sectors(bio));
+part_stat_add(cpu, disk-part0, sectors[rw], blk_rq_sectors(rq));
part_inc_in_flight(disk-part0, rw);
part_stat_unlock();
}

-static void nvme_end_io_acct(struct bio *bio, unsigned long start_time)
+static void nvme_end_io_acct(struct request *rq, unsigned long
start_time)
{
-struct gendisk *disk = bio-bi_bdev-bd_disk;
-const int rw = bio_data_dir(bio);
+struct gendisk *disk = rq-rq_disk;
+const int rw = rq_data_dir(rq);
unsigned long duration = jiffies - start_time;
int cpu = part_stat_lock();
part_stat_add(cpu, disk-part0, ticks[rw], duration);
@@ -342,23 +370,26 @@ static void nvme_end_io_acct(struct bio *bio,
unsigned long start_time)
part_stat_unlock();
}


I think you can remove the io accounting, right? These were added here
because the diskstats are not updated in the block layer for bio-based
block drivers.


Yeap, I'll make a patch for the next version that removes them.




@@ -715,32 +606,47 @@ static int nvme_submit_bio_queue(struct
nvme_queue *nvmeq, struct nvme_ns *ns,
dma_dir = DMA_FROM_DEVICE;
}

-result = nvme_map_bio(nvmeq, iod, bio, dma_dir, psegs);
-if (result = 0)
+if (nvme_map_rq(nvmeq, iod, rq, dma_dir))
goto free_cmdid;
-length = result;

-cmnd-rw.command_id = cmdid;
+length = blk_rq_bytes(rq);
+
+cmnd-rw.command_id = rq-tag;


The command ids have to be unique on a submission queue. Since each
namespace's blk-mq has its own 'tags' used as command ids here but share
submission queues, what's stopping the tags for commands sent to namespace
1 from clashing with tags for namespace 2?

I think this would work better if one blk-mq was created per device
rather than namespace. It would fix the tag problem above and save a
lot of memory potentially wasted on millions of requests allocated that
can't be used.


You're right. I didn't see the connection. In v3 I'll push struct 
request_queue to nvme_dev and map the queues appropriately. It will also 
fix the command id issues.




Do you know how/if this is planned to work with scsi? Will there be one
blk-mq per LUN or per host controller?


Christoph Hellwig and Nicholas Bellinger are working on scsi-mq.

https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/log/?h=scsi-mq

I think they will map it out per controller. That'll be the most natural.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/