Re: [PATCH net-next v12 04/14] mm: page_frag: add '_va' suffix to page_frag API
Regardless of the API discussion, The nvme-tcp bits look straight-forward: Acked-by: Sagi Grimberg
Re: [PATCH v2] nvme-tcp: Check if request has started before processing it
request tag can't be zero? I forget... Of course it can. But the reserved tags are before the normal tags, so 0 would be a reserved tag for nvme. Right.
Re: [PATCH v2] nvme-tcp: Check if request has started before processing it
It is, but in this situation, the controller is sending a second completion that results in a use-after-free, which makes the transport irrelevant. Unless there is some other flow (which is unclear to me) that causes this which is a bug that needs to be fixed rather than hidden with a safeguard. The kernel should not crash regardless of any network traffic that is sent to the system. It should not be possible to either intentionally of mistakenly contruct packets that will deny service in this way. This is not specific to nvme-tcp. I can build an rdma or pci controller that can trigger the same crash... I saw a similar patch from Hannes implemented in the scsi level, and not the individual scsi transports.. If scsi wants this too, this could be made generic at the blk-mq level. We just need to make something like blk_mq_tag_to_rq(), but return NULL if the request isn't started. Makes sense... I would also mention, that a crash is not even the scariest issue that we can see here, because if the request happened to be reused we are in the silent data corruption realm... If this does happen, I think we have to come up with some way to mitigate it. We're not utilizing the full 16 bits of the command_id, so maybe we can append something like a generation sequence number that can be checked for validity. That's actually a great idea. scsi needs unique tags so it encodes the hwq in the upper 16 bits giving the actual tag the lower 16 bits which is more than enough for a single queue. We can do the same with a gencnt that will increment in both submission and completion and we can validate against it. This will be useful for all transports, so maintaining it in nvme_req(rq)->genctr and introducing a helper like: rq = nvme_find_tag(tagset, cqe->command_id) That will filter genctr, locate the request. Also: nvme_validate_request_gen(rq, cqe->command_id) that would compare against it. And then a helper to set the command_id like: cmd->common.command_id = nvme_request_command_id(rq) that will both increment the genctr and build a command_id from it. Thoughts?
Re: [PATCH v2] nvme-tcp: Check if request has started before processing it
What we can do, though, is checking the 'state' field in the tcp request, and only allow completions for commands which are in a state allowing for completions. Let's see if I can whip up a patch. That would be great. BTW in the crash dump I am looking at now, it looks like pdu->command_id was zero in nvme_tcp_recv_data(), and blk_mq_tag_to_rq() returned a request struct that had not been used. So I think we do need to check that the tag was actually allocated. request tag can't be zero? I forget...
Re: [PATCH] nvme: Export fast_io_fail_tmo to sysfs
Reviewed-by: Sagi Grimberg
Re: [PATCH v2] nvme-tcp: Check if request has started before processing it
It is, but in this situation, the controller is sending a second completion that results in a use-after-free, which makes the transport irrelevant. Unless there is some other flow (which is unclear to me) that causes this which is a bug that needs to be fixed rather than hidden with a safeguard. The kernel should not crash regardless of any network traffic that is sent to the system. It should not be possible to either intentionally of mistakenly contruct packets that will deny service in this way. This is not specific to nvme-tcp. I can build an rdma or pci controller that can trigger the same crash... I saw a similar patch from Hannes implemented in the scsi level, and not the individual scsi transports.. I would also mention, that a crash is not even the scariest issue that we can see here, because if the request happened to be reused we are in the silent data corruption realm...
Re: [PATCH v2] nvme: disallow passthru cmd from targeting a nsid != nsid of the block dev
Reviewed-by: Sagi Grimberg
Re: [PATCH v2] infiniband: Fix a use after free in isert_connect_request
Acked-by: Sagi Grimberg
Re: [PATCH] nvme/rdma: Fix a use after free in nvmet_rdma_write_data_done
In nvmet_rdma_write_data_done, rsp is recoverd by wc->wr_cqe and freed by nvmet_rdma_release_rsp(). But after that, pr_info() used the freed chunk's member object and could leak the freed chunk address with wc->wr_cqe by computing the offset. Signed-off-by: Lv Yunlong --- drivers/nvme/target/rdma.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c index 06b6b742bb21..6c1f3ab7649c 100644 --- a/drivers/nvme/target/rdma.c +++ b/drivers/nvme/target/rdma.c @@ -802,9 +802,8 @@ static void nvmet_rdma_write_data_done(struct ib_cq *cq, struct ib_wc *wc) nvmet_req_uninit(&rsp->req); nvmet_rdma_release_rsp(rsp); if (wc->status != IB_WC_WR_FLUSH_ERR) { - pr_info("RDMA WRITE for CQE 0x%p failed with status %s (%d).\n", - wc->wr_cqe, ib_wc_status_msg(wc->status), - wc->status); + pr_info("RDMA WRITE for CQE failed with status %s (%d).\n", + ib_wc_status_msg(wc->status), wc->status); There is nothing wrong with this reference, wr_cqe is a valid reference and I think Israel added this for some extra information that may be useful to him. Israel? you ok with this removed? nvmet_rdma_error_comp(queue); } return;
Re: [PATCH v2] nvme-tcp: Check if request has started before processing it
Hi Sagi, On Fri, Mar 05, 2021 at 11:57:30AM -0800, Sagi Grimberg wrote: Daniel, again, there is nothing specific about this to nvme-tcp, this is a safeguard against a funky controller (or a different bug that is hidden by this). As far I can tell, the main difference between nvme-tcp and FC/NVMe, nvme-tcp has not a FW or a big driver which filter out some noise from a misbehaving controller. I haven't really checked the other transports but I wouldn't surprised they share the same properties as FC/NVMe. The same can happen in any other transport so I would suggest that if this is a safeguard we want to put in place, we should make it a generic one. i.e. nvme_tag_to_rq() that _all_ transports call consistently. Okay, I'll review all the relevant code and see what could made more generic and consistent. Though I think nvme-tcp plays in a different league as it is exposed to normal networking traffic and this is a very hostile environment. It is, but in this situation, the controller is sending a second completion that results in a use-after-free, which makes the transport irrelevant. Unless there is some other flow (which is unclear to me) that causes this which is a bug that needs to be fixed rather than hidden with a safeguard.
Re: [PATCH v2] nvme-tcp: Check if request has started before processing it
blk_mq_tag_to_rq() always returns a request if the tag id is in a valid range [0...max_tags). If the target replies with a tag for which we don't have a request but it's not started, the host will likely corrupt data or simply crash. Add an additional check if the a request has been started if not reset the connection. This addition check will not protected against an invalid tag which maps to a request which has been started. There is nothing we can do about this. Though it will at a least protect from crashing the host, which generally thought to be the right thing to do. Daniel, again, there is nothing specific about this to nvme-tcp, this is a safeguard against a funky controller (or a different bug that is hidden by this). The same can happen in any other transport so I would suggest that if this is a safeguard we want to put in place, we should make it a generic one. i.e. nvme_tag_to_rq() that _all_ transports call consistently.
Re: [PATCH] nvme-tcp: Check if request has started before processing it
blk_mq_tag_to_rq() will always return a request if the command_id is in the valid range. Check if the request has been started. If we blindly process the request we might double complete a request which can be fatal. How did you get to this one? did the controller send a completion for a completed/bogus request? If that is the case, then that must mean it's possible the driver could have started the command id just before the bogus completion check. Data iorruption, right? 'during TCP LIF toggles and aggr relocates' testing the host crashes. TBH, I do not really know what is happening or what the test does. Still trying to figure out what's going on. Well, I think we should probably figure out why that is happening first. I was just very surprised how much the code trusts the other side to behave correctly./ What does pci/rdma/fc does differently? What does scsi do here? Yes, which is why I don't think this check is very useful.. I actually view that as a valid protection against spoofed frames. Without it it's easy to crash the machine by injecting fake completions with random command ids. In this test scenario it's not even a spoofed frame; maybe just a confused controller. Maybe... I am still not sure how this patch helps here though...
Re: [PATCH] nvme-tcp: Check if request has started before processing it
blk_mq_tag_to_rq() will always return a request if the command_id is in the valid range. Check if the request has been started. If we blindly process the request we might double complete a request which can be fatal. How did you get to this one? did the controller send a completion for a completed/bogus request? If that is the case, then that must mean it's possible the driver could have started the command id just before the bogus completion check. Data iorruption, right? Yes, which is why I don't think this check is very useful.. I actually view that as a valid protection against spoofed frames. Without it it's easy to crash the machine by injecting fake completions with random command ids. And this doesn't help because the command can have been easily reused and started... What is this protecting against? Note that none of the other transports checks that, why should tcp?
Re: [PATCH] nvme-tcp: Check if request has started before processing it
blk_mq_tag_to_rq() will always return a request if the command_id is in the valid range. Check if the request has been started. If we blindly process the request we might double complete a request which can be fatal. How did you get to this one? did the controller send a completion for a completed/bogus request? If that is the case, then that must mean it's possible the driver could have started the command id just before the bogus completion check. Data iorruption, right? Yes, which is why I don't think this check is very useful..
Re: [PATCH] nvme-tcp: Check if request has started before processing it
blk_mq_tag_to_rq() will always return a request if the command_id is in the valid range. Check if the request has been started. If we blindly process the request we might double complete a request which can be fatal. How did you get to this one? did the controller send a completion for a completed/bogus request?
Re: [PATCH] nvme: convert sysfs sprintf/snprintf family to sysfs_emit
Reviewed-by: Sagi Grimberg
Re: [PATCH v2] nvme-multipath: Early exit if no path is available
You can't see exactly where it dies but I followed the assembly to nvme_round_robin_path(). Maybe it's not the initial nvme_next_ns(head, old) which returns NULL but nvme_next_ns() is returning NULL eventually (list_next_or_null_rcu()). So there is other bug cause nvme_next_ns abormal. I review the code about head->list and head->current_path, I find 2 bugs may cause the bug: First, I already send the patch. see: https://lore.kernel.org/linux-nvme/20210128033351.22116-1-lengc...@huawei.com/ Second, in nvme_ns_remove, list_del_rcu is before nvme_mpath_clear_current_path. This may cause "old" is deleted from the "head", but still use "old". I'm not sure there's any other consideration here, I will check it and try to fix it. The reason why we first remove from head->list and only then clear current_path is because the other way around there is no way to guarantee that that the ns won't be assigned as current_path again (because it is in head->list). ok, I see. nvme_ns_remove fences continue of deletion of the ns by synchronizing the srcu such that for sure the current_path clearance is visible. The list will be like this: head->next = ns1; ns1->next = head; old->next = ns1; This may cause infinite loop in nvme_round_robin_path. for (ns = nvme_next_ns(head, old); ns != old; ns = nvme_next_ns(head, ns)) The ns will always be ns1, and then infinite loop. Who is being removed? I'm not following
Re: [PATCH v2] nvme-multipath: Early exit if no path is available
You can't see exactly where it dies but I followed the assembly to nvme_round_robin_path(). Maybe it's not the initial nvme_next_ns(head, old) which returns NULL but nvme_next_ns() is returning NULL eventually (list_next_or_null_rcu()). So there is other bug cause nvme_next_ns abormal. I review the code about head->list and head->current_path, I find 2 bugs may cause the bug: First, I already send the patch. see: https://lore.kernel.org/linux-nvme/20210128033351.22116-1-lengc...@huawei.com/ Second, in nvme_ns_remove, list_del_rcu is before nvme_mpath_clear_current_path. This may cause "old" is deleted from the "head", but still use "old". I'm not sure there's any other consideration here, I will check it and try to fix it. The reason why we first remove from head->list and only then clear current_path is because the other way around there is no way to guarantee that that the ns won't be assigned as current_path again (because it is in head->list). nvme_ns_remove fences continue of deletion of the ns by synchronizing the srcu such that for sure the current_path clearance is visible.
Re: [PATCH] nvme: reject the ns when the block size is smaller than a sector
But this only catches a physical block size < 512 for NVMe, not any other block device. Please fix it for the general case in blk_queue_physical_block_size(). We actually call that later and would probably be better to check here.. We had a series for that a short while ago that got lost. What was it called?
Re: [PATCH] nvme: Constify static attribute_group structs
FWIW, Reviewed-by: Sagi Grimberg
Re: [PATCH] nvme: reject the ns when the block size is smaller than a sector
The nvme spec(1.4a, figure 248) says: "A value smaller than 9 (i.e., 512 bytes) is not supported." Signed-off-by: Li Feng --- drivers/nvme/host/core.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index f320273fc672..1f02e6e49a05 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -2161,6 +2161,12 @@ static int nvme_update_ns_info(struct nvme_ns *ns, struct nvme_id_ns *id) blk_mq_freeze_queue(ns->disk->queue); ns->lba_shift = id->lbaf[lbaf].ds; + if (ns->lba_shift < 9) { + pr_warn("%s: bad lba_shift(%d)\n", ns->disk->disk_name, ns->lba_shift); + ret = -1; Meaningful errno would be useful. Also better to use dev_warn + goto out_unfreeze; + } + nvme_set_queue_limits(ns->ctrl, ns->queue); if (ns->head->ids.csi == NVME_CSI_ZNS) { But this only catches a physical block size < 512 for NVMe, not any other block device. Please fix it for the general case in blk_queue_physical_block_size(). We actually call that later and would probably be better to check here..
Re: [PATCH] iosched: Add i10 I/O Scheduler
Dear all: Thanks, again, for the very constructive decisions. I am writing back with quite a few updates: 1. We have now included a detailed comparison of i10 scheduler with Kyber with NVMe-over-TCP (https://github.com/i10-kernel/upstream-linux/blob/master/i10-evaluation.pdf). In a nutshell, when operating with NVMe-over-TCP, i10 demonstrates the core tradeoff: higher latency, but also higher throughput. This seems to be the core tradeoff exposed by i10. 2. We have now implemented an adaptive version of i10 I/O scheduler, that uses the number of outstanding requests at the time of batch dispatch (and whether the dispatch was triggered by timeout or not) to adaptively set the batching size. The new results (https://github.com/i10-kernel/upstream-linux/blob/master/i10-evaluation.pdf) show that i10-adaptive further improves performance for low loads, while keeping the performance for high loads. IMO, there is much to do on designing improved adaptation algorithms. 3. We have now updated the i10-evaluation document to include results for local storage access. The core take-away here is that i10-adaptive can achieve similar throughput and latency at low loads and at high loads when compared to noop, but still requires more work for lower loads. However, given that the tradeoff exposed by i10 scheduler is particularly useful for remote storage devices (and as Jens suggested, perhaps for virtualized local storage access), I do agree with Sagi -- I think we should consider including it in the core, since this may be useful for a broad range of new use cases. We have also created a second version of the patch that includes these updates: https://github.com/i10-kernel/upstream-linux/blob/master/0002-iosched-Add-i10-I-O-Scheduler.patch As always, thank you for the constructive discussion and I look forward to working with you on this. Thanks Rachit, Would be good if you can send a formal patch for the adaptive queuing so people can have a look. One thing that was left on the table is weather this should be a full blown scheduler or a core block infrastructure that would either be set on-demand or by default. I think that Jens and Ming expressed that this should be something that we should place this in the block core, but I'd like to hear maybe Ming can elaborate on his ideas how to do this.
Re: [PATCH] iosched: Add i10 I/O Scheduler
But if you think this has a better home, I'm assuming that the guys will be open to that. Also see the reply from Ming. It's a balancing act - don't want to add extra overhead to the core, but also don't want to carry an extra scheduler if the main change is really just variable dispatch batching. And since we already have a notion of that, seems worthwhile to explore that venue. I agree, The main difference is that this balancing is not driven from device resource pressure, but rather from an assumption of device specific optimization (and also with a specific optimization target), hence a scheduler a user would need to opt-in seemed like a good compromise. But maybe Ming has some good ideas on a different way to add it.. So here's another case - virtualized nvme. The commit overhead is suitably large there that performance suffers quite a bit, similarly to your remote storage case. If we had suitable logic in the core, then we could easily propagate this knowledge when setting up the queue. Then it could happen automatically, without needing a configuration to switch to a specific scheduler. Yes, these use-cases share characteristics. I'm not at all opposed to placing this in the core. I do think that in order to put something like this in the core, the bar needs to be higher such that an optimization target cannot be biased towards a workload (i.e. needs to be adaptive). I'm still not sure how we would build this on top of what we already have as it is really centered around device being busy (which is not the case for nvme), but I didn't put enough thought into it yet.
Re: [PATCH] iosched: Add i10 I/O Scheduler
But if you think this has a better home, I'm assuming that the guys will be open to that. Also see the reply from Ming. It's a balancing act - don't want to add extra overhead to the core, but also don't want to carry an extra scheduler if the main change is really just variable dispatch batching. And since we already have a notion of that, seems worthwhile to explore that venue. I agree, The main difference is that this balancing is not driven from device resource pressure, but rather from an assumption of device specific optimization (and also with a specific optimization target), hence a scheduler a user would need to opt-in seemed like a good compromise. But maybe Ming has some good ideas on a different way to add it..
Re: [PATCH] iosched: Add i10 I/O Scheduler
I haven't taken a close look at the code yet so far, but one quick note that patches like this should be against the branches for 5.11. In fact, this one doesn't even compile against current -git, as blk_mq_bio_list_merge is now called blk_bio_list_merge. Ugh, I guess that Jaehyun had this patch bottled up and didn't rebase before submitting.. Sorry about that. In any case, I did run this through some quick peak testing as I was curious, and I'm seeing about 20% drop in peak IOPS over none running this. Perf diff: 10.71% -2.44% [kernel.vmlinux] [k] read_tsc 2.33% -1.99% [kernel.vmlinux] [k] _raw_spin_lock You ran this with nvme? or null_blk? I guess neither would benefit from this because if the underlying device will not benefit from batching (at least enough for the extra cost of accounting for it) it will be counter productive to use this scheduler. This is nvme, actual device. The initial posting could be a bit more explicit on the use case, it says: "For NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in terms of IOPS per core over "noop" I/O scheduler." which made me very skeptical, as it sounds like it's raw device claims. You are absolutely right, that needs to be fixed. Does beg the question of why this is a new scheduler then. It's pretty basic stuff, something that could trivially just be added a side effect of the core (and in fact we have much of it already). Doesn't really seem to warrant a new scheduler at all. There isn't really much in there. Not saying it absolutely warrants a new one, and it could I guess sit in the core, but this attempts to optimize for a specific metric while trading-off others, which is exactly what I/O schedulers are for, optimizing for a specific metric. Not sure we want to build something biases towards throughput on the expense of latency into the block core. And, as mentioned this is not well suited to all device types... But if you think this has a better home, I'm assuming that the guys will be open to that. Was curious and wanted to look it up, but it doesn't exist. I think this is the right one: https://github.com/i10-kernel/upstream-linux/blob/master/i10-evaluation.pdf We had some back and forth around the naming, hence this was probably omitted. That works, my local results were a bit worse than listed in there though. And what does this mean: "We note that Linux I/O scheduler introduces an additional kernel worker thread at the I/O dispatching stage" It most certainly does not for the common/hot case. Yes I agree, didn't see the local results. Probably some misunderstanding or a typo, I'll let them reply on this.
Re: [PATCH] iosched: Add i10 I/O Scheduler
blk-mq actually has built-in batching(or sort of) mechanism, which is enabled if the hw queue is busy(hctx->dispatch_busy is > 0). We use EWMA to compute hctx->dispatch_busy, and it is adaptive, even though the implementation is quite coarse. But there should be much space to improve, IMO. You are correct, however nvme-tcp should be getting to dispatch_busy > 0 IIUC. It is reported that this way improves SQ high-end SCSI SSD very much[1], and MMC performance gets improved too[2]. [1] https://lore.kernel.org/linux-block/3cc3e03901dc1a63ef32e03618252...@mail.gmail.com/ [2] https://lore.kernel.org/linux-block/CADBw62o9eTQDJ9RvNgEqSpXmg6Xcq=2TxH0Hfxhp29uF2W=t...@mail.gmail.com/ Yes, the guys paid attention to the MMC related improvements that you made. The i10 I/O scheduler builds upon recent work on [6]. We have tested the i10 I/O scheduler with nvme-tcp optimizaitons [2,3] and batching dispatch [4], varying number of cores, varying read/write ratios, and varying request sizes, and with NVMe SSD and RAM block device. For NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in terms of IOPS per core over "noop" I/O scheduler. These results are available at [5], and many additional results are presented in [6]. In case of none scheduler, basically nvme driver won't provide any queue busy feedback, so the built-in batching dispatch doesn't work simply. Exactly. kyber scheduler uses io latency feedback to throttle and build io batch, can you compare i10 with kyber on nvme/nvme-tcp? I assume it should be simple to get, I'll let Rachit/Jaehyun comment. While other schedulers may also batch I/O (e.g., mq-deadline), the optimization target in the i10 I/O scheduler is throughput maximization. Hence there is no latency target nor a need for a global tracking context, so a new scheduler is needed rather than to build this functionality to an existing scheduler. We currently use fixed default values as batching thresholds (e.g., 16 for #requests, 64KB for #bytes, and 50us for timeout). These default values are based on sensitivity tests in [6]. For our future work, we plan to support adaptive batching according to Frankly speaking, hardcode 16 #rquests or 64KB may not work everywhere, and product environment could be much complicated than your sensitivity tests. If possible, please start with adaptive batching. That was my feedback as well for sure. But given that this is a scheduler one would opt-in to anyway, that won't be a must-have initially. I'm not sure if the guys made progress with this yet, I'll let them comment.
Re: [PATCH] iosched: Add i10 I/O Scheduler
I haven't taken a close look at the code yet so far, but one quick note that patches like this should be against the branches for 5.11. In fact, this one doesn't even compile against current -git, as blk_mq_bio_list_merge is now called blk_bio_list_merge. Ugh, I guess that Jaehyun had this patch bottled up and didn't rebase before submitting.. Sorry about that. In any case, I did run this through some quick peak testing as I was curious, and I'm seeing about 20% drop in peak IOPS over none running this. Perf diff: 10.71% -2.44% [kernel.vmlinux] [k] read_tsc 2.33% -1.99% [kernel.vmlinux] [k] _raw_spin_lock You ran this with nvme? or null_blk? I guess neither would benefit from this because if the underlying device will not benefit from batching (at least enough for the extra cost of accounting for it) it will be counter productive to use this scheduler. Also: [5] https://github.com/i10-kernel/upstream-linux/blob/master/dss-evaluation.pdf Was curious and wanted to look it up, but it doesn't exist. I think this is the right one: https://github.com/i10-kernel/upstream-linux/blob/master/i10-evaluation.pdf We had some back and forth around the naming, hence this was probably omitted.
Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
There really aren't any rules for this, and it's perfectly legit to complete from process context. Maybe you're a kthread driven driver and that's how you handle completions. The block completion path has always been hard IRQ safe, but possible to call from anywhere. I'm not trying to put restrictions and forbidding completions from a kthread. I'm trying to avoid the pointless softirq dance for no added value. We could: to not break that assumption you just mentioned and provide |static inline void blk_mq_complete_request_local(struct request *rq) |{ | rq->q->mq_ops->complete(rq); |} so that completion issued from from process context (like those from usb-storage) don't end up waking `ksoftird' (running at SCHED_OTHER) completing the requests but rather performing it right away. The softirq dance makes no sense here. Agreed. But I don't think your above blk_mq_complete_request_local is all that useful either as ->complete is defined by the caller, so we could just do a direct call. Basically we should just return false from blk_mq_complete_request_remote after updating the state when called from process context. Agreed. But given that IIRC we are not supposed to check what state we are called from we'll need a helper just for updating the state instead and ensure the driver uses the right helper. Now of course we might have process context callers that still want to bounce to the submitting CPU, but in that case we should go directly to a workqueue or similar. This would mean that it may be suboptimal for nvme-tcp to complete requests in softirq context from the network context (determined by NIC steering). Because in this case, this would trigger workqueue schedule on a per-request basis rather than once per .data_ready call like we do today. Is that correct? It has been observed that completing commands in softirq context (network determined cpu) because basically the completion does IPI + local complete, not IPI + softirq or IPI + workqueue. Either way doing this properly will probabl involve an audit of all drivers, but I think that is worth it. Agree.
Re: [PATCH -next] IB/isert: Fix warning Comparison to bool
Reviewed-by: Sagi Grimberg
Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
Well, usb-storage obviously seems to do it, and the block layer does not prohibit it. Also loop, nvme-tcp and then I stopped looking. Any objections about adding local_bh_disable() around it? To me it seems like the whole IPI plus potentially softirq dance is a little pointless when completing from process context. I agree. Sagi, any opinion on that from the nvme-tcp POV? nvme-tcp should (almost) always complete from the context that matches the rq->mq_ctx->cpu as the thread that processes incoming completions (per hctx) should be affinitized to match it (unless cpus come and go). in which context? Not sure what is the question. But this is probably nr_hw_queues > 1? Yes. So for nvme-tcp I don't expect blk_mq_complete_need_ipi to return true in normal operation. That leaves the teardowns+aborts, which aren't very interesting here. The process context invocation is nvme_tcp_complete_timed_out(). Yes. I would note that nvme-tcp does not go to sleep after completing every I/O like how sebastian indicated usb does. Having said that, today the network stack is calling nvme_tcp_data_ready in napi context (softirq) which in turn triggers the queue thread to handle network rx (and complete the I/O). It's been measured recently that running the rx context directly in softirq will save some latency (possible because nvme-tcp rx context is non-blocking). So I'd think that patch #2 is unnecessary and just add overhead for nvme-tcp.. do note that the napi softirq cpu mapping depends on the RSS steering, which is unlikely to match rq->mq_ctx->cpu, hence if completed from napi context, nvme-tcp will probably always go to the IPI path. but running it in softirq on the remote CPU would still allow of other packets to come on the remote CPU (which would block BLOCK sofirq if NET_RX is already running). Not sure I understand your comment, if napi triggers on core X and we complete from that, it will trigger IPI to core Y, and there with patch #2 is will trigger softirq instead of calling ->complete directly no?
Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done
Well, usb-storage obviously seems to do it, and the block layer does not prohibit it. Also loop, nvme-tcp and then I stopped looking. Any objections about adding local_bh_disable() around it? To me it seems like the whole IPI plus potentially softirq dance is a little pointless when completing from process context. I agree. Sagi, any opinion on that from the nvme-tcp POV? nvme-tcp should (almost) always complete from the context that matches the rq->mq_ctx->cpu as the thread that processes incoming completions (per hctx) should be affinitized to match it (unless cpus come and go). So for nvme-tcp I don't expect blk_mq_complete_need_ipi to return true in normal operation. That leaves the teardowns+aborts, which aren't very interesting here. I would note that nvme-tcp does not go to sleep after completing every I/O like how sebastian indicated usb does. Having said that, today the network stack is calling nvme_tcp_data_ready in napi context (softirq) which in turn triggers the queue thread to handle network rx (and complete the I/O). It's been measured recently that running the rx context directly in softirq will save some latency (possible because nvme-tcp rx context is non-blocking). So I'd think that patch #2 is unnecessary and just add overhead for nvme-tcp.. do note that the napi softirq cpu mapping depends on the RSS steering, which is unlikely to match rq->mq_ctx->cpu, hence if completed from napi context, nvme-tcp will probably always go to the IPI path.
Re: [PATCH 1/2] nvmet: introduce transport layer keep-alive
On 10/27/20 5:15 AM, zhenwei pi wrote: In the zero KATO scenario, if initiator crashes without transport layer disconnection, target side would never reclaim resources. A target could start transport layer keep-alive to detect dead connection for the admin queue. Not sure why we should worry about kato=0, it's really more for debug purposes. I'd rather that we block this option from the host altogether. Signed-off-by: zhenwei pi --- drivers/nvme/target/admin-cmd.c | 2 +- drivers/nvme/target/core.c | 14 +++--- drivers/nvme/target/nvmet.h | 3 ++- 3 files changed, 14 insertions(+), 5 deletions(-) diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c index dca34489a1dc..53fbd5c193a1 100644 --- a/drivers/nvme/target/admin-cmd.c +++ b/drivers/nvme/target/admin-cmd.c @@ -729,7 +729,7 @@ u16 nvmet_set_feat_kato(struct nvmet_req *req) nvmet_stop_keep_alive_timer(req->sq->ctrl); req->sq->ctrl->kato = DIV_ROUND_UP(val32, 1000); - nvmet_start_keep_alive_timer(req->sq->ctrl); + nvmet_start_keep_alive_timer(req->sq->ctrl, req); nvmet_set_result(req, req->sq->ctrl->kato); diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 957b39a82431..451192f7ad6a 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -395,10 +395,18 @@ static void nvmet_keep_alive_timer(struct work_struct *work) nvmet_ctrl_fatal_error(ctrl); } -void nvmet_start_keep_alive_timer(struct nvmet_ctrl *ctrl) +void nvmet_start_keep_alive_timer(struct nvmet_ctrl *ctrl, struct nvmet_req *req) { - if (unlikely(ctrl->kato == 0)) + if (unlikely(ctrl->kato == 0)) { + if (req->ops->keep_alive) + pr_info("ctrl %d starts with transport keep-alive %s\n", ctrl->cntlid, + req->ops->keep_alive(req) ? "failed" : "succeed"); + else + pr_info("ctrl %d starts without both NVMeOF and transport keep-alive", + ctrl->cntlid); + return; + } pr_debug("ctrl %d start keep-alive timer for %d secs\n", ctrl->cntlid, ctrl->kato); @@ -1383,7 +1391,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn, ctrl->err_counter = 0; spin_lock_init(&ctrl->error_lock); - nvmet_start_keep_alive_timer(ctrl); + nvmet_start_keep_alive_timer(ctrl, req); mutex_lock(&subsys->lock); list_add_tail(&ctrl->subsys_entry, &subsys->ctrls); diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 559a15ccc322..de1785ce9fcd 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -305,6 +305,7 @@ struct nvmet_fabrics_ops { u16 (*install_queue)(struct nvmet_sq *nvme_sq); void (*discovery_chg)(struct nvmet_port *port); u8 (*get_mdts)(const struct nvmet_ctrl *ctrl); + int (*keep_alive)(struct nvmet_req *req); }; #define NVMET_MAX_INLINE_BIOVEC 8 @@ -395,7 +396,7 @@ void nvmet_get_feat_async_event(struct nvmet_req *req); u16 nvmet_set_feat_kato(struct nvmet_req *req); u16 nvmet_set_feat_async_event(struct nvmet_req *req, u32 mask); void nvmet_execute_async_event(struct nvmet_req *req); -void nvmet_start_keep_alive_timer(struct nvmet_ctrl *ctrl); +void nvmet_start_keep_alive_timer(struct nvmet_ctrl *ctrl, struct nvmet_req *req); void nvmet_stop_keep_alive_timer(struct nvmet_ctrl *ctrl); u16 nvmet_parse_connect_cmd(struct nvmet_req *req);
Re: [PATCH v3] nvme-rdma: handle nvme completion data length
Reviewed-by: Sagi Grimberg
Re: [PATCH v2] nvme-rdma: handle nvme completion data length
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 9e378d0a0c01..2ecadd309f4a 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -1767,6 +1767,21 @@ static void nvme_rdma_recv_done(struct ib_cq *cq, struct ib_wc *wc) return; } + /* received data length checking */ + if (unlikely(wc->byte_len < len)) { + /* zero bytes message could be ignored */ + if (!wc->byte_len) { + nvme_rdma_post_recv(queue, qe); + return; + } Nothing in the spec defines zero-length messages, hence we cannot support something that is not standard. If your array needs this, please submit a TPAR to the NVMe TWG.
Re: [PATCH v2] nvmet: fix uninitialized work for zero kato
Fixes: Don't run keep alive work with zero kato. "Fixes" tags need to have a git commit id followed by the commit subject. I can't find any commit with that subject, though. Fixes: 0d3b6a8d213a ("nvmet: Disable keep-alive timer when kato is cleared to 0h")
Re: [PATCH] nvmet: fix uninitialized work for zero kato
Hit a warning: WARNING: CPU: 1 PID: 241 at kernel/workqueue.c:1627 __queue_delayed_work+0x6d/0x90 with trace: mod_delayed_work_on+0x59/0x90 nvmet_update_cc+0xee/0x100 [nvmet] nvmet_execute_prop_set+0x72/0x80 [nvmet] nvmet_tcp_try_recv_pdu+0x2f7/0x770 [nvmet_tcp] nvmet_tcp_io_work+0x63f/0xb2d [nvmet_tcp] ... This could be reproduced easily with a keep alive time 0: nvme connect -t tcp -n NQN -a ADDR -s PORT --keep-alive-tmo=0 The reason is: Starting an uninitialized work when initiator connects with zero kato. Althrough keep-alive timer is disabled during allocating a ctrl (fix in 0d3b6a8d213a), ka_work still has a chance to run (called by nvmet_start_ctrl to detect dead host). This should have a "Fixes:" tag. Initilize ka_work during allocating ctrl, and set a reasonable kato before scheduling ka_work. Signed-off-by: zhenwei pi --- drivers/nvme/target/core.c | 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index b7b63330b5ef..3c5b2b065476 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -19,6 +19,8 @@ struct workqueue_struct *buffered_io_wq; static const struct nvmet_fabrics_ops *nvmet_transports[NVMF_TRTYPE_MAX]; static DEFINE_IDA(cntlid_ida); +#define NVMET_DEFAULT_KATO 5 + /* * This read/write semaphore is used to synchronize access to configuration * information on a target system that will result in discovery log page @@ -385,6 +387,11 @@ static void nvmet_keep_alive_timer(struct work_struct *work) if (cmd_seen) { pr_debug("ctrl %d reschedule traffic based keep-alive timer\n", ctrl->cntlid); + + /* run once, trigger from nvmet_start_ctrl to detect dead link */ + if (!ctrl->kato) + return; + schedule_delayed_work(&ctrl->ka_work, ctrl->kato * HZ); It will be better to just schedule/mod the ka_work if kato != 0, other changes in the patch aren't needed IMO. return; } @@ -403,15 +410,11 @@ static void nvmet_start_keep_alive_timer(struct nvmet_ctrl *ctrl) pr_debug("ctrl %d start keep-alive timer for %d secs\n", ctrl->cntlid, ctrl->kato); - INIT_DELAYED_WORK(&ctrl->ka_work, nvmet_keep_alive_timer); schedule_delayed_work(&ctrl->ka_work, ctrl->kato * HZ); } static void nvmet_stop_keep_alive_timer(struct nvmet_ctrl *ctrl) { - if (unlikely(ctrl->kato == 0)) - return; - pr_debug("ctrl %d stop keep-alive\n", ctrl->cntlid); cancel_delayed_work_sync(&ctrl->ka_work); @@ -1107,6 +1110,8 @@ static inline u8 nvmet_cc_iocqes(u32 cc) static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl) { + u32 kato = ctrl->kato ? ctrl->kato : NVMET_DEFAULT_KATO; + The controller shouldn't have a default value, it should receive the desired value from the host. lockdep_assert_held(&ctrl->lock); if (nvmet_cc_iosqes(ctrl->cc) != NVME_NVM_IOSQES || @@ -1126,7 +1131,7 @@ static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl) * in case a host died before it enabled the controller. Hence, simply * reset the keep alive timer when the controller is enabled. */ - mod_delayed_work(system_wq, &ctrl->ka_work, ctrl->kato * HZ); + mod_delayed_work(system_wq, &ctrl->ka_work, kato * HZ); } static void nvmet_clear_ctrl(struct nvmet_ctrl *ctrl) @@ -1378,6 +1383,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn, /* keep-alive timeout in seconds */ ctrl->kato = DIV_ROUND_UP(kato, 1000); + INIT_DELAYED_WORK(&ctrl->ka_work, nvmet_keep_alive_timer); ctrl->err_counter = 0; spin_lock_init(&ctrl->error_lock);
Re: [PATCH blk-next 1/2] blk-mq-rdma: Delete not-used multi-queue RDMA map queue code
Yes, basically usage of managed affinity caused people to report regressions not being able to change irq affinity from procfs. Well, why would they change it? The whole point of the infrastructure is that there is a single sane affinity setting for a given setup. Now that setting needed some refinement from the original series (e.g. the current series about only using housekeeping cpus if cpu isolation is in use). But allowing random users to modify affinity is just a receipe for a trainwreck. Well allowing people to mangle irq affinity settings seem to be a hard requirement from the discussions in the past. So I think we need to bring this back ASAP, as doing affinity right out of the box is an absolute requirement for sane performance without all the benchmarketing deep magic. Well, it's hard to say that setting custom irq affinity settings is deemed non-useful to anyone and hence should be prevented. I'd expect that irq settings have a sane default that works and if someone wants to change it, it can but there should be no guarantees on optimal performance. But IIRC this had some dependencies on drivers and some more infrastructure to handle dynamic changes...
Re: [PATCH blk-next 1/2] blk-mq-rdma: Delete not-used multi-queue RDMA map queue code
From: Leon Romanovsky The RDMA vector affinity code is not backed up by any driver and always returns NULL to every ib_get_vector_affinity() call. This means that blk_mq_rdma_map_queues() always takes fallback path. Fixes: 9afc97c29b03 ("mlx5: remove support for ib_get_vector_affinity") Signed-off-by: Leon Romanovsky So you guys totally broken the nvme queue assignment without even telling anyone? Great job! Who is "you guys" and it wasn't silent either? I'm sure that Sagi knows the craft. https://lore.kernel.org/linux-rdma/20181224221606.ga25...@ziepe.ca/ commit 759ace7832802eaefbca821b2b43a44ab896b449 Author: Sagi Grimberg Date: Thu Nov 1 13:08:07 2018 -0700 i40iw: remove support for ib_get_vector_affinity commit 9afc97c29b032af9a4112c2f4a02d5313b4dc71f Author: Sagi Grimberg Date: Thu Nov 1 09:13:12 2018 -0700 mlx5: remove support for ib_get_vector_affinity Thanks Yes, basically usage of managed affinity caused people to report regressions not being able to change irq affinity from procfs. Back then I started a discussion with Thomas to make managed affinity to still allow userspace to modify this, but this was dropped at some point. So currently rdma cannot do automatic irq affinitization out of the box.
Re: [PATCH] [v2] nvme: replace meaningless judgement by checking whether req is null
Reviewed-by: Sagi Grimberg
Re: [PATCH AUTOSEL 4.19 127/206] nvme: Fix controller creation races with teardown flow
This causes a regression and was reverted upstream, just FYI. On 9/17/20 7:06 PM, Sasha Levin wrote: From: Israel Rukshin [ Upstream commit ce1518139e6976cf19c133b555083354fdb629b8 ] Calling nvme_sysfs_delete() when the controller is in the middle of creation may cause several bugs. If the controller is in NEW state we remove delete_controller file and don't delete the controller. The user will not be able to use nvme disconnect command on that controller again, although the controller may be active. Other bugs may happen if the controller is in the middle of create_ctrl callback and nvme_do_delete_ctrl() starts. For example, freeing I/O tagset at nvme_do_delete_ctrl() before it was allocated at create_ctrl callback. To fix all those races don't allow the user to delete the controller before it was fully created. Signed-off-by: Israel Rukshin Reviewed-by: Max Gurtovoy Reviewed-by: Christoph Hellwig Signed-off-by: Keith Busch Signed-off-by: Sasha Levin --- drivers/nvme/host/core.c | 5 + drivers/nvme/host/nvme.h | 1 + 2 files changed, 6 insertions(+) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 4b182ac15687e..faa7feebb6095 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -2856,6 +2856,10 @@ static ssize_t nvme_sysfs_delete(struct device *dev, { struct nvme_ctrl *ctrl = dev_get_drvdata(dev); + /* Can't delete non-created controllers */ + if (!ctrl->created) + return -EBUSY; + if (device_remove_file_self(dev, attr)) nvme_delete_ctrl_sync(ctrl); return count; @@ -3576,6 +3580,7 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl) queue_work(nvme_wq, &ctrl->async_event_work); nvme_start_queues(ctrl); } + ctrl->created = true; } EXPORT_SYMBOL_GPL(nvme_start_ctrl); diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 31c1496f938fb..a70b997060e68 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -206,6 +206,7 @@ struct nvme_ctrl { struct nvme_command ka_cmd; struct work_struct fw_act_work; unsigned long events; + bool created; #ifdef CONFIG_NVME_MULTIPATH /* asymmetric namespace access: */
Re: [PATCH] nvme: tcp: fix kconfig dependency warning when !CRYPTO
Reviewed-by: Sagi Grimberg
Re: [PATCH v2] nvme-pci: cancel nvme device request before disabling
Added to nvme-5.9-rc
Re: [PATCH] nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'
The way 'spin_lock()' and 'spin_lock_irqsave()' are used is not consistent in this function. Use 'spin_lock_irqsave()' also here, as there is no guarantee that interruptions are disabled at that point, according to surrounding code. Fixes: a97ec51b37ef ("nvmet_fc: Rework target side abort handling") Signed-off-by: Christophe JAILLET --- Not tested, only based on what looks logical to me according to surrounding code --- drivers/nvme/target/fc.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c index 55bafd56166a..e6861cc10e7d 100644 --- a/drivers/nvme/target/fc.c +++ b/drivers/nvme/target/fc.c @@ -2342,9 +2342,9 @@ nvmet_fc_fod_op_done(struct nvmet_fc_fcp_iod *fod) return; if (fcpreq->fcp_error || fcpreq->transferred_length != fcpreq->transfer_length) { - spin_lock(&fod->flock); + spin_lock_irqsave(&fod->flock, flags); fod->abort = true; - spin_unlock(&fod->flock); + spin_unlock_irqrestore(&fod->flock, flags); nvmet_req_complete(&fod->req, NVME_SC_INTERNAL); return; James, can I get a reviewed-by from you on this?
Re: [PATCH 2/3] block: fix locking for struct block_device size updates
Reviewed-by: Sagi Grimberg
Re: [PATCH 3/3] nvme: don't call revalidate_disk from nvme_set_queue_dying
Reviewed-by: Sagi Grimberg
Re: [PATCH 1/3] block: replace bd_set_size with bd_set_nr_sectors
Reviewed-by: Sagi Grimberg
Re: [PATCH] nvme-pci: cancel nvme device request before disabling
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index ba725ae47305..c4f1ce0ee1e3 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1249,8 +1249,8 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) dev_warn_ratelimited(dev->ctrl.device, "I/O %d QID %d timeout, disable controller\n", req->tag, nvmeq->qid); - nvme_dev_disable(dev, true); nvme_req(req)->flags |= NVME_REQ_CANCELLED; + nvme_dev_disable(dev, true); return BLK_EH_DONE; Shouldn't this flag have been set in nvme_cancel_request()? nvme_cancel_request() is not setting this flag to cancelled and this is causing Right, I see that it doesn't, but I'm saying that it should. We used to do something like that, and I'm struggling to recall why we're not anymore. I also don't recall why, but I know that we rely on the status propagating back from submit_sync_cmd which won't happen because it converts the status into -ENODEV. The driver is not reporting non-response back for all cancelled requests, and that is probably not what we should be doing. I'd think that we should modify our callers to handle nvme status codes as well rather than rely on nvme_submit_sync_cmd to return a negative codes under some conditions.
Re: [PATCH] nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth
queued to nvme-5.9-rc
Re: [PATCH v2] nvme: Use spin_lock_irq() when taking the ctrl->lock
There's an unrelated whitespace change in nvme_init_identify(). Otherwise, looks fine. Oops, sorry. can this be fixed up when it's merged? Fixed and queued.
Re: [PATCH] nvmet: fix opps in nvmet_execute_passthru_cmd()
This patch adds a check in nvmet_execute_passthru_cmd() to prevent the following oops :- Why LKML and not linux-nvme? My bad (+linux-nvme). It doesn't have the patch. can you resend?
Re: [PATCH] nvmet: fix opps in nvmet_execute_passthru_cmd()
This patch adds a check in nvmet_execute_passthru_cmd() to prevent the following oops :- Why LKML and not linux-nvme?
Re: [PATCH] nvmet-passthru: Reject commands with non-sgl flags set
Reviewed-by: Sagi Grimberg
Re: [IB/srpt] c804af2c1d: last_state.test.blktests.exit_code.143
Greeting, FYI, we noticed the following commit (built with gcc-9): commit: c804af2c1d3152c0cf877eeb50d60c2d49ac0cf0 ("IB/srpt: use new shared CQ mechanism") https://git.kernel.org/cgit/linux/kernel/git/rdma/rdma.git for-next in testcase: blktests with following parameters: test: srp-group1 ucode: 0x21 on test machine: 4 threads Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz with 4G memory caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): If you fix the issue, kindly add following tag Reported-by: kernel test robot user :notice: [ 44.688140] 2020-08-01 16:10:22 ./check srp/001 srp/002 srp/003 srp/004 srp/005 srp/006 srp/007 srp/008 srp/009 srp/010 srp/011 srp/012 srp/013 srp/015 user :notice: [ 44.706657] srp/001 (Create and remove LUNs) user :notice: [ 44.718405] srp/001 (Create and remove LUNs) [passed] user :notice: [ 44.729902] runtime ... 1.972s user :notice: [ 99.038748] IPMI BMC is not supported on this machine, skip bmc-watchdog setup! user :notice: [ 3699.039790] Sat Aug 1 17:11:22 UTC 2020 detected soft_timeout user :notice: [ 3699.060341] kill 960 /usr/bin/time -v -o /tmp/lkp/blktests.time /lkp/lkp/src/tests/blktests Yamin and Max, can you take a look at this? The SRP tests from the blktests repository pass reliably with kernel version v5.7 and before. With label next-20200731 from linux-next however that test triggers the following hang: I will look into it. FWIW, I ran into this as well with nvme-rdma, but it also reproduces when I revert the shared CQ patch from nvme-rdma. Another data point is that my tests passes with siw.
Re: [PATCH v15 7/9] nvmet-passthru: Add passthru code to process commands
Thanks for the review Christoph. I think I should be able to make all the requested changes in the next week or two. On 2020-07-20 1:35 p.m., Sagi Grimberg wrote: I'm still not so happy about having to look up the namespace and still wonder if we should generalize the connect_q to a passthrough_q. But I guess we can do that later and then reduce some of the exports here.. That is a neat idea! should be easy to do (and we can then lose the host xarray stuff). I don't mind having it on a later patch, but it should be easy enough to do even before... I sort of follow this. I can try to work something up but it will probably take me a few iterations to get it to where you want it. So, roughly, we'd create a passthrough_q in core with the controller's IO tagset and then cleanup the fabrics hosts to use that instead of each independently creating their connect_q? Though, I don't understand how this relates to the host xarray stuff that Sagi mentioned... passthru commands are in essence REQ_OP_DRV_IN/REQ_OP_DRV_OUT, which means that the driver shouldn't need the ns at all. So if you have a dedicated request queue (mapped to the I/O tagset), you don't need the ns->queue and we can lose the ns lookup altogether. The only part is to check the effects, but that can probably be handled when we setup the passthru controller or something... Yes, I implemented the passthru_q (which was quite simple). Nice.. But I'm not sure how we are supposed to call nvme_command_effects() correctly without the ns. You can't possibly do that during setup for every possible opcode on every namespace. And even if we do, we'll still need the same nvme_find_get_ns() and nvme_put_ns() exports and probably another xarray to lookup the information. Also, we pass the namespace's disk to in order to get proper block accounting for the underlying disk. (Which is pretty important for debugging). So we need to lookup the namespace for this too. Unless there are some other ideas to solve these issues, I don't think this change will gain us anything. Let's defer it to a followup set than.
Re: [PATCH v15 7/9] nvmet-passthru: Add passthru code to process commands
On Mon, Jul 20, 2020 at 05:01:19PM -0600, Logan Gunthorpe wrote: On 2020-07-20 4:35 p.m., Sagi Grimberg wrote: passthru commands are in essence REQ_OP_DRV_IN/REQ_OP_DRV_OUT, which means that the driver shouldn't need the ns at all. So if you have a dedicated request queue (mapped to the I/O tagset), you don't need the ns->queue and we can lose the ns lookup altogether. We still need a request_queue to dispatch the command. I guess you could make a generic one for the controller that isn't tied to a namespace, but we lose the fair shared tag allocation. What do you mean fair shared tag allocation? See hctx_may_queue(). Still not following your point... this queue is yet another request queue on the I/O tagset, e.g. ctrl->passthru_q = blk_mq_init_queue(ctrl->tagset);
Re: [PATCH v15 7/9] nvmet-passthru: Add passthru code to process commands
On 7/20/20 4:17 PM, Keith Busch wrote: On Mon, Jul 20, 2020 at 05:01:19PM -0600, Logan Gunthorpe wrote: On 2020-07-20 4:35 p.m., Sagi Grimberg wrote: passthru commands are in essence REQ_OP_DRV_IN/REQ_OP_DRV_OUT, which means that the driver shouldn't need the ns at all. So if you have a dedicated request queue (mapped to the I/O tagset), you don't need the ns->queue and we can lose the ns lookup altogether. We still need a request_queue to dispatch the command. I guess you could make a generic one for the controller that isn't tied to a namespace, but we lose the fair shared tag allocation. What do you mean fair shared tag allocation?
Re: [PATCH v15 7/9] nvmet-passthru: Add passthru code to process commands
Thanks for the review Christoph. I think I should be able to make all the requested changes in the next week or two. On 2020-07-20 1:35 p.m., Sagi Grimberg wrote: I'm still not so happy about having to look up the namespace and still wonder if we should generalize the connect_q to a passthrough_q. But I guess we can do that later and then reduce some of the exports here.. That is a neat idea! should be easy to do (and we can then lose the host xarray stuff). I don't mind having it on a later patch, but it should be easy enough to do even before... I sort of follow this. I can try to work something up but it will probably take me a few iterations to get it to where you want it. So, roughly, we'd create a passthrough_q in core with the controller's IO tagset and then cleanup the fabrics hosts to use that instead of each independently creating their connect_q? Though, I don't understand how this relates to the host xarray stuff that Sagi mentioned... passthru commands are in essence REQ_OP_DRV_IN/REQ_OP_DRV_OUT, which means that the driver shouldn't need the ns at all. So if you have a dedicated request queue (mapped to the I/O tagset), you don't need the ns->queue and we can lose the ns lookup altogether. Thanks, that helps clarify things a bit, but which xarray were you talking about? The patch from Chaitanya See "[PATCH V4 2/2] nvme-core: use xarray for ctrl ns tracking"
Re: [PATCH v15 7/9] nvmet-passthru: Add passthru code to process commands
Thanks for the review Christoph. I think I should be able to make all the requested changes in the next week or two. On 2020-07-20 1:35 p.m., Sagi Grimberg wrote: I'm still not so happy about having to look up the namespace and still wonder if we should generalize the connect_q to a passthrough_q. But I guess we can do that later and then reduce some of the exports here.. That is a neat idea! should be easy to do (and we can then lose the host xarray stuff). I don't mind having it on a later patch, but it should be easy enough to do even before... I sort of follow this. I can try to work something up but it will probably take me a few iterations to get it to where you want it. So, roughly, we'd create a passthrough_q in core with the controller's IO tagset and then cleanup the fabrics hosts to use that instead of each independently creating their connect_q? Though, I don't understand how this relates to the host xarray stuff that Sagi mentioned... passthru commands are in essence REQ_OP_DRV_IN/REQ_OP_DRV_OUT, which means that the driver shouldn't need the ns at all. So if you have a dedicated request queue (mapped to the I/O tagset), you don't need the ns->queue and we can lose the ns lookup altogether. The only part is to check the effects, but that can probably be handled when we setup the passthru controller or something...
Re: [PATCH v15 7/9] nvmet-passthru: Add passthru code to process commands
Add passthru command handling capability for the NVMeOF target and export passthru APIs which are used to integrate passthru code with nvmet-core. The new file passthru.c handles passthru cmd parsing and execution. In the passthru mode, we create a block layer request from the nvmet request and map the data on to the block layer request. Admin commands and features are on a white list as there are a number of each that don't make too much sense with passthrough. We use a white list so that new commands can be considered before being blindly passed through. In both cases, vendor specific commands are always allowed. We also blacklist reservation IO commands as the underlying device cannot differentiate between multiple hosts behind a fabric. I'm still not so happy about having to look up the namespace and still wonder if we should generalize the connect_q to a passthrough_q. But I guess we can do that later and then reduce some of the exports here.. That is a neat idea! should be easy to do (and we can then lose the host xarray stuff). I don't mind having it on a later patch, but it should be easy enough to do even before...
Re: [PATCH 5/5] nvme-pci: Use standard block status macro
Reviewed-by: Sagi Grimberg
Re: [PATCH 4/5] nvme-pci: Use the consistent return type of nvme_pci_iod_alloc_size()
Reviewed-by: Sagi Grimberg
Re: [PATCH 2/5] nvme-pci: Add a blank line after declarations
Reviewed-by: Sagi Grimberg
Re: [PATCH 1/5] nvme-pci: Fix some comments issues
Reviewed-by: Sagi Grimberg
Re: [PATCH 4.19 082/131] nvme: fix possible deadlock when I/O is blocked
Hi! From: Sagi Grimberg [ Upstream commit 3b4b19721ec652ad2c4fe51dfbe5124212b5f581 ] Revert fab7772bfbcf ("nvme-multipath: revalidate nvme_ns_head gendisk in nvme_validate_ns") When adding a new namespace to the head disk (via nvme_mpath_set_live) we will see partition scan which triggers I/O on the mpath device node. This process will usually be triggered from the scan_work which holds the scan_lock. If I/O blocks (if we got ana change currently have only available paths but none are accessible) this can deadlock on the head disk bd_mutex as both partition scan I/O takes it, and head disk revalidation takes it to check for resize (also triggered from scan_work on a different path). See trace [1]. The mpath disk revalidation was originally added to detect online disk size change, but this is no longer needed since commit cb224c3af4df ("nvme: Convert to use set_capacity_revalidate_and_notify") which already updates resize info without unnecessarily revalidating the disk (the Unfortunately, v4.19-stable does not contain cb224c3af4df. According to changelog, it seems it should be cherry-picked? You are absolutely right, The reference commit is a part of the series: 78317c5d58e6 ("scsi: Convert to use set_capacity_revalidate_and_notify") cb224c3af4df ("nvme: Convert to use set_capacity_revalidate_and_notify") 3cbc28bb902b ("xen-blkfront.c: Convert to use set_capacity_revalidate_and_notify") 662155e2898d ("virtio_blk.c: Convert to use set_capacity_revalidate_and_notify") e598a72faeb5 ("block/genhd: Notify udev about capacity change") It would be cool if they are cherry picked, although they don't qualify as stable patches per se...
Re: [PATCH 3/3] nvme: Use USEC_PER_SEC instead of magic numbers
Reviewed-by: Sagi Grimberg
Re: [PATCH 1/3] nvme: Add Arbitration Burst support
From the NVMe spec, "In order to make efficient use of the non-volatile memory, it is often advantageous to execute multiple commands from a Submission Queue in parallel. For Submission Queues that are using weighted round robin with urgent priority class or round robin arbitration, host software may configure an Arbitration Burst setting". Thus add Arbitration Burst setting support. But if the user changed it to something else that better matches how they want to use queues, the driver is just going to undo that setting on the next reset. Where do we do priority class arbitration? no one sets it
Re: [PATCH 2/2] nvme-pci: remove empty line following nvme_should_reset()
Reviewed-by: Sagi Grimberg
Re: [PATCH 1/2] nvmet-loop: remove unused 'target_ctrl' in nvme_loop_ctrl
Reviewed-by: Sagi Grimberg
Re: [PATCH] nvme: do not call del_gendisk() on a disk that was never added
Reviewed-by: Sagi Grimberg
Re: [PATCH 1/1] nvme-fcloop: verify wwnn and wwpn format
Reviewed-by: Sagi Grimberg
Re: [PATCH] nvme-tcp: constify static struct blk_mq_ops
Acked-by: Sagi Grimberg
Re: remove kernel_setsockopt and kernel_getsockopt
Hi Dave, this series removes the kernel_setsockopt and kernel_getsockopt functions, and instead switches their users to small functions that implement setting (or in one case getting) a sockopt directly using a normal kernel function call with type safety and all the other benefits of not having a function call. In some cases these functions seem pretty heavy handed as they do a lock_sock even for just setting a single variable, but this mirrors the real setsockopt implementation - counter to that a few kernel drivers just set the fields directly already. Nevertheless the diffstat looks quite promising: 42 files changed, 721 insertions(+), 799 deletions(-) For the nvme-tcp bits, Acked-by: Sagi Grimberg
Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
devices will benefit from the batching so maybe the flag needs to be inverted? BLK_MQ_F_DONT_BATCHING_SUBMISSION? Actually I'd rather to not add any flag, and we may use some algorithm (maybe EWMA or other intelligent stuff, or use hctx->dispatch_busy directly) to figure out one dynamic batching number which is used to dequeue requests from scheduler or sw queue. Then just one single dispatch io code path is enough. Sounds good to me, do you plan to submit a patchset?
Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
Basically, my idea is to dequeue request one by one, and for each dequeued request: - we try to get a budget and driver tag, if both succeed, add the request to one per-task list which can be stored in stack variable, then continue to dequeue more request - if either budget or driver tag can't be allocated for this request, marks the last request in the per-task list as .last, and send the batching requests stored in the list to LLD - when queueing batching requests to LLD, if one request isn't queued to driver successfully, calling .commit_rqs() like before, meantime adding the remained requests in the per-task list back to scheduler queue or hctx->dispatch. Sounds good to me. One issue is that this way might degrade sequential IO performance if the LLD just tells queue busy to blk-mq via return value of .queue_rq(), so I guess we still may need one flag, such as BLK_MQ_F_BATCHING_SUBMISSION. Why is that degrading sequential I/O performance? because the specific Some devices may only return BLK_STS_RESOURCE from .queue_rq(), then more requests are dequeued from scheduler queue if we always queue batching IOs to LLD, and chance of IO merge is reduced, so sequential IO performance will be effected. Such as some scsi device which doesn't use sdev->queue_depth for throttling IOs. For virtio-scsi or virtio-blk, we may stop queue for avoiding the potential affect. Do we have a way to characterize such devices? I'd assume that most devices will benefit from the batching so maybe the flag needs to be inverted? BLK_MQ_F_DONT_BATCHING_SUBMISSION?
Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
You're mostly correct. This is exactly why an I/O scheduler may be applicable here IMO. Mostly because I/O schedulers tend to optimize for something specific and always present tradeoffs. Users need to understand what they are optimizing for. Hence I'd say this functionality can definitely be available to an I/O scheduler should one exist. I guess it is just that there can be multiple requests available from scheduler queue. Actually it can be so for other non-nvme drivers in case of none, such as SCSI. Another way is to use one per-task list(such as plug list) to hold the requests for dispatch, then every drivers may see real .last flag, so they may get chance for optimizing batch queuing. I will think about the idea further and see if it is really doable. How about my RFC v1 patch set[1], which allows dispatching more than one request from the scheduler to support batch requests? [1] https://lore.kernel.org/patchwork/patch/1210034/ https://lore.kernel.org/patchwork/patch/1210035/ Basically, my idea is to dequeue request one by one, and for each dequeued request: - we try to get a budget and driver tag, if both succeed, add the request to one per-task list which can be stored in stack variable, then continue to dequeue more request - if either budget or driver tag can't be allocated for this request, marks the last request in the per-task list as .last, and send the batching requests stored in the list to LLD - when queueing batching requests to LLD, if one request isn't queued to driver successfully, calling .commit_rqs() like before, meantime adding the remained requests in the per-task list back to scheduler queue or hctx->dispatch. Sounds good to me. One issue is that this way might degrade sequential IO performance if the LLD just tells queue busy to blk-mq via return value of .queue_rq(), so I guess we still may need one flag, such as BLK_MQ_F_BATCHING_SUBMISSION. Why is that degrading sequential I/O performance? because the specific device will do better without batching submissions? If so, the driver is not obligated to respect the bd->last/.commit_rqs, so if that is the case, it should just ignore it.
Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
Hey Ming, Would it make sense to elevate this flag to a request_queue flag (QUEUE_FLAG_ALWAYS_COMMIT)? request queue flag usually is writable, however this case just needs one read-only flag, so I think it may be better to make it as tagset/hctx flag. I actually intended it to be writable. I'm thinking of a possibility that an I/O scheduler may be used to activate this functionality rather than having the driver set it necessarily... Could you explain a bit why I/O scheduler should activate this functionality? Sure, I've recently seen some academic work showing the benefits of batching in tcp/ip based block drivers. The problem with the approaches taken is that I/O scheduling is exercised deep down in the driver, which is not the direction I'd like to go if we are want to adopt some of the batching concepts. I spent some (limited) time thinking about this, and it seems to me that there is an opportunity to implement this as a dedicated I/O scheduler, and tie it to driver specific LLD stack optimizations (net-stack for example) relying on the commit_rq/bd->last hints. When scanning the scheduler code, I noticed exactly the phenomenon that this patchset is attempting to solve and Christoph referred me to it. Now I'm thinking if we can extend this batching optimization for both use-cases. batching submission may be good for some drivers, and currently we only do it in limited way. One reason is that there is extra cost for full batching submission, such as this patch requires one extra .commit_rqs() for each dispatch, and lock is often needed in this callback. That is not necessarily the case at all. IMO it can be a win for some slow driver or device, but may cause a little performance drop for fast driver/device especially in workload of not-batching submission. You're mostly correct. This is exactly why an I/O scheduler may be applicable here IMO. Mostly because I/O schedulers tend to optimize for something specific and always present tradeoffs. Users need to understand what they are optimizing for. Hence I'd say this functionality can definitely be available to an I/O scheduler should one exist.
Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index f389d7c724bd..6a20f8e8eb85 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -391,6 +391,7 @@ struct blk_mq_ops { enum { BLK_MQ_F_SHOULD_MERGE = 1 << 0, BLK_MQ_F_TAG_SHARED = 1 << 1, + BLK_MQ_F_FORCE_COMMIT_RQS = 1 << 3, Maybe BLK_MQ_F_ALWAYS_COMMIT might be a better name? Also this flag (just like the existing ones..) could really use a comment explaining it. Would it make sense to elevate this flag to a request_queue flag (QUEUE_FLAG_ALWAYS_COMMIT)? I'm thinking of a possibility that an I/O scheduler may be used to activate this functionality rather than having the driver set it necessarily...
Re: [PATCH] nvme: fix uninitialized return of ret when sysfs_create_link fails
This was already fixed and merged (by Dan) On 10/2/19 5:43 AM, Colin King wrote: From: Colin Ian King Currently when the call to sysfs_create_link fails the error exit path returns an uninitialized value in variable ret. Fix this by returning the error code returned from the failed call to sysfs_create_link. Addresses-Coverity: ("Uninitialized scalar variable") Fixes: 32fd90c40768 ("nvme: change locking for the per-subsystem controller list") Signed-off-by: Colin Ian King --- drivers/nvme/host/core.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 63b37d08ac98..f6acbff3e3bc 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -2540,8 +2540,9 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id) list_add_tail(&subsys->entry, &nvme_subsystems); } - if (sysfs_create_link(&subsys->dev.kobj, &ctrl->device->kobj, - dev_name(ctrl->device))) { + ret = sysfs_create_link(&subsys->dev.kobj, &ctrl->device->kobj, + dev_name(ctrl->device)); + if (ret) { dev_err(ctrl->device, "failed to create sysfs link from subsystem.\n"); goto out_put_subsystem;
Re: [PATCH v3] nvme: allow 64-bit results in passthru commands
Looks fine to me, Reviewed-by: Sagi Grimberg Keith, Christoph?
Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
Sagi, Sorry it took a while to bring my system back online. With the patch, the IOPS is about the same drop with the 1st patch. I think the excessive context switches are causing the drop in IOPS. The following are captured by "perf sched record" for 30 seconds during tests. "perf sched latency" With patch: fio:(82) | 937632.706 ms | 1782255 | avg:0.209 ms | max: 63.123 ms | max at:768.274023 s without patch: fio:(82) |2348323.432 ms |18848 | avg:0.295 ms | max: 28.446 ms | max at: 6447.310255 s Without patch means the proposed hard-irq patch? It means the current upstream code without any patch. But It's prone to soft lockup. Ming's proposed hard-irq patch gets similar results to "without patch", however it fixes the soft lockup. Thanks for the clarification. The problem with what Ming is proposing in my mind (and its an existing problem that exists today), is that nvme is taking precedence over anything else until it absolutely cannot hog the cpu in hardirq. In the thread Ming referenced a case where today if the cpu core has a net softirq activity it cannot make forward progress. So with Ming's suggestion, net softirq will eventually make progress, but it creates an inherent fairness issue. Who said that nvme completions should come faster then the net rx/tx or another I/O device (or hrtimers or sched events...)? As much as I'd like nvme to complete as soon as possible, I might have other activities in the system that are as important if not more. So I don't think we can solve this with something that is not cooperative or fair with the rest of the system. If we are context switching too much, it means the soft-irq operation is not efficient, not necessarily the fact that the completion path is running in soft- irq.. Is your kernel compiled with full preemption or voluntary preemption? The tests are based on Ubuntu 18.04 kernel configuration. Here are the parameters: # CONFIG_PREEMPT_NONE is not set CONFIG_PREEMPT_VOLUNTARY=y # CONFIG_PREEMPT is not set I see, so it still seems that irq_poll_softirq is still not efficient in reaping completions. reaping the completions on its own is pretty much the same in hard and soft irq, so its really the scheduling part that is creating the overhead (which does not exist in hard irq). Question: when you test with without the patch (completions are coming in hard-irq), do the fio threads that run on the cpu cores that are assigned to the cores that are handling interrupts get substantially lower throughput than the rest of the fio threads? I would expect that the fio threads that are running on the first 32 cores to get very low iops (overpowered by the nvme interrupts) and the rest doing much more given that nvme has almost no limits to how much time it can spend on processing completions. If need_resched() is causing us to context switch too aggressively, does changing that to local_softirq_pending() make things better? -- diff --git a/lib/irq_poll.c b/lib/irq_poll.c index d8eab563fa77..05d524fcaf04 100644 --- a/lib/irq_poll.c +++ b/lib/irq_poll.c @@ -116,7 +116,7 @@ static void __latent_entropy irq_poll_softirq(struct softirq_action *h) /* * If softirq window is exhausted then punt. */ - if (need_resched()) + if (local_softirq_pending()) break; } -- Although, this can potentially cause other threads from making forward progress.. If it is better, perhaps we also need a time limit as well. Perhaps we should add statistics/tracing on how many completions we are reaping per invocation...
Re: [PATCH v2] nvme-pci: Save PCI state before putting drive into deepest state
Keith, can we get a review from you for this?
Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
Hey Ming, Ok, so the real problem is per-cpu bounded tasks. I share Thomas opinion about a NAPI like approach. We already have that, its irq_poll, but it seems that for this use-case, we get lower performance for some reason. I'm not entirely sure why that is, maybe its because we need to mask interrupts because we don't have an "arm" register in nvme like network devices have? Long observed that IOPS drops much too by switching to threaded irq. If softirqd is waken up for handing softirq, the performance shouldn't be better than threaded irq. Its true that it shouldn't be any faster, but what irqpoll already has and we don't need to reinvent is a proper budgeting mechanism that needs to occur when multiple devices map irq vectors to the same cpu core. irqpoll already maintains a percpu list and dispatch the ->poll with a budget that the backend enforces and irqpoll multiplexes between them. Having this mechanism in irq (hard or threaded) context sounds unnecessary a bit. It seems like we're attempting to stay in irq context for as long as we can instead of scheduling to softirq/thread context if we have more than a minimal amount of work to do. Without at least understanding why softirq/thread degrades us so much this code seems like the wrong approach to me. Interrupt context will always be faster, but it is not a sufficient reason to spend as much time as possible there, is it? We should also keep in mind, that the networking stack has been doing this for years, I would try to understand why this cannot work for nvme before dismissing. Especially, Long found that context switch is increased a lot after applying your irq poll patch. https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists .infradead.org%2Fpipermail%2Flinux-nvme%2F2019- August%2F026788.html&am p;data=02%7C01%7Clongli%40microsoft.com%7C20391b0810844821325908d73 59c 64d2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036818140279 742&a mp;sdata=GyBWILwPvwHYvrTGSAVZbdl%2Fcoz3twSXe2DrH2t1MeQ%3D&am p;reserved =0 Oh, I didn't see that one, wonder why... thanks! 5% improvement, I guess we can buy that for other users as is :) If we suffer from lots of context switches while the CPU is flooded with interrupts, then I would argue that we're re-raising softirq too much. In this use-case, my assumption is that the cpu cannot keep up with the interrupts and not that it doesn't reap enough (we also reap the first batch in interrupt context...) Perhaps making irqpoll continue until it must resched would improve things further? Although this is a latency vs. efficiency tradeoff, looks like MAX_SOFTIRQ_TIME is set to 2ms: " * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in * certain cases, such as stop_machine(), jiffies may cease to * increment and so we need the MAX_SOFTIRQ_RESTART limit as * well to make sure we eventually return from this method. * * These limits have been established via experimentation. * The two things to balance is latency against fairness - * we want to handle softirqs as soon as possible, but they * should not be able to lock up the box. " Long, does this patch make any difference? Sagi, Sorry it took a while to bring my system back online. With the patch, the IOPS is about the same drop with the 1st patch. I think the excessive context switches are causing the drop in IOPS. The following are captured by "perf sched record" for 30 seconds during tests. "perf sched latency" With patch: fio:(82) | 937632.706 ms | 1782255 | avg:0.209 ms | max: 63.123 ms | max at:768.274023 s without patch: fio:(82) |2348323.432 ms |18848 | avg:0.295 ms | max: 28.446 ms | max at: 6447.310255 s Without patch means the proposed hard-irq patch? If we are context switching too much, it means the soft-irq operation is not efficient, not necessarily the fact that the completion path is running in soft-irq.. Is your kernel compiled with full preemption or voluntary preemption? Look closer at each CPU, we can see ksoftirqd is competing CPU with fio (and effectively throttle other fio processes) (captured in /sys/kernel/debug/tracing, echo sched:* >set_event) On CPU1 with patch: (note that the prev_state for fio is "R", it's preemptively scheduled) <...>-4077 [001] d... 66456.805062: sched_switch: prev_comm=fio prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 next_prio=120 <...>-17[001] d... 66456.805859: sched_switch: prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=fio next_pid=4077 next_prio=120 <...>-4077 [001] d... 66456.844049: sched_switch: prev_comm=fio prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 next_prio=120 <...>-17[001] d... 66456.844607: sched_switch: prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=f
Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
It seems like we're attempting to stay in irq context for as long as we can instead of scheduling to softirq/thread context if we have more than a minimal amount of work to do. Without at least understanding why softirq/thread degrades us so much this code seems like the wrong approach to me. Interrupt context will always be faster, but it is not a sufficient reason to spend as much time as possible there, is it? If extra latency is added in IO completion path, this latency will be introduced in the submission path, because the hw queue depth is fixed, which is often small. Especially in case of multiple submission vs. single(shared) completion, the whole hw queue tags can be exhausted easily. This is why the patch is reaping the first batch from hard-irq, but only if it has more will defer to softirq. So I'm not sure the short QD use case applies here... I guess no such effect for networking IO. Maybe, maybe not...
Re: [PATCH] Added QUIRKs for ADATA XPG SX8200 Pro 512GB
This does not apply on nvme-5.4, can you please respin a patch that cleanly applies?
Re: [PATCH] Added QUIRKs for ADATA XPG SX8200 Pro 512GB
Reviewed-by: Sagi Grimberg
Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
Hey Ming, Ok, so the real problem is per-cpu bounded tasks. I share Thomas opinion about a NAPI like approach. We already have that, its irq_poll, but it seems that for this use-case, we get lower performance for some reason. I'm not entirely sure why that is, maybe its because we need to mask interrupts because we don't have an "arm" register in nvme like network devices have? Long observed that IOPS drops much too by switching to threaded irq. If softirqd is waken up for handing softirq, the performance shouldn't be better than threaded irq. Its true that it shouldn't be any faster, but what irqpoll already has and we don't need to reinvent is a proper budgeting mechanism that needs to occur when multiple devices map irq vectors to the same cpu core. irqpoll already maintains a percpu list and dispatch the ->poll with a budget that the backend enforces and irqpoll multiplexes between them. Having this mechanism in irq (hard or threaded) context sounds unnecessary a bit. It seems like we're attempting to stay in irq context for as long as we can instead of scheduling to softirq/thread context if we have more than a minimal amount of work to do. Without at least understanding why softirq/thread degrades us so much this code seems like the wrong approach to me. Interrupt context will always be faster, but it is not a sufficient reason to spend as much time as possible there, is it? We should also keep in mind, that the networking stack has been doing this for years, I would try to understand why this cannot work for nvme before dismissing. Especially, Long found that context switch is increased a lot after applying your irq poll patch. http://lists.infradead.org/pipermail/linux-nvme/2019-August/026788.html Oh, I didn't see that one, wonder why... thanks! 5% improvement, I guess we can buy that for other users as is :) If we suffer from lots of context switches while the CPU is flooded with interrupts, then I would argue that we're re-raising softirq too much. In this use-case, my assumption is that the cpu cannot keep up with the interrupts and not that it doesn't reap enough (we also reap the first batch in interrupt context...) Perhaps making irqpoll continue until it must resched would improve things further? Although this is a latency vs. efficiency tradeoff, looks like MAX_SOFTIRQ_TIME is set to 2ms: " * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in * certain cases, such as stop_machine(), jiffies may cease to * increment and so we need the MAX_SOFTIRQ_RESTART limit as * well to make sure we eventually return from this method. * * These limits have been established via experimentation. * The two things to balance is latency against fairness - * we want to handle softirqs as soon as possible, but they * should not be able to lock up the box. " Long, does this patch make any difference? -- diff --git a/lib/irq_poll.c b/lib/irq_poll.c index 2f17b488d58e..d8eab563fa77 100644 --- a/lib/irq_poll.c +++ b/lib/irq_poll.c @@ -12,8 +12,6 @@ #include #include -static unsigned int irq_poll_budget __read_mostly = 256; - static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll); /** @@ -77,42 +75,29 @@ EXPORT_SYMBOL(irq_poll_complete); static void __latent_entropy irq_poll_softirq(struct softirq_action *h) { - struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll); - int rearm = 0, budget = irq_poll_budget; - unsigned long start_time = jiffies; + struct list_head *irqpoll_list = this_cpu_ptr(&blk_cpu_iopoll); + LIST_HEAD(list); local_irq_disable(); + list_splice_init(irqpoll_list, &list); + local_irq_enable(); - while (!list_empty(list)) { + while (!list_empty(&list)) { struct irq_poll *iop; int work, weight; - /* -* If softirq window is exhausted then punt. -*/ - if (budget <= 0 || time_after(jiffies, start_time)) { - rearm = 1; - break; - } - - local_irq_enable(); - /* Even though interrupts have been re-enabled, this * access is safe because interrupts can only add new * entries to the tail of this list, and only ->poll() * calls can remove this head entry from the list. */ - iop = list_entry(list->next, struct irq_poll, list); + iop = list_first_entry(&list, struct irq_poll, list); weight = iop->weight; work = 0; if (test_bit(IRQ_POLL_F_SCHED, &iop->state)) work = iop->poll(iop, weight); - budget -= work; - - local_irq_disable(); - /* * Drivers must not modify the iopoll state, if they * consume their assigned weight (or more, some drivers can't @@ -125,11 +110,21
Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism
Ok, so the real problem is per-cpu bounded tasks. I share Thomas opinion about a NAPI like approach. We already have that, its irq_poll, but it seems that for this use-case, we get lower performance for some reason. I'm not entirely sure why that is, maybe its because we need to mask interrupts because we don't have an "arm" register in nvme like network devices have?
Re: [PATCH] nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()
Reviewed-by: Sagi Grimberg applied to nvme-5.4
Re: [PATCHv2] nvme: Assign subsy instance from first ctrl
Looks good, Reviewed-by: Sagi Grimberg Applied to nvme-5.4
Re: [PATCH] nvme: tcp: remove redundant assignment to variable ret
Applied to nvme-5.4
Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace
Still don't understand how this is ok... I have /dev/nvme0 represents a network endpoint that I would discover from, it is raising me an event to do a discovery operation (namely to issue an ioctl to it) so my udev code calls a systemd script. By the time I actually get to do that, /dev/nvme0 represents now a new network endpoint (where the event is no longer relevant to). I would rather the discovery to explicitly fail than to give me something different, so we pass some arguments that we verify in the operation. Its a stretch case, but it was raised by people as a potential issue. Ok, and how do you handle this same thing for something like /dev/sda ? (hint, it isn't new, and is something we solved over a decade ago) If you worry about stuff like this, use a persistant device naming scheme and have your device node be pointed to by a symlink. Create that symlink by using the information in the initial 'ADD' uevent. That way, when userspace opens the device node, it "knows" exactly which one it opens. It sounds like you have a bunch of metadata to describe these uniquely, so pass that in the ADD event, not in some other crazy non-standard manner. We could send these variables when adding the device and then validating them using a rich-text-explanatory symlink. Seems slightly backwards to me, but that would work too. We create the char device using device_add (in nvme_init_subsystem), I didn't find any way to append env variables to that ADD uevent. Did you mean that we should add another flavor of device_add that accepts char *envp_ext[]? What exactly is the "standard manner" to pass these variables to the chardev KOBJ_ADD uevent? All other examples I could find use KOBJ_CHANGE to pass private stuff..
Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace
Yes we do, userspace should use it to order events. Does udev not handle that properly today? The problem is not ordering of events, its really about the fact that the chardev can be removed and reallocated for a different controller (could be a completely different discovery controller) by the time that userspace handles the event. The same is generally true for lot of kernel devices. We could reduce the chance by using the idr cyclic allocator. Well, it was raised by Hannes and James, so I'll ask them respond here because I don't mind having it this way. I personally think that this is a better approach than having a cyclic idr allocator. In general, I don't necessarily think that this is a good idea to have cyclic controller enumerations if we don't absolutely have to... We hit it right and left without the cyclic allocator, but that won't necessarily remove it. Perhaps we should have had a unique token assigned to the controller, and have the event pass the name and the token. The cli would then, if the token is present, validate it via an ioctl before proceeding with other ioctls. Where all the connection arguments were added we due to the reuse issue and then solving the question of how to verify and/or lookup the desired controller, by using the shotgun approach rather than being very pointed, which is what the name/token would do. This unique token is: trtype:traddr:trsvcid:host-traddr ...
Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace
You are correct that this information can be derived from sysfs, but the main reason why we add these here, is because in udev rule we can't just go ahead and start looking these up and parsing these.. We could send the discovery aen with NVME_CTRL_NAME and have then have systemd run something like: nvme connect-all -d nvme0 --sysfs and have nvme-cli retrieve all this stuff from sysfs? Actually that may be a problem. There could be a hypothetical case where after the event was fired and before it was handled, the discovery controller went away and came back again with a different controller instance, and the old instance is now a different discovery controller. This is why we need this information in the event. And we verify this information in sysfs in nvme-cli. Well, that must be a usual issue with uevents, right? Don't we usually have a increasing serial number for that or something? Yes we do, userspace should use it to order events. Does udev not handle that properly today? The problem is not ordering of events, its really about the fact that the chardev can be removed and reallocated for a different controller (could be a completely different discovery controller) by the time that userspace handles the event. So? You will have gotten the remove and then new addition uevent in order showing you this. So your userspace code knows that something went away and then came back properly so you should be kept in sync. Still don't understand how this is ok... I have /dev/nvme0 represents a network endpoint that I would discover from, it is raising me an event to do a discovery operation (namely to issue an ioctl to it) so my udev code calls a systemd script. By the time I actually get to do that, /dev/nvme0 represents now a new network endpoint (where the event is no longer relevant to). I would rather the discovery to explicitly fail than to give me something different, so we pass some arguments that we verify in the operation. Its a stretch case, but it was raised by people as a potential issue.
Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace
Yes we do, userspace should use it to order events. Does udev not handle that properly today? The problem is not ordering of events, its really about the fact that the chardev can be removed and reallocated for a different controller (could be a completely different discovery controller) by the time that userspace handles the event. The same is generally true for lot of kernel devices. We could reduce the chance by using the idr cyclic allocator. Well, it was raised by Hannes and James, so I'll ask them respond here because I don't mind having it this way. I personally think that this is a better approach than having a cyclic idr allocator. In general, I don't necessarily think that this is a good idea to have cyclic controller enumerations if we don't absolutely have to...
Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace
You are correct that this information can be derived from sysfs, but the main reason why we add these here, is because in udev rule we can't just go ahead and start looking these up and parsing these.. We could send the discovery aen with NVME_CTRL_NAME and have then have systemd run something like: nvme connect-all -d nvme0 --sysfs and have nvme-cli retrieve all this stuff from sysfs? Actually that may be a problem. There could be a hypothetical case where after the event was fired and before it was handled, the discovery controller went away and came back again with a different controller instance, and the old instance is now a different discovery controller. This is why we need this information in the event. And we verify this information in sysfs in nvme-cli. Well, that must be a usual issue with uevents, right? Don't we usually have a increasing serial number for that or something? Yes we do, userspace should use it to order events. Does udev not handle that properly today? The problem is not ordering of events, its really about the fact that the chardev can be removed and reallocated for a different controller (could be a completely different discovery controller) by the time that userspace handles the event.
Re: [PATCH v4 2/4] nvme-pci: Add support for variable IO SQ element size
I'll fix it. Note that I'm going to take it out of the tree soon because it will have conflicts with Jens for-5.4/block, so we will send it to Jens after the initial merge window, after he rebases off of Linus. Conflicts too hard to fixup at merge time ? Otherwise I could just rebase on top of Jens and put in a topic branch... The quirk enumeration conflicts with 5.3-rc. Not a big deal, just thought it'd be easier to handle that way. Rebasing on top of Jens won't help because his for-5.4/block which does not have the rc code yet.
Re: [PATCH v4 2/4] nvme-pci: Add support for variable IO SQ element size
wrote: +#define NVME_NVM_ADMSQES 6 #define NVME_NVM_IOSQES 6 #define NVME_NVM_IOCQES 4 The NVM in the two defines here stands for the NVM command set, so this should just be named NVME_ADM_SQES or so. But except for this the patch looks good: Reviewed-by: Christoph Hellwig So maybe Sagi can just fix this up in the tree. Ah ok I missed the meaning. Thanks. Sagi, can you fix that up or do you need me to resubmit ? I'll fix it. Note that I'm going to take it out of the tree soon because it will have conflicts with Jens for-5.4/block, so we will send it to Jens after the initial merge window, after he rebases off of Linus.
Re: [PATCH v7 08/14] nvmet-core: allow one host per passthru-ctrl
I don't understand why we don't limit a regular ctrl to single access and we do it for the PT ctrl. I guess the block layer helps to sync between multiple access in parallel but we can do it as well. Also, let's say you limit the access to this subsystem to 1 user, the bdev is still accessibly for local user and also you can create a different subsystem that will use this device (PT and non-PT ctrl). Sagi, can you explain the trouble you meant and how this limitation solve it ? Its different to emulate the controller with all its admin commands vs. passing it through to the nvme device.. (think of format nvm) we don't need to support format command for PT ctrl as we don't support other commands such create_sq/cq. That is just an example, basically every command that we are not aware of we simply passthru to the drive without knowing the implications on a multi-host environment..
Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts
Sagi, Here are the test results. Benchmark command: fio --bs=4k --ioengine=libaio --iodepth=64 --filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1 --direct=1 --runtime=90 --numjobs=80 --rw=randread --name=test --group_reporting --gtod_reduce=1 With your patch: 1720k IOPS With threaded interrupts: 1320k IOPS With just interrupts: 3720k IOPS Interrupts are the fastest but we need to find a way to throttle it. This is the workload that generates the flood? If so I did not expect that this would be the perf difference.. If completions keep pounding on the cpu, I would expect irqpoll to simply keep running forever and poll the cqs. There is no fundamental reason why polling would be faster in an interrupt, what matters could be: 1. we reschedule more than we need to 2. we don't reap enough completions in every poll round, which will trigger rearming the interrupt and then when it fires reschedule another softirq... Maybe we need to take care of some irq_poll optimizations? Does this (untested) patch make any difference? -- diff --git a/lib/irq_poll.c b/lib/irq_poll.c index 2f17b488d58e..0e94183eba15 100644 --- a/lib/irq_poll.c +++ b/lib/irq_poll.c @@ -12,7 +12,8 @@ #include #include -static unsigned int irq_poll_budget __read_mostly = 256; +static unsigned int irq_poll_budget __read_mostly = 3000; +unsigned int __read_mostly irqpoll_budget_usecs = 2000; static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll); @@ -77,32 +78,26 @@ EXPORT_SYMBOL(irq_poll_complete); static void __latent_entropy irq_poll_softirq(struct softirq_action *h) { - struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll); - int rearm = 0, budget = irq_poll_budget; - unsigned long start_time = jiffies; + struct list_head *irqpoll_list = this_cpu_ptr(&blk_cpu_iopoll); + unsigned int budget = irq_poll_budget; + unsigned long time_limit = + jiffies + usecs_to_jiffies(irqpoll_budget_usecs); + LIST_HEAD(list); local_irq_disable(); + list_splice_init(irqpoll_list, &list); + local_irq_enable(); - while (!list_empty(list)) { + while (!list_empty(&list)) { struct irq_poll *iop; int work, weight; - /* -* If softirq window is exhausted then punt. -*/ - if (budget <= 0 || time_after(jiffies, start_time)) { - rearm = 1; - break; - } - - local_irq_enable(); - /* Even though interrupts have been re-enabled, this * access is safe because interrupts can only add new * entries to the tail of this list, and only ->poll() * calls can remove this head entry from the list. */ - iop = list_entry(list->next, struct irq_poll, list); + iop = list_first_entry(&list, struct irq_poll, list); weight = iop->weight; work = 0; @@ -111,8 +106,6 @@ static void __latent_entropy irq_poll_softirq(struct softirq_action *h) budget -= work; - local_irq_disable(); - /* * Drivers must not modify the iopoll state, if they * consume their assigned weight (or more, some drivers can't @@ -125,11 +118,21 @@ static void __latent_entropy irq_poll_softirq(struct softirq_action *h) if (test_bit(IRQ_POLL_F_DISABLE, &iop->state)) __irq_poll_complete(iop); else - list_move_tail(&iop->list, list); + list_move_tail(&iop->list, &list); } + + /* +* If softirq window is exhausted then punt. +*/ + if (budget <= 0 || time_after_eq(jiffies, time_limit)) + break; } - if (rearm) + local_irq_disable(); + + list_splice_tail_init(irqpoll_list, &list); + list_splice(&list, irqpoll_list); + if (!list_empty(irqpoll_list)) __raise_softirq_irqoff(IRQ_POLL_SOFTIRQ); local_irq_enable(); --
Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts
From: Long Li When a NVMe hardware queue is mapped to several CPU queues, it is possible that the CPU this hardware queue is bound to is flooded by returning I/O for other CPUs. For example, consider the following scenario: 1. CPU 0, 1, 2 and 3 share the same hardware queue 2. the hardware queue interrupts CPU 0 for I/O response 3. processes from CPU 1, 2 and 3 keep sending I/Os CPU 0 may be flooded with interrupts from NVMe device that are I/O responses for CPU 1, 2 and 3. Under heavy I/O load, it is possible that CPU 0 spends all the time serving NVMe and other system interrupts, but doesn't have a chance to run in process context. To fix this, CPU 0 can schedule a work to complete the I/O request when it detects the scheduler is not making progress. This serves multiple purposes: 1. This CPU has to be scheduled to complete the request. The other CPUs can't issue more I/Os until some previous I/Os are completed. This helps this CPU get out of NVMe interrupts. 2. This acts a throttling mechanisum for NVMe devices, in that it can not starve a CPU while servicing I/Os from other CPUs. 3. This CPU can make progress on RCU and other work items on its queue. The problem is indeed real, but this is the wrong approach in my mind. We already have irqpoll which takes care proper budgeting polling cycles and not hogging the cpu. I've sent rfc for this particular problem before [1]. At the time IIRC, Christoph suggested that we will poll the first batch directly from the irq context and reap the rest in irqpoll handler. [1]: http://lists.infradead.org/pipermail/linux-nvme/2016-October/006497.html How about something like this instead: -- diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 71127a366d3c..84bf16d75109 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -24,6 +24,7 @@ #include #include #include +#include #include "trace.h" #include "nvme.h" @@ -32,6 +33,7 @@ #define CQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_completion)) #define SGES_PER_PAGE (PAGE_SIZE / sizeof(struct nvme_sgl_desc)) +#define NVME_POLL_BUDGET_IRQ 256 /* * These can be higher, but we need to ensure that any command doesn't @@ -189,6 +191,7 @@ struct nvme_queue { u32 *dbbuf_cq_db; u32 *dbbuf_sq_ei; u32 *dbbuf_cq_ei; + struct irq_poll iop; struct completion delete_done; }; @@ -1015,6 +1018,23 @@ static inline int nvme_process_cq(struct nvme_queue *nvmeq, u16 *start, return found; } +static int nvme_irqpoll_handler(struct irq_poll *iop, int budget) +{ + struct nvme_queue *nvmeq = container_of(iop, struct nvme_queue, iop); + struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev); + u16 start, end; + int completed; + + completed = nvme_process_cq(nvmeq, &start, &end, budget); + nvme_complete_cqes(nvmeq, start, end); + if (completed < budget) { + irq_poll_complete(&nvmeq->iop); + enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector)); + } + + return completed; +} + static irqreturn_t nvme_irq(int irq, void *data) { struct nvme_queue *nvmeq = data; @@ -1028,12 +1048,16 @@ static irqreturn_t nvme_irq(int irq, void *data) rmb(); if (nvmeq->cq_head != nvmeq->last_cq_head) ret = IRQ_HANDLED; - nvme_process_cq(nvmeq, &start, &end, -1); + nvme_process_cq(nvmeq, &start, &end, NVME_POLL_BUDGET_IRQ); nvmeq->last_cq_head = nvmeq->cq_head; wmb(); if (start != end) { nvme_complete_cqes(nvmeq, start, end); + if (nvme_cqe_pending(nvmeq)) { + disable_irq_nosync(irq); + irq_poll_sched(&nvmeq->iop); + } return IRQ_HANDLED; } @@ -1347,6 +1371,7 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved) static void nvme_free_queue(struct nvme_queue *nvmeq) { + irq_poll_disable(&nvmeq->iop); dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq), (void *)nvmeq->cqes, nvmeq->cq_dma_addr); if (!nvmeq->sq_cmds) @@ -1481,6 +1506,7 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int qid, int depth) nvmeq->dev = dev; spin_lock_init(&nvmeq->sq_lock); spin_lock_init(&nvmeq->cq_poll_lock); + irq_poll_init(&nvmeq->iop, NVME_POLL_BUDGET_IRQ, nvme_irqpoll_handler); nvmeq->cq_head = 0; nvmeq->cq_phase = 1; nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride]; --