from:"Sagi Grimberg"

Re: [PATCH net-next v12 04/14] mm: page_frag: add '_va' suffix to page_frag API

2024-08-03 Thread Sagi Grimberg


Regardless of the API discussion,

The nvme-tcp bits look straight-forward:
Acked-by: Sagi Grimberg

Re: [PATCH v2] nvme-tcp: Check if request has started before processing it

2021-04-01 Thread Sagi Grimberg





request tag can't be zero? I forget...


Of course it can.  But the reserved tags are before the normal tags,
so 0 would be a reserved tag for nvme.


Right.

Re: [PATCH v2] nvme-tcp: Check if request has started before processing it

2021-03-31 Thread Sagi Grimberg





It is, but in this situation, the controller is sending a second
completion that results in a use-after-free, which makes the
transport irrelevant. Unless there is some other flow (which is
unclear
to me) that causes this which is a bug that needs to be fixed rather
than hidden with a safeguard.



The kernel should not crash regardless of any network traffic that is
sent to the system.  It should not be possible to either intentionally
of mistakenly contruct packets that will deny service in this way.


This is not specific to nvme-tcp. I can build an rdma or pci controller
that can trigger the same crash... I saw a similar patch from Hannes
implemented in the scsi level, and not the individual scsi transports..


If scsi wants this too, this could be made generic at the blk-mq level.
We just need to make something like blk_mq_tag_to_rq(), but return NULL
if the request isn't started.


Makes sense...


I would also mention, that a crash is not even the scariest issue that
we can see here, because if the request happened to be reused we are
in the silent data corruption realm...


If this does happen, I think we have to come up with some way to
mitigate it. We're not utilizing the full 16 bits of the command_id, so
maybe we can append something like a generation sequence number that can
be checked for validity.


That's actually a great idea. scsi needs unique tags so it encodes the
hwq in the upper 16 bits giving the actual tag the lower 16 bits which
is more than enough for a single queue. We can do the same with
a gencnt that will increment in both submission and completion and we
can validate against it.

This will be useful for all transports, so maintaining it in
nvme_req(rq)->genctr and introducing a helper like:
rq = nvme_find_tag(tagset, cqe->command_id)
That will filter genctr, locate the request.

Also:
nvme_validate_request_gen(rq, cqe->command_id) that would
compare against it.


And then a helper to set the command_id like:
cmd->common.command_id = nvme_request_command_id(rq)
that will both increment the genctr and build a command_id
from it.

Thoughts?

Re: [PATCH v2] nvme-tcp: Check if request has started before processing it

2021-03-31 Thread Sagi Grimberg





What we can do, though, is checking the 'state' field in the tcp
request, and only allow completions for commands which are in a state
allowing for completions.

Let's see if I can whip up a patch.


That would be great.  BTW in the crash dump I am looking at now, it
looks like pdu->command_id was zero in nvme_tcp_recv_data(), and
blk_mq_tag_to_rq() returned a request struct that had not been used.
So I think we do need to check that the tag was actually allocated.


request tag can't be zero? I forget...

Re: [PATCH] nvme: Export fast_io_fail_tmo to sysfs

2021-03-31 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH v2] nvme-tcp: Check if request has started before processing it

2021-03-30 Thread Sagi Grimberg





It is, but in this situation, the controller is sending a second
completion that results in a use-after-free, which makes the
transport irrelevant. Unless there is some other flow (which is
unclear
to me) that causes this which is a bug that needs to be fixed rather
than hidden with a safeguard.



The kernel should not crash regardless of any network traffic that is
sent to the system.  It should not be possible to either intentionally
of mistakenly contruct packets that will deny service in this way.


This is not specific to nvme-tcp. I can build an rdma or pci controller
that can trigger the same crash... I saw a similar patch from Hannes
implemented in the scsi level, and not the individual scsi transports..

I would also mention, that a crash is not even the scariest issue that
we can see here, because if the request happened to be reused we are
in the silent data corruption realm...

Re: [PATCH v2] nvme: disallow passthru cmd from targeting a nsid != nsid of the block dev

2021-03-26 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH v2] infiniband: Fix a use after free in isert_connect_request

2021-03-22 Thread Sagi Grimberg


Acked-by: Sagi Grimberg

Re: [PATCH] nvme/rdma: Fix a use after free in nvmet_rdma_write_data_done

2021-03-15 Thread Sagi Grimberg





In nvmet_rdma_write_data_done, rsp is recoverd by wc->wr_cqe
and freed by nvmet_rdma_release_rsp(). But after that, pr_info()
used the freed chunk's member object and could leak the freed
chunk address with wc->wr_cqe by computing the offset.

Signed-off-by: Lv Yunlong 
---
  drivers/nvme/target/rdma.c | 5 ++---
  1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 06b6b742bb21..6c1f3ab7649c 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -802,9 +802,8 @@ static void nvmet_rdma_write_data_done(struct ib_cq *cq, 
struct ib_wc *wc)
nvmet_req_uninit(&rsp->req);
nvmet_rdma_release_rsp(rsp);
if (wc->status != IB_WC_WR_FLUSH_ERR) {
-   pr_info("RDMA WRITE for CQE 0x%p failed with status %s 
(%d).\n",
-   wc->wr_cqe, ib_wc_status_msg(wc->status),
-   wc->status);
+   pr_info("RDMA WRITE for CQE failed with status %s 
(%d).\n",
+   ib_wc_status_msg(wc->status), wc->status);


There is nothing wrong with this reference, wr_cqe is a valid reference
and I think Israel added this for some extra information that may be
useful to him.

Israel? you ok with this removed?


nvmet_rdma_error_comp(queue);
}
return;

Re: [PATCH v2] nvme-tcp: Check if request has started before processing it

2021-03-15 Thread Sagi Grimberg





Hi Sagi,

On Fri, Mar 05, 2021 at 11:57:30AM -0800, Sagi Grimberg wrote:

Daniel, again, there is nothing specific about this to nvme-tcp,
this is a safeguard against a funky controller (or a different
bug that is hidden by this).


As far I can tell, the main difference between nvme-tcp and FC/NVMe,
nvme-tcp has not a FW or a big driver which filter out some noise from a
misbehaving controller. I haven't really checked the other transports
but I wouldn't surprised they share the same properties as FC/NVMe.


The same can happen in any other transport so I would suggest that if
this is a safeguard we want to put in place, we should make it a
generic one.

i.e. nvme_tag_to_rq() that _all_ transports call consistently.


Okay, I'll review all the relevant code and see what could made more
generic and consistent.

Though I think nvme-tcp plays in a different league as it is exposed to
normal networking traffic and this is a very hostile environment.


It is, but in this situation, the controller is sending a second
completion that results in a use-after-free, which makes the
transport irrelevant. Unless there is some other flow (which is unclear
to me) that causes this which is a bug that needs to be fixed rather
than hidden with a safeguard.

Re: [PATCH v2] nvme-tcp: Check if request has started before processing it

2021-03-05 Thread Sagi Grimberg





blk_mq_tag_to_rq() always returns a request if the tag id is in a
valid range [0...max_tags). If the target replies with a tag for which
we don't have a request but it's not started, the host will likely
corrupt data or simply crash.

Add an additional check if the a request has been started if not
reset the connection.

This addition check will not protected against an invalid tag which
maps to a request which has been started. There is nothing we can do
about this. Though it will at a least protect from crashing the host,
which generally thought to be the right thing to do.


Daniel, again, there is nothing specific about this to nvme-tcp,
this is a safeguard against a funky controller (or a different
bug that is hidden by this). The same can happen in any other
transport so I would suggest that if this is a safeguard we
want to put in place, we should make it a generic one.

i.e. nvme_tag_to_rq() that _all_ transports call consistently.

Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-15 Thread Sagi Grimberg





blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.


How did you get to this one? did the controller send a completion for
a completed/bogus request?


If that is the case, then that must mean it's possible the driver could
have started the command id just before the bogus completion check. Data
iorruption, right?


'during TCP LIF toggles and aggr relocates' testing the host
crashes. TBH, I do not really know what is happening or what the test
does. Still trying to figure out what's going on.


Well, I think we should probably figure out why that is happening first.


I was just very
surprised how much the code trusts the other side to behave correctly./


What does pci/rdma/fc does differently? What does scsi do here?


Yes, which is why I don't think this check is very useful..


I actually view that as a valid protection against spoofed frames.
Without it it's easy to crash the machine by injecting fake completions with
random command ids.


In this test scenario it's not even a spoofed frame; maybe just a
confused controller.


Maybe... I am still not sure how this patch helps here though...

Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-15 Thread Sagi Grimberg





blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.


How did you get to this one? did the controller send a completion for
a completed/bogus request?


If that is the case, then that must mean it's possible the driver could
have started the command id just before the bogus completion check. Data
iorruption, right?


Yes, which is why I don't think this check is very useful..


I actually view that as a valid protection against spoofed frames.
Without it it's easy to crash the machine by injecting fake completions 
with random command ids.


And this doesn't help because the command can have been easily reused
and started... What is this protecting against? Note that none of the
other transports checks that, why should tcp?

Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-12 Thread Sagi Grimberg





blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.


How did you get to this one? did the controller send a completion for
a completed/bogus request?


If that is the case, then that must mean it's possible the driver could
have started the command id just before the bogus completion check. Data
iorruption, right?


Yes, which is why I don't think this check is very useful..

Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-12 Thread Sagi Grimberg


blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.


How did you get to this one? did the controller send a completion for
a completed/bogus request?

Re: [PATCH] nvme: convert sysfs sprintf/snprintf family to sysfs_emit

2021-02-03 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH v2] nvme-multipath: Early exit if no path is available

2021-01-28 Thread Sagi Grimberg





You can't see exactly where it dies but I followed the assembly to
nvme_round_robin_path(). Maybe it's not the initial nvme_next_ns(head,
old) which returns NULL but nvme_next_ns() is returning NULL eventually
(list_next_or_null_rcu()).

So there is other bug cause nvme_next_ns abormal.
I review the code about head->list and head->current_path, I find 2 bugs
may cause the bug:
First, I already send the patch. see:
https://lore.kernel.org/linux-nvme/20210128033351.22116-1-lengc...@huawei.com/ 


Second, in nvme_ns_remove, list_del_rcu is before
nvme_mpath_clear_current_path. This may cause "old" is deleted from the
"head", but still use "old". I'm not sure there's any other
consideration here, I will check it and try to fix it.


The reason why we first remove from head->list and only then clear
current_path is because the other way around there is no way
to guarantee that that the ns won't be assigned as current_path
again (because it is in head->list).

ok, I see.


nvme_ns_remove fences continue of deletion of the ns by synchronizing
the srcu such that for sure the current_path clearance is visible.

The list will be like this:
head->next = ns1;
ns1->next = head;
old->next = ns1;
This may cause infinite loop in nvme_round_robin_path.
for (ns = nvme_next_ns(head, old);
 ns != old;
 ns = nvme_next_ns(head, ns))
The ns will always be ns1, and then infinite loop.


Who is being removed? I'm not following

Re: [PATCH v2] nvme-multipath: Early exit if no path is available

2021-01-28 Thread Sagi Grimberg





You can't see exactly where it dies but I followed the assembly to
nvme_round_robin_path(). Maybe it's not the initial nvme_next_ns(head,
old) which returns NULL but nvme_next_ns() is returning NULL eventually
(list_next_or_null_rcu()).

So there is other bug cause nvme_next_ns abormal.
I review the code about head->list and head->current_path, I find 2 bugs
may cause the bug:
First, I already send the patch. see:
https://lore.kernel.org/linux-nvme/20210128033351.22116-1-lengc...@huawei.com/ 


Second, in nvme_ns_remove, list_del_rcu is before
nvme_mpath_clear_current_path. This may cause "old" is deleted from the
"head", but still use "old". I'm not sure there's any other
consideration here, I will check it and try to fix it.


The reason why we first remove from head->list and only then clear
current_path is because the other way around there is no way
to guarantee that that the ns won't be assigned as current_path
again (because it is in head->list).

nvme_ns_remove fences continue of deletion of the ns by synchronizing
the srcu such that for sure the current_path clearance is visible.

Re: [PATCH] nvme: reject the ns when the block size is smaller than a sector

2021-01-14 Thread Sagi Grimberg





But this only catches a physical block size < 512 for NVMe, not any other block 
device.

Please fix it for the general case in blk_queue_physical_block_size().


We actually call that later and would probably be better to check here..


We had a series for that a short while ago that got lost.


What was it called?

Re: [PATCH] nvme: Constify static attribute_group structs

2021-01-13 Thread Sagi Grimberg


FWIW,

Reviewed-by: Sagi Grimberg

Re: [PATCH] nvme: reject the ns when the block size is smaller than a sector

2021-01-13 Thread Sagi Grimberg





The nvme spec(1.4a, figure 248) says:
"A value smaller than 9 (i.e., 512 bytes) is not supported."

Signed-off-by: Li Feng 
---
  drivers/nvme/host/core.c | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f320273fc672..1f02e6e49a05 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2161,6 +2161,12 @@ static int nvme_update_ns_info(struct nvme_ns *ns, 
struct nvme_id_ns *id)
  
  	blk_mq_freeze_queue(ns->disk->queue);

ns->lba_shift = id->lbaf[lbaf].ds;
+   if (ns->lba_shift < 9) {
+   pr_warn("%s: bad lba_shift(%d)\n", ns->disk->disk_name, 
ns->lba_shift);
+   ret = -1;


Meaningful errno would be useful. Also better to use dev_warn


+   goto out_unfreeze;
+   }
+
nvme_set_queue_limits(ns->ctrl, ns->queue);
  
  	if (ns->head->ids.csi == NVME_CSI_ZNS) {





But this only catches a physical block size < 512 for NVMe, not any other block 
device.

Please fix it for the general case in blk_queue_physical_block_size().


We actually call that later and would probably be better to check here..

Re: [PATCH] iosched: Add i10 I/O Scheduler

2020-11-30 Thread Sagi Grimberg

Dear all:

Thanks, again, for the very constructive decisions.

I am writing back with quite a few updates:

1. We have now included a detailed comparison of i10 scheduler with
Kyber with NVMe-over-TCP
(https://github.com/i10-kernel/upstream-linux/blob/master/i10-evaluation.pdf).
In a nutshell, when operating with NVMe-over-TCP, i10 demonstrates the
core tradeoff: higher latency, but also higher throughput. This seems to
be the core tradeoff exposed by i10.

2. We have now implemented an adaptive version of i10 I/O scheduler,
that uses the number of outstanding requests at the time of batch
dispatch (and whether the dispatch was triggered by timeout or not) to
adaptively set the batching size. The new results
(https://github.com/i10-kernel/upstream-linux/blob/master/i10-evaluation.pdf)
show that i10-adaptive further improves performance for low loads, while
keeping the performance for high loads. IMO, there is much to do on
designing improved adaptation algorithms.

3. We have now updated the i10-evaluation document to include results
for local storage access. The core take-away here is that i10-adaptive
can achieve similar throughput and latency at low loads and at high
loads when compared to noop, but still requires more work for lower
loads. However, given that the tradeoff exposed by i10 scheduler is
particularly useful for remote storage devices (and as Jens suggested,
perhaps for virtualized local storage access), I do agree with Sagi -- I
think we should consider including it in the core, since this may be
useful for a broad range of new use cases.

We have also created a second version of the patch that includes these
updates:
https://github.com/i10-kernel/upstream-linux/blob/master/0002-iosched-Add-i10-I-O-Scheduler.patch

As always, thank you for the constructive discussion and I look forward
to working with you on this.

Thanks Rachit,

Would be good if you can send a formal patch for the adaptive queuing so
people can have a look.

One thing that was left on the table is weather this should be a full
blown scheduler or a core block infrastructure that would either be
set on-demand or by default.

I think that Jens and Ming expressed that this should be something that
we should place this in the block core, but I'd like to hear maybe
Ming can elaborate on his ideas how to do this.

Re: [PATCH] iosched: Add i10 I/O Scheduler

2020-11-13 Thread Sagi Grimberg





But if you think this has a better home, I'm assuming that the guys
will be open to that.


Also see the reply from Ming. It's a balancing act - don't want to add
extra overhead to the core, but also don't want to carry an extra
scheduler if the main change is really just variable dispatch batching.
And since we already have a notion of that, seems worthwhile to explore
that venue.


I agree,

The main difference is that this balancing is not driven from device
resource pressure, but rather from an assumption of device specific
optimization (and also with a specific optimization target), hence a
scheduler a user would need to opt-in seemed like a good compromise.

But maybe Ming has some good ideas on a different way to add it..


So here's another case - virtualized nvme. The commit overhead is
suitably large there that performance suffers quite a bit, similarly to
your remote storage case. If we had suitable logic in the core, then we
could easily propagate this knowledge when setting up the queue. Then it
could happen automatically, without needing a configuration to switch to
a specific scheduler.


Yes, these use-cases share characteristics. I'm not at all opposed to
placing this in the core. I do think that in order to put something like
this in the core, the bar needs to be higher such that an optimization
target cannot be biased towards a workload (i.e. needs to be adaptive).

I'm still not sure how we would build this on top of what we already
have as it is really centered around device being busy (which is not
the case for nvme), but I didn't put enough thought into it yet.

Re: [PATCH] iosched: Add i10 I/O Scheduler

2020-11-13 Thread Sagi Grimberg





But if you think this has a better home, I'm assuming that the guys
will be open to that.


Also see the reply from Ming. It's a balancing act - don't want to add
extra overhead to the core, but also don't want to carry an extra
scheduler if the main change is really just variable dispatch batching.
And since we already have a notion of that, seems worthwhile to explore
that venue.


I agree,

The main difference is that this balancing is not driven from device
resource pressure, but rather from an assumption of device specific
optimization (and also with a specific optimization target), hence a
scheduler a user would need to opt-in seemed like a good compromise.

But maybe Ming has some good ideas on a different way to add it..

Re: [PATCH] iosched: Add i10 I/O Scheduler

2020-11-13 Thread Sagi Grimberg





I haven't taken a close look at the code yet so far, but one quick note
that patches like this should be against the branches for 5.11. In fact,
this one doesn't even compile against current -git, as
blk_mq_bio_list_merge is now called blk_bio_list_merge.


Ugh, I guess that Jaehyun had this patch bottled up and didn't rebase
before submitting.. Sorry about that.


In any case, I did run this through some quick peak testing as I was
curious, and I'm seeing about 20% drop in peak IOPS over none running
this. Perf diff:

  10.71% -2.44%  [kernel.vmlinux]  [k] read_tsc
   2.33% -1.99%  [kernel.vmlinux]  [k] _raw_spin_lock


You ran this with nvme? or null_blk? I guess neither would benefit
from this because if the underlying device will not benefit from
batching (at least enough for the extra cost of accounting for it) it
will be counter productive to use this scheduler.


This is nvme, actual device. The initial posting could be a bit more
explicit on the use case, it says:

"For NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in
terms of IOPS per core over "noop" I/O scheduler."

which made me very skeptical, as it sounds like it's raw device claims.


You are absolutely right, that needs to be fixed.


Does beg the question of why this is a new scheduler then. It's pretty
basic stuff, something that could trivially just be added a side effect
of the core (and in fact we have much of it already). Doesn't really seem
to warrant a new scheduler at all. There isn't really much in there.


Not saying it absolutely warrants a new one, and it could I guess sit in
the core, but this attempts to optimize for a specific metric while
trading-off others, which is exactly what I/O schedulers are for,
optimizing for a specific metric.

Not sure we want to build something biases towards throughput on the
expense of latency into the block core. And, as mentioned this is not
well suited to all device types...

But if you think this has a better home, I'm assuming that the guys
will be open to that.


Was curious and wanted to look it up, but it doesn't exist.


I think this is the right one:
https://github.com/i10-kernel/upstream-linux/blob/master/i10-evaluation.pdf

We had some back and forth around the naming, hence this was probably
omitted.


That works, my local results were a bit worse than listed in there though.
And what does this mean:

"We note that Linux I/O scheduler introduces an additional kernel worker
thread at the I/O dispatching stage"

It most certainly does not for the common/hot case.


Yes I agree, didn't see the local results. Probably some
misunderstanding or a typo, I'll let them reply on this.

Re: [PATCH] iosched: Add i10 I/O Scheduler

2020-11-13 Thread Sagi Grimberg





blk-mq actually has built-in batching(or sort of) mechanism, which is enabled
if the hw queue is busy(hctx->dispatch_busy is > 0). We use EWMA to compute
hctx->dispatch_busy, and it is adaptive, even though the implementation is quite
coarse. But there should be much space to improve, IMO.


You are correct, however nvme-tcp should be getting to dispatch_busy > 0
IIUC.


It is reported that this way improves SQ high-end SCSI SSD very much[1],
and MMC performance gets improved too[2].

[1] 
https://lore.kernel.org/linux-block/3cc3e03901dc1a63ef32e03618252...@mail.gmail.com/
[2] 
https://lore.kernel.org/linux-block/CADBw62o9eTQDJ9RvNgEqSpXmg6Xcq=2TxH0Hfxhp29uF2W=t...@mail.gmail.com/


Yes, the guys paid attention to the MMC related improvements that you
made.


The i10 I/O scheduler builds upon recent work on [6]. We have tested the i10 I/O
scheduler with nvme-tcp optimizaitons [2,3] and batching dispatch [4], varying 
number
of cores, varying read/write ratios, and varying request sizes, and with NVMe 
SSD and
RAM block device. For NVMe SSDs, the i10 I/O scheduler achieves ~60% 
improvements in
terms of IOPS per core over "noop" I/O scheduler. These results are available 
at [5],
and many additional results are presented in [6].


In case of none scheduler, basically nvme driver won't provide any queue busy
feedback, so the built-in batching dispatch doesn't work simply.


Exactly.


kyber scheduler uses io latency feedback to throttle and build io batch,
can you compare i10 with kyber on nvme/nvme-tcp?


I assume it should be simple to get, I'll let Rachit/Jaehyun comment.


While other schedulers may also batch I/O (e.g., mq-deadline), the optimization 
target
in the i10 I/O scheduler is throughput maximization. Hence there is no latency 
target
nor a need for a global tracking context, so a new scheduler is needed rather 
than
to build this functionality to an existing scheduler.

We currently use fixed default values as batching thresholds (e.g., 16 for 
#requests,
64KB for #bytes, and 50us for timeout). These default values are based on 
sensitivity
tests in [6]. For our future work, we plan to support adaptive batching 
according to


Frankly speaking, hardcode 16 #rquests or 64KB may not work everywhere,
and product environment could be much complicated than your sensitivity
tests. If possible, please start with adaptive batching.


That was my feedback as well for sure. But given that this is a
scheduler one would opt-in to anyway, that won't be a must-have
initially. I'm not sure if the guys made progress with this yet, I'll
let them comment.

Re: [PATCH] iosched: Add i10 I/O Scheduler

2020-11-13 Thread Sagi Grimberg





I haven't taken a close look at the code yet so far, but one quick note
that patches like this should be against the branches for 5.11. In fact,
this one doesn't even compile against current -git, as
blk_mq_bio_list_merge is now called blk_bio_list_merge.


Ugh, I guess that Jaehyun had this patch bottled up and didn't rebase
before submitting.. Sorry about that.


In any case, I did run this through some quick peak testing as I was
curious, and I'm seeing about 20% drop in peak IOPS over none running
this. Perf diff:

 10.71% -2.44%  [kernel.vmlinux]  [k] read_tsc
  2.33% -1.99%  [kernel.vmlinux]  [k] _raw_spin_lock


You ran this with nvme? or null_blk? I guess neither would benefit
from this because if the underlying device will not benefit from
batching (at least enough for the extra cost of accounting for it) it
will be counter productive to use this scheduler.


Also:


[5] https://github.com/i10-kernel/upstream-linux/blob/master/dss-evaluation.pdf


Was curious and wanted to look it up, but it doesn't exist.


I think this is the right one:
https://github.com/i10-kernel/upstream-linux/blob/master/i10-evaluation.pdf

We had some back and forth around the naming, hence this was probably
omitted.

Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done

2020-11-04 Thread Sagi Grimberg





There really aren't any rules for this, and it's perfectly legit to
complete from process context. Maybe you're a kthread driven driver and
that's how you handle completions. The block completion path has always
been hard IRQ safe, but possible to call from anywhere.


I'm not trying to put restrictions and forbidding completions from a
kthread. I'm trying to avoid the pointless softirq dance for no added
value. We could:



to not break that assumption you just mentioned and provide
|static inline void blk_mq_complete_request_local(struct request *rq)
|{
| rq->q->mq_ops->complete(rq);
|}

so that completion issued from from process context (like those from
usb-storage) don't end up waking `ksoftird' (running at SCHED_OTHER)
completing the requests but rather performing it right away. The softirq
dance makes no sense here.


Agreed.  But I don't think your above blk_mq_complete_request_local
is all that useful either as ->complete is defined by the caller,
so we could just do a direct call.  Basically we should just
return false from blk_mq_complete_request_remote after updating
the state when called from process context.


Agreed.


But given that IIRC
we are not supposed to check what state we are called from
we'll need a helper just for updating the state instead and
ensure the driver uses the right helper.  Now of course we might
have process context callers that still want to bounce to the
submitting CPU, but in that case we should go directly to a
workqueue or similar.


This would mean that it may be suboptimal for nvme-tcp to complete
requests in softirq context from the network context (determined by
NIC steering). Because in this case, this would trigger workqueue
schedule on a per-request basis rather than once per .data_ready
call like we do today. Is that correct?

It has been observed that completing commands in softirq context
(network determined cpu) because basically the completion does
IPI + local complete, not IPI + softirq or IPI + workqueue.


Either way doing this properly will probabl involve an audit of all
drivers, but I think that is worth it.


Agree.

Re: [PATCH -next] IB/isert: Fix warning Comparison to bool

2020-11-03 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done

2020-10-29 Thread Sagi Grimberg





Well, usb-storage obviously seems to do it, and the block layer
does not prohibit it.


Also loop, nvme-tcp and then I stopped looking.
Any objections about adding local_bh_disable() around it?


To me it seems like the whole IPI plus potentially softirq dance is
a little pointless when completing from process context.


I agree.


Sagi, any opinion on that from the nvme-tcp POV?


nvme-tcp should (almost) always complete from the context that matches
the rq->mq_ctx->cpu as the thread that processes incoming
completions (per hctx) should be affinitized to match it (unless cpus
come and go).


in which context?


Not sure what is the question.


But this is probably nr_hw_queues > 1?


Yes.


So for nvme-tcp I don't expect blk_mq_complete_need_ipi to return true
in normal operation. That leaves the teardowns+aborts, which aren't very
interesting here.


The process context invocation is nvme_tcp_complete_timed_out().


Yes.


I would note that nvme-tcp does not go to sleep after completing every
I/O like how sebastian indicated usb does.

Having said that, today the network stack is calling nvme_tcp_data_ready
in napi context (softirq) which in turn triggers the queue thread to
handle network rx (and complete the I/O). It's been measured recently
that running the rx context directly in softirq will save some
latency (possible because nvme-tcp rx context is non-blocking).

So I'd think that patch #2 is unnecessary and just add overhead for
nvme-tcp.. do note that the napi softirq cpu mapping depends on the RSS
steering, which is unlikely to match rq->mq_ctx->cpu, hence if completed
from napi context, nvme-tcp will probably always go to the IPI path.


but running it in softirq on the remote CPU would still allow of other
packets to come on the remote CPU (which would block BLOCK sofirq if
NET_RX is already running).


Not sure I understand your comment, if napi triggers on core X and we
complete from that, it will trigger IPI to core Y, and there with patch 
#2 is will trigger softirq instead of calling ->complete directly no?

Re: [PATCH 3/3] blk-mq: Use llist_head for blk_cpu_done

2020-10-29 Thread Sagi Grimberg





Well, usb-storage obviously seems to do it, and the block layer
does not prohibit it.


Also loop, nvme-tcp and then I stopped looking.
Any objections about adding local_bh_disable() around it?


To me it seems like the whole IPI plus potentially softirq dance is
a little pointless when completing from process context.


I agree.


Sagi, any opinion on that from the nvme-tcp POV?


nvme-tcp should (almost) always complete from the context that matches
the rq->mq_ctx->cpu as the thread that processes incoming
completions (per hctx) should be affinitized to match it (unless cpus
come and go).

So for nvme-tcp I don't expect blk_mq_complete_need_ipi to return true
in normal operation. That leaves the teardowns+aborts, which aren't very
interesting here.

I would note that nvme-tcp does not go to sleep after completing every
I/O like how sebastian indicated usb does.

Having said that, today the network stack is calling nvme_tcp_data_ready
in napi context (softirq) which in turn triggers the queue thread to
handle network rx (and complete the I/O). It's been measured recently
that running the rx context directly in softirq will save some
latency (possible because nvme-tcp rx context is non-blocking).

So I'd think that patch #2 is unnecessary and just add overhead for
nvme-tcp.. do note that the napi softirq cpu mapping depends on the RSS
steering, which is unlikely to match rq->mq_ctx->cpu, hence if completed
from napi context, nvme-tcp will probably always go to the IPI path.

Re: [PATCH 1/2] nvmet: introduce transport layer keep-alive

2020-10-28 Thread Sagi Grimberg




On 10/27/20 5:15 AM, zhenwei pi wrote:

In the zero KATO scenario, if initiator crashes without transport
layer disconnection, target side would never reclaim resources.

A target could start transport layer keep-alive to detect dead
connection for the admin queue.


Not sure why we should worry about kato=0, it's really
more for debug purposes. I'd rather that we block this
option from the host altogether.



Signed-off-by: zhenwei pi 
---
  drivers/nvme/target/admin-cmd.c |  2 +-
  drivers/nvme/target/core.c  | 14 +++---
  drivers/nvme/target/nvmet.h |  3 ++-
  3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index dca34489a1dc..53fbd5c193a1 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -729,7 +729,7 @@ u16 nvmet_set_feat_kato(struct nvmet_req *req)
  
  	nvmet_stop_keep_alive_timer(req->sq->ctrl);

req->sq->ctrl->kato = DIV_ROUND_UP(val32, 1000);
-   nvmet_start_keep_alive_timer(req->sq->ctrl);
+   nvmet_start_keep_alive_timer(req->sq->ctrl, req);
  
  	nvmet_set_result(req, req->sq->ctrl->kato);
  
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c

index 957b39a82431..451192f7ad6a 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -395,10 +395,18 @@ static void nvmet_keep_alive_timer(struct work_struct 
*work)
nvmet_ctrl_fatal_error(ctrl);
  }
  
-void nvmet_start_keep_alive_timer(struct nvmet_ctrl *ctrl)

+void nvmet_start_keep_alive_timer(struct nvmet_ctrl *ctrl, struct nvmet_req 
*req)
  {
-   if (unlikely(ctrl->kato == 0))
+   if (unlikely(ctrl->kato == 0)) {
+   if (req->ops->keep_alive)
+   pr_info("ctrl %d starts with transport keep-alive %s\n", 
ctrl->cntlid,
+   req->ops->keep_alive(req) ? "failed" : 
"succeed");
+   else
+   pr_info("ctrl %d starts without both NVMeOF and transport 
keep-alive",
+   ctrl->cntlid);
+
return;
+   }
  
  	pr_debug("ctrl %d start keep-alive timer for %d secs\n",

ctrl->cntlid, ctrl->kato);
@@ -1383,7 +1391,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char 
*hostnqn,
ctrl->err_counter = 0;
spin_lock_init(&ctrl->error_lock);
  
-	nvmet_start_keep_alive_timer(ctrl);

+   nvmet_start_keep_alive_timer(ctrl, req);
  
  	mutex_lock(&subsys->lock);

list_add_tail(&ctrl->subsys_entry, &subsys->ctrls);
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 559a15ccc322..de1785ce9fcd 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -305,6 +305,7 @@ struct nvmet_fabrics_ops {
u16 (*install_queue)(struct nvmet_sq *nvme_sq);
void (*discovery_chg)(struct nvmet_port *port);
u8 (*get_mdts)(const struct nvmet_ctrl *ctrl);
+   int (*keep_alive)(struct nvmet_req *req);
  };
  
  #define NVMET_MAX_INLINE_BIOVEC	8

@@ -395,7 +396,7 @@ void nvmet_get_feat_async_event(struct nvmet_req *req);
  u16 nvmet_set_feat_kato(struct nvmet_req *req);
  u16 nvmet_set_feat_async_event(struct nvmet_req *req, u32 mask);
  void nvmet_execute_async_event(struct nvmet_req *req);
-void nvmet_start_keep_alive_timer(struct nvmet_ctrl *ctrl);
+void nvmet_start_keep_alive_timer(struct nvmet_ctrl *ctrl, struct nvmet_req 
*req);
  void nvmet_stop_keep_alive_timer(struct nvmet_ctrl *ctrl);
  
  u16 nvmet_parse_connect_cmd(struct nvmet_req *req);

Re: [PATCH v3] nvme-rdma: handle nvme completion data length

2020-10-26 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH v2] nvme-rdma: handle nvme completion data length

2020-10-23 Thread Sagi Grimberg





diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 9e378d0a0c01..2ecadd309f4a 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1767,6 +1767,21 @@ static void nvme_rdma_recv_done(struct ib_cq *cq, struct 
ib_wc *wc)
return;
}
  
+	/* received data length checking */

+   if (unlikely(wc->byte_len < len)) {
+   /* zero bytes message could be ignored */
+   if (!wc->byte_len) {
+   nvme_rdma_post_recv(queue, qe);
+   return;
+   }


Nothing in the spec defines zero-length messages, hence we cannot
support something that is not standard. If your array needs this,
please submit a TPAR to the NVMe TWG.

Re: [PATCH v2] nvmet: fix uninitialized work for zero kato

2020-10-14 Thread Sagi Grimberg





Fixes:
Don't run keep alive work with zero kato.


"Fixes" tags need to have a git commit id followed by the commit
subject. I can't find any commit with that subject, though.


Fixes: 0d3b6a8d213a ("nvmet: Disable keep-alive timer when kato is 
cleared to 0h")

Re: [PATCH] nvmet: fix uninitialized work for zero kato

2020-10-13 Thread Sagi Grimberg


Hit a warning:
WARNING: CPU: 1 PID: 241 at kernel/workqueue.c:1627 
__queue_delayed_work+0x6d/0x90
with trace:
   mod_delayed_work_on+0x59/0x90
   nvmet_update_cc+0xee/0x100 [nvmet]
   nvmet_execute_prop_set+0x72/0x80 [nvmet]
   nvmet_tcp_try_recv_pdu+0x2f7/0x770 [nvmet_tcp]
   nvmet_tcp_io_work+0x63f/0xb2d [nvmet_tcp]
   ...

This could be reproduced easily with a keep alive time 0:
nvme connect -t tcp -n NQN -a ADDR -s PORT --keep-alive-tmo=0

The reason is:
Starting an uninitialized work when initiator connects with zero
kato. Althrough keep-alive timer is disabled during allocating a ctrl
(fix in 0d3b6a8d213a), ka_work still has a chance to run
(called by nvmet_start_ctrl to detect dead host).


This should have a "Fixes:" tag.



Initilize ka_work during allocating ctrl, and set a reasonable kato
before scheduling ka_work.

Signed-off-by: zhenwei pi 
---
  drivers/nvme/target/core.c | 16 +++-
  1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index b7b63330b5ef..3c5b2b065476 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -19,6 +19,8 @@ struct workqueue_struct *buffered_io_wq;
  static const struct nvmet_fabrics_ops *nvmet_transports[NVMF_TRTYPE_MAX];
  static DEFINE_IDA(cntlid_ida);
  
+#define NVMET_DEFAULT_KATO	5

+
  /*
   * This read/write semaphore is used to synchronize access to configuration
   * information on a target system that will result in discovery log page
@@ -385,6 +387,11 @@ static void nvmet_keep_alive_timer(struct work_struct 
*work)
if (cmd_seen) {
pr_debug("ctrl %d reschedule traffic based keep-alive timer\n",
ctrl->cntlid);
+
+   /* run once, trigger from nvmet_start_ctrl to detect dead link 
*/
+   if (!ctrl->kato)
+   return;
+
schedule_delayed_work(&ctrl->ka_work, ctrl->kato * HZ);


It will be better to just schedule/mod the ka_work if kato != 0, other
changes in the patch aren't needed IMO.


return;
}
@@ -403,15 +410,11 @@ static void nvmet_start_keep_alive_timer(struct 
nvmet_ctrl *ctrl)
pr_debug("ctrl %d start keep-alive timer for %d secs\n",
ctrl->cntlid, ctrl->kato);
  
-	INIT_DELAYED_WORK(&ctrl->ka_work, nvmet_keep_alive_timer);

schedule_delayed_work(&ctrl->ka_work, ctrl->kato * HZ);
  }
  
  static void nvmet_stop_keep_alive_timer(struct nvmet_ctrl *ctrl)

  {
-   if (unlikely(ctrl->kato == 0))
-   return;
-
pr_debug("ctrl %d stop keep-alive\n", ctrl->cntlid);
  
  	cancel_delayed_work_sync(&ctrl->ka_work);

@@ -1107,6 +1110,8 @@ static inline u8 nvmet_cc_iocqes(u32 cc)
  
  static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl)

  {
+   u32 kato = ctrl->kato ? ctrl->kato : NVMET_DEFAULT_KATO;
+


The controller shouldn't have a default value, it should receive
the desired value from the host.


lockdep_assert_held(&ctrl->lock);
  
  	if (nvmet_cc_iosqes(ctrl->cc) != NVME_NVM_IOSQES ||

@@ -1126,7 +1131,7 @@ static void nvmet_start_ctrl(struct nvmet_ctrl *ctrl)
 * in case a host died before it enabled the controller.  Hence, simply
 * reset the keep alive timer when the controller is enabled.
 */
-   mod_delayed_work(system_wq, &ctrl->ka_work, ctrl->kato * HZ);
+   mod_delayed_work(system_wq, &ctrl->ka_work, kato * HZ);
  }
  
  static void nvmet_clear_ctrl(struct nvmet_ctrl *ctrl)

@@ -1378,6 +1383,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char 
*hostnqn,
  
  	/* keep-alive timeout in seconds */

ctrl->kato = DIV_ROUND_UP(kato, 1000);
+   INIT_DELAYED_WORK(&ctrl->ka_work, nvmet_keep_alive_timer);
  
  	ctrl->err_counter = 0;

spin_lock_init(&ctrl->error_lock);

Re: [PATCH blk-next 1/2] blk-mq-rdma: Delete not-used multi-queue RDMA map queue code

2020-10-02 Thread Sagi Grimberg





Yes, basically usage of managed affinity caused people to report
regressions not being able to change irq affinity from procfs.


Well, why would they change it?  The whole point of the infrastructure
is that there is a single sane affinity setting for a given setup. Now
that setting needed some refinement from the original series (e.g. the
current series about only using housekeeping cpus if cpu isolation is
in use).  But allowing random users to modify affinity is just a receipe
for a trainwreck.


Well allowing people to mangle irq affinity settings seem to be a hard
requirement from the discussions in the past.


So I think we need to bring this back ASAP, as doing affinity right
out of the box is an absolute requirement for sane performance without
all the benchmarketing deep magic.


Well, it's hard to say that setting custom irq affinity settings is
deemed non-useful to anyone and hence should be prevented. I'd expect
that irq settings have a sane default that works and if someone wants to
change it, it can but there should be no guarantees on optimal
performance. But IIRC this had some dependencies on drivers and some
more infrastructure to handle dynamic changes...

Re: [PATCH blk-next 1/2] blk-mq-rdma: Delete not-used multi-queue RDMA map queue code

2020-09-29 Thread Sagi Grimberg





From: Leon Romanovsky 

The RDMA vector affinity code is not backed up by any driver and always
returns NULL to every ib_get_vector_affinity() call.

This means that blk_mq_rdma_map_queues() always takes fallback path.

Fixes: 9afc97c29b03 ("mlx5: remove support for ib_get_vector_affinity")
Signed-off-by: Leon Romanovsky 


So you guys totally broken the nvme queue assignment without even
telling anyone?  Great job!


Who is "you guys" and it wasn't silent either? I'm sure that Sagi knows the 
craft.
https://lore.kernel.org/linux-rdma/20181224221606.ga25...@ziepe.ca/

commit 759ace7832802eaefbca821b2b43a44ab896b449
Author: Sagi Grimberg 
Date:   Thu Nov 1 13:08:07 2018 -0700

 i40iw: remove support for ib_get_vector_affinity



commit 9afc97c29b032af9a4112c2f4a02d5313b4dc71f
Author: Sagi Grimberg 
Date:   Thu Nov 1 09:13:12 2018 -0700

 mlx5: remove support for ib_get_vector_affinity

Thanks


Yes, basically usage of managed affinity caused people to report
regressions not being able to change irq affinity from procfs.

Back then I started a discussion with Thomas to make managed
affinity to still allow userspace to modify this, but this
was dropped at some point. So currently rdma cannot do
automatic irq affinitization out of the box.

Re: [PATCH] [v2] nvme: replace meaningless judgement by checking whether req is null

2020-09-29 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH AUTOSEL 4.19 127/206] nvme: Fix controller creation races with teardown flow

2020-09-18 Thread Sagi Grimberg


This causes a regression and was reverted upstream, just FYI.

On 9/17/20 7:06 PM, Sasha Levin wrote:

From: Israel Rukshin 

[ Upstream commit ce1518139e6976cf19c133b555083354fdb629b8 ]

Calling nvme_sysfs_delete() when the controller is in the middle of
creation may cause several bugs. If the controller is in NEW state we
remove delete_controller file and don't delete the controller. The user
will not be able to use nvme disconnect command on that controller again,
although the controller may be active. Other bugs may happen if the
controller is in the middle of create_ctrl callback and
nvme_do_delete_ctrl() starts. For example, freeing I/O tagset at
nvme_do_delete_ctrl() before it was allocated at create_ctrl callback.

To fix all those races don't allow the user to delete the controller
before it was fully created.

Signed-off-by: Israel Rukshin 
Reviewed-by: Max Gurtovoy 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Keith Busch 
Signed-off-by: Sasha Levin 
---
  drivers/nvme/host/core.c | 5 +
  drivers/nvme/host/nvme.h | 1 +
  2 files changed, 6 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4b182ac15687e..faa7feebb6095 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2856,6 +2856,10 @@ static ssize_t nvme_sysfs_delete(struct device *dev,
  {
struct nvme_ctrl *ctrl = dev_get_drvdata(dev);
  
+	/* Can't delete non-created controllers */

+   if (!ctrl->created)
+   return -EBUSY;
+
if (device_remove_file_self(dev, attr))
nvme_delete_ctrl_sync(ctrl);
return count;
@@ -3576,6 +3580,7 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl)
queue_work(nvme_wq, &ctrl->async_event_work);
nvme_start_queues(ctrl);
}
+   ctrl->created = true;
  }
  EXPORT_SYMBOL_GPL(nvme_start_ctrl);
  
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h

index 31c1496f938fb..a70b997060e68 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -206,6 +206,7 @@ struct nvme_ctrl {
struct nvme_command ka_cmd;
struct work_struct fw_act_work;
unsigned long events;
+   bool created;
  
  #ifdef CONFIG_NVME_MULTIPATH

/* asymmetric namespace access: */

Re: [PATCH] nvme: tcp: fix kconfig dependency warning when !CRYPTO

2020-09-14 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH v2] nvme-pci: cancel nvme device request before disabling

2020-08-28 Thread Sagi Grimberg


Added to nvme-5.9-rc

Re: [PATCH] nvmet-fc: Fix a missed _irqsave version of spin_lock in 'nvmet_fc_fod_op_done()'

2020-08-24 Thread Sagi Grimberg


The way 'spin_lock()' and 'spin_lock_irqsave()' are used is not consistent
in this function.

Use 'spin_lock_irqsave()' also here, as there is no guarantee that
interruptions are disabled at that point, according to surrounding code.

Fixes: a97ec51b37ef ("nvmet_fc: Rework target side abort handling")
Signed-off-by: Christophe JAILLET 
---
Not tested, only based on what looks logical to me according to
surrounding code
---
  drivers/nvme/target/fc.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c
index 55bafd56166a..e6861cc10e7d 100644
--- a/drivers/nvme/target/fc.c
+++ b/drivers/nvme/target/fc.c
@@ -2342,9 +2342,9 @@ nvmet_fc_fod_op_done(struct nvmet_fc_fcp_iod *fod)
return;
if (fcpreq->fcp_error ||
fcpreq->transferred_length != fcpreq->transfer_length) {
-   spin_lock(&fod->flock);
+   spin_lock_irqsave(&fod->flock, flags);
fod->abort = true;
-   spin_unlock(&fod->flock);
+   spin_unlock_irqrestore(&fod->flock, flags);
  
  			nvmet_req_complete(&fod->req, NVME_SC_INTERNAL);

return;



James, can I get a reviewed-by from you on this?

Re: [PATCH 2/3] block: fix locking for struct block_device size updates

2020-08-24 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH 3/3] nvme: don't call revalidate_disk from nvme_set_queue_dying

2020-08-24 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH 1/3] block: replace bd_set_size with bd_set_nr_sectors

2020-08-24 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH] nvme-pci: cancel nvme device request before disabling

2020-08-14 Thread Sagi Grimberg





diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index ba725ae47305..c4f1ce0ee1e3 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1249,8 +1249,8 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
   dev_warn_ratelimited(dev->ctrl.device,
"I/O %d QID %d timeout, disable controller\n",
req->tag, nvmeq->qid);
- nvme_dev_disable(dev, true);
   nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+ nvme_dev_disable(dev, true);
   return BLK_EH_DONE;


Shouldn't this flag have been set in nvme_cancel_request()?


nvme_cancel_request() is not setting this flag to cancelled and this is causing


Right, I see that it doesn't, but I'm saying that it should. We used to
do something like that, and I'm struggling to recall why we're not
anymore.


I also don't recall why, but I know that we rely on the status
propagating back from submit_sync_cmd which won't happen because
it converts the status into -ENODEV.


The driver is not reporting   non-response back for all
cancelled requests, and that is probably not what we should be doing.


I'd think that we should modify our callers to handle nvme status
codes as well rather than rely on nvme_submit_sync_cmd to return a
negative codes under some conditions.

Re: [PATCH] nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth

2020-08-14 Thread Sagi Grimberg


queued to nvme-5.9-rc

Re: [PATCH v2] nvme: Use spin_lock_irq() when taking the ctrl->lock

2020-08-13 Thread Sagi Grimberg





There's an unrelated whitespace change in nvme_init_identify().
Otherwise, looks fine.


Oops, sorry. can this be fixed up when it's merged?


Fixed and queued.

Re: [PATCH] nvmet: fix opps in nvmet_execute_passthru_cmd()

2020-08-05 Thread Sagi Grimberg





This patch adds a check in nvmet_execute_passthru_cmd() to prevent the
following oops :-


Why LKML and not linux-nvme?



My bad (+linux-nvme).


It doesn't have the patch. can you resend?

Re: [PATCH] nvmet: fix opps in nvmet_execute_passthru_cmd()

2020-08-05 Thread Sagi Grimberg





This patch adds a check in nvmet_execute_passthru_cmd() to prevent the
following oops :-


Why LKML and not linux-nvme?

Re: [PATCH] nvmet-passthru: Reject commands with non-sgl flags set

2020-08-03 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [IB/srpt] c804af2c1d: last_state.test.blktests.exit_code.143

2020-08-03 Thread Sagi Grimberg





Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: c804af2c1d3152c0cf877eeb50d60c2d49ac0cf0 ("IB/srpt: use new 
shared CQ mechanism")

https://git.kernel.org/cgit/linux/kernel/git/rdma/rdma.git for-next


in testcase: blktests
with following parameters:

test: srp-group1
ucode: 0x21



on test machine: 4 threads Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz 
with 4G memory


caused below changes (please refer to attached dmesg/kmsg for entire 
log/backtrace):





If you fix the issue, kindly add following tag
Reported-by: kernel test robot 


user  :notice: [   44.688140] 2020-08-01 16:10:22 ./check srp/001 
srp/002 srp/003 srp/004 srp/005 srp/006 srp/007 srp/008 srp/009 
srp/010 srp/011 srp/012 srp/013 srp/015

user  :notice: [   44.706657] srp/001 (Create and remove LUNs)
user  :notice: [   44.718405] srp/001 (Create and remove 
LUNs) [passed]

user  :notice: [   44.729902] runtime  ...  1.972s
user  :notice: [   99.038748] IPMI BMC is not supported on this 
machine, skip bmc-watchdog setup!
user  :notice: [ 3699.039790] Sat Aug  1 17:11:22 UTC 2020 detected 
soft_timeout
user  :notice: [ 3699.060341] kill 960 /usr/bin/time -v -o 
/tmp/lkp/blktests.time /lkp/lkp/src/tests/blktests

Yamin and Max, can you take a look at this? The SRP tests from the
blktests repository pass reliably with kernel version v5.7 and before.
With label next-20200731 from linux-next however that test triggers the
following hang:


I will look into it.


FWIW, I ran into this as well with nvme-rdma, but it also reproduces
when I revert the shared CQ patch from nvme-rdma. Another data point
is that my tests passes with siw.

Re: [PATCH v15 7/9] nvmet-passthru: Add passthru code to process commands

2020-07-22 Thread Sagi Grimberg





Thanks for the review Christoph. I think I should be able to make all
the requested changes in the next week or two.

On 2020-07-20 1:35 p.m., Sagi Grimberg wrote:



I'm still not so happy about having to look up the namespace and still
wonder if we should generalize the connect_q to a passthrough_q.  But
I guess we can do that later and then reduce some of the exports here..


That is a neat idea! should be easy to do (and we can then lose the host
xarray stuff). I don't mind having it on a later patch, but it should be
easy enough to do even before...



I sort of follow this. I can try to work something up but it will
probably take me a few iterations to get it to where you want it. So,
roughly, we'd create a passthrough_q in core with the controller's IO
tagset and then cleanup the fabrics hosts to use that instead of each
independently creating their connect_q?

Though, I don't understand how this relates to the host xarray stuff
that Sagi mentioned...


passthru commands are in essence REQ_OP_DRV_IN/REQ_OP_DRV_OUT, which
means that the driver shouldn't need the ns at all. So if you have a
dedicated request queue (mapped to the I/O tagset), you don't need the
ns->queue and we can lose the ns lookup altogether.

The only part is to check the effects, but that can probably be handled
when we setup the passthru controller or something...


Yes, I implemented the passthru_q (which was quite simple).


Nice..


But I'm not
sure how we are supposed to call nvme_command_effects() correctly
without the ns. You can't possibly do that during setup for every
possible opcode on every namespace. And even if we do, we'll still need
the same nvme_find_get_ns() and nvme_put_ns() exports and probably
another xarray to lookup the information.

Also, we pass the namespace's disk to in order to get proper block
accounting for the underlying disk. (Which is pretty important for
debugging). So we need to lookup the namespace for this too.

Unless there are some other ideas to solve these issues, I don't think
this change will gain us anything.


Let's defer it to a followup set than.

Re: [PATCH v15 7/9] nvmet-passthru: Add passthru code to process commands

2020-07-20 Thread Sagi Grimberg





On Mon, Jul 20, 2020 at 05:01:19PM -0600, Logan Gunthorpe wrote:

On 2020-07-20 4:35 p.m., Sagi Grimberg wrote:


passthru commands are in essence REQ_OP_DRV_IN/REQ_OP_DRV_OUT, which
means that the driver shouldn't need the ns at all. So if you have a
dedicated request queue (mapped to the I/O tagset), you don't need the
ns->queue and we can lose the ns lookup altogether.


We still need a request_queue to dispatch the command. I guess you could
make a generic one for the controller that isn't tied to a namespace,
but we lose the fair shared tag allocation.


What do you mean fair shared tag allocation?


See hctx_may_queue().


Still not following your point... this queue is yet another request
queue on the I/O tagset, e.g.

ctrl->passthru_q = blk_mq_init_queue(ctrl->tagset);

Re: [PATCH v15 7/9] nvmet-passthru: Add passthru code to process commands

2020-07-20 Thread Sagi Grimberg





On 7/20/20 4:17 PM, Keith Busch wrote:

On Mon, Jul 20, 2020 at 05:01:19PM -0600, Logan Gunthorpe wrote:

On 2020-07-20 4:35 p.m., Sagi Grimberg wrote:


passthru commands are in essence REQ_OP_DRV_IN/REQ_OP_DRV_OUT, which
means that the driver shouldn't need the ns at all. So if you have a
dedicated request queue (mapped to the I/O tagset), you don't need the
ns->queue and we can lose the ns lookup altogether.


We still need a request_queue to dispatch the command. I guess you could
make a generic one for the controller that isn't tied to a namespace,
but we lose the fair shared tag allocation.


What do you mean fair shared tag allocation?

Re: [PATCH v15 7/9] nvmet-passthru: Add passthru code to process commands

2020-07-20 Thread Sagi Grimberg





Thanks for the review Christoph. I think I should be able to make all
the requested changes in the next week or two.

On 2020-07-20 1:35 p.m., Sagi Grimberg wrote:



I'm still not so happy about having to look up the namespace and still
wonder if we should generalize the connect_q to a passthrough_q.  But
I guess we can do that later and then reduce some of the exports here..


That is a neat idea! should be easy to do (and we can then lose the host
xarray stuff). I don't mind having it on a later patch, but it should be
easy enough to do even before...



I sort of follow this. I can try to work something up but it will
probably take me a few iterations to get it to where you want it. So,
roughly, we'd create a passthrough_q in core with the controller's IO
tagset and then cleanup the fabrics hosts to use that instead of each
independently creating their connect_q?

Though, I don't understand how this relates to the host xarray stuff
that Sagi mentioned...


passthru commands are in essence REQ_OP_DRV_IN/REQ_OP_DRV_OUT, which
means that the driver shouldn't need the ns at all. So if you have a
dedicated request queue (mapped to the I/O tagset), you don't need the
ns->queue and we can lose the ns lookup altogether.


Thanks, that helps clarify things a bit, but which xarray were you
talking about?


The patch from Chaitanya

See "[PATCH V4 2/2] nvme-core: use xarray for ctrl ns tracking"

Re: [PATCH v15 7/9] nvmet-passthru: Add passthru code to process commands

2020-07-20 Thread Sagi Grimberg





Thanks for the review Christoph. I think I should be able to make all
the requested changes in the next week or two.

On 2020-07-20 1:35 p.m., Sagi Grimberg wrote:



I'm still not so happy about having to look up the namespace and still
wonder if we should generalize the connect_q to a passthrough_q.  But
I guess we can do that later and then reduce some of the exports here..


That is a neat idea! should be easy to do (and we can then lose the host
xarray stuff). I don't mind having it on a later patch, but it should be
easy enough to do even before...



I sort of follow this. I can try to work something up but it will
probably take me a few iterations to get it to where you want it. So,
roughly, we'd create a passthrough_q in core with the controller's IO
tagset and then cleanup the fabrics hosts to use that instead of each
independently creating their connect_q?

Though, I don't understand how this relates to the host xarray stuff
that Sagi mentioned...


passthru commands are in essence REQ_OP_DRV_IN/REQ_OP_DRV_OUT, which
means that the driver shouldn't need the ns at all. So if you have a
dedicated request queue (mapped to the I/O tagset), you don't need the
ns->queue and we can lose the ns lookup altogether.

The only part is to check the effects, but that can probably be handled
when we setup the passthru controller or something...

Re: [PATCH v15 7/9] nvmet-passthru: Add passthru code to process commands

2020-07-20 Thread Sagi Grimberg





Add passthru command handling capability for the NVMeOF target and
export passthru APIs which are used to integrate passthru
code with nvmet-core.

The new file passthru.c handles passthru cmd parsing and execution.
In the passthru mode, we create a block layer request from the nvmet
request and map the data on to the block layer request.

Admin commands and features are on a white list as there are a number
of each that don't make too much sense with passthrough. We use a
white list so that new commands can be considered before being blindly
passed through. In both cases, vendor specific commands are always
allowed.

We also blacklist reservation IO commands as the underlying device
cannot differentiate between multiple hosts behind a fabric.


I'm still not so happy about having to look up the namespace and still
wonder if we should generalize the connect_q to a passthrough_q.  But
I guess we can do that later and then reduce some of the exports here..


That is a neat idea! should be easy to do (and we can then lose the host
xarray stuff). I don't mind having it on a later patch, but it should be
easy enough to do even before...

Re: [PATCH 5/5] nvme-pci: Use standard block status macro

2020-07-05 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH 4/5] nvme-pci: Use the consistent return type of nvme_pci_iod_alloc_size()

2020-07-05 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH 2/5] nvme-pci: Add a blank line after declarations

2020-07-04 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH 1/5] nvme-pci: Fix some comments issues

2020-07-04 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH 4.19 082/131] nvme: fix possible deadlock when I/O is blocked

2020-07-02 Thread Sagi Grimberg





Hi!


From: Sagi Grimberg 

[ Upstream commit 3b4b19721ec652ad2c4fe51dfbe5124212b5f581 ]

Revert fab7772bfbcf ("nvme-multipath: revalidate nvme_ns_head gendisk
in nvme_validate_ns")

When adding a new namespace to the head disk (via nvme_mpath_set_live)
we will see partition scan which triggers I/O on the mpath device node.
This process will usually be triggered from the scan_work which holds
the scan_lock. If I/O blocks (if we got ana change currently have only
available paths but none are accessible) this can deadlock on the head
disk bd_mutex as both partition scan I/O takes it, and head disk revalidation
takes it to check for resize (also triggered from scan_work on a different
path). See trace [1].

The mpath disk revalidation was originally added to detect online disk
size change, but this is no longer needed since commit cb224c3af4df
("nvme: Convert to use set_capacity_revalidate_and_notify") which already
updates resize info without unnecessarily revalidating the disk (the


Unfortunately, v4.19-stable does not contain cb224c3af4df. According
to changelog, it seems it should be cherry-picked?


You are absolutely right,

The reference commit is a part of the series:
78317c5d58e6 ("scsi: Convert to use set_capacity_revalidate_and_notify")
cb224c3af4df ("nvme: Convert to use set_capacity_revalidate_and_notify")
3cbc28bb902b ("xen-blkfront.c: Convert to use 
set_capacity_revalidate_and_notify")
662155e2898d ("virtio_blk.c: Convert to use 
set_capacity_revalidate_and_notify")

e598a72faeb5 ("block/genhd: Notify udev about capacity change")

It would be cool if they are cherry picked, although they don't qualify
as stable patches per se...

Re: [PATCH 3/3] nvme: Use USEC_PER_SEC instead of magic numbers

2020-06-23 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH 1/3] nvme: Add Arbitration Burst support

2020-06-23 Thread Sagi Grimberg





 From the NVMe spec, "In order to make efficient use of the non-volatile
memory, it is often advantageous to execute multiple commands from a
Submission Queue in parallel. For Submission Queues that are using
weighted round robin with urgent priority class or round robin
arbitration, host software may configure an Arbitration Burst setting".
Thus add Arbitration Burst setting support.


But if the user changed it to something else that better matches how
they want to use queues, the driver is just going to undo that setting
on the next reset.


Where do we do priority class arbitration? no one sets it

Re: [PATCH 2/2] nvme-pci: remove empty line following nvme_should_reset()

2020-06-08 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH 1/2] nvmet-loop: remove unused 'target_ctrl' in nvme_loop_ctrl

2020-06-08 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH] nvme: do not call del_gendisk() on a disk that was never added

2020-06-07 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH 1/1] nvme-fcloop: verify wwnn and wwpn format

2020-06-04 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH] nvme-tcp: constify static struct blk_mq_ops

2020-05-28 Thread Sagi Grimberg


Acked-by: Sagi Grimberg

Re: remove kernel_setsockopt and kernel_getsockopt

2020-05-13 Thread Sagi Grimberg





Hi Dave,

this series removes the kernel_setsockopt and kernel_getsockopt
functions, and instead switches their users to small functions that
implement setting (or in one case getting) a sockopt directly using
a normal kernel function call with type safety and all the other
benefits of not having a function call.

In some cases these functions seem pretty heavy handed as they do
a lock_sock even for just setting a single variable, but this mirrors
the real setsockopt implementation - counter to that a few kernel
drivers just set the fields directly already.

Nevertheless the diffstat looks quite promising:

  42 files changed, 721 insertions(+), 799 deletions(-)


For the nvme-tcp bits,

Acked-by: Sagi Grimberg

Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing

2020-05-11 Thread Sagi Grimberg





devices will benefit from the batching so maybe the flag needs to be
inverted? BLK_MQ_F_DONT_BATCHING_SUBMISSION?


Actually I'd rather to not add any flag, and we may use some algorithm
(maybe EWMA or other intelligent stuff, or use hctx->dispatch_busy directly)
to figure out one dynamic batching number which is used to dequeue requests
from scheduler or sw queue.

Then just one single dispatch io code path is enough.


Sounds good to me, do you plan to submit a patchset?

Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing

2020-05-11 Thread Sagi Grimberg





Basically, my idea is to dequeue request one by one, and for each
dequeued request:

- we try to get a budget and driver tag, if both succeed, add the
request to one per-task list which can be stored in stack variable,
then continue to dequeue more request

- if either budget or driver tag can't be allocated for this request,
marks the last request in the per-task list as .last, and send the
batching requests stored in the list to LLD

- when queueing batching requests to LLD, if one request isn't queued
to driver successfully, calling .commit_rqs() like before, meantime
adding the remained requests in the per-task list back to scheduler
queue or hctx->dispatch.


Sounds good to me.


One issue is that this way might degrade sequential IO performance if
the LLD just tells queue busy to blk-mq via return value of .queue_rq(),
so I guess we still may need one flag, such as BLK_MQ_F_BATCHING_SUBMISSION.


Why is that degrading sequential I/O performance? because the specific


Some devices may only return BLK_STS_RESOURCE from .queue_rq(), then more
requests are dequeued from scheduler queue if we always queue batching IOs
to LLD, and chance of IO merge is reduced, so sequential IO performance will
be effected.

Such as some scsi device which doesn't use sdev->queue_depth for
throttling IOs.

For virtio-scsi or virtio-blk, we may stop queue for avoiding the
potential affect.


Do we have a way to characterize such devices? I'd assume that most
devices will benefit from the batching so maybe the flag needs to be
inverted? BLK_MQ_F_DONT_BATCHING_SUBMISSION?

Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing

2020-05-10 Thread Sagi Grimberg





You're mostly correct. This is exactly why an I/O scheduler may be
applicable here IMO. Mostly because I/O schedulers tend to optimize for
something specific and always present tradeoffs. Users need to
understand what they are optimizing for.

Hence I'd say this functionality can definitely be available to an I/O
scheduler should one exist.



I guess it is just that there can be multiple requests available from
scheduler queue. Actually it can be so for other non-nvme drivers in
case of none, such as SCSI.

Another way is to use one per-task list(such as plug list) to hold the
requests for dispatch, then every drivers may see real .last flag, so they
may get chance for optimizing batch queuing. I will think about the
idea further and see if it is really doable.


How about my RFC v1 patch set[1], which allows dispatching more than
one request from the scheduler to support batch requests?

[1]
https://lore.kernel.org/patchwork/patch/1210034/
https://lore.kernel.org/patchwork/patch/1210035/


Basically, my idea is to dequeue request one by one, and for each
dequeued request:

- we try to get a budget and driver tag, if both succeed, add the
request to one per-task list which can be stored in stack variable,
then continue to dequeue more request

- if either budget or driver tag can't be allocated for this request,
marks the last request in the per-task list as .last, and send the
batching requests stored in the list to LLD

- when queueing batching requests to LLD, if one request isn't queued
to driver successfully, calling .commit_rqs() like before, meantime
adding the remained requests in the per-task list back to scheduler
queue or hctx->dispatch.


Sounds good to me.


One issue is that this way might degrade sequential IO performance if
the LLD just tells queue busy to blk-mq via return value of .queue_rq(),
so I guess we still may need one flag, such as BLK_MQ_F_BATCHING_SUBMISSION.


Why is that degrading sequential I/O performance? because the specific
device will do better without batching submissions? If so, the driver
is not obligated to respect the bd->last/.commit_rqs, so if that is the
case, it should just ignore it.

Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing

2020-05-08 Thread Sagi Grimberg


Hey Ming,


Would it make sense to elevate this flag to a request_queue flag
(QUEUE_FLAG_ALWAYS_COMMIT)?


request queue flag usually is writable, however this case just needs
one read-only flag, so I think it may be better to make it as
tagset/hctx flag.


I actually intended it to be writable.


I'm thinking of a possibility that an I/O scheduler may be used
to activate this functionality rather than having the driver set
it necessarily...


Could you explain a bit why I/O scheduler should activate this
functionality?


Sure, I've recently seen some academic work showing the benefits
of batching in tcp/ip based block drivers. The problem with the
approaches taken is that I/O scheduling is exercised deep down in the
driver, which is not the direction I'd like to go if we are want
to adopt some of the batching concepts.

I spent some (limited) time thinking about this, and it seems to
me that there is an opportunity to implement this as a dedicated
I/O scheduler, and tie it to driver specific LLD stack optimizations
(net-stack for example) relying on the commit_rq/bd->last hints.

When scanning the scheduler code, I noticed exactly the phenomenon that
this patchset is attempting to solve and Christoph referred me to it.
Now I'm thinking if we can extend this batching optimization for both
use-cases.


batching submission may be good for some drivers, and currently
we only do it in limited way. One reason is that there is extra
cost for full batching submission, such as this patch requires
one extra .commit_rqs() for each dispatch, and lock is often needed
in this callback.


That is not necessarily the case at all.


IMO it can be a win for some slow driver or device, but may cause
a little performance drop for fast driver/device especially in workload
of not-batching submission.


You're mostly correct. This is exactly why an I/O scheduler may be
applicable here IMO. Mostly because I/O schedulers tend to optimize for
something specific and always present tradeoffs. Users need to
understand what they are optimizing for.

Hence I'd say this functionality can definitely be available to an I/O
scheduler should one exist.

Re: [RFC PATCH v2 1/7] block: Extand commit_rqs() to do batch processing

2020-05-08 Thread Sagi Grimberg





diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index f389d7c724bd..6a20f8e8eb85 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -391,6 +391,7 @@ struct blk_mq_ops {
  enum {
BLK_MQ_F_SHOULD_MERGE   = 1 << 0,
BLK_MQ_F_TAG_SHARED = 1 << 1,
+   BLK_MQ_F_FORCE_COMMIT_RQS = 1 << 3,


Maybe BLK_MQ_F_ALWAYS_COMMIT might be a better name?  Also this
flag (just like the existing ones..) could really use a comment
explaining it.


Would it make sense to elevate this flag to a request_queue flag
(QUEUE_FLAG_ALWAYS_COMMIT)?

I'm thinking of a possibility that an I/O scheduler may be used
to activate this functionality rather than having the driver set
it necessarily...

Re: [PATCH] nvme: fix uninitialized return of ret when sysfs_create_link fails

2019-10-04 Thread Sagi Grimberg


This was already fixed and merged (by Dan)

On 10/2/19 5:43 AM, Colin King wrote:

From: Colin Ian King 

Currently when the call to sysfs_create_link fails the error exit
path returns an uninitialized value in variable ret. Fix this by
returning the error code returned from the failed call to
sysfs_create_link.

Addresses-Coverity: ("Uninitialized scalar variable")
Fixes: 32fd90c40768 ("nvme: change locking for the per-subsystem controller 
list")
Signed-off-by: Colin Ian King 
---
  drivers/nvme/host/core.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 63b37d08ac98..f6acbff3e3bc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2540,8 +2540,9 @@ static int nvme_init_subsystem(struct nvme_ctrl *ctrl, 
struct nvme_id_ctrl *id)
list_add_tail(&subsys->entry, &nvme_subsystems);
}
  
-	if (sysfs_create_link(&subsys->dev.kobj, &ctrl->device->kobj,

-   dev_name(ctrl->device))) {
+   ret = sysfs_create_link(&subsys->dev.kobj, &ctrl->device->kobj,
+   dev_name(ctrl->device));
+   if (ret) {
dev_err(ctrl->device,
"failed to create sysfs link from subsystem.\n");
goto out_put_subsystem;

Re: [PATCH v3] nvme: allow 64-bit results in passthru commands

2019-09-24 Thread Sagi Grimberg


Looks fine to me,

Reviewed-by: Sagi Grimberg 

Keith, Christoph?

Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism

2019-09-20 Thread Sagi Grimberg





Sagi,

Sorry it took a while to bring my system back online.

With the patch, the IOPS is about the same drop with the 1st patch. I think

the excessive context switches are causing the drop in IOPS.


The following are captured by "perf sched record" for 30 seconds during

tests.


"perf sched latency"
With patch:
fio:(82)  | 937632.706 ms |  1782255 | avg:0.209 ms | max:  
 63.123

ms | max at:768.274023 s


without patch:
fio:(82)  |2348323.432 ms |18848 | avg:0.295 ms | max:  
 28.446

ms | max at:   6447.310255 s

Without patch means the proposed hard-irq patch?


It means the current upstream code without any patch. But It's prone to soft 
lockup.

Ming's proposed hard-irq patch gets similar results to "without patch", however 
it fixes the soft lockup.


Thanks for the clarification.

The problem with what Ming is proposing in my mind (and its an existing
problem that exists today), is that nvme is taking precedence over
anything else until it absolutely cannot hog the cpu in hardirq.

In the thread Ming referenced a case where today if the cpu core has a
net softirq activity it cannot make forward progress. So with Ming's
suggestion, net softirq will eventually make progress, but it creates an
inherent fairness issue. Who said that nvme completions should come
faster then the net rx/tx or another I/O device (or hrtimers or sched
events...)?

As much as I'd like nvme to complete as soon as possible, I might have
other activities in the system that are as important if not more. So
I don't think we can solve this with something that is not cooperative
or fair with the rest of the system.


If we are context switching too much, it means the soft-irq operation is not
efficient, not necessarily the fact that the completion path is running in soft-
irq..

Is your kernel compiled with full preemption or voluntary preemption?


The tests are based on Ubuntu 18.04 kernel configuration. Here are the 
parameters:

# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set


I see, so it still seems that irq_poll_softirq is still not efficient in
reaping completions. reaping the completions on its own is pretty much
the same in hard and soft irq, so its really the scheduling part that is
creating the overhead (which does not exist in hard irq).

Question:
when you test with without the patch (completions are coming in 
hard-irq), do the fio threads that run on the cpu cores that are 
assigned to the cores that are handling interrupts get substantially lower

throughput than the rest of the fio threads? I would expect that
the fio threads that are running on the first 32 cores to get very low
iops (overpowered by the nvme interrupts) and the rest doing much more
given that nvme has almost no limits to how much time it can spend on
processing completions.

If need_resched() is causing us to context switch too aggressively, does 
changing that to local_softirq_pending() make things better?

--
diff --git a/lib/irq_poll.c b/lib/irq_poll.c
index d8eab563fa77..05d524fcaf04 100644
--- a/lib/irq_poll.c
+++ b/lib/irq_poll.c
@@ -116,7 +116,7 @@ static void __latent_entropy irq_poll_softirq(struct 
softirq_action *h)

/*
 * If softirq window is exhausted then punt.
 */
-   if (need_resched())
+   if (local_softirq_pending())
break;
}
--

Although, this can potentially cause other threads from making forward
progress.. If it is better, perhaps we also need a time limit as well.

Perhaps we should add statistics/tracing on how many completions we are
reaping per invocation...

Re: [PATCH v2] nvme-pci: Save PCI state before putting drive into deepest state

2019-09-20 Thread Sagi Grimberg


Keith, can we get a review from you for this?

Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism

2019-09-20 Thread Sagi Grimberg





Hey Ming,


Ok, so the real problem is per-cpu bounded tasks.

I share Thomas opinion about a NAPI like approach.


We already have that, its irq_poll, but it seems that for this
use-case, we get lower performance for some reason. I'm not entirely
sure why that is, maybe its because we need to mask interrupts
because we don't have an "arm" register in nvme like network devices
have?


Long observed that IOPS drops much too by switching to threaded irq.
If softirqd is waken up for handing softirq, the performance shouldn't
be better than threaded irq.


Its true that it shouldn't be any faster, but what irqpoll already has and we
don't need to reinvent is a proper budgeting mechanism that needs to occur
when multiple devices map irq vectors to the same cpu core.

irqpoll already maintains a percpu list and dispatch the ->poll with a budget
that the backend enforces and irqpoll multiplexes between them.
Having this mechanism in irq (hard or threaded) context sounds unnecessary a
bit.

It seems like we're attempting to stay in irq context for as long as we can
instead of scheduling to softirq/thread context if we have more than a
minimal amount of work to do. Without at least understanding why
softirq/thread degrades us so much this code seems like the wrong approach
to me. Interrupt context will always be faster, but it is not a sufficient 
reason
to spend as much time as possible there, is it?

We should also keep in mind, that the networking stack has been doing this
for years, I would try to understand why this cannot work for nvme before
dismissing.


Especially, Long found that context
switch is increased a lot after applying your irq poll patch.

https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists
.infradead.org%2Fpipermail%2Flinux-nvme%2F2019-

August%2F026788.html&am



p;data=02%7C01%7Clongli%40microsoft.com%7C20391b0810844821325908d73
59c



64d2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637036818140279
742&a



mp;sdata=GyBWILwPvwHYvrTGSAVZbdl%2Fcoz3twSXe2DrH2t1MeQ%3D&am
p;reserved

=0


Oh, I didn't see that one, wonder why... thanks!

5% improvement, I guess we can buy that for other users as is :)

If we suffer from lots of context switches while the CPU is flooded with
interrupts, then I would argue that we're re-raising softirq too much.
In this use-case, my assumption is that the cpu cannot keep up with the
interrupts and not that it doesn't reap enough (we also reap the first batch in
interrupt context...)

Perhaps making irqpoll continue until it must resched would improve things
further? Although this is a latency vs. efficiency tradeoff, looks like
MAX_SOFTIRQ_TIME is set to 2ms:

"
  * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
  * certain cases, such as stop_machine(), jiffies may cease to
  * increment and so we need the MAX_SOFTIRQ_RESTART limit as
  * well to make sure we eventually return from this method.
  *
  * These limits have been established via experimentation.
  * The two things to balance is latency against fairness -
  * we want to handle softirqs as soon as possible, but they
  * should not be able to lock up the box.
"

Long, does this patch make any difference?


Sagi,

Sorry it took a while to bring my system back online.

With the patch, the IOPS is about the same drop with the 1st patch. I think the 
excessive context switches are causing the drop in IOPS.

The following are captured by "perf sched record" for 30 seconds during tests.

"perf sched latency"
With patch:
   fio:(82)  | 937632.706 ms |  1782255 | avg:0.209 ms | max:   
63.123 ms | max at:768.274023 s

without patch:
   fio:(82)  |2348323.432 ms |18848 | avg:0.295 ms | max:   
28.446 ms | max at:   6447.310255 s


Without patch means the proposed hard-irq patch?

If we are context switching too much, it means the soft-irq operation is
not efficient, not necessarily the fact that the completion path is
running in soft-irq..

Is your kernel compiled with full preemption or voluntary preemption?


Look closer at each CPU, we can see ksoftirqd is competing CPU with fio (and 
effectively throttle other fio processes)
(captured in /sys/kernel/debug/tracing, echo sched:* >set_event)

On CPU1 with patch: (note that the prev_state for fio is "R", it's preemptively 
scheduled)
<...>-4077  [001] d... 66456.805062: sched_switch: prev_comm=fio 
prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 
next_prio=120
<...>-17[001] d... 66456.805859: sched_switch: 
prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=fio 
next_pid=4077 next_prio=120
<...>-4077  [001] d... 66456.844049: sched_switch: prev_comm=fio 
prev_pid=4077 prev_prio=120 prev_state=R ==> next_comm=ksoftirqd/1 next_pid=17 
next_prio=120
<...>-17[001] d... 66456.844607: sched_switch: 
prev_comm=ksoftirqd/1 prev_pid=17 prev_prio=120 prev_state=S ==> next_comm=f

Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism

2019-09-20 Thread Sagi Grimberg





It seems like we're attempting to stay in irq context for as long as we
can instead of scheduling to softirq/thread context if we have more than
a minimal amount of work to do. Without at least understanding why
softirq/thread degrades us so much this code seems like the wrong
approach to me. Interrupt context will always be faster, but it is
not a sufficient reason to spend as much time as possible there, is it?


If extra latency is added in IO completion path, this latency will be
introduced in the submission path, because the hw queue depth is fixed,
which is often small. Especially in case of multiple submission vs.
single(shared) completion, the whole hw queue tags can be exhausted
easily.


This is why the patch is reaping the first batch from hard-irq, but only
if it has more will defer to softirq. So I'm not sure the short QD use
case applies here...


I guess no such effect for networking IO.


Maybe, maybe not...

Re: [PATCH] Added QUIRKs for ADATA XPG SX8200 Pro 512GB

2019-09-11 Thread Sagi Grimberg


This does not apply on nvme-5.4, can you please respin a patch
that cleanly applies?

Re: [PATCH] Added QUIRKs for ADATA XPG SX8200 Pro 512GB

2019-09-11 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg

Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism

2019-09-09 Thread Sagi Grimberg


Hey Ming,


Ok, so the real problem is per-cpu bounded tasks.

I share Thomas opinion about a NAPI like approach.


We already have that, its irq_poll, but it seems that for this
use-case, we get lower performance for some reason. I'm not
entirely sure why that is, maybe its because we need to mask interrupts
because we don't have an "arm" register in nvme like network devices
have?


Long observed that IOPS drops much too by switching to threaded irq. If
softirqd is waken up for handing softirq, the performance shouldn't
be better than threaded irq.


Its true that it shouldn't be any faster, but what irqpoll already has
and we don't need to reinvent is a proper budgeting mechanism that
needs to occur when multiple devices map irq vectors to the same cpu
core.

irqpoll already maintains a percpu list and dispatch the ->poll with
a budget that the backend enforces and irqpoll multiplexes between them.
Having this mechanism in irq (hard or threaded) context sounds
unnecessary a bit.

It seems like we're attempting to stay in irq context for as long as we
can instead of scheduling to softirq/thread context if we have more than
a minimal amount of work to do. Without at least understanding why
softirq/thread degrades us so much this code seems like the wrong
approach to me. Interrupt context will always be faster, but it is
not a sufficient reason to spend as much time as possible there, is it?

We should also keep in mind, that the networking stack has been doing
this for years, I would try to understand why this cannot work for nvme
before dismissing.


Especially, Long found that context
switch is increased a lot after applying your irq poll patch.

http://lists.infradead.org/pipermail/linux-nvme/2019-August/026788.html


Oh, I didn't see that one, wonder why... thanks!

5% improvement, I guess we can buy that for other users as is :)

If we suffer from lots of context switches while the CPU is flooded with
interrupts, then I would argue that we're re-raising softirq too much.
In this use-case, my assumption is that the cpu cannot keep up with the
interrupts and not that it doesn't reap enough (we also reap the first
batch in interrupt context...)

Perhaps making irqpoll continue until it must resched would improve
things further? Although this is a latency vs. efficiency tradeoff,
looks like MAX_SOFTIRQ_TIME is set to 2ms:

"
 * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
 * certain cases, such as stop_machine(), jiffies may cease to
 * increment and so we need the MAX_SOFTIRQ_RESTART limit as
 * well to make sure we eventually return from this method.
 *
 * These limits have been established via experimentation.
 * The two things to balance is latency against fairness -
 * we want to handle softirqs as soon as possible, but they
 * should not be able to lock up the box.
"

Long, does this patch make any difference?
--
diff --git a/lib/irq_poll.c b/lib/irq_poll.c
index 2f17b488d58e..d8eab563fa77 100644
--- a/lib/irq_poll.c
+++ b/lib/irq_poll.c
@@ -12,8 +12,6 @@
 #include 
 #include 

-static unsigned int irq_poll_budget __read_mostly = 256;
-
 static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll);

 /**
@@ -77,42 +75,29 @@ EXPORT_SYMBOL(irq_poll_complete);

 static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
 {
-   struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll);
-   int rearm = 0, budget = irq_poll_budget;
-   unsigned long start_time = jiffies;
+   struct list_head *irqpoll_list = this_cpu_ptr(&blk_cpu_iopoll);
+   LIST_HEAD(list);

local_irq_disable();
+   list_splice_init(irqpoll_list, &list);
+   local_irq_enable();

-   while (!list_empty(list)) {
+   while (!list_empty(&list)) {
struct irq_poll *iop;
int work, weight;

-   /*
-* If softirq window is exhausted then punt.
-*/
-   if (budget <= 0 || time_after(jiffies, start_time)) {
-   rearm = 1;
-   break;
-   }
-
-   local_irq_enable();
-
/* Even though interrupts have been re-enabled, this
 * access is safe because interrupts can only add new
 * entries to the tail of this list, and only ->poll()
 * calls can remove this head entry from the list.
 */
-   iop = list_entry(list->next, struct irq_poll, list);
+   iop = list_first_entry(&list, struct irq_poll, list);

weight = iop->weight;
work = 0;
if (test_bit(IRQ_POLL_F_SCHED, &iop->state))
work = iop->poll(iop, weight);

-   budget -= work;
-
-   local_irq_disable();
-
/*
 * Drivers must not modify the iopoll state, if they
 * consume their assigned weight (or more, some drivers 
can't
@@ -125,11 +110,21

Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism

2019-09-06 Thread Sagi Grimberg






Ok, so the real problem is per-cpu bounded tasks.

I share Thomas opinion about a NAPI like approach.


We already have that, its irq_poll, but it seems that for this
use-case, we get lower performance for some reason. I'm not
entirely sure why that is, maybe its because we need to mask interrupts
because we don't have an "arm" register in nvme like network devices
have?

Re: [PATCH] nvmet: Use PTR_ERR_OR_ZERO() in nvmet_init_discovery()

2019-09-06 Thread Sagi Grimberg


Reviewed-by: Sagi Grimberg 

applied to nvme-5.4

Re: [PATCHv2] nvme: Assign subsy instance from first ctrl

2019-09-05 Thread Sagi Grimberg


Looks good,

Reviewed-by: Sagi Grimberg 

Applied to nvme-5.4

Re: [PATCH] nvme: tcp: remove redundant assignment to variable ret

2019-09-05 Thread Sagi Grimberg


Applied to nvme-5.4

Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace

2019-09-03 Thread Sagi Grimberg





Still don't understand how this is ok...

I have /dev/nvme0 represents a network endpoint that I would discover
from, it is raising me an event to do a discovery operation (namely to
issue an ioctl to it) so my udev code calls a systemd script.

By the time I actually get to do that, /dev/nvme0 represents now a new
network endpoint (where the event is no longer relevant to). I would
rather the discovery to explicitly fail than to give me something
different, so we pass some arguments that we verify in the operation.

Its a stretch case, but it was raised by people as a potential issue.


Ok, and how do you handle this same thing for something like /dev/sda ?
(hint, it isn't new, and is something we solved over a decade ago)

If you worry about stuff like this, use a persistant device naming
scheme and have your device node be pointed to by a symlink.  Create
that symlink by using the information in the initial 'ADD' uevent.

That way, when userspace opens the device node, it "knows" exactly which
one it opens.  It sounds like you have a bunch of metadata to describe
these uniquely, so pass that in the ADD event, not in some other crazy
non-standard manner.


We could send these variables when adding the device and then validating
them using a rich-text-explanatory symlink. Seems slightly backwards to
me, but that would work too.

We create the char device using device_add (in nvme_init_subsystem),
I didn't find any way to append env variables to that ADD uevent.

Did you mean that we should add another flavor of device_add that
accepts char *envp_ext[]?

What exactly is the "standard manner" to pass these variables to
the chardev KOBJ_ADD uevent? All other examples I could find use
KOBJ_CHANGE to pass private stuff..

Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace

2019-08-30 Thread Sagi Grimberg





Yes we do, userspace should use it to order events.  Does udev not
handle that properly today?


The problem is not ordering of events, its really about the fact that
the chardev can be removed and reallocated for a different controller
(could be a completely different discovery controller) by the time
that userspace handles the event.


The same is generally true for lot of kernel devices.  We could reduce
the chance by using the idr cyclic allocator.


Well, it was raised by Hannes and James, so I'll ask them respond here
because I don't mind having it this way. I personally think that this
is a better approach than having a cyclic idr allocator. In general, I
don't necessarily think that this is a good idea to have cyclic
controller enumerations if we don't absolutely have to...


We hit it right and left without the cyclic allocator, but that won't 
necessarily remove it.


Perhaps we should have had a unique token assigned to the controller, 
and have the event pass the name and the token.  The cli would then, if 
the token is present, validate it via an ioctl before proceeding with 
other ioctls.


Where all the connection arguments were added we due to the reuse issue 
and then solving the question of how to verify and/or lookup the desired 
controller, by using the shotgun approach rather than being very 
pointed, which is what the name/token would do.


This unique token is: trtype:traddr:trsvcid:host-traddr ...

Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace

2019-08-30 Thread Sagi Grimberg





You are correct that this information can be derived from sysfs, but the
main reason why we add these here, is because in udev rule we can't
just go ahead and start looking these up and parsing these..

We could send the discovery aen with NVME_CTRL_NAME and have
then have systemd run something like:

nvme connect-all -d nvme0 --sysfs

and have nvme-cli retrieve all this stuff from sysfs?


Actually that may be a problem.

There could be a hypothetical case where after the event was fired
and before it was handled, the discovery controller went away and
came back again with a different controller instance, and the old
instance is now a different discovery controller.

This is why we need this information in the event. And we verify this
information in sysfs in nvme-cli.


Well, that must be a usual issue with uevents, right?  Don't we usually
have a increasing serial number for that or something?


Yes we do, userspace should use it to order events.  Does udev not
handle that properly today?


The problem is not ordering of events, its really about the fact that
the chardev can be removed and reallocated for a different controller
(could be a completely different discovery controller) by the time
that userspace handles the event.


So?  You will have gotten the remove and then new addition uevent in
order showing you this.  So your userspace code knows that something
went away and then came back properly so you should be kept in sync.


Still don't understand how this is ok...

I have /dev/nvme0 represents a network endpoint that I would discover
from, it is raising me an event to do a discovery operation (namely to
issue an ioctl to it) so my udev code calls a systemd script.

By the time I actually get to do that, /dev/nvme0 represents now a new
network endpoint (where the event is no longer relevant to). I would
rather the discovery to explicitly fail than to give me something
different, so we pass some arguments that we verify in the operation.

Its a stretch case, but it was raised by people as a potential issue.

Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace

2019-08-30 Thread Sagi Grimberg





Yes we do, userspace should use it to order events.  Does udev not
handle that properly today?


The problem is not ordering of events, its really about the fact that
the chardev can be removed and reallocated for a different controller
(could be a completely different discovery controller) by the time
that userspace handles the event.


The same is generally true for lot of kernel devices.  We could reduce
the chance by using the idr cyclic allocator.


Well, it was raised by Hannes and James, so I'll ask them respond here
because I don't mind having it this way. I personally think that this
is a better approach than having a cyclic idr allocator. In general, I
don't necessarily think that this is a good idea to have cyclic
controller enumerations if we don't absolutely have to...

Re: [PATCH v2 3/3] nvme: fire discovery log page change events to userspace

2019-08-29 Thread Sagi Grimberg





You are correct that this information can be derived from sysfs, but the
main reason why we add these here, is because in udev rule we can't
just go ahead and start looking these up and parsing these..

We could send the discovery aen with NVME_CTRL_NAME and have
then have systemd run something like:

nvme connect-all -d nvme0 --sysfs

and have nvme-cli retrieve all this stuff from sysfs?


Actually that may be a problem.

There could be a hypothetical case where after the event was fired
and before it was handled, the discovery controller went away and
came back again with a different controller instance, and the old
instance is now a different discovery controller.

This is why we need this information in the event. And we verify this
information in sysfs in nvme-cli.


Well, that must be a usual issue with uevents, right?  Don't we usually
have a increasing serial number for that or something?


Yes we do, userspace should use it to order events.  Does udev not
handle that properly today?


The problem is not ordering of events, its really about the fact that
the chardev can be removed and reallocated for a different controller
(could be a completely different discovery controller) by the time
that userspace handles the event.

Re: [PATCH v4 2/4] nvme-pci: Add support for variable IO SQ element size

2019-08-22 Thread Sagi Grimberg





I'll fix it. Note that I'm going to take it out of the tree soon
because it will have conflicts with Jens for-5.4/block, so we
will send it to Jens after the initial merge window, after he
rebases off of Linus.


Conflicts too hard to fixup at merge time ? Otherwise I could just
rebase on top of Jens and put in a topic branch...


The quirk enumeration conflicts with 5.3-rc. Not a big deal, just
thought it'd be easier to handle that way.

Rebasing on top of Jens won't help because his for-5.4/block which
does not have the rc code yet.

Re: [PATCH v4 2/4] nvme-pci: Add support for variable IO SQ element size

2019-08-22 Thread Sagi Grimberg





wrote:

+#define NVME_NVM_ADMSQES   6
  #define NVME_NVM_IOSQES   6
  #define NVME_NVM_IOCQES   4


The NVM in the two defines here stands for the NVM command set,
so this should just be named NVME_ADM_SQES or so.  But except for
this
the patch looks good:

Reviewed-by: Christoph Hellwig 

So maybe Sagi can just fix this up in the tree.


Ah ok I missed the meaning. Thanks. Sagi, can you fix that up or do you
need me to resubmit ?


I'll fix it. Note that I'm going to take it out of the tree soon
because it will have conflicts with Jens for-5.4/block, so we
will send it to Jens after the initial merge window, after he
rebases off of Linus.

Re: [PATCH v7 08/14] nvmet-core: allow one host per passthru-ctrl

2019-08-22 Thread Sagi Grimberg




I don't understand why we don't limit a regular ctrl to single access 
and we do it for the PT ctrl.


I guess the block layer helps to sync between multiple access in 
parallel but we can do it as well.


Also, let's say you limit the access to this subsystem to 1 user, the 
bdev is still accessibly for local user and also you can create a 
different subsystem that will use this device (PT and non-PT ctrl).


Sagi,

can you explain the trouble you meant and how this limitation solve it ?


Its different to emulate the controller with all its admin
commands vs. passing it through to the nvme device.. (think of format 
nvm)




we don't need to support format command for PT ctrl as we don't support 
other commands such create_sq/cq.


That is just an example, basically every command that we are not aware
of we simply passthru to the drive without knowing the implications
on a multi-host environment..

Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts

2019-08-21 Thread Sagi Grimberg





Sagi,

Here are the test results.

Benchmark command:
fio --bs=4k --ioengine=libaio --iodepth=64 
--filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1
 --direct=1 --runtime=90 --numjobs=80 --rw=randread --name=test 
--group_reporting --gtod_reduce=1

With your patch: 1720k IOPS
With threaded interrupts: 1320k IOPS
With just interrupts: 3720k IOPS

Interrupts are the fastest but we need to find a way to throttle it.


This is the workload that generates the flood?
If so I did not expect that this would be the perf difference..

If completions keep pounding on the cpu, I would expect irqpoll
to simply keep running forever and poll the cqs. There is no
fundamental reason why polling would be faster in an interrupt,
what matters could be:
1. we reschedule more than we need to
2. we don't reap enough completions in every poll round, which
will trigger rearming the interrupt and then when it fires reschedule
another softirq...

Maybe we need to take care of some irq_poll optimizations?

Does this (untested) patch make any difference?
--
diff --git a/lib/irq_poll.c b/lib/irq_poll.c
index 2f17b488d58e..0e94183eba15 100644
--- a/lib/irq_poll.c
+++ b/lib/irq_poll.c
@@ -12,7 +12,8 @@
 #include 
 #include 

-static unsigned int irq_poll_budget __read_mostly = 256;
+static unsigned int irq_poll_budget __read_mostly = 3000;
+unsigned int __read_mostly irqpoll_budget_usecs = 2000;

 static DEFINE_PER_CPU(struct list_head, blk_cpu_iopoll);

@@ -77,32 +78,26 @@ EXPORT_SYMBOL(irq_poll_complete);

 static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
 {
-   struct list_head *list = this_cpu_ptr(&blk_cpu_iopoll);
-   int rearm = 0, budget = irq_poll_budget;
-   unsigned long start_time = jiffies;
+   struct list_head *irqpoll_list = this_cpu_ptr(&blk_cpu_iopoll);
+   unsigned int budget = irq_poll_budget;
+   unsigned long time_limit =
+   jiffies + usecs_to_jiffies(irqpoll_budget_usecs);
+   LIST_HEAD(list);

local_irq_disable();
+   list_splice_init(irqpoll_list, &list);
+   local_irq_enable();

-   while (!list_empty(list)) {
+   while (!list_empty(&list)) {
struct irq_poll *iop;
int work, weight;

-   /*
-* If softirq window is exhausted then punt.
-*/
-   if (budget <= 0 || time_after(jiffies, start_time)) {
-   rearm = 1;
-   break;
-   }
-
-   local_irq_enable();
-
/* Even though interrupts have been re-enabled, this
 * access is safe because interrupts can only add new
 * entries to the tail of this list, and only ->poll()
 * calls can remove this head entry from the list.
 */
-   iop = list_entry(list->next, struct irq_poll, list);
+   iop = list_first_entry(&list, struct irq_poll, list);

weight = iop->weight;
work = 0;
@@ -111,8 +106,6 @@ static void __latent_entropy irq_poll_softirq(struct 
softirq_action *h)


budget -= work;

-   local_irq_disable();
-
/*
 * Drivers must not modify the iopoll state, if they
 * consume their assigned weight (or more, some drivers 
can't
@@ -125,11 +118,21 @@ static void __latent_entropy 
irq_poll_softirq(struct softirq_action *h)

if (test_bit(IRQ_POLL_F_DISABLE, &iop->state))
__irq_poll_complete(iop);
else
-   list_move_tail(&iop->list, list);
+   list_move_tail(&iop->list, &list);
}
+
+   /*
+* If softirq window is exhausted then punt.
+*/
+   if (budget <= 0 || time_after_eq(jiffies, time_limit))
+   break;
}

-   if (rearm)
+   local_irq_disable();
+
+   list_splice_tail_init(irqpoll_list, &list);
+   list_splice(&list, irqpoll_list);
+   if (!list_empty(irqpoll_list))
__raise_softirq_irqoff(IRQ_POLL_SOFTIRQ);

local_irq_enable();
--

Re: [PATCH 3/3] nvme: complete request in work queue on CPU with flooded interrupts

2019-08-20 Thread Sagi Grimberg





From: Long Li 

When a NVMe hardware queue is mapped to several CPU queues, it is possible
that the CPU this hardware queue is bound to is flooded by returning I/O for
other CPUs.

For example, consider the following scenario:
1. CPU 0, 1, 2 and 3 share the same hardware queue
2. the hardware queue interrupts CPU 0 for I/O response
3. processes from CPU 1, 2 and 3 keep sending I/Os

CPU 0 may be flooded with interrupts from NVMe device that are I/O responses
for CPU 1, 2 and 3. Under heavy I/O load, it is possible that CPU 0 spends
all the time serving NVMe and other system interrupts, but doesn't have a
chance to run in process context.

To fix this, CPU 0 can schedule a work to complete the I/O request when it
detects the scheduler is not making progress. This serves multiple purposes:

1. This CPU has to be scheduled to complete the request. The other CPUs can't
issue more I/Os until some previous I/Os are completed. This helps this CPU
get out of NVMe interrupts.

2. This acts a throttling mechanisum for NVMe devices, in that it can not
starve a CPU while servicing I/Os from other CPUs.

3. This CPU can make progress on RCU and other work items on its queue.


The problem is indeed real, but this is the wrong approach in my mind.

We already have irqpoll which takes care proper budgeting polling
cycles and not hogging the cpu.

I've sent rfc for this particular problem before [1]. At the time IIRC,
Christoph suggested that we will poll the first batch directly from
the irq context and reap the rest in irqpoll handler.

[1]: 
http://lists.infradead.org/pipermail/linux-nvme/2016-October/006497.html


How about something like this instead:
--
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 71127a366d3c..84bf16d75109 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 

 #include "trace.h"
 #include "nvme.h"
@@ -32,6 +33,7 @@
 #define CQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_completion))

 #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
+#define NVME_POLL_BUDGET_IRQ   256

 /*
  * These can be higher, but we need to ensure that any command doesn't
@@ -189,6 +191,7 @@ struct nvme_queue {
u32 *dbbuf_cq_db;
u32 *dbbuf_sq_ei;
u32 *dbbuf_cq_ei;
+   struct irq_poll iop;
struct completion delete_done;
 };

@@ -1015,6 +1018,23 @@ static inline int nvme_process_cq(struct 
nvme_queue *nvmeq, u16 *start,

return found;
 }

+static int nvme_irqpoll_handler(struct irq_poll *iop, int budget)
+{
+   struct nvme_queue *nvmeq = container_of(iop, struct nvme_queue, 
iop);

+   struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
+   u16 start, end;
+   int completed;
+
+   completed = nvme_process_cq(nvmeq, &start, &end, budget);
+   nvme_complete_cqes(nvmeq, start, end);
+   if (completed < budget) {
+   irq_poll_complete(&nvmeq->iop);
+   enable_irq(pci_irq_vector(pdev, nvmeq->cq_vector));
+   }
+
+   return completed;
+}
+
 static irqreturn_t nvme_irq(int irq, void *data)
 {
struct nvme_queue *nvmeq = data;
@@ -1028,12 +1048,16 @@ static irqreturn_t nvme_irq(int irq, void *data)
rmb();
if (nvmeq->cq_head != nvmeq->last_cq_head)
ret = IRQ_HANDLED;
-   nvme_process_cq(nvmeq, &start, &end, -1);
+   nvme_process_cq(nvmeq, &start, &end, NVME_POLL_BUDGET_IRQ);
nvmeq->last_cq_head = nvmeq->cq_head;
wmb();

if (start != end) {
nvme_complete_cqes(nvmeq, start, end);
+   if (nvme_cqe_pending(nvmeq)) {
+   disable_irq_nosync(irq);
+   irq_poll_sched(&nvmeq->iop);
+   }
return IRQ_HANDLED;
}

@@ -1347,6 +1371,7 @@ static enum blk_eh_timer_return 
nvme_timeout(struct request *req, bool reserved)


 static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
+   irq_poll_disable(&nvmeq->iop);
dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
if (!nvmeq->sq_cmds)
@@ -1481,6 +1506,7 @@ static int nvme_alloc_queue(struct nvme_dev *dev, 
int qid, int depth)

nvmeq->dev = dev;
spin_lock_init(&nvmeq->sq_lock);
spin_lock_init(&nvmeq->cq_poll_lock);
+   irq_poll_init(&nvmeq->iop, NVME_POLL_BUDGET_IRQ, 
nvme_irqpoll_handler);

nvmeq->cq_head = 0;
nvmeq->cq_phase = 1;
nvmeq->q_db = &dev->dbs[qid * 2 * dev->db_stride];
--

1 2 3 4 5 >

1 - 100 of 471 matches

Mail list logo