Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-03-02 Thread Keith Busch
On Tue, Mar 02, 2021 at 08:18:40AM +0100, Hannes Reinecke wrote:
> On 3/1/21 9:59 PM, Keith Busch wrote:
> > On Mon, Mar 01, 2021 at 05:53:25PM +0100, Hannes Reinecke wrote:
> >> On 3/1/21 5:05 PM, Keith Busch wrote:
> >>> On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
>  On 3/1/21 2:26 PM, Daniel Wagner wrote:
> > On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
> >> Crashing is bad, silent data corruption is worse. Is there truly no
> >> defense against that? If not, why should anyone rely on this?
> >
> > If we receive an response for which we don't have a started request, we
> > know that something is wrong. Couldn't we in just reset the connection
> > in this case? We don't have to pretend nothing has happened and
> > continuing normally. This would avoid a host crash and would not create
> > (more) data corruption. Or I am just too naive?
> >
>  This is actually a sensible solution.
>  Please send a patch for that.
> >>>
> >>> Is a bad frame a problem that can be resolved with a reset?
> >>>
> >>> Even if so, the reset doesn't indicate to the user if previous commands
> >>> completed with bad data, so it still seems unreliable.
> >>>
> >> We need to distinguish two cases here.
> >> The one is use receiving a frame with an invalid tag, leading to a crash.
> >> This can be easily resolved by issuing a reset, as clearly the command was
> >> garbage and we need to invoke error handling (which is reset).
> >>
> >> The other case is us receiving a frame with a _duplicate_ tag, ie a tag
> >> which is _currently_ valid. This is a case which will fail _even now_, as 
> >> we
> >> have simply no way of detecting this.
> >>
> >> So what again do we miss by fixing the first case?
> >> Apart from a system which does _not_ crash?
> > 
> > I'm just saying each case is a symptom of the same problem. The only
> > difference from observing one vs the other is a race with the host's
> > dispatch. And since you're proposing this patch, it sounds like this
> > condition does happen on tcp compared to other transports where we don't
> > observe it. I just thought the implication that data corruption happens
> > is a alarming.
> > 
> Oh yes, it is.
> But sadly TCP inherently suffers from this, as literally anyone can
> spoof frames on the network.
> Other transports like RDMA or FC do not suffer to that extend as
> spoofing frames there is far more elaborate, and not really possible
> without dedicated hardware equipment.
> 
> That's why there is header and data digest; that will protect you
> against accidental frame corruption (as this case clearly is; the
> remainder of the frame is filled with zeroes).
> It would not protect you against deliberate frame corruption; that's why
> there is TPAR 8010 (TLS encryption for NVMe-TCP).
> 
> Be it as it may, none of these methods are in use here, and none of
> these methods can be made mandatory. So we need to deal with the case at
> hand.
> 
> And in my opinion crashing is the _worst_ options of all.
> Tear the connection down, reset the thing, whatever.
> 
> But do not crash.
> Customers tend to have a very dim view on crashing machines, and have a
> very limited capacity for being susceptible to our reasoning in these cases.

I was pushing the data corruption angle because fixing that should
address both cases. If there's really nothing you can do about
corruption, your approach here makes sense, and I defer to Sagi and
Christoph for inclusion.

I still wouldn't trust my data to storage behaving this way, though. :)


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-03-02 Thread Hannes Reinecke
On 3/1/21 9:59 PM, Keith Busch wrote:
> On Mon, Mar 01, 2021 at 05:53:25PM +0100, Hannes Reinecke wrote:
>> On 3/1/21 5:05 PM, Keith Busch wrote:
>>> On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
 On 3/1/21 2:26 PM, Daniel Wagner wrote:
> On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
>> Crashing is bad, silent data corruption is worse. Is there truly no
>> defense against that? If not, why should anyone rely on this?
>
> If we receive an response for which we don't have a started request, we
> know that something is wrong. Couldn't we in just reset the connection
> in this case? We don't have to pretend nothing has happened and
> continuing normally. This would avoid a host crash and would not create
> (more) data corruption. Or I am just too naive?
>
 This is actually a sensible solution.
 Please send a patch for that.
>>>
>>> Is a bad frame a problem that can be resolved with a reset?
>>>
>>> Even if so, the reset doesn't indicate to the user if previous commands
>>> completed with bad data, so it still seems unreliable.
>>>
>> We need to distinguish two cases here.
>> The one is use receiving a frame with an invalid tag, leading to a crash.
>> This can be easily resolved by issuing a reset, as clearly the command was
>> garbage and we need to invoke error handling (which is reset).
>>
>> The other case is us receiving a frame with a _duplicate_ tag, ie a tag
>> which is _currently_ valid. This is a case which will fail _even now_, as we
>> have simply no way of detecting this.
>>
>> So what again do we miss by fixing the first case?
>> Apart from a system which does _not_ crash?
> 
> I'm just saying each case is a symptom of the same problem. The only
> difference from observing one vs the other is a race with the host's
> dispatch. And since you're proposing this patch, it sounds like this
> condition does happen on tcp compared to other transports where we don't
> observe it. I just thought the implication that data corruption happens
> is a alarming.
> 
Oh yes, it is.
But sadly TCP inherently suffers from this, as literally anyone can
spoof frames on the network.
Other transports like RDMA or FC do not suffer to that extend as
spoofing frames there is far more elaborate, and not really possible
without dedicated hardware equipment.

That's why there is header and data digest; that will protect you
against accidental frame corruption (as this case clearly is; the
remainder of the frame is filled with zeroes).
It would not protect you against deliberate frame corruption; that's why
there is TPAR 8010 (TLS encryption for NVMe-TCP).

Be it as it may, none of these methods are in use here, and none of
these methods can be made mandatory. So we need to deal with the case at
hand.

And in my opinion crashing is the _worst_ options of all.
Tear the connection down, reset the thing, whatever.

But do not crash.
Customers tend to have a very dim view on crashing machines, and have a
very limited capacity for being susceptible to our reasoning in these cases.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-03-01 Thread Keith Busch
On Mon, Mar 01, 2021 at 05:53:25PM +0100, Hannes Reinecke wrote:
> On 3/1/21 5:05 PM, Keith Busch wrote:
> > On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
> > > On 3/1/21 2:26 PM, Daniel Wagner wrote:
> > > > On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
> > > > > Crashing is bad, silent data corruption is worse. Is there truly no
> > > > > defense against that? If not, why should anyone rely on this?
> > > > 
> > > > If we receive an response for which we don't have a started request, we
> > > > know that something is wrong. Couldn't we in just reset the connection
> > > > in this case? We don't have to pretend nothing has happened and
> > > > continuing normally. This would avoid a host crash and would not create
> > > > (more) data corruption. Or I am just too naive?
> > > > 
> > > This is actually a sensible solution.
> > > Please send a patch for that.
> > 
> > Is a bad frame a problem that can be resolved with a reset?
> > 
> > Even if so, the reset doesn't indicate to the user if previous commands
> > completed with bad data, so it still seems unreliable.
> > 
> We need to distinguish two cases here.
> The one is use receiving a frame with an invalid tag, leading to a crash.
> This can be easily resolved by issuing a reset, as clearly the command was
> garbage and we need to invoke error handling (which is reset).
> 
> The other case is us receiving a frame with a _duplicate_ tag, ie a tag
> which is _currently_ valid. This is a case which will fail _even now_, as we
> have simply no way of detecting this.
> 
> So what again do we miss by fixing the first case?
> Apart from a system which does _not_ crash?

I'm just saying each case is a symptom of the same problem. The only
difference from observing one vs the other is a race with the host's
dispatch. And since you're proposing this patch, it sounds like this
condition does happen on tcp compared to other transports where we don't
observe it. I just thought the implication that data corruption happens
is a alarming.


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-03-01 Thread Hannes Reinecke

On 3/1/21 5:05 PM, Keith Busch wrote:

On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:

On 3/1/21 2:26 PM, Daniel Wagner wrote:

On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:

Crashing is bad, silent data corruption is worse. Is there truly no
defense against that? If not, why should anyone rely on this?


If we receive an response for which we don't have a started request, we
know that something is wrong. Couldn't we in just reset the connection
in this case? We don't have to pretend nothing has happened and
continuing normally. This would avoid a host crash and would not create
(more) data corruption. Or I am just too naive?


This is actually a sensible solution.
Please send a patch for that.


Is a bad frame a problem that can be resolved with a reset?

Even if so, the reset doesn't indicate to the user if previous commands
completed with bad data, so it still seems unreliable.


We need to distinguish two cases here.
The one is use receiving a frame with an invalid tag, leading to a 
crash. This can be easily resolved by issuing a reset, as clearly the 
command was garbage and we need to invoke error handling (which is reset).


The other case is us receiving a frame with a _duplicate_ tag, ie a tag 
which is _currently_ valid. This is a case which will fail _even now_, 
as we have simply no way of detecting this.


So what again do we miss by fixing the first case?
Apart from a system which does _not_ crash?

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-03-01 Thread Keith Busch
On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
> On 3/1/21 2:26 PM, Daniel Wagner wrote:
> > On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
> >> Crashing is bad, silent data corruption is worse. Is there truly no
> >> defense against that? If not, why should anyone rely on this?
> > 
> > If we receive an response for which we don't have a started request, we
> > know that something is wrong. Couldn't we in just reset the connection
> > in this case? We don't have to pretend nothing has happened and
> > continuing normally. This would avoid a host crash and would not create
> > (more) data corruption. Or I am just too naive?
> > 
> This is actually a sensible solution.
> Please send a patch for that.

Is a bad frame a problem that can be resolved with a reset?

Even if so, the reset doesn't indicate to the user if previous commands
completed with bad data, so it still seems unreliable.


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-03-01 Thread Hannes Reinecke
On 3/1/21 2:26 PM, Daniel Wagner wrote:
> On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
>> Crashing is bad, silent data corruption is worse. Is there truly no
>> defense against that? If not, why should anyone rely on this?
> 
> If we receive an response for which we don't have a started request, we
> know that something is wrong. Couldn't we in just reset the connection
> in this case? We don't have to pretend nothing has happened and
> continuing normally. This would avoid a host crash and would not create
> (more) data corruption. Or I am just too naive?
> 
This is actually a sensible solution.
Please send a patch for that.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-03-01 Thread Daniel Wagner
On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
> Crashing is bad, silent data corruption is worse. Is there truly no
> defense against that? If not, why should anyone rely on this?

If we receive an response for which we don't have a started request, we
know that something is wrong. Couldn't we in just reset the connection
in this case? We don't have to pretend nothing has happened and
continuing normally. This would avoid a host crash and would not create
(more) data corruption. Or I am just too naive?


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-26 Thread Keith Busch
On Fri, Feb 26, 2021 at 05:42:46PM +0100, Hannes Reinecke wrote:
> On 2/26/21 5:13 PM, Keith Busch wrote:
> > 
> > That's just addressing a symptom. You can't fully verify the request is
> > valid this way because the host could have started the same command ID
> > the very moment before the code checks it, incorrectly completing an
> > in-flight command and getting data corruption.
> > 
> Oh, I am fully aware.
> 
> Bad frames are just that, bad frames.
> We can only fully validate that when digests are enabled, but I gather that
> controllers sending out bad frames wouldn't want to enable digests, either.
> So relying on that is possibly not an option.
> 
> So really what I'm trying to avoid is the host crashing on a bad frame.
> That kind of thing always resonates bad with customers.
> And tripping over an uninitialized command is just too stupid IMO.

Crashing is bad, silent data corruption is worse. Is there truly no
defense against that? If not, why should anyone rely on this?


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-26 Thread Hannes Reinecke

On 2/26/21 5:13 PM, Keith Busch wrote:

On Fri, Feb 26, 2021 at 01:54:00PM +0100, Hannes Reinecke wrote:

On 2/26/21 1:35 PM, Daniel Wagner wrote:

On Mon, Feb 15, 2021 at 01:29:45PM -0800, Sagi Grimberg wrote:

Well, I think we should probably figure out why that is happening first.


I got my hands on a tcpdump trace. I've trimmed it to this:


[ .. ]

NVM Express Fabrics TCP
  Pdu Type: CapsuleResponse (5)
  Pdu Specific Flags: 0x00
   ...0 = PDU Header Digest: Not set
   ..0. = PDU Data Digest: Not set
   .0.. = PDU Data Last: Not set
   0... = PDU Data Success: Not set
  Pdu Header Length: 24
  Pdu Data Offset: 0
  Packet Length: 24
  Unknown Data: 020004001b001f00

  00 00 0c 9f f5 a8 b4 96 91 41 16 c0 08 00 45 00   .AE.
0010  00 4c 00 00 40 00 40 06 00 00 0a e4 26 af 0a e4   .L..@.@.&...
0020  c2 1e 11 44 88 4f b8 58 90 ec 8e 1b 32 ed 80 18   ...D.O.X2...
0030  01 01 fe d3 00 00 01 01 08 0a e6 ed ac be d6 a3   
0040  5d 0c 05 00 18 00 18 00 00 00 02 00 04 00 00 00   ]...
0050  00 00 1b 00 00 00 1f 00 00 00 ..


As I suspected, we did receive an invalid frame.
Data digest would have saved us, but then it's not enabled.

So we do need to check if the request is valid before processing it.


That's just addressing a symptom. You can't fully verify the request is
valid this way because the host could have started the same command ID
the very moment before the code checks it, incorrectly completing an
in-flight command and getting data corruption.


Oh, I am fully aware.

Bad frames are just that, bad frames.
We can only fully validate that when digests are enabled, but I gather 
that controllers sending out bad frames wouldn't want to enable digests, 
either. So relying on that is possibly not an option.


So really what I'm trying to avoid is the host crashing on a bad frame.
That kind of thing always resonates bad with customers.
And tripping over an uninitialized command is just too stupid IMO.

Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-26 Thread Keith Busch
On Fri, Feb 26, 2021 at 01:54:00PM +0100, Hannes Reinecke wrote:
> On 2/26/21 1:35 PM, Daniel Wagner wrote:
> > On Mon, Feb 15, 2021 at 01:29:45PM -0800, Sagi Grimberg wrote:
> > > Well, I think we should probably figure out why that is happening first.
> > 
> > I got my hands on a tcpdump trace. I've trimmed it to this:
> > 
> > 
> > No. Time   SourceDestination   Protocol 
> > Length Info
> >1 0.00   10.228.194.30 10.228.38.214 NVMe
> >  138NVMe Write
> >2 0.000285   10.228.38.214 10.228.194.30 
> > NVMe/TCP 90 Ready To Transfer
> >3 0.000591   10.228.194.30 10.228.38.214 NVMe
> >  4186   NVMe Write: Data
> >4 0.000673   10.228.38.214 10.228.194.30 TCP 
> >  66 4420 → 58535 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2655324576 
> > TSecr=1497295579
> >5 0.002140   10.228.38.214 10.228.194.30 NVMe
> >  90 NVMe Write: Response
> >6 0.002511   10.228.194.30 10.228.38.175 NVMe
> >  138NVMe Write
> >7 0.002812   10.228.38.175 10.228.194.30 
> > NVMe/TCP 90 Ready To Transfer
> >8 0.003006   10.228.194.30 10.228.38.175 NVMe
> >  4186   NVMe Write: Data
> >9 0.003098   10.228.38.175 10.228.194.30 TCP 
> >  66 4420 → 51241 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2183410196 
> > TSecr=3601034207
> >   10 0.004420   10.228.38.175 10.228.194.30 NVMe
> >  90 NVMe Write: Response
> >   11 0.004890   10.228.38.214 10.228.194.30 
> > NVMe/TCP 90
> >   12 0.004969   10.228.38.175 10.228.194.30 
> > NVMe/TCP 90
> > 
> > 
> > The last few seconds contain only normal writes and suddenly the host
> > receives two invalid packets. From what I see the host doesn't misbehave
> > at all. I wonder if it would be possible to detect the invalid packet by
> > locking at the PDU header only. If this would be possible we could
> > discard it early and do not try to use the invalid command id.
> > 
> > Here the last two packets with details:
> > 
> > 
> > No. Time   SourceDestination   Protocol 
> > Length Info
> >   11 0.004890   10.228.38.214 10.228.194.30 
> > NVMe/TCP 90
> > 
> > Frame 11: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
> >  Encapsulation type: Ethernet (1)
> >  Arrival Time: Feb 23, 2021 18:16:08.780574000 CET
> >  [Time shift for this packet: 0.0 seconds]
> >  Epoch Time: 1614100568.780574000 seconds
> >  [Time delta from previous captured frame: 0.00047 seconds]
> >  [Time delta from previous displayed frame: 0.00047 seconds]
> >  [Time since reference or first frame: 0.00489 seconds]
> >  Frame Number: 11
> >  Frame Length: 90 bytes (720 bits)
> >  Capture Length: 90 bytes (720 bits)
> >  [Frame is marked: False]
> >  [Frame is ignored: False]
> >  [Protocols in frame: eth:ethertype:ip:tcp:nvme-tcp]
> >  [Coloring Rule Name: TCP]
> >  [Coloring Rule String: tcp]
> > Ethernet II, Src: IntelCor_41:16:c0 (b4:96:91:41:16:c0), Dst: 
> > Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> >  Destination: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> >  Address: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> >   ..0.     = LG bit: Globally unique address 
> > (factory default)
> >   ...0     = IG bit: Individual address 
> > (unicast)
> >  Source: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
> >  Address: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
> >   ..0.     = LG bit: Globally unique address 
> > (factory default)
> >   ...0     = IG bit: Individual address 
> > (unicast)
> >  Type: IPv4 (0x0800)
> > Internet Protocol Version 4, Src: 10.228.38.214, Dst: 10.228.194.30
> >  0100  = Version: 4
> >   0101 = Header Length: 20 bytes (5)
> >  Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
> >   00.. = Differentiated Services Codepoint: Default (0)
> >   ..00 = Explicit Congestion Notification: Not ECN-Capable 
> > Transport (0)
> >  Total Length: 76
> >  Identification: 0x (0)
> >  Flags: 0x40, Don't fragment
> >  0...  = Reserved bit: Not set
> >  .1..  = Don't fragment: Set
> >  ..0.  = More fragments: Not set
> >  Fragment Offset: 0
> >  Time to Live: 64
> >  Protocol: TCP (6)
> >  Header Checksum: 0x [validation disabled]
> >  [Header checksum status: Unverified]
> >  Source Address: 10.228.38.214
> >  Destination Address: 10.228.194.30
> > Transmission Control Protocol, Src Port: 4420, Dst Port: 46909, Seq: 1, 
> > 

Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-26 Thread Hannes Reinecke

On 2/26/21 1:35 PM, Daniel Wagner wrote:

On Mon, Feb 15, 2021 at 01:29:45PM -0800, Sagi Grimberg wrote:

Well, I think we should probably figure out why that is happening first.


I got my hands on a tcpdump trace. I've trimmed it to this:


No. Time   SourceDestination   Protocol 
Length Info
   1 0.00   10.228.194.30 10.228.38.214 NVMe 
138NVMe Write
   2 0.000285   10.228.38.214 10.228.194.30 NVMe/TCP 90 
Ready To Transfer
   3 0.000591   10.228.194.30 10.228.38.214 NVMe 
4186   NVMe Write: Data
   4 0.000673   10.228.38.214 10.228.194.30 TCP  66 
4420 → 58535 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2655324576 
TSecr=1497295579
   5 0.002140   10.228.38.214 10.228.194.30 NVMe 90 
NVMe Write: Response
   6 0.002511   10.228.194.30 10.228.38.175 NVMe 
138NVMe Write
   7 0.002812   10.228.38.175 10.228.194.30 NVMe/TCP 90 
Ready To Transfer
   8 0.003006   10.228.194.30 10.228.38.175 NVMe 
4186   NVMe Write: Data
   9 0.003098   10.228.38.175 10.228.194.30 TCP  66 
4420 → 51241 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2183410196 
TSecr=3601034207
  10 0.004420   10.228.38.175 10.228.194.30 NVMe 90 
NVMe Write: Response
  11 0.004890   10.228.38.214 10.228.194.30 NVMe/TCP 90
  12 0.004969   10.228.38.175 10.228.194.30 NVMe/TCP 90


The last few seconds contain only normal writes and suddenly the host
receives two invalid packets. From what I see the host doesn't misbehave
at all. I wonder if it would be possible to detect the invalid packet by
locking at the PDU header only. If this would be possible we could
discard it early and do not try to use the invalid command id.

Here the last two packets with details:


No. Time   SourceDestination   Protocol 
Length Info
  11 0.004890   10.228.38.214 10.228.194.30 NVMe/TCP 90

Frame 11: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
 Encapsulation type: Ethernet (1)
 Arrival Time: Feb 23, 2021 18:16:08.780574000 CET
 [Time shift for this packet: 0.0 seconds]
 Epoch Time: 1614100568.780574000 seconds
 [Time delta from previous captured frame: 0.00047 seconds]
 [Time delta from previous displayed frame: 0.00047 seconds]
 [Time since reference or first frame: 0.00489 seconds]
 Frame Number: 11
 Frame Length: 90 bytes (720 bits)
 Capture Length: 90 bytes (720 bits)
 [Frame is marked: False]
 [Frame is ignored: False]
 [Protocols in frame: eth:ethertype:ip:tcp:nvme-tcp]
 [Coloring Rule Name: TCP]
 [Coloring Rule String: tcp]
Ethernet II, Src: IntelCor_41:16:c0 (b4:96:91:41:16:c0), Dst: Cisco_9f:f5:a8 
(00:00:0c:9f:f5:a8)
 Destination: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
 Address: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
  ..0.     = LG bit: Globally unique address 
(factory default)
  ...0     = IG bit: Individual address (unicast)
 Source: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
 Address: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
  ..0.     = LG bit: Globally unique address 
(factory default)
  ...0     = IG bit: Individual address (unicast)
 Type: IPv4 (0x0800)
Internet Protocol Version 4, Src: 10.228.38.214, Dst: 10.228.194.30
 0100  = Version: 4
  0101 = Header Length: 20 bytes (5)
 Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
  00.. = Differentiated Services Codepoint: Default (0)
  ..00 = Explicit Congestion Notification: Not ECN-Capable 
Transport (0)
 Total Length: 76
 Identification: 0x (0)
 Flags: 0x40, Don't fragment
 0...  = Reserved bit: Not set
 .1..  = Don't fragment: Set
 ..0.  = More fragments: Not set
 Fragment Offset: 0
 Time to Live: 64
 Protocol: TCP (6)
 Header Checksum: 0x [validation disabled]
 [Header checksum status: Unverified]
 Source Address: 10.228.38.214
 Destination Address: 10.228.194.30
Transmission Control Protocol, Src Port: 4420, Dst Port: 46909, Seq: 1, Ack: 1, 
Len: 24
 Source Port: 4420
 Destination Port: 46909
 [Stream index: 2]
 [TCP Segment Len: 24]
 Sequence Number: 1(relative sequence number)
 Sequence Number (raw): 4175488220
 [Next Sequence Number: 25(relative sequence number)]
 Acknowledgment Number: 1(relative ack number)
 Acknowledgment number (raw): 2966903626
 1000  = Header Length: 32 bytes (8)
 Flags: 0x018 (PSH, ACK)
 000.   = 

Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-26 Thread Daniel Wagner
On Mon, Feb 15, 2021 at 01:29:45PM -0800, Sagi Grimberg wrote:
> Well, I think we should probably figure out why that is happening first.

I got my hands on a tcpdump trace. I've trimmed it to this:


No. Time   SourceDestination   Protocol 
Length Info
  1 0.00   10.228.194.30 10.228.38.214 NVMe 138 
   NVMe Write
  2 0.000285   10.228.38.214 10.228.194.30 NVMe/TCP 90  
   Ready To Transfer
  3 0.000591   10.228.194.30 10.228.38.214 NVMe 
4186   NVMe Write: Data
  4 0.000673   10.228.38.214 10.228.194.30 TCP  66  
   4420 → 58535 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2655324576 
TSecr=1497295579
  5 0.002140   10.228.38.214 10.228.194.30 NVMe 90  
   NVMe Write: Response
  6 0.002511   10.228.194.30 10.228.38.175 NVMe 138 
   NVMe Write
  7 0.002812   10.228.38.175 10.228.194.30 NVMe/TCP 90  
   Ready To Transfer
  8 0.003006   10.228.194.30 10.228.38.175 NVMe 
4186   NVMe Write: Data
  9 0.003098   10.228.38.175 10.228.194.30 TCP  66  
   4420 → 51241 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2183410196 
TSecr=3601034207
 10 0.004420   10.228.38.175 10.228.194.30 NVMe 90  
   NVMe Write: Response
 11 0.004890   10.228.38.214 10.228.194.30 NVMe/TCP 90  
   
 12 0.004969   10.228.38.175 10.228.194.30 NVMe/TCP 90  


The last few seconds contain only normal writes and suddenly the host
receives two invalid packets. From what I see the host doesn't misbehave
at all. I wonder if it would be possible to detect the invalid packet by
locking at the PDU header only. If this would be possible we could
discard it early and do not try to use the invalid command id.

Here the last two packets with details:


No. Time   SourceDestination   Protocol 
Length Info
 11 0.004890   10.228.38.214 10.228.194.30 NVMe/TCP 90  
   

Frame 11: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
Encapsulation type: Ethernet (1)
Arrival Time: Feb 23, 2021 18:16:08.780574000 CET
[Time shift for this packet: 0.0 seconds]
Epoch Time: 1614100568.780574000 seconds
[Time delta from previous captured frame: 0.00047 seconds]
[Time delta from previous displayed frame: 0.00047 seconds]
[Time since reference or first frame: 0.00489 seconds]
Frame Number: 11
Frame Length: 90 bytes (720 bits)
Capture Length: 90 bytes (720 bits)
[Frame is marked: False]
[Frame is ignored: False]
[Protocols in frame: eth:ethertype:ip:tcp:nvme-tcp]
[Coloring Rule Name: TCP]
[Coloring Rule String: tcp]
Ethernet II, Src: IntelCor_41:16:c0 (b4:96:91:41:16:c0), Dst: Cisco_9f:f5:a8 
(00:00:0c:9f:f5:a8)
Destination: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
Address: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
 ..0.     = LG bit: Globally unique address 
(factory default)
 ...0     = IG bit: Individual address (unicast)
Source: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
Address: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
 ..0.     = LG bit: Globally unique address 
(factory default)
 ...0     = IG bit: Individual address (unicast)
Type: IPv4 (0x0800)
Internet Protocol Version 4, Src: 10.228.38.214, Dst: 10.228.194.30
0100  = Version: 4
 0101 = Header Length: 20 bytes (5)
Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
 00.. = Differentiated Services Codepoint: Default (0)
 ..00 = Explicit Congestion Notification: Not ECN-Capable Transport 
(0)
Total Length: 76
Identification: 0x (0)
Flags: 0x40, Don't fragment
0...  = Reserved bit: Not set
.1..  = Don't fragment: Set
..0.  = More fragments: Not set
Fragment Offset: 0
Time to Live: 64
Protocol: TCP (6)
Header Checksum: 0x [validation disabled]
[Header checksum status: Unverified]
Source Address: 10.228.38.214
Destination Address: 10.228.194.30
Transmission Control Protocol, Src Port: 4420, Dst Port: 46909, Seq: 1, Ack: 1, 
Len: 24
Source Port: 4420
Destination Port: 46909
[Stream index: 2]
[TCP Segment Len: 24]
Sequence Number: 1(relative sequence number)
Sequence Number (raw): 4175488220
[Next Sequence Number: 25(relative sequence number)]
Acknowledgment Number: 1(relative ack number)
Acknowledgment number (raw): 2966903626
1000  = Header Length: 32 bytes (8)
Flags: 0x018 (PSH, ACK)
000.   = Reserved: Not set
...0   = Nonce: Not set
 0...  = Congestion Window 

Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-16 Thread Hannes Reinecke

On 2/15/21 10:23 PM, Sagi Grimberg wrote:



blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.


How did you get to this one? did the controller send a completion for
a completed/bogus request?


If that is the case, then that must mean it's possible the driver could
have started the command id just before the bogus completion check. 
Data

iorruption, right?


Yes, which is why I don't think this check is very useful..


I actually view that as a valid protection against spoofed frames.
Without it it's easy to crash the machine by injecting fake 
completions with random command ids.


And this doesn't help because the command can have been easily reused
and started... What is this protecting against? Note that none of the
other transports checks that, why should tcp?


Because it's particularly easy to spoof packets on tcp.
All other nvme-of transports are layered on top of other transports 
which do some sanity checks already, so it becomes really hard to inject 
invalid NVMe-oF frames for those.
NVMe-TCP has none of these protections, making it really easy to inject 
faulty frames (or, heaven forbid, running a packet fuzzer).


And crashing the machine on invalid frames is always a bad idea; I would 
have expected NVMe-TCP to drop them.


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-15 Thread Sagi Grimberg




blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.


How did you get to this one? did the controller send a completion for
a completed/bogus request?


If that is the case, then that must mean it's possible the driver could
have started the command id just before the bogus completion check. Data
iorruption, right?


'during TCP LIF toggles and aggr relocates' testing the host
crashes. TBH, I do not really know what is happening or what the test
does. Still trying to figure out what's going on.


Well, I think we should probably figure out why that is happening first.


I was just very
surprised how much the code trusts the other side to behave correctly./


What does pci/rdma/fc does differently? What does scsi do here?


Yes, which is why I don't think this check is very useful..


I actually view that as a valid protection against spoofed frames.
Without it it's easy to crash the machine by injecting fake completions with
random command ids.


In this test scenario it's not even a spoofed frame; maybe just a
confused controller.


Maybe... I am still not sure how this patch helps here though...


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-15 Thread Sagi Grimberg




blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.


How did you get to this one? did the controller send a completion for
a completed/bogus request?


If that is the case, then that must mean it's possible the driver could
have started the command id just before the bogus completion check. Data
iorruption, right?


Yes, which is why I don't think this check is very useful..


I actually view that as a valid protection against spoofed frames.
Without it it's easy to crash the machine by injecting fake completions 
with random command ids.


And this doesn't help because the command can have been easily reused
and started... What is this protecting against? Note that none of the
other transports checks that, why should tcp?


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-15 Thread Daniel Wagner
On Sat, Feb 13, 2021 at 09:46:41AM +0100, Hannes Reinecke wrote:
> On 2/12/21 10:49 PM, Sagi Grimberg wrote:
> > 
> > > > > blk_mq_tag_to_rq() will always return a request if the command_id is
> > > > > in the valid range. Check if the request has been started. If we
> > > > > blindly process the request we might double complete a request which
> > > > > can be fatal.
> > > > 
> > > > How did you get to this one? did the controller send a completion for
> > > > a completed/bogus request?
> > > 
> > > If that is the case, then that must mean it's possible the driver could
> > > have started the command id just before the bogus completion check. Data
> > > iorruption, right?

'during TCP LIF toggles and aggr relocates' testing the host
crashes. TBH, I do not really know what is happening or what the test
does. Still trying to figure out what's going on. I was just very
surprised how much the code trusts the other side to behave correctly.

> > Yes, which is why I don't think this check is very useful..
> 
> I actually view that as a valid protection against spoofed frames.
> Without it it's easy to crash the machine by injecting fake completions with
> random command ids.

In this test scenario it's not even a spoofed frame; maybe just a
confused controller.


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-13 Thread Hannes Reinecke

On 2/12/21 10:49 PM, Sagi Grimberg wrote:



blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.


How did you get to this one? did the controller send a completion for
a completed/bogus request?


If that is the case, then that must mean it's possible the driver could
have started the command id just before the bogus completion check. Data
iorruption, right?


Yes, which is why I don't think this check is very useful..


I actually view that as a valid protection against spoofed frames.
Without it it's easy to crash the machine by injecting fake completions 
with random command ids.


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-13 Thread Hannes Reinecke

On 2/12/21 7:17 PM, Daniel Wagner wrote:

blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.

Signed-off-by: Daniel Wagner 
---

This patch is against nvme-5.12.

There is one blk_mq_tag_to_rq() in nvme_tcp_recv_ddgst() which I
didn't update as I am not sure if it's also needed.

I guess it is; this patch is essentially a protection against invalid 
frames, and as such affects all places.


Cheers,

Hannes
--
Dr. Hannes ReineckeKernel Storage Architect
h...@suse.de  +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-12 Thread Sagi Grimberg




blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.


How did you get to this one? did the controller send a completion for
a completed/bogus request?


If that is the case, then that must mean it's possible the driver could
have started the command id just before the bogus completion check. Data
iorruption, right?


Yes, which is why I don't think this check is very useful..


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-12 Thread Keith Busch
On Fri, Feb 12, 2021 at 12:58:27PM -0800, Sagi Grimberg wrote:
> > blk_mq_tag_to_rq() will always return a request if the command_id is
> > in the valid range. Check if the request has been started. If we
> > blindly process the request we might double complete a request which
> > can be fatal.
> 
> How did you get to this one? did the controller send a completion for
> a completed/bogus request?

If that is the case, then that must mean it's possible the driver could
have started the command id just before the bogus completion check. Data
iorruption, right?


Re: [PATCH] nvme-tcp: Check if request has started before processing it

2021-02-12 Thread Sagi Grimberg

blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.


How did you get to this one? did the controller send a completion for
a completed/bogus request?


[PATCH] nvme-tcp: Check if request has started before processing it

2021-02-12 Thread Daniel Wagner
blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.

Signed-off-by: Daniel Wagner 
---

This patch is against nvme-5.12.

There is one blk_mq_tag_to_rq() in nvme_tcp_recv_ddgst() which I
didn't update as I am not sure if it's also needed.

py-crash> bt
#0  0xa76a33de in arch_atomic_try_cmpxchg (new=, 
old=, v=)
at ../arch/x86/include/asm/atomic.h:200
#1  atomic_try_cmpxchg (new=, old=, v=) at ../include/asm-generic/atomic-instrumented.h:695
#2  queued_spin_lock (lock=) at 
../include/asm-generic/qspinlock.h:78
#3  do_raw_spin_lock_flags (flags=, lock=) at 
../include/linux/spinlock.h:193
#4  __raw_spin_lock_irqsave (lock=) at 
../include/linux/spinlock_api_smp.h:119
#5  _raw_spin_lock_irqsave (lock=0x8 <__UNIQUE_ID_license257+8>) at 
../kernel/locking/spinlock.c:159
#6  0xa6eea418 in complete (x=0x0 <__UNIQUE_ID_license257>) at 
../kernel/sched/completion.c:32
#7  0xa721f99c in blk_mq_force_complete_rq (rq=0x8 
<__UNIQUE_ID_license257+8>) at ../block/blk-mq.c:634
#8  0xa721fa0a in blk_mq_complete_request (rq=) at 
../block/blk-mq.c:672
#9  0xc0b092ef in nvme_end_request (result=..., status=, 
req=) at ../drivers/nvme/host/nvme.h:477
#10 nvme_tcp_process_nvme_cqe (cqe=, queue=) at 
../drivers/nvme/host/tcp.c:485
rq = 0x948b840d
hdr = 
ret = 0
queue = 0x949501dd8110
result = 0
#11 nvme_tcp_handle_comp (pdu=, queue=) at 
../drivers/nvme/host/tcp.c:542
#12 nvme_tcp_recv_pdu (len=, offset=, 
skb=, queue=) at ../drivers/nvme/host/tcp.c:660
#13 nvme_tcp_recv_skb (desc=, skb=, offset=24, 
len=0) at ../drivers/nvme/host/tcp.c:805
#14 0xa7598af5 in tcp_read_sock
(sk=0x8 <__UNIQUE_ID_license257+8>, desc=0xa <__UNIQUE_ID_license257+10>, 
recv_actor=0x1 <__UNIQUE_ID_license257+1>) at ../net/ipv4/tcp.c:1645
#15 0xc0b075b8 in nvme_tcp_try_recv (queue=0x949501dd8110) at 
../drivers/nvme/host/tcp.c:1102
#16 0xc0b08fc7 in nvme_tcp_io_work (w=0x949501dd8118) at 
../drivers/nvme/host/tcp.c:1126
#17 0xa6eba4e4 in process_one_work (worker=0x948d1b633ec0, 
work=0x949501dd8118) at ../kernel/workqueue.c:2273
#18 0xa6eba6fd in worker_thread (__worker=0x948d1b633ec0) at 
../kernel/workqueue.c:2419
#19 0xa6ec0a3d in kthread (_create=0x948d1b618ec0) at 
../kernel/kthread.c:268
#20 0xa7800215 in ret_from_fork () at ../arch/x86/entry/entry_64.S:351

py-crash>  p /x ((struct request*)0x948b840d)->state
$2 = 0x2

 drivers/nvme/host/tcp.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 69f59d2c5799..4bec705ce8e6 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -485,7 +485,7 @@ static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue 
*queue,
struct request *rq;
 
rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
-   if (!rq) {
+   if (!rq || !blk_mq_request_started(rq)) {
dev_err(queue->ctrl->ctrl.device,
"queue %d tag 0x%x not found\n",
nvme_tcp_queue_id(queue), cqe->command_id);
@@ -506,7 +506,7 @@ static int nvme_tcp_handle_c2h_data(struct nvme_tcp_queue 
*queue,
struct request *rq;
 
rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
-   if (!rq) {
+   if (!rq || !blk_mq_request_started(rq)) {
dev_err(queue->ctrl->ctrl.device,
"queue %d tag %#x not found\n",
nvme_tcp_queue_id(queue), pdu->command_id);
@@ -610,7 +610,7 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
int ret;
 
rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
-   if (!rq) {
+   if (!rq || !blk_mq_request_started(rq)) {
dev_err(queue->ctrl->ctrl.device,
"queue %d tag %#x not found\n",
nvme_tcp_queue_id(queue), pdu->command_id);
@@ -696,7 +696,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, 
struct sk_buff *skb,
struct request *rq;
 
rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
-   if (!rq) {
+   if (!rq || !blk_mq_request_started(rq)) {
dev_err(queue->ctrl->ctrl.device,
"queue %d tag %#x not found\n",
nvme_tcp_queue_id(queue), pdu->command_id);
-- 
2.29.2