Re: Seemingly random nvme (nda) write error on new drive (retries exhausted)

2023-06-08 Thread Rebecca Cran

On 6/8/23 05:48, Warner Losh wrote:



On Thu, Jun 8, 2023, 4:35 AM Rebecca Cran  wrote:

It's ZFS, using the default options when creating it via the FreeBSD
installer so I presume TRIM is enabled. Without a reliable way to
reproduce the error I'm not sure disabling TRIM will help at the
moment.

I don't think there's any newer firmware for it.


pci gen 4 has a highter error rate so that needs to be managed with 
retries.  There's a whole protocol to do that which linux implements. 
I suspect the time has come for us to do so too. There's some code 
floating around I'll have to track down.


Thanks. I dropped the configuration down to PCIe gen 3 and the errors 
have so far gone away.



nda0: nvme version 1.3 x8 (max x8) lanes PCIe Gen3 (max Gen4) link
nda1: nvme version 1.3 x4 (max x4) lanes PCIe Gen3 (max Gen4) link

--
Rebecca Cran




Re: Seemingly random nvme (nda) write error on new drive (retries exhausted)

2023-06-08 Thread Warner Losh
On Thu, Jun 8, 2023, 4:35 AM Rebecca Cran  wrote:

> It's ZFS, using the default options when creating it via the FreeBSD
> installer so I presume TRIM is enabled. Without a reliable way to
> reproduce the error I'm not sure disabling TRIM will help at the moment.
>
> I don't think there's any newer firmware for it.
>

pci gen 4 has a highter error rate so that needs to be managed with
retries.  There's a whole protocol to do that which linux implements. I
suspect the time has come for us to do so too. There's some code floating
around I'll have to track down.

Warner

-- 
>
> Rebecca Cran
>
>
> On 6/8/23 04:25, Tomek CEDRO wrote:
> > what filesystem? is TRIM enabled on that drive? have you tried
> > disabling trim? i had similar ssd related problem on samsung's ssd
> > long time ago that was related to trim. maybe drive firmware can be
> > updated too? :-)
> >
> > --
> > CeDeROM, SQ7MHZ, http://www.tomek.cedro.info
>
>


Re: Seemingly random nvme (nda) write error on new drive (retries exhausted)

2023-06-08 Thread Rebecca Cran
It's ZFS, using the default options when creating it via the FreeBSD 
installer so I presume TRIM is enabled. Without a reliable way to 
reproduce the error I'm not sure disabling TRIM will help at the moment.


I don't think there's any newer firmware for it.


--

Rebecca Cran


On 6/8/23 04:25, Tomek CEDRO wrote:
what filesystem? is TRIM enabled on that drive? have you tried 
disabling trim? i had similar ssd related problem on samsung's ssd 
long time ago that was related to trim. maybe drive firmware can be 
updated too? :-)


--
CeDeROM, SQ7MHZ, http://www.tomek.cedro.info




Re: Seemingly random nvme (nda) write error on new drive (retries exhausted)

2023-06-08 Thread Tomek CEDRO
what filesystem? is TRIM enabled on that drive? have you tried disabling
trim? i had similar ssd related problem on samsung's ssd long time ago that
was related to trim. maybe drive firmware can be updated too? :-)

--
CeDeROM, SQ7MHZ, http://www.tomek.cedro.info


Re: Seemingly random nvme (nda) write error on new drive (retries exhausted)

2023-06-08 Thread Rebecca Cran

On 6/8/23 00:24, Warner Losh wrote:

PCIe 3 or PCIe 4?


PCIe 4.


nda0 at nvme0 bus 0 scbus0 target 0 lun 1
nda0: 
nda0: Serial Number S55KNC0TC00168
nda0: nvme version 1.3 x8 (max x8) lanes PCIe Gen4 (max Gen4) link
nda0: 6104710MB (12502446768 512 byte sectors)

--

Rebecca Cran




Re: Seemingly random nvme (nda) write error on new drive (retries exhausted)

2023-06-08 Thread Warner Losh
On Wed, Jun 7, 2023 at 11:12 PM Rebecca Cran  wrote:

> I got a seemingly random nvme data transfer error on my new arm64 Ampere
> Altra machine, which has a Samsung PM1735 PCIe AIC NVMe drive.
>
> Since it's a new drive and smartctl doesn't show any errors I thought it
> might be worth mentioning here.
>
> I'm running 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n263139-baef3a5b585f.
>
>
> dmesg contains:
>
> nvme0: WRITE sqid:16 cid:126 nsid:1 lba:2550684560 len:8
> nvme0: DATA TRANSFER ERROR (00/04) crd:0 m:0 dnr:0 sqid:16 cid:126 cdw0:0
> (nda0:nvme0:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0
> cdw=98085b90 0 7 0 0 0
> (nda0:nvme0:0:0:1): CAM status: CCB request completed with an error
> (nda0:nvme0:0:0:1): Error 5, Retries exhausted
>
>
> nvmecontrol identify nvme0 shows:
>
> Vendor ID:   144d
> Subsystem Vendor ID: 144d
> Model Number:SAMSUNG MZPLJ6T4HALA-7
> Firmware Version:EPK9CB5Q
> Recommended Arb Burst:   8
> IEEE OUI Identifier: 00 25 38
> Multi-Path I/O Capabilities: Multiple controllers, Multiple ports
> Max Data Transfer Size:  131072 bytes
> Sanitize Crypto Erase:   Supported
> Sanitize Block Erase:Supported
> Sanitize Overwrite:  Not Supported
> Sanitize NDI:Not Supported
> Sanitize NODMMAS:Undefined
> Controller ID:   0x0041
> Version: 1.3.0
>

PCIe 3 or PCIe 4?

So the only documented reason for this error is if we setup the memory wrong
such that the drive couldn't start a transfer from the specified address.
This seems
weird to me... But in the prior paragraph it talks about other types of
aborts that
need software intervention. If this is a transient error, then  maybe we
should retry
it as part of the data recovery. Unless this do not retry bit is set. which
it isn't. I wonder
this is retried 5 times or not before generating the error...

Warner


Seemingly random nvme (nda) write error on new drive (retries exhausted)

2023-06-07 Thread Rebecca Cran
I got a seemingly random nvme data transfer error on my new arm64 Ampere 
Altra machine, which has a Samsung PM1735 PCIe AIC NVMe drive.


Since it's a new drive and smartctl doesn't show any errors I thought it 
might be worth mentioning here.


I'm running 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n263139-baef3a5b585f.


dmesg contains:

nvme0: WRITE sqid:16 cid:126 nsid:1 lba:2550684560 len:8
nvme0: DATA TRANSFER ERROR (00/04) crd:0 m:0 dnr:0 sqid:16 cid:126 cdw0:0
(nda0:nvme0:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 
cdw=98085b90 0 7 0 0 0

(nda0:nvme0:0:0:1): CAM status: CCB request completed with an error
(nda0:nvme0:0:0:1): Error 5, Retries exhausted


nvmecontrol identify nvme0 shows:

Vendor ID:   144d
Subsystem Vendor ID: 144d
Model Number:    SAMSUNG MZPLJ6T4HALA-7
Firmware Version:    EPK9CB5Q
Recommended Arb Burst:   8
IEEE OUI Identifier: 00 25 38
Multi-Path I/O Capabilities: Multiple controllers, Multiple ports
Max Data Transfer Size:  131072 bytes
Sanitize Crypto Erase:   Supported
Sanitize Block Erase:    Supported
Sanitize Overwrite:  Not Supported
Sanitize NDI:    Not Supported
Sanitize NODMMAS:    Undefined
Controller ID:   0x0041
Version: 1.3.0


--

Rebecca Cran