Re: Seemingly random nvme (nda) write error on new drive (retries exhausted)
On 6/8/23 05:48, Warner Losh wrote: On Thu, Jun 8, 2023, 4:35 AM Rebecca Cran wrote: It's ZFS, using the default options when creating it via the FreeBSD installer so I presume TRIM is enabled. Without a reliable way to reproduce the error I'm not sure disabling TRIM will help at the moment. I don't think there's any newer firmware for it. pci gen 4 has a highter error rate so that needs to be managed with retries. There's a whole protocol to do that which linux implements. I suspect the time has come for us to do so too. There's some code floating around I'll have to track down. Thanks. I dropped the configuration down to PCIe gen 3 and the errors have so far gone away. nda0: nvme version 1.3 x8 (max x8) lanes PCIe Gen3 (max Gen4) link nda1: nvme version 1.3 x4 (max x4) lanes PCIe Gen3 (max Gen4) link -- Rebecca Cran
Re: Seemingly random nvme (nda) write error on new drive (retries exhausted)
On Thu, Jun 8, 2023, 4:35 AM Rebecca Cran wrote: > It's ZFS, using the default options when creating it via the FreeBSD > installer so I presume TRIM is enabled. Without a reliable way to > reproduce the error I'm not sure disabling TRIM will help at the moment. > > I don't think there's any newer firmware for it. > pci gen 4 has a highter error rate so that needs to be managed with retries. There's a whole protocol to do that which linux implements. I suspect the time has come for us to do so too. There's some code floating around I'll have to track down. Warner -- > > Rebecca Cran > > > On 6/8/23 04:25, Tomek CEDRO wrote: > > what filesystem? is TRIM enabled on that drive? have you tried > > disabling trim? i had similar ssd related problem on samsung's ssd > > long time ago that was related to trim. maybe drive firmware can be > > updated too? :-) > > > > -- > > CeDeROM, SQ7MHZ, http://www.tomek.cedro.info > >
Re: Seemingly random nvme (nda) write error on new drive (retries exhausted)
It's ZFS, using the default options when creating it via the FreeBSD installer so I presume TRIM is enabled. Without a reliable way to reproduce the error I'm not sure disabling TRIM will help at the moment. I don't think there's any newer firmware for it. -- Rebecca Cran On 6/8/23 04:25, Tomek CEDRO wrote: what filesystem? is TRIM enabled on that drive? have you tried disabling trim? i had similar ssd related problem on samsung's ssd long time ago that was related to trim. maybe drive firmware can be updated too? :-) -- CeDeROM, SQ7MHZ, http://www.tomek.cedro.info
Re: Seemingly random nvme (nda) write error on new drive (retries exhausted)
what filesystem? is TRIM enabled on that drive? have you tried disabling trim? i had similar ssd related problem on samsung's ssd long time ago that was related to trim. maybe drive firmware can be updated too? :-) -- CeDeROM, SQ7MHZ, http://www.tomek.cedro.info
Re: Seemingly random nvme (nda) write error on new drive (retries exhausted)
On 6/8/23 00:24, Warner Losh wrote: PCIe 3 or PCIe 4? PCIe 4. nda0 at nvme0 bus 0 scbus0 target 0 lun 1 nda0: nda0: Serial Number S55KNC0TC00168 nda0: nvme version 1.3 x8 (max x8) lanes PCIe Gen4 (max Gen4) link nda0: 6104710MB (12502446768 512 byte sectors) -- Rebecca Cran
Re: Seemingly random nvme (nda) write error on new drive (retries exhausted)
On Wed, Jun 7, 2023 at 11:12 PM Rebecca Cran wrote: > I got a seemingly random nvme data transfer error on my new arm64 Ampere > Altra machine, which has a Samsung PM1735 PCIe AIC NVMe drive. > > Since it's a new drive and smartctl doesn't show any errors I thought it > might be worth mentioning here. > > I'm running 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n263139-baef3a5b585f. > > > dmesg contains: > > nvme0: WRITE sqid:16 cid:126 nsid:1 lba:2550684560 len:8 > nvme0: DATA TRANSFER ERROR (00/04) crd:0 m:0 dnr:0 sqid:16 cid:126 cdw0:0 > (nda0:nvme0:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 > cdw=98085b90 0 7 0 0 0 > (nda0:nvme0:0:0:1): CAM status: CCB request completed with an error > (nda0:nvme0:0:0:1): Error 5, Retries exhausted > > > nvmecontrol identify nvme0 shows: > > Vendor ID: 144d > Subsystem Vendor ID: 144d > Model Number:SAMSUNG MZPLJ6T4HALA-7 > Firmware Version:EPK9CB5Q > Recommended Arb Burst: 8 > IEEE OUI Identifier: 00 25 38 > Multi-Path I/O Capabilities: Multiple controllers, Multiple ports > Max Data Transfer Size: 131072 bytes > Sanitize Crypto Erase: Supported > Sanitize Block Erase:Supported > Sanitize Overwrite: Not Supported > Sanitize NDI:Not Supported > Sanitize NODMMAS:Undefined > Controller ID: 0x0041 > Version: 1.3.0 > PCIe 3 or PCIe 4? So the only documented reason for this error is if we setup the memory wrong such that the drive couldn't start a transfer from the specified address. This seems weird to me... But in the prior paragraph it talks about other types of aborts that need software intervention. If this is a transient error, then maybe we should retry it as part of the data recovery. Unless this do not retry bit is set. which it isn't. I wonder this is retried 5 times or not before generating the error... Warner
Seemingly random nvme (nda) write error on new drive (retries exhausted)
I got a seemingly random nvme data transfer error on my new arm64 Ampere Altra machine, which has a Samsung PM1735 PCIe AIC NVMe drive. Since it's a new drive and smartctl doesn't show any errors I thought it might be worth mentioning here. I'm running 14.0-CURRENT FreeBSD 14.0-CURRENT #0 main-n263139-baef3a5b585f. dmesg contains: nvme0: WRITE sqid:16 cid:126 nsid:1 lba:2550684560 len:8 nvme0: DATA TRANSFER ERROR (00/04) crd:0 m:0 dnr:0 sqid:16 cid:126 cdw0:0 (nda0:nvme0:0:0:1): WRITE. NCB: opc=1 fuse=0 nsid=1 prp1=0 prp2=0 cdw=98085b90 0 7 0 0 0 (nda0:nvme0:0:0:1): CAM status: CCB request completed with an error (nda0:nvme0:0:0:1): Error 5, Retries exhausted nvmecontrol identify nvme0 shows: Vendor ID: 144d Subsystem Vendor ID: 144d Model Number: SAMSUNG MZPLJ6T4HALA-7 Firmware Version: EPK9CB5Q Recommended Arb Burst: 8 IEEE OUI Identifier: 00 25 38 Multi-Path I/O Capabilities: Multiple controllers, Multiple ports Max Data Transfer Size: 131072 bytes Sanitize Crypto Erase: Supported Sanitize Block Erase: Supported Sanitize Overwrite: Not Supported Sanitize NDI: Not Supported Sanitize NODMMAS: Undefined Controller ID: 0x0041 Version: 1.3.0 -- Rebecca Cran