Re: Another NVMe failure, this time with AER info

2018-05-11 Thread Ming Lei
On Sat, May 12, 2018 at 12:57 AM, Bjorn Helgaas wrote: > Andrew wrote: >> A friend of mine has a brand new LG laptop that has intermittent NVMe >> failures. They mostly happen during a suspend/resume cycle >> (apparently during suspend, not resume). Unlike the earlier >>

Re: Another NVMe failure, this time with AER info

2018-05-11 Thread Ming Lei
On Sat, May 12, 2018 at 12:57 AM, Bjorn Helgaas wrote: > Andrew wrote: >> A friend of mine has a brand new LG laptop that has intermittent NVMe >> failures. They mostly happen during a suspend/resume cycle >> (apparently during suspend, not resume). Unlike the earlier >> Dell/Samsung issue, the

Re: Another NVMe failure, this time with AER info

2018-05-11 Thread Bjorn Helgaas
On Fri, May 11, 2018 at 11:42:42AM -0600, Keith Busch wrote: > On Fri, May 11, 2018 at 11:26:11AM -0600, Keith Busch wrote: > > I trust you know the offsets here, but it's hard to tell what this > > is doing with hard-coded addresses. Just to be safe and for clarity, > > I recommend the 'CAP_*+'

Re: Another NVMe failure, this time with AER info

2018-05-11 Thread Bjorn Helgaas
On Fri, May 11, 2018 at 11:42:42AM -0600, Keith Busch wrote: > On Fri, May 11, 2018 at 11:26:11AM -0600, Keith Busch wrote: > > I trust you know the offsets here, but it's hard to tell what this > > is doing with hard-coded addresses. Just to be safe and for clarity, > > I recommend the 'CAP_*+'

Re: Another NVMe failure, this time with AER info

2018-05-11 Thread Keith Busch
On Fri, May 11, 2018 at 11:26:11AM -0600, Keith Busch wrote: > I trust you know the offsets here, but it's hard to tell what this > is doing with hard-coded addresses. Just to be safe and for clarity, > I recommend the 'CAP_*+' with a mask. > > For example, disabling ASPM L1.2 can look like: > >

Re: Another NVMe failure, this time with AER info

2018-05-11 Thread Keith Busch
On Fri, May 11, 2018 at 11:26:11AM -0600, Keith Busch wrote: > I trust you know the offsets here, but it's hard to tell what this > is doing with hard-coded addresses. Just to be safe and for clarity, > I recommend the 'CAP_*+' with a mask. > > For example, disabling ASPM L1.2 can look like: > >

Re: Another NVMe failure, this time with AER info

2018-05-11 Thread Keith Busch
On Fri, May 11, 2018 at 11:57:52AM -0500, Bjorn Helgaas wrote: > We reported several corrected errors before the nvme timeout: > > [12750.281158] nvme nvme0: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x10 > [12750.297594] nvme nvme0: I/O 455 QID 2 timeout, disable

Re: Another NVMe failure, this time with AER info

2018-05-11 Thread Keith Busch
On Fri, May 11, 2018 at 11:57:52AM -0500, Bjorn Helgaas wrote: > We reported several corrected errors before the nvme timeout: > > [12750.281158] nvme nvme0: controller is down; will reset: CSTS=0x, > PCI_STATUS=0x10 > [12750.297594] nvme nvme0: I/O 455 QID 2 timeout, disable

Re: Another NVMe failure, this time with AER info

2018-05-11 Thread Bjorn Helgaas
Andrew wrote: > A friend of mine has a brand new LG laptop that has intermittent NVMe > failures. They mostly happen during a suspend/resume cycle > (apparently during suspend, not resume). Unlike the earlier > Dell/Samsung issue, the NVMe device isn't completely gone -- MMIO > reads fail, but

Re: Another NVMe failure, this time with AER info

2018-05-11 Thread Bjorn Helgaas
Andrew wrote: > A friend of mine has a brand new LG laptop that has intermittent NVMe > failures. They mostly happen during a suspend/resume cycle > (apparently during suspend, not resume). Unlike the earlier > Dell/Samsung issue, the NVMe device isn't completely gone -- MMIO > reads fail, but