@kaihengfeng

So v5.7 was fine and after many reboots it has been found that this
commit below introduced the issue.

Do I also need to find when the issue was resolved ? (between v5.8-rc1
and v5.9.10) or is this information enough ?


54b2fcee1db041a83b52b51752dade6090cf952f is the first bad commit
commit 54b2fcee1db041a83b52b51752dade6090cf952f
Author: Keith Busch <kbu...@kernel.org>
Date:   Mon Apr 27 11:54:46 2020 -0700

    nvme-pci: remove last_sq_tail
    
    The nvme driver does not have enough tags to wrap the queue, and blk-mq
    will no longer call commit_rqs() when there are no new submissions to
    notify.
    
    Signed-off-by: Keith Busch <kbu...@kernel.org>
    Reviewed-by: Sagi Grimberg <s...@grimberg.me>
    Signed-off-by: Christoph Hellwig <h...@lst.de>
    Signed-off-by: Jens Axboe <ax...@kernel.dk>

 drivers/nvme/host/pci.c | 23 ++++-------------------
 1 file changed, 4 insertions(+), 19 deletions(-)


And my $ git bisect log is the following FWIW.
git bisect start
# good: [3d77e6a8804abcc0504c904bd6e5cdf3a5cf8162] Linux 5.7
git bisect good 3d77e6a8804abcc0504c904bd6e5cdf3a5cf8162
# bad: [b3a9e3b9622ae10064826dccb4f7a52bd88c7407] Linux 5.8-rc1
git bisect bad b3a9e3b9622ae10064826dccb4f7a52bd88c7407
# bad: [ee01c4d72adffb7d424535adf630f2955748fa8b] Merge branch 'akpm' (patches 
from Andrew)
git bisect bad ee01c4d72adffb7d424535adf630f2955748fa8b
# bad: [16d91548d1057691979de4686693f0ff92f46000] Merge tag 'xfs-5.8-merge-8' 
of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
git bisect bad 16d91548d1057691979de4686693f0ff92f46000
# good: [cfa3b8068b09f25037146bfd5eed041b78878bee] Merge tag 'for-linus-hmm' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
git bisect good cfa3b8068b09f25037146bfd5eed041b78878bee
# good: [3fd911b69b3117e03181262fc19ae6c3ef6962ce] Merge tag 
'drm-misc-next-2020-05-07' of git://anongit.freedesktop.org/drm/drm-misc into 
drm-next
git bisect good 3fd911b69b3117e03181262fc19ae6c3ef6962ce
# good: [1966391fa576e1fb2701be8bcca197d8f72737b7] mm/migrate.c: 
attach_page_private already does the get_page
git bisect good 1966391fa576e1fb2701be8bcca197d8f72737b7
# bad: [0c8d3fceade2ab1bbac68bca013e62bfdb851d19] bcache: configure the 
asynchronous registertion to be experimental
git bisect bad 0c8d3fceade2ab1bbac68bca013e62bfdb851d19
# bad: [84b8d0d7aa159652dc191d58c4d353b6c9173c54] nvmet: use type-name map for 
ana states
git bisect bad 84b8d0d7aa159652dc191d58c4d353b6c9173c54
# good: [72e6329f86c714785ac195d293cb19dd24507880] nvme-fc and nvmet-fc: revise 
LLDD api for LS reception and LS request
git bisect good 72e6329f86c714785ac195d293cb19dd24507880
# good: [e4fcc72c1a420bdbe425530dd19724214ceb44ec] nvmet-fc: slight cleanup for 
kbuild test warnings
git bisect good e4fcc72c1a420bdbe425530dd19724214ceb44ec
# good: [31fdad7be18992606078caed6ff71741fa76310a] nvme: consolodate io settings
git bisect good 31fdad7be18992606078caed6ff71741fa76310a
# bad: [2a5bcfdd41d68559567cec3c124a75e093506cc1] nvme-pci: align io queue 
count with allocted nvme_queue in nvme_probe
git bisect bad 2a5bcfdd41d68559567cec3c124a75e093506cc1
# good: [6623c5b3dfa5513190d729a8516db7a5163ec7de] nvme: clean up error 
handling in nvme_init_ns_head
git bisect good 6623c5b3dfa5513190d729a8516db7a5163ec7de
# good: [74943d45eef4db64b1e5c9f7ad1d018576e113c5] nvme-pci: remove volatile 
cqes
git bisect good 74943d45eef4db64b1e5c9f7ad1d018576e113c5
# bad: [54b2fcee1db041a83b52b51752dade6090cf952f] nvme-pci: remove last_sq_tail
git bisect bad 54b2fcee1db041a83b52b51752dade6090cf952f
# first bad commit: [54b2fcee1db041a83b52b51752dade6090cf952f] nvme-pci: remove 
last_sq_tail

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1910866

Title:
  nvme drive fails after some time

Status in linux package in Ubuntu:
  Confirmed

Bug description:
  Sorry for the vague title. I thought this was a hardware issue until
  someone else online mentioned their nvme drive goes "read only" after
  some time. I tend not to reboot my system much, so have a large
  journal. Either way this happens once in a while. The / drive is fine,
  but /home is on nvme which just disappears. I reboot and everything is
  fine. But leave it long enough and it'll fail again.

  Here's the most recent snippet about the nvme drive before I restarted
  the system.

  Jan 08 19:19:11 robot kernel: nvme nvme1: I/O 448 QID 5 timeout, aborting     
                                                                                
                           
  Jan 08 19:19:11 robot kernel: nvme nvme1: I/O 449 QID 5 timeout, aborting     
                                                                                
                           
  Jan 08 19:19:11 robot kernel: nvme nvme1: I/O 450 QID 5 timeout, aborting     
                                                                                
                           
  Jan 08 19:19:11 robot kernel: nvme nvme1: I/O 451 QID 5 timeout, aborting     
                                                                                
                           
  Jan 08 19:19:42 robot kernel: nvme nvme1: I/O 448 QID 5 timeout, reset 
controller
  Jan 08 19:19:42 robot kernel: nvme nvme1: I/O 22 QID 0 timeout, reset 
controller
  Jan 08 19:21:04 robot kernel: nvme nvme1: Device not ready; aborting reset, 
CSTS=0x1
  Jan 08 19:21:04 robot kernel: nvme nvme1: Abort status: 0x371
  Jan 08 19:21:04 robot kernel: nvme nvme1: Abort status: 0x371
  Jan 08 19:21:04 robot kernel: nvme nvme1: Abort status: 0x371
  Jan 08 19:21:04 robot kernel: nvme nvme1: Abort status: 0x371
  Jan 08 19:21:25 robot kernel: nvme nvme1: Device not ready; aborting reset, 
CSTS=0x1
  Jan 08 19:21:25 robot kernel: nvme nvme1: Removing after probe failure 
status: -19
  Jan 08 19:21:41 robot kernel: INFO: task jbd2/nvme1n1p1-:731 blocked for more 
than 120 seconds.
  Jan 08 19:21:41 robot kernel: jbd2/nvme1n1p1- D    0   731      2 0x00004000
  Jan 08 19:21:45 robot kernel: nvme nvme1: Device not ready; aborting reset, 
CSTS=0x1
  Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, 
sector 1920993784 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
  Jan 08 19:21:45 robot kernel: Buffer I/O error on dev nvme1n1p1, logical 
block 240123967, lost async page write
  Jan 08 19:21:45 robot kernel: EXT4-fs error (device nvme1n1p1): 
__ext4_find_entry:1535: inode #57278595: comm gsd-print-notif: reading 
directory lblock 0
  Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, 
sector 1920993384 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
  Jan 08 19:21:45 robot kernel: Buffer I/O error on dev nvme1n1p1, logical 
block 240123917, lost async page write
  Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, 
sector 1920993320 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
  Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, 
sector 1833166472 op 0x0:(READ) flags 0x3000 phys_seg 1 prio class 0
  Jan 08 19:21:45 robot kernel: Buffer I/O error on dev nvme1n1p1, logical 
block 240123909, lost async page write
  Jan 08 19:21:45 robot kernel: blk_update_request: I/O error, dev nvme1n1, 
sector 1909398624 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
  Jan 08 19:21:45 robot kernel: Buffer I/O error on dev nvme1n1p1, logical 
block 0, lost sync page write
  Jan 08 19:21:45 robot kernel: EXT4-fs (nvme1n1p1): I/O error while writing 
superblock

  ProblemType: Bug
  DistroRelease: Ubuntu 20.10
  Package: linux-image-5.8.0-34-generic 5.8.0-34.37
  ProcVersionSignature: Ubuntu 5.8.0-34.37-generic 5.8.18
  Uname: Linux 5.8.0-34-generic x86_64
  NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
  ApportVersion: 2.20.11-0ubuntu50.3
  Architecture: amd64
  CasperMD5CheckResult: skip
  CurrentDesktop: ubuntu:GNOME
  Date: Sat Jan  9 11:56:28 2021
  InstallationDate: Installed on 2020-08-15 (146 days ago)
  InstallationMedia: Ubuntu 20.04.1 LTS "Focal Fossa" - Release amd64 (20200731)
  MachineType: Intel Corporation NUC8i7HVK
  ProcFB: 0 amdgpudrmfb
  ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.8.0-34-generic 
root=UUID=c212e9d4-a049-4da0-8e34-971cb7414e60 ro quiet splash vt.handoff=7
  RebootRequiredPkgs:
   linux-image-5.8.0-36-generic
   linux-base
  RelatedPackageVersions:
   linux-restricted-modules-5.8.0-34-generic N/A
   linux-backports-modules-5.8.0-34-generic  N/A
   linux-firmware                            1.190.2
  SourcePackage: linux
  UpgradeStatus: Upgraded to groovy on 2020-09-20 (110 days ago)
  dmi.bios.date: 12/17/2018
  dmi.bios.release: 5.6
  dmi.bios.vendor: Intel Corp.
  dmi.bios.version: HNKBLi70.86A.0053.2018.1217.1739
  dmi.board.name: NUC8i7HVB
  dmi.board.vendor: Intel Corporation
  dmi.board.version: J68196-502
  dmi.chassis.type: 3
  dmi.chassis.vendor: Intel Corporation
  dmi.chassis.version: 2.0
  dmi.modalias: 
dmi:bvnIntelCorp.:bvrHNKBLi70.86A.0053.2018.1217.1739:bd12/17/2018:br5.6:svnIntelCorporation:pnNUC8i7HVK:pvrJ71485-502:rvnIntelCorporation:rnNUC8i7HVB:rvrJ68196-502:cvnIntelCorporation:ct3:cvr2.0:
  dmi.product.family: Intel NUC
  dmi.product.name: NUC8i7HVK
  dmi.product.version: J71485-502
  dmi.sys.vendor: Intel Corporation

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1910866/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to