Hi, gromacs started dying on me lately with rather obscure error messages as in the caption of this mail. Errors seem to be related to the nvidia driver (see below for more output,and further below for the mdp file) ... i perform a large number of short (2ns) sims and this happens perhapsone out of ten times, its non-reproducible and un-predictable. has anybody experienced anything like it?... and is this a a) gromacs issue, b) an nvidia driver issue, or c) a hardware issue?? thanks for any help!Michael my system is a vanilla debian stretch with cuda-toolkit version 9.1.85-4and nvidia-driver 390.87-2, with unmodified gromacs 2018.3
what i see: gmx mdrun dies and writes in log: Source file: src/gromacs/gpu_utils/cudautils.cuh (line 298) Fatal error: Unexpected cudaStreamQuery failure: unspecified launch failure at the same time i see in syslog: Nov 20 09:34:56 rcpepc01797 kernel: [350452.906685] NVRM: Xid (PCI:0000:20:00): 69, Class Error: ChId 001a, Class 0000c1c0, Offset 000001b0, Data 00000041, ErrorCode 00000053 or: gmx mdrun dies and writes in log: Source file: src/gromacs/gpu_utils/cudautils.cuh (line 298) Fatal error: Unexpected cudaStreamQuery failure: unspecified launch failure at the same time i see in syslog: 06:10 Nov 20 06:10:54 rcpepc01797 kernel: [338210.240084] NVRM: Xid (PCI:0000:20:00): 13, Graphics Exception: EXTRA_INLINE_DATA Nov 20 06:10:54 rcpepc01797 kernel: [338210.240088] NVRM: Xid (PCI:0000:20:00): 13, Graphics Exception: ESR 0x404600=0x80000001 Nov 20 06:10:54 rcpepc01797 kernel: [338210.240115] NVRM: Xid (PCI:0000:20:00): 13, Graphics Exception: ChID 001a, Class 0000c1c0, Offset 000001b4, Data 00000000 or: gmx mdrun dies and writes in log: Source file: src/gromacs/ewald/pme.cu (line 76) Fatal error: Failed to synchronize the PME GPU stream!: unspecified launch failure at the same time i see in syslog: Nov 17 03:16:15 rcpepc01797 kernel: [68519.703064] NVRM: GPU at PCI:0000:20:00: GPU-16aba4a6-68c1-44ab-47dd-7c7d06d2ddc5 Nov 17 03:16:15 rcpepc01797 kernel: [68519.703066] NVRM: GPU Board Serial Number: Nov 17 03:16:15 rcpepc01797 kernel: [68519.703068] NVRM: Xid (PCI:0000:20:00): 13, Graphics Exception - INSTR_RAM_ACCESS_OUT_OF_BOUNDS Nov 17 03:16:15 rcpepc01797 kernel: [68519.703072] NVRM: Xid (PCI:0000:20:00): 13, Graphics Exception: ESR 0x404490=0x80000020 Nov 17 03:16:15 rcpepc01797 kernel: [68519.703096] NVRM: Xid (PCI:0000:20:00): 13, Graphics Exception: ChID 001b, Class 0000c197, Offset 00000000, Data 00000000 mdp file: integrator = md dt = 0.002 nsteps = 1000000 comm-grps = System ; nstxout = 50 nstvout = 0 nstfout = 0 nstlog = 50 nstenergy = 50 ; nstlist = 50 ns_type = grid pbc = xyz rlist = 1.2 cutoff-scheme = Verlet ; coulombtype = PME rcoulomb = 1.2 vdw_type = cut-off rvdw = 1.2 ; constraints = h-bonds ; tcoupl = v-rescale tau-t = 0.1 ref-t = 300.0 tc-grps = System ; acc-grps = api pol accelerate = 0 0.5 0 0 -0.5 0 -- Gromacs Users mailing list * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/GMX-Users_List before posting! * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists * For (un)subscribe requests visit https://maillist.sys.kth.se/mailman/listinfo/gromacs.org_gmx-users or send a mail to gmx-users-requ...@gromacs.org.