I am now wondering whether the nvidia driver, as built automatically by
Debian during tecent updating/upgrading, allows correct rendering but fails
with NAMD computations
To this concern, it is not clear to me whether Debian, with its automatic
building, uses the proprietary nvidia driver. If not, I could try by
downloading the proprietary nvidia driver

---------- Forwarded message ---------
From: Francesco Pietra <chiendar...@gmail.com>
Date: Sun, Nov 20, 2022, 7:07 PM
Subject: CUDA error cudaStreamSynchronize(stream) and CUDA error in
ComputeBondedCUDA
To: debian-users <debian-user@lists.debian.org>


Hello
Main board GA-X79-UD3 with two 680 GPUs
Debian10 Linux,
kernel 5.10.0-19-amd64
OpenGL 4.6.0
nvidia driver 470.141.03
------------------------------
Months ago, following updating/upgrading of amd64, the GPUs, while
rendering correctly, became unable to run classical molecular dynamics
simulations. Launching a minimization with software NAMD with both GPUs or
with one of them (by software or even by removing one GPU)

namd2 +idlepoll +p12 +devices 0,1 min.conf
namd2 +idlepoll +p12 +devices 0 min.conf
namd2 +idlepoll +p12 +devices 1 min.conf

NAMD organizes the simulation correctly but at the stage of starting the
computation, accessing memory, a crash occurs with error

TCL: Minimizing for 3000 steps
> FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file
> src/CudaTileListKernel.cu, function buildTileLists, line 1136
> on Pe 4 (gig64 device 0 pci 0:2:0): an illegal memory access was
> encountered
> FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling
> 48 times over 0.005047 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal
> memory access was encountered
> FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file
> src/CudaTileListKernel.cu, function buildTileLists, line 1136
> on Pe 4 (gig64 device 0 pci 0:2:0): an illegal memory access was
> encountered
> FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling
> 48 times over 0.005047 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal
> memory access was encountered
> [Partition 0][Node 0] End of program
>

"illegal memory access" is a software error (as also proven by using
alternatively one of the two GPUs) that escapes all my attempts at
unraveling its origin. I had no clues from NAMD forum. Hope here.

Thanks for your kind attention

francesco pietra

Reply via email to