Re: [gmx-users] GPU job often stopped
the problem is still there... :-( On 04/29/2013 06:06 PM, Szilárd Páll wrote: On Mon, Apr 29, 2013 at 3:51 PM, Albert wrote: >On 04/29/2013 03:47 PM, Szilárd Páll wrote: >> >>In that case, while it isn't very likely, the issue could be caused by >>some implementation detail which aims to avoid performance loss caused >>by an issue in the NVIDIA drivers. >> >>Try running with the GMX_CUDA_STREAMSYNC environment variable set. >> >>Btw, were there any other processes using the GPU while mdrun was running? >> >>Cheers, >>-- >>Szilárd > > >thanks for kind reply. >There is no any other process when I am running Gromacs. > >do you mean I should set GMX_CUDA_STREAMSYNC in the job script like: > >export GMX_CUDA_STREAMSYNC=/opt/cuda-5.0 Sort of, but the value does not matter. So if your shell is bash, the above as well as simply "export GMX_CUDA_STREAMSYNC=" will work fine. Let us know if this avoided the crash - when you have simulated long enough to be able to judge. Cheers, -- Szilárd -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] GPU job often stopped
On Mon, Apr 29, 2013 at 3:51 PM, Albert wrote: > On 04/29/2013 03:47 PM, Szilárd Páll wrote: >> >> In that case, while it isn't very likely, the issue could be caused by >> some implementation detail which aims to avoid performance loss caused >> by an issue in the NVIDIA drivers. >> >> Try running with the GMX_CUDA_STREAMSYNC environment variable set. >> >> Btw, were there any other processes using the GPU while mdrun was running? >> >> Cheers, >> -- >> Szilárd > > > thanks for kind reply. > There is no any other process when I am running Gromacs. > > do you mean I should set GMX_CUDA_STREAMSYNC in the job script like: > > export GMX_CUDA_STREAMSYNC=/opt/cuda-5.0 Sort of, but the value does not matter. So if your shell is bash, the above as well as simply "export GMX_CUDA_STREAMSYNC=" will work fine. Let us know if this avoided the crash - when you have simulated long enough to be able to judge. Cheers, -- Szilárd > > ? > > THX > Albert > > > > > -- > gmx-users mailing listgmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > * Please don't post (un)subscribe requests to the list. Use the www > interface or send it to gmx-users-requ...@gromacs.org. > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] GPU job often stopped
On 04/29/2013 03:47 PM, Szilárd Páll wrote: In that case, while it isn't very likely, the issue could be caused by some implementation detail which aims to avoid performance loss caused by an issue in the NVIDIA drivers. Try running with the GMX_CUDA_STREAMSYNC environment variable set. Btw, were there any other processes using the GPU while mdrun was running? Cheers, -- Szilárd thanks for kind reply. There is no any other process when I am running Gromacs. do you mean I should set GMX_CUDA_STREAMSYNC in the job script like: export GMX_CUDA_STREAMSYNC=/opt/cuda-5.0 ? THX Albert -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] GPU job often stopped
In that case, while it isn't very likely, the issue could be caused by some implementation detail which aims to avoid performance loss caused by an issue in the NVIDIA drivers. Try running with the GMX_CUDA_STREAMSYNC environment variable set. Btw, were there any other processes using the GPU while mdrun was running? Cheers, -- Szilárd On Mon, Apr 29, 2013 at 3:32 PM, Albert wrote: > On 04/29/2013 03:31 PM, Szilárd Páll wrote: >> >> The segv indicates that mdrun crashed and not that the machine was >> restarted. The GPU detection output (both on stderr and log) should >> show whether ECC is "on" (and so does the nvidia-smi tool). >> >> Cheers, >> -- >> Szilárd > > > yes it was on: > > > Reading file heavy.tpr, VERSION 4.6.1 (single precision) > Using 4 MPI threads > Using 8 OpenMP threads per tMPI thread > > 5 GPUs detected: > #0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible > #1: NVIDIA GeForce GTX 650, compute cap.: 3.0, ECC: no, stat: compatible > #2: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible > #3: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible > #4: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible > > 4 GPUs user-selected for this run: #0, #2, #3, #4 > > > -- > gmx-users mailing listgmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > * Please don't post (un)subscribe requests to the list. Use the www > interface or send it to gmx-users-requ...@gromacs.org. > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] GPU job often stopped
On 04/29/2013 03:31 PM, Szilárd Páll wrote: The segv indicates that mdrun crashed and not that the machine was restarted. The GPU detection output (both on stderr and log) should show whether ECC is "on" (and so does the nvidia-smi tool). Cheers, -- Szilárd yes it was on: Reading file heavy.tpr, VERSION 4.6.1 (single precision) Using 4 MPI threads Using 8 OpenMP threads per tMPI thread 5 GPUs detected: #0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible #1: NVIDIA GeForce GTX 650, compute cap.: 3.0, ECC: no, stat: compatible #2: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible #3: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible #4: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible 4 GPUs user-selected for this run: #0, #2, #3, #4 -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] GPU job often stopped
On Mon, Apr 29, 2013 at 2:41 PM, Albert wrote: > On 04/28/2013 05:45 PM, Justin Lemkul wrote: >> >> >> Frequent failures suggest instability in the simulated system. Check your >> .log file or stderr for informative Gromacs diagnostic information. >> >> -Justin > > > > my log file didn't have any errors, the end of topped log file something > like: > > DD step 2259 vol min/aver 0.967 load imb.: force 0.8% > >Step Time Lambda >226045200.00.0 > >Energies (kJ/mol) > AngleU-BProper Dih. Improper Dih. LJ-14 > 9.86437e+034.02406e+043.52809e+046.13542e+02 8.61815e+03 > Coulomb-14LJ (SR) Disper. corr. Coulomb (SR) Coul. recip. > 1.25055e+043.05477e+04 -9.05956e+03 -6.02400e+05 1.58357e+03 > Position Rest. PotentialKinetic En. Total Energy Temperature > 1.39149e+02 -4.72066e+051.37165e+05 -3.34901e+05 3.11958e+02 > Pres. DC (bar) Pressure (bar) Constr. rmsd >-2.94092e+02 -7.91535e+011.79812e-05 > > > also in the information file I only obtained information: > > > step 13300, will finish Tue Apr 30 14:41 > NOTE: Turning on dynamic load balancing > > > Probably the machine was restarted from time to time? The segv indicates that mdrun crashed and not that the machine was restarted. The GPU detection output (both on stderr and log) should show whether ECC is "on" (and so does the nvidia-smi tool). Cheers, -- Szilárd > > best > Albert > > > > -- > gmx-users mailing listgmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > * Please don't post (un)subscribe requests to the list. Use the www > interface or send it to gmx-users-requ...@gromacs.org. > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] GPU job often stopped
On 04/28/2013 05:45 PM, Justin Lemkul wrote: Frequent failures suggest instability in the simulated system. Check your .log file or stderr for informative Gromacs diagnostic information. -Justin my log file didn't have any errors, the end of topped log file something like: DD step 2259 vol min/aver 0.967 load imb.: force 0.8% Step Time Lambda 226045200.00.0 Energies (kJ/mol) AngleU-BProper Dih. Improper Dih. LJ-14 9.86437e+034.02406e+043.52809e+046.13542e+02 8.61815e+03 Coulomb-14LJ (SR) Disper. corr. Coulomb (SR) Coul. recip. 1.25055e+043.05477e+04 -9.05956e+03 -6.02400e+05 1.58357e+03 Position Rest. PotentialKinetic En. Total Energy Temperature 1.39149e+02 -4.72066e+051.37165e+05 -3.34901e+05 3.11958e+02 Pres. DC (bar) Pressure (bar) Constr. rmsd -2.94092e+02 -7.91535e+011.79812e-05 also in the information file I only obtained information: step 13300, will finish Tue Apr 30 14:41 NOTE: Turning on dynamic load balancing Probably the machine was restarted from time to time? best Albert -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] GPU job often stopped
Hello: yes, I tried the CPU only version, it goes well and didn't stop. I am not sure whether I have ECC on or not. There are 4 Tesla K20 and one GTX650 in the workstation, after compilation, I simple submit the jobs with command: mdrun -s md.tpr -gpu_id 0234 I submit the same system in another GTX690 machine, it also goes well. I compiled Gromacs with the same options in that machine. thank you very much best Albert On 04/29/2013 01:19 PM, Szilárd Páll wrote: Have you tried running on CPUs only just to see if the issue persists? Unless the issue does not occur with the same binary on the same hardware running on CPUs only, I doubt it's a problem in the code. Do you have ECC on? -- Szilárd -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] GPU job often stopped
Have you tried running on CPUs only just to see if the issue persists? Unless the issue does not occur with the same binary on the same hardware running on CPUs only, I doubt it's a problem in the code. Do you have ECC on? -- Szilárd On Sun, Apr 28, 2013 at 5:27 PM, Albert wrote: > Dear: > > I am running MD jobs in a workstation with 4 K20 GPU and I found that the > job always failed with following messages from time to time: > > > [tesla:03432] *** Process received signal *** > [tesla:03432] Signal: Segmentation fault (11) > [tesla:03432] Signal code: Address not mapped (1) > [tesla:03432] Failing at address: 0xfffe02de67e0 > [tesla:03432] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) > [0x7f4666da1cb0] > [tesla:03432] [ 1] mdrun_mpi() [0x47dd61] > [tesla:03432] [ 2] mdrun_mpi() [0x47d8ae] > [tesla:03432] [ 3] > /opt/intel/lib/intel64/libiomp5.so(__kmp_invoke_microtask+0x93) > [0x7f46667904f3] > [tesla:03432] *** End of error message *** > -- > mpirun noticed that process rank 0 with PID 3432 on node tesla exited on > signal 11 (Segmentation fault). > -- > > > I can continue the jobs with mdrun option "-append -cpi", but it still > stopped from time to time. I am just wondering what's the problem? > > thank you very much > Albert > -- > gmx-users mailing listgmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > * Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > * Please don't post (un)subscribe requests to the list. Use the www > interface or send it to gmx-users-requ...@gromacs.org. > * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] GPU job often stopped
On 4/28/13 11:27 AM, Albert wrote: Dear: I am running MD jobs in a workstation with 4 K20 GPU and I found that the job always failed with following messages from time to time: [tesla:03432] *** Process received signal *** [tesla:03432] Signal: Segmentation fault (11) [tesla:03432] Signal code: Address not mapped (1) [tesla:03432] Failing at address: 0xfffe02de67e0 [tesla:03432] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f4666da1cb0] [tesla:03432] [ 1] mdrun_mpi() [0x47dd61] [tesla:03432] [ 2] mdrun_mpi() [0x47d8ae] [tesla:03432] [ 3] /opt/intel/lib/intel64/libiomp5.so(__kmp_invoke_microtask+0x93) [0x7f46667904f3] [tesla:03432] *** End of error message *** -- mpirun noticed that process rank 0 with PID 3432 on node tesla exited on signal 11 (Segmentation fault). -- I can continue the jobs with mdrun option "-append -cpi", but it still stopped from time to time. I am just wondering what's the problem? Frequent failures suggest instability in the simulated system. Check your .log file or stderr for informative Gromacs diagnostic information. -Justin -- Justin A. Lemkul, Ph.D. Research Scientist Department of Biochemistry Virginia Tech Blacksburg, VA jalemkul[at]vt.edu | (540) 231-9080 http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
[gmx-users] GPU job often stopped
Dear: I am running MD jobs in a workstation with 4 K20 GPU and I found that the job always failed with following messages from time to time: [tesla:03432] *** Process received signal *** [tesla:03432] Signal: Segmentation fault (11) [tesla:03432] Signal code: Address not mapped (1) [tesla:03432] Failing at address: 0xfffe02de67e0 [tesla:03432] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f4666da1cb0] [tesla:03432] [ 1] mdrun_mpi() [0x47dd61] [tesla:03432] [ 2] mdrun_mpi() [0x47d8ae] [tesla:03432] [ 3] /opt/intel/lib/intel64/libiomp5.so(__kmp_invoke_microtask+0x93) [0x7f46667904f3] [tesla:03432] *** End of error message *** -- mpirun noticed that process rank 0 with PID 3432 on node tesla exited on signal 11 (Segmentation fault). -- I can continue the jobs with mdrun option "-append -cpi", but it still stopped from time to time. I am just wondering what's the problem? thank you very much Albert -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists