Re: [gmx-users] GPU job often stopped

2013-05-02 Thread Albert

the problem is still there...

:-(



On 04/29/2013 06:06 PM, Szilárd Páll wrote:

On Mon, Apr 29, 2013 at 3:51 PM, Albert  wrote:

>On 04/29/2013 03:47 PM, Szilárd Páll wrote:

>>
>>In that case, while it isn't very likely, the issue could be caused by
>>some implementation detail which aims to avoid performance loss caused
>>by an issue in the NVIDIA drivers.
>>
>>Try running with the GMX_CUDA_STREAMSYNC environment variable set.
>>
>>Btw, were there any other processes using the GPU while mdrun was running?
>>
>>Cheers,
>>--
>>Szilárd

>
>
>thanks for kind reply.
>There is no any other process when I am running Gromacs.
>
>do you mean I should set GMX_CUDA_STREAMSYNC in the job script like:
>
>export GMX_CUDA_STREAMSYNC=/opt/cuda-5.0

Sort of, but the value does not matter. So if your shell is bash, the
above as well as simply "export GMX_CUDA_STREAMSYNC=" will work fine.

Let us know if this avoided the crash - when you have simulated long
enough to be able to judge.

Cheers,
--
Szilárd



--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] GPU job often stopped

2013-04-29 Thread Szilárd Páll
On Mon, Apr 29, 2013 at 3:51 PM, Albert  wrote:
> On 04/29/2013 03:47 PM, Szilárd Páll wrote:
>>
>> In that case, while it isn't very likely, the issue could be caused by
>> some implementation detail which aims to avoid performance loss caused
>> by an issue in the NVIDIA drivers.
>>
>> Try running with the GMX_CUDA_STREAMSYNC environment variable set.
>>
>> Btw, were there any other processes using the GPU while mdrun was running?
>>
>> Cheers,
>> --
>> Szilárd
>
>
> thanks for kind reply.
> There is no any other process when I am running Gromacs.
>
> do you mean I should set GMX_CUDA_STREAMSYNC in the job script like:
>
> export GMX_CUDA_STREAMSYNC=/opt/cuda-5.0

Sort of, but the value does not matter. So if your shell is bash, the
above as well as simply "export GMX_CUDA_STREAMSYNC=" will work fine.

Let us know if this avoided the crash - when you have simulated long
enough to be able to judge.

Cheers,
--
Szilárd

>
> ?
>
> THX
> Albert
>
>
>
>
> --
> gmx-users mailing listgmx-users@gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the www
> interface or send it to gmx-users-requ...@gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] GPU job often stopped

2013-04-29 Thread Albert

On 04/29/2013 03:47 PM, Szilárd Páll wrote:

In that case, while it isn't very likely, the issue could be caused by
some implementation detail which aims to avoid performance loss caused
by an issue in the NVIDIA drivers.

Try running with the GMX_CUDA_STREAMSYNC environment variable set.

Btw, were there any other processes using the GPU while mdrun was running?

Cheers,
--
Szilárd


thanks for kind reply.
There is no any other process when I am running Gromacs.

do you mean I should set GMX_CUDA_STREAMSYNC in the job script like:

export GMX_CUDA_STREAMSYNC=/opt/cuda-5.0

?

THX
Albert



--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] GPU job often stopped

2013-04-29 Thread Szilárd Páll
In that case, while it isn't very likely, the issue could be caused by
some implementation detail which aims to avoid performance loss caused
by an issue in the NVIDIA drivers.

Try running with the GMX_CUDA_STREAMSYNC environment variable set.

Btw, were there any other processes using the GPU while mdrun was running?

Cheers,
--
Szilárd


On Mon, Apr 29, 2013 at 3:32 PM, Albert  wrote:
> On 04/29/2013 03:31 PM, Szilárd Páll wrote:
>>
>> The segv indicates that mdrun crashed and not that the machine was
>> restarted. The GPU detection output (both on stderr and log) should
>> show whether ECC is "on" (and so does the nvidia-smi tool).
>>
>> Cheers,
>> --
>> Szilárd
>
>
> yes it was on:
>
>
> Reading file heavy.tpr, VERSION 4.6.1 (single precision)
> Using 4 MPI threads
> Using 8 OpenMP threads per tMPI thread
>
> 5 GPUs detected:
>   #0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
>   #1: NVIDIA GeForce GTX 650, compute cap.: 3.0, ECC:  no, stat: compatible
>   #2: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
>   #3: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
>   #4: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
>
> 4 GPUs user-selected for this run: #0, #2, #3, #4
>
>
> --
> gmx-users mailing listgmx-users@gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the www
> interface or send it to gmx-users-requ...@gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] GPU job often stopped

2013-04-29 Thread Albert

On 04/29/2013 03:31 PM, Szilárd Páll wrote:

The segv indicates that mdrun crashed and not that the machine was
restarted. The GPU detection output (both on stderr and log) should
show whether ECC is "on" (and so does the nvidia-smi tool).

Cheers,
--
Szilárd


yes it was on:


Reading file heavy.tpr, VERSION 4.6.1 (single precision)
Using 4 MPI threads
Using 8 OpenMP threads per tMPI thread

5 GPUs detected:
  #0: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
  #1: NVIDIA GeForce GTX 650, compute cap.: 3.0, ECC:  no, stat: compatible
  #2: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
  #3: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible
  #4: NVIDIA Tesla K20m, compute cap.: 3.5, ECC: yes, stat: compatible

4 GPUs user-selected for this run: #0, #2, #3, #4

--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] GPU job often stopped

2013-04-29 Thread Szilárd Páll
On Mon, Apr 29, 2013 at 2:41 PM, Albert  wrote:
> On 04/28/2013 05:45 PM, Justin Lemkul wrote:
>>
>>
>> Frequent failures suggest instability in the simulated system. Check your
>> .log file or stderr for informative Gromacs diagnostic information.
>>
>> -Justin
>
>
>
> my log file didn't have any errors, the end of topped log file something
> like:
>
> DD  step 2259  vol min/aver 0.967  load imb.: force  0.8%
>
>Step   Time Lambda
>226045200.00.0
>
>Energies (kJ/mol)
>   AngleU-BProper Dih.  Improper Dih.  LJ-14
> 9.86437e+034.02406e+043.52809e+046.13542e+02 8.61815e+03
>  Coulomb-14LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
> 1.25055e+043.05477e+04   -9.05956e+03   -6.02400e+05 1.58357e+03
>  Position Rest.  PotentialKinetic En.   Total Energy Temperature
> 1.39149e+02   -4.72066e+051.37165e+05   -3.34901e+05 3.11958e+02
>  Pres. DC (bar) Pressure (bar)   Constr. rmsd
>-2.94092e+02   -7.91535e+011.79812e-05
>
>
> also in the information file I only obtained information:
>
>
> step 13300, will finish Tue Apr 30 14:41
> NOTE: Turning on dynamic load balancing
>
>
> Probably the machine was restarted from time to time?

The segv indicates that mdrun crashed and not that the machine was
restarted. The GPU detection output (both on stderr and log) should
show whether ECC is "on" (and so does the nvidia-smi tool).

Cheers,
--
Szilárd


>
> best
> Albert
>
>
>
> --
> gmx-users mailing listgmx-users@gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the www
> interface or send it to gmx-users-requ...@gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] GPU job often stopped

2013-04-29 Thread Albert

On 04/28/2013 05:45 PM, Justin Lemkul wrote:


Frequent failures suggest instability in the simulated system. Check 
your .log file or stderr for informative Gromacs diagnostic information.


-Justin 



my log file didn't have any errors, the end of topped log file something 
like:


DD  step 2259  vol min/aver 0.967  load imb.: force  0.8%

   Step   Time Lambda
   226045200.00.0

   Energies (kJ/mol)
  AngleU-BProper Dih.  Improper Dih.  LJ-14
9.86437e+034.02406e+043.52809e+046.13542e+02 8.61815e+03
 Coulomb-14LJ (SR)  Disper. corr.   Coulomb (SR)   Coul. recip.
1.25055e+043.05477e+04   -9.05956e+03   -6.02400e+05 1.58357e+03
 Position Rest.  PotentialKinetic En.   Total Energy Temperature
1.39149e+02   -4.72066e+051.37165e+05   -3.34901e+05 3.11958e+02
 Pres. DC (bar) Pressure (bar)   Constr. rmsd
   -2.94092e+02   -7.91535e+011.79812e-05


also in the information file I only obtained information:


step 13300, will finish Tue Apr 30 14:41
NOTE: Turning on dynamic load balancing


Probably the machine was restarted from time to time?

best
Albert


--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] GPU job often stopped

2013-04-29 Thread Albert

Hello:

 yes, I tried the CPU only version, it goes well and didn't stop. I am 
not sure whether I have ECC on or not. There are 4 Tesla K20 and one 
GTX650 in the workstation, after compilation, I simple submit the jobs 
with command:



mdrun -s md.tpr -gpu_id 0234

I submit the same system in another GTX690 machine, it also goes 
well. I compiled Gromacs with the same options in that machine.


thank you very much
best
Albert



On 04/29/2013 01:19 PM, Szilárd Páll wrote:

Have you tried running on CPUs only just to see if the issue persists?
Unless the issue does not occur with the same binary on the same
hardware running on CPUs only, I doubt it's a problem in the code.

Do you have ECC on?
--
Szilárd


--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] GPU job often stopped

2013-04-29 Thread Szilárd Páll
Have you tried running on CPUs only just to see if the issue persists?
Unless the issue does not occur with the same binary on the same
hardware running on CPUs only, I doubt it's a problem in the code.

Do you have ECC on?
--
Szilárd


On Sun, Apr 28, 2013 at 5:27 PM, Albert  wrote:
> Dear:
>
>   I am running MD jobs in a workstation with 4 K20 GPU and I found that the
> job always failed with following messages from time to time:
>
>
> [tesla:03432] *** Process received signal ***
> [tesla:03432] Signal: Segmentation fault (11)
> [tesla:03432] Signal code: Address not mapped (1)
> [tesla:03432] Failing at address: 0xfffe02de67e0
> [tesla:03432] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0)
> [0x7f4666da1cb0]
> [tesla:03432] [ 1] mdrun_mpi() [0x47dd61]
> [tesla:03432] [ 2] mdrun_mpi() [0x47d8ae]
> [tesla:03432] [ 3]
> /opt/intel/lib/intel64/libiomp5.so(__kmp_invoke_microtask+0x93)
> [0x7f46667904f3]
> [tesla:03432] *** End of error message ***
> --
> mpirun noticed that process rank 0 with PID 3432 on node tesla exited on
> signal 11 (Segmentation fault).
> --
>
>
> I can continue the jobs with mdrun option "-append -cpi", but it still
> stopped from time to time. I am just wondering what's the problem?
>
> thank you very much
> Albert
> --
> gmx-users mailing listgmx-users@gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> * Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> * Please don't post (un)subscribe requests to the list. Use the www
> interface or send it to gmx-users-requ...@gromacs.org.
> * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] GPU job often stopped

2013-04-28 Thread Justin Lemkul



On 4/28/13 11:27 AM, Albert wrote:

Dear:

   I am running MD jobs in a workstation with 4 K20 GPU and I found that the job
always failed with following messages from time to time:


[tesla:03432] *** Process received signal ***
[tesla:03432] Signal: Segmentation fault (11)
[tesla:03432] Signal code: Address not mapped (1)
[tesla:03432] Failing at address: 0xfffe02de67e0
[tesla:03432] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) 
[0x7f4666da1cb0]
[tesla:03432] [ 1] mdrun_mpi() [0x47dd61]
[tesla:03432] [ 2] mdrun_mpi() [0x47d8ae]
[tesla:03432] [ 3]
/opt/intel/lib/intel64/libiomp5.so(__kmp_invoke_microtask+0x93) [0x7f46667904f3]
[tesla:03432] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 3432 on node tesla exited on signal
11 (Segmentation fault).
--


I can continue the jobs with mdrun option "-append -cpi", but it still stopped
from time to time. I am just wondering what's the problem?



Frequent failures suggest instability in the simulated system.  Check your .log 
file or stderr for informative Gromacs diagnostic information.


-Justin

--


Justin A. Lemkul, Ph.D.
Research Scientist
Department of Biochemistry
Virginia Tech
Blacksburg, VA
jalemkul[at]vt.edu | (540) 231-9080
http://www.bevanlab.biochem.vt.edu/Pages/Personal/justin


--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


[gmx-users] GPU job often stopped

2013-04-28 Thread Albert

Dear:

  I am running MD jobs in a workstation with 4 K20 GPU and I found that 
the job always failed with following messages from time to time:



[tesla:03432] *** Process received signal ***
[tesla:03432] Signal: Segmentation fault (11)
[tesla:03432] Signal code: Address not mapped (1)
[tesla:03432] Failing at address: 0xfffe02de67e0
[tesla:03432] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) 
[0x7f4666da1cb0]

[tesla:03432] [ 1] mdrun_mpi() [0x47dd61]
[tesla:03432] [ 2] mdrun_mpi() [0x47d8ae]
[tesla:03432] [ 3] 
/opt/intel/lib/intel64/libiomp5.so(__kmp_invoke_microtask+0x93) 
[0x7f46667904f3]

[tesla:03432] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 3432 on node tesla exited on 
signal 11 (Segmentation fault).

--


I can continue the jobs with mdrun option "-append -cpi", but it still 
stopped from time to time. I am just wondering what's the problem?


thank you very much
Albert
--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
* Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
* Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.

* Can't post? Read http://www.gromacs.org/Support/Mailing_Lists