Re: [gmx-developers] Re: [gmx-users] unexpexted stop of simulation

2010-11-03 Thread Bogdan Costescu
On Thu, Nov 4, 2010 at 1:08 AM, Roland Schulz  wrote:
> BTW: Is it somehow possible to print the kernel error messages that are
> shown by dmesg to the user from within GROMACS? That would help the user to
> directly see the reason of the error. Thus I'm looking for a function
> similar to strerror but which returns the kernel message not just the
> message of the error code (which in this case was just "Input/Output
> errror".

Hi Roland!

In general, it's not possible to make a connection between a message
logged by the Linux kernel (which is then shown by dmesg or the system
logging) and a particular call to an I/O function. More specifically,
dmesg just dumps the kernel log buffer which, to my knowledge, doesn't
contain time information, so the last message in the buffer and the
last I/O operation cannot be correlated this way; the system logging
(syslogd and similar) attaches time information, but it's usually only
readable by root for security reasons - and even if it would be
readable by the GROMACS user, there is no way to uniquely associate an
entry in syslog with a particular I/O operation on a
multiuser/multitasking OS. It might be doable using a tracing
infrastructure in the kernel... but that's no longer generic.

Cheers,
Bogdan
-- 
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] unexpexted stop of simulation

2010-11-03 Thread Roland Schulz
Hi,

the reason turned out to be that the lock daemon (lockd) on the NFS server
was hanging. The error could be found by dmesg.

BTW: Is it somehow possible to print the kernel error messages that are
shown by dmesg to the user from within GROMACS? That would help the user to
directly see the reason of the error. Thus I'm looking for a function
similar to strerror but which returns the kernel message not just the
message of the error code (which in this case was just "Input/Output
errror".

Roland



On Wed, Nov 3, 2010 at 12:05 PM, Carsten Kutzner  wrote:

> Hi,
>
> there was also an issue with the locking of the general md.log
> output file which was resolved for 4.5.2. An update might help.
>
> Carsten
>
>
> On Nov 3, 2010, at 3:50 PM, Florian Dommert wrote:
>
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA1
> >
> > On 11/03/2010 03:38 PM, Hong, Liang wrote:
> >> Dear all,
> >> I'm performing a three-day simulation. It runs well for the first day,
> but stops for the second one. The error message is below. Does anyone know
> what might be the problem? Thanks
> >> Liang
> >>
> >> Program mdrun, VERSION 4.5.1-dev-20101008-e2cbc-dirty
> >> Source code file:
> /home/z8g/download/gromacs.head/src/gmxlib/checkpoint.c, line: 1748
> >>
> >> Fatal error:
> >> Failed to lock: md100ns.log. Already running simulation?
> >> For more information and tips for troubleshooting, please check the
> GROMACS
> >> website at http://www.gromacs.org/Documentation/Errors
> >> ---
> >>
> >> "Sitting on a rooftop watching molecules collide" (A Camp)
> >>
> >> Error on node 0, will try to stop all the nodes
> >> Halting parallel program mdrun on CPU 0 out of 32
> >>
> >> gcq#348: "Sitting on a rooftop watching molecules collide" (A Camp)
> >>
> >>
> --
> >> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> >> with errorcode -1.
> >>
> >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> >> You may or may not see output from other processes, depending on
> >> exactly when Open MPI kills them.
> >>
> --
> >> [node139:04470] [[37327,0],0]-[[37327,1],0] mca_oob_tcp_msg_recv: readv
> failed: Connection reset by peer (104)
> >>
> --
> >> mpiexec has exited due to process rank 0 with PID 4471 on
> >> node node139 exiting without calling "finalize". This may
> >> have caused other processes in the application to be
> >> terminated by signals sent by mpiexec (as reported here).
> >
> > Perhaps the queueing system of your cluster does not allow running a job
> > longer than 24h. Or the default is 24h and you have to supply the
> > corresponding information to the submission script.
> >
> > /Flo
> >
> > - --
> > Florian Dommert
> > Dipl.-Phys.
> >
> > Institute for Computational Physics
> >
> > University Stuttgart
> >
> > Pfaffenwaldring 27
> > 70569 Stuttgart
> >
> > Phone: +49(0)711/685-6-3613
> > Fax:   +49-(0)711/685-6-3658
> >
> > EMail: domm...@icp.uni-stuttgart.de
> > Home: http://www.icp.uni-stuttgart.de/~icp/Florian_Dommert
> > -BEGIN PGP SIGNATURE-
> > Version: GnuPG v1.4.10 (GNU/Linux)
> > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> >
> > iEYEARECAAYFAkzRdrEACgkQLpNNBb9GiPm1sgCg3LkRUWgiZvOOH/GIjp5ifbZI
> > bJcAn1aamCMWlWTokD1+eDCLG1WhT/rd
> > =4Vs3
> > -END PGP SIGNATURE-
> > --
> > gmx-users mailing listgmx-users@gromacs.org
> > http://lists.gromacs.org/mailman/listinfo/gmx-users
> > Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> > Please don't post (un)subscribe requests to the list. Use the
> > www interface or send it to gmx-users-requ...@gromacs.org.
> > Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
>
>
>
>
> --
> gmx-users mailing listgmx-users@gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the
> www interface or send it to gmx-users-requ...@gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
>
>


-- 
ORNL/UT Center for Molecular Biophysics cmb.ornl.gov
865-241-1537, ORNL PO BOX 2008 MS6309
-- 
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Re: [gmx-users] unexpexted stop of simulation

2010-11-03 Thread Carsten Kutzner
Hi,

there was also an issue with the locking of the general md.log
output file which was resolved for 4.5.2. An update might help.

Carsten


On Nov 3, 2010, at 3:50 PM, Florian Dommert wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 11/03/2010 03:38 PM, Hong, Liang wrote:
>> Dear all,
>> I'm performing a three-day simulation. It runs well for the first day, but 
>> stops for the second one. The error message is below. Does anyone know what 
>> might be the problem? Thanks
>> Liang
>> 
>> Program mdrun, VERSION 4.5.1-dev-20101008-e2cbc-dirty
>> Source code file: /home/z8g/download/gromacs.head/src/gmxlib/checkpoint.c, 
>> line: 1748
>> 
>> Fatal error:
>> Failed to lock: md100ns.log. Already running simulation?
>> For more information and tips for troubleshooting, please check the GROMACS
>> website at http://www.gromacs.org/Documentation/Errors
>> ---
>> 
>> "Sitting on a rooftop watching molecules collide" (A Camp)
>> 
>> Error on node 0, will try to stop all the nodes
>> Halting parallel program mdrun on CPU 0 out of 32
>> 
>> gcq#348: "Sitting on a rooftop watching molecules collide" (A Camp)
>> 
>> --
>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>> with errorcode -1.
>> 
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> --
>> [node139:04470] [[37327,0],0]-[[37327,1],0] mca_oob_tcp_msg_recv: readv 
>> failed: Connection reset by peer (104)
>> --
>> mpiexec has exited due to process rank 0 with PID 4471 on
>> node node139 exiting without calling "finalize". This may
>> have caused other processes in the application to be
>> terminated by signals sent by mpiexec (as reported here).
> 
> Perhaps the queueing system of your cluster does not allow running a job
> longer than 24h. Or the default is 24h and you have to supply the
> corresponding information to the submission script.
> 
> /Flo
> 
> - -- 
> Florian Dommert
> Dipl.-Phys.
> 
> Institute for Computational Physics
> 
> University Stuttgart
> 
> Pfaffenwaldring 27
> 70569 Stuttgart
> 
> Phone: +49(0)711/685-6-3613
> Fax:   +49-(0)711/685-6-3658
> 
> EMail: domm...@icp.uni-stuttgart.de
> Home: http://www.icp.uni-stuttgart.de/~icp/Florian_Dommert
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iEYEARECAAYFAkzRdrEACgkQLpNNBb9GiPm1sgCg3LkRUWgiZvOOH/GIjp5ifbZI
> bJcAn1aamCMWlWTokD1+eDCLG1WhT/rd
> =4Vs3
> -END PGP SIGNATURE-
> -- 
> gmx-users mailing listgmx-users@gromacs.org
> http://lists.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at 
> http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
> Please don't post (un)subscribe requests to the list. Use the 
> www interface or send it to gmx-users-requ...@gromacs.org.
> Can't post? Read http://www.gromacs.org/Support/Mailing_Lists





--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


Re: [gmx-users] unexpexted stop of simulation

2010-11-03 Thread Florian Dommert
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 11/03/2010 03:38 PM, Hong, Liang wrote:
> Dear all,
> I'm performing a three-day simulation. It runs well for the first day, but 
> stops for the second one. The error message is below. Does anyone know what 
> might be the problem? Thanks
> Liang
> 
> Program mdrun, VERSION 4.5.1-dev-20101008-e2cbc-dirty
> Source code file: /home/z8g/download/gromacs.head/src/gmxlib/checkpoint.c, 
> line: 1748
> 
> Fatal error:
> Failed to lock: md100ns.log. Already running simulation?
> For more information and tips for troubleshooting, please check the GROMACS
> website at http://www.gromacs.org/Documentation/Errors
> ---
> 
> "Sitting on a rooftop watching molecules collide" (A Camp)
> 
> Error on node 0, will try to stop all the nodes
> Halting parallel program mdrun on CPU 0 out of 32
> 
> gcq#348: "Sitting on a rooftop watching molecules collide" (A Camp)
> 
> --
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode -1.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --
> [node139:04470] [[37327,0],0]-[[37327,1],0] mca_oob_tcp_msg_recv: readv 
> failed: Connection reset by peer (104)
> --
> mpiexec has exited due to process rank 0 with PID 4471 on
> node node139 exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpiexec (as reported here).

Perhaps the queueing system of your cluster does not allow running a job
longer than 24h. Or the default is 24h and you have to supply the
corresponding information to the submission script.

/Flo

- -- 
Florian Dommert
Dipl.-Phys.

Institute for Computational Physics

University Stuttgart

Pfaffenwaldring 27
70569 Stuttgart

Phone: +49(0)711/685-6-3613
Fax:   +49-(0)711/685-6-3658

EMail: domm...@icp.uni-stuttgart.de
Home: http://www.icp.uni-stuttgart.de/~icp/Florian_Dommert
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkzRdrEACgkQLpNNBb9GiPm1sgCg3LkRUWgiZvOOH/GIjp5ifbZI
bJcAn1aamCMWlWTokD1+eDCLG1WhT/rd
=4Vs3
-END PGP SIGNATURE-
-- 
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists


[gmx-users] unexpexted stop of simulation

2010-11-03 Thread Hong, Liang
Dear all,
I'm performing a three-day simulation. It runs well for the first day, but 
stops for the second one. The error message is below. Does anyone know what 
might be the problem? Thanks
Liang

Program mdrun, VERSION 4.5.1-dev-20101008-e2cbc-dirty
Source code file: /home/z8g/download/gromacs.head/src/gmxlib/checkpoint.c, 
line: 1748

Fatal error:
Failed to lock: md100ns.log. Already running simulation?
For more information and tips for troubleshooting, please check the GROMACS
website at http://www.gromacs.org/Documentation/Errors
---

"Sitting on a rooftop watching molecules collide" (A Camp)

Error on node 0, will try to stop all the nodes
Halting parallel program mdrun on CPU 0 out of 32

gcq#348: "Sitting on a rooftop watching molecules collide" (A Camp)

--
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode -1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--
[node139:04470] [[37327,0],0]-[[37327,1],0] mca_oob_tcp_msg_recv: readv failed: 
Connection reset by peer (104)
--
mpiexec has exited due to process rank 0 with PID 4471 on
node node139 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--
gmx-users mailing listgmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists