Re: [gmx-developers] Re: [gmx-users] unexpexted stop of simulation
On Thu, Nov 4, 2010 at 1:08 AM, Roland Schulz wrote: > BTW: Is it somehow possible to print the kernel error messages that are > shown by dmesg to the user from within GROMACS? That would help the user to > directly see the reason of the error. Thus I'm looking for a function > similar to strerror but which returns the kernel message not just the > message of the error code (which in this case was just "Input/Output > errror". Hi Roland! In general, it's not possible to make a connection between a message logged by the Linux kernel (which is then shown by dmesg or the system logging) and a particular call to an I/O function. More specifically, dmesg just dumps the kernel log buffer which, to my knowledge, doesn't contain time information, so the last message in the buffer and the last I/O operation cannot be correlated this way; the system logging (syslogd and similar) attaches time information, but it's usually only readable by root for security reasons - and even if it would be readable by the GROMACS user, there is no way to uniquely associate an entry in syslog with a particular I/O operation on a multiuser/multitasking OS. It might be doable using a tracing infrastructure in the kernel... but that's no longer generic. Cheers, Bogdan -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] unexpexted stop of simulation
Hi, the reason turned out to be that the lock daemon (lockd) on the NFS server was hanging. The error could be found by dmesg. BTW: Is it somehow possible to print the kernel error messages that are shown by dmesg to the user from within GROMACS? That would help the user to directly see the reason of the error. Thus I'm looking for a function similar to strerror but which returns the kernel message not just the message of the error code (which in this case was just "Input/Output errror". Roland On Wed, Nov 3, 2010 at 12:05 PM, Carsten Kutzner wrote: > Hi, > > there was also an issue with the locking of the general md.log > output file which was resolved for 4.5.2. An update might help. > > Carsten > > > On Nov 3, 2010, at 3:50 PM, Florian Dommert wrote: > > > -BEGIN PGP SIGNED MESSAGE- > > Hash: SHA1 > > > > On 11/03/2010 03:38 PM, Hong, Liang wrote: > >> Dear all, > >> I'm performing a three-day simulation. It runs well for the first day, > but stops for the second one. The error message is below. Does anyone know > what might be the problem? Thanks > >> Liang > >> > >> Program mdrun, VERSION 4.5.1-dev-20101008-e2cbc-dirty > >> Source code file: > /home/z8g/download/gromacs.head/src/gmxlib/checkpoint.c, line: 1748 > >> > >> Fatal error: > >> Failed to lock: md100ns.log. Already running simulation? > >> For more information and tips for troubleshooting, please check the > GROMACS > >> website at http://www.gromacs.org/Documentation/Errors > >> --- > >> > >> "Sitting on a rooftop watching molecules collide" (A Camp) > >> > >> Error on node 0, will try to stop all the nodes > >> Halting parallel program mdrun on CPU 0 out of 32 > >> > >> gcq#348: "Sitting on a rooftop watching molecules collide" (A Camp) > >> > >> > -- > >> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > >> with errorcode -1. > >> > >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > >> You may or may not see output from other processes, depending on > >> exactly when Open MPI kills them. > >> > -- > >> [node139:04470] [[37327,0],0]-[[37327,1],0] mca_oob_tcp_msg_recv: readv > failed: Connection reset by peer (104) > >> > -- > >> mpiexec has exited due to process rank 0 with PID 4471 on > >> node node139 exiting without calling "finalize". This may > >> have caused other processes in the application to be > >> terminated by signals sent by mpiexec (as reported here). > > > > Perhaps the queueing system of your cluster does not allow running a job > > longer than 24h. Or the default is 24h and you have to supply the > > corresponding information to the submission script. > > > > /Flo > > > > - -- > > Florian Dommert > > Dipl.-Phys. > > > > Institute for Computational Physics > > > > University Stuttgart > > > > Pfaffenwaldring 27 > > 70569 Stuttgart > > > > Phone: +49(0)711/685-6-3613 > > Fax: +49-(0)711/685-6-3658 > > > > EMail: domm...@icp.uni-stuttgart.de > > Home: http://www.icp.uni-stuttgart.de/~icp/Florian_Dommert > > -BEGIN PGP SIGNATURE- > > Version: GnuPG v1.4.10 (GNU/Linux) > > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > > > iEYEARECAAYFAkzRdrEACgkQLpNNBb9GiPm1sgCg3LkRUWgiZvOOH/GIjp5ifbZI > > bJcAn1aamCMWlWTokD1+eDCLG1WhT/rd > > =4Vs3 > > -END PGP SIGNATURE- > > -- > > gmx-users mailing listgmx-users@gromacs.org > > http://lists.gromacs.org/mailman/listinfo/gmx-users > > Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > > Please don't post (un)subscribe requests to the list. Use the > > www interface or send it to gmx-users-requ...@gromacs.org. > > Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > > > > > -- > gmx-users mailing listgmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > Please don't post (un)subscribe requests to the list. Use the > www interface or send it to gmx-users-requ...@gromacs.org. > Can't post? Read http://www.gromacs.org/Support/Mailing_Lists > > -- ORNL/UT Center for Molecular Biophysics cmb.ornl.gov 865-241-1537, ORNL PO BOX 2008 MS6309 -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] unexpexted stop of simulation
Hi, there was also an issue with the locking of the general md.log output file which was resolved for 4.5.2. An update might help. Carsten On Nov 3, 2010, at 3:50 PM, Florian Dommert wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 11/03/2010 03:38 PM, Hong, Liang wrote: >> Dear all, >> I'm performing a three-day simulation. It runs well for the first day, but >> stops for the second one. The error message is below. Does anyone know what >> might be the problem? Thanks >> Liang >> >> Program mdrun, VERSION 4.5.1-dev-20101008-e2cbc-dirty >> Source code file: /home/z8g/download/gromacs.head/src/gmxlib/checkpoint.c, >> line: 1748 >> >> Fatal error: >> Failed to lock: md100ns.log. Already running simulation? >> For more information and tips for troubleshooting, please check the GROMACS >> website at http://www.gromacs.org/Documentation/Errors >> --- >> >> "Sitting on a rooftop watching molecules collide" (A Camp) >> >> Error on node 0, will try to stop all the nodes >> Halting parallel program mdrun on CPU 0 out of 32 >> >> gcq#348: "Sitting on a rooftop watching molecules collide" (A Camp) >> >> -- >> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >> with errorcode -1. >> >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >> You may or may not see output from other processes, depending on >> exactly when Open MPI kills them. >> -- >> [node139:04470] [[37327,0],0]-[[37327,1],0] mca_oob_tcp_msg_recv: readv >> failed: Connection reset by peer (104) >> -- >> mpiexec has exited due to process rank 0 with PID 4471 on >> node node139 exiting without calling "finalize". This may >> have caused other processes in the application to be >> terminated by signals sent by mpiexec (as reported here). > > Perhaps the queueing system of your cluster does not allow running a job > longer than 24h. Or the default is 24h and you have to supply the > corresponding information to the submission script. > > /Flo > > - -- > Florian Dommert > Dipl.-Phys. > > Institute for Computational Physics > > University Stuttgart > > Pfaffenwaldring 27 > 70569 Stuttgart > > Phone: +49(0)711/685-6-3613 > Fax: +49-(0)711/685-6-3658 > > EMail: domm...@icp.uni-stuttgart.de > Home: http://www.icp.uni-stuttgart.de/~icp/Florian_Dommert > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.10 (GNU/Linux) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAkzRdrEACgkQLpNNBb9GiPm1sgCg3LkRUWgiZvOOH/GIjp5ifbZI > bJcAn1aamCMWlWTokD1+eDCLG1WhT/rd > =4Vs3 > -END PGP SIGNATURE- > -- > gmx-users mailing listgmx-users@gromacs.org > http://lists.gromacs.org/mailman/listinfo/gmx-users > Please search the archive at > http://www.gromacs.org/Support/Mailing_Lists/Search before posting! > Please don't post (un)subscribe requests to the list. Use the > www interface or send it to gmx-users-requ...@gromacs.org. > Can't post? Read http://www.gromacs.org/Support/Mailing_Lists -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
Re: [gmx-users] unexpexted stop of simulation
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 11/03/2010 03:38 PM, Hong, Liang wrote: > Dear all, > I'm performing a three-day simulation. It runs well for the first day, but > stops for the second one. The error message is below. Does anyone know what > might be the problem? Thanks > Liang > > Program mdrun, VERSION 4.5.1-dev-20101008-e2cbc-dirty > Source code file: /home/z8g/download/gromacs.head/src/gmxlib/checkpoint.c, > line: 1748 > > Fatal error: > Failed to lock: md100ns.log. Already running simulation? > For more information and tips for troubleshooting, please check the GROMACS > website at http://www.gromacs.org/Documentation/Errors > --- > > "Sitting on a rooftop watching molecules collide" (A Camp) > > Error on node 0, will try to stop all the nodes > Halting parallel program mdrun on CPU 0 out of 32 > > gcq#348: "Sitting on a rooftop watching molecules collide" (A Camp) > > -- > MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > with errorcode -1. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -- > [node139:04470] [[37327,0],0]-[[37327,1],0] mca_oob_tcp_msg_recv: readv > failed: Connection reset by peer (104) > -- > mpiexec has exited due to process rank 0 with PID 4471 on > node node139 exiting without calling "finalize". This may > have caused other processes in the application to be > terminated by signals sent by mpiexec (as reported here). Perhaps the queueing system of your cluster does not allow running a job longer than 24h. Or the default is 24h and you have to supply the corresponding information to the submission script. /Flo - -- Florian Dommert Dipl.-Phys. Institute for Computational Physics University Stuttgart Pfaffenwaldring 27 70569 Stuttgart Phone: +49(0)711/685-6-3613 Fax: +49-(0)711/685-6-3658 EMail: domm...@icp.uni-stuttgart.de Home: http://www.icp.uni-stuttgart.de/~icp/Florian_Dommert -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkzRdrEACgkQLpNNBb9GiPm1sgCg3LkRUWgiZvOOH/GIjp5ifbZI bJcAn1aamCMWlWTokD1+eDCLG1WhT/rd =4Vs3 -END PGP SIGNATURE- -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists
[gmx-users] unexpexted stop of simulation
Dear all, I'm performing a three-day simulation. It runs well for the first day, but stops for the second one. The error message is below. Does anyone know what might be the problem? Thanks Liang Program mdrun, VERSION 4.5.1-dev-20101008-e2cbc-dirty Source code file: /home/z8g/download/gromacs.head/src/gmxlib/checkpoint.c, line: 1748 Fatal error: Failed to lock: md100ns.log. Already running simulation? For more information and tips for troubleshooting, please check the GROMACS website at http://www.gromacs.org/Documentation/Errors --- "Sitting on a rooftop watching molecules collide" (A Camp) Error on node 0, will try to stop all the nodes Halting parallel program mdrun on CPU 0 out of 32 gcq#348: "Sitting on a rooftop watching molecules collide" (A Camp) -- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode -1. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -- [node139:04470] [[37327,0],0]-[[37327,1],0] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) -- mpiexec has exited due to process rank 0 with PID 4471 on node node139 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpiexec (as reported here). -- gmx-users mailing listgmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. Can't post? Read http://www.gromacs.org/Support/Mailing_Lists