Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...
Hi, I know the pull request has not (yet) been merged, but here is a somewhat "different" output from a single sequential test (automatically) laucnhed without mpirun last night: [lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename hash 1366255883 [lorien:172229] plm:base:set_hnp_name: final jobfam 39075 [lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rsh path NULL [lorien:172229] [[39075,0],0] plm:base:receive start comm [lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered [lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not a dynamic spawn [lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack: received unexpected process identifier [[41545,0],0] from [[39075,0],0] [lorien:172218] *** Process received signal *** [lorien:172218] Signal: Segmentation fault (11) [lorien:172218] Signal code: Invalid permissions (2) [lorien:172218] Failing at address: 0x2d07e00 [lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive stop comm unfortunately, I didn't got any coredump (???) The line: [lorien:172218] Signal code: Invalid permissions (2) is curious or not? as usual, here are the build logs: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_config.log http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_ompi_info_all.txt Does the PR #1376 will prevent or fix this too? Thanks again! Eric On 15/09/16 09:32 AM, Eric Chamberland wrote: Hi Gilles, On 15/09/16 03:38 AM, Gilles Gouaillardet wrote: Eric, a bug has been identified, and a patch is available at https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 ./a.out), so if applying a patch does not fit your test workflow, it might be easier for you to update it and mpirun -np 1 ./a.out instead of ./a.out basically, increasing verbosity runs some extra code, which include sprintf. so yes, it is possible to crash an app by increasing verbosity by running into a bug that is hidden under normal operation. my intuition suggests this is quite unlikely ... if you can get a core file and a backtrace, we will soon find out Damn! I did got one but it got erased last night when the automatic process started again... (which erase all directories before starting) :/ I would like to put core files in a user specific directory, but it seems it has to be a system-wide configuration... :/ I will trick this by changing the "pwd" to a path outside the erased directory... So as of tonight I should be able to retrieve core files even after I relaunched the process.. Thanks for all the support! Eric Cheers, Gilles On 9/15/2016 2:58 AM, Eric Chamberland wrote: Ok, one test segfaulted *but* I can't tell if it is the *same* bug because there has been a segfault: stderr: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 1366255883 [lorien:190552] plm:base:set_hnp_name: final jobfam 53310 [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL [lorien:190552] [[53310,0],0] plm:base:receive start comm *** Error in `orted': realloc(): invalid next size: 0x01e58770 *** ... ... [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 573 [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 163 *** An error occurred in MPI_Init_thread *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, ***and potentially your MPI job) [lorien:190306] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! stdout: -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_init failed --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons t
Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...
Eric, I expect the PR will fix this bug. The crash occur after the unexpected process identifier error, and this error should not happen in the first place. So at this stage, I would not worry too much of that crash (to me, it is an undefined behavior anyway) Cheers, Gilles On Friday, September 16, 2016, Eric Chamberland < eric.chamberl...@giref.ulaval.ca> wrote: > Hi, > > I know the pull request has not (yet) been merged, but here is a somewhat > "different" output from a single sequential test (automatically) laucnhed > without mpirun last night: > > [lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path > NULL > [lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename hash > 1366255883 > [lorien:172229] plm:base:set_hnp_name: final jobfam 39075 > [lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rsh path NULL > [lorien:172229] [[39075,0],0] plm:base:receive start comm > [lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered > [lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not a > dynamic spawn > [lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack: received > unexpected process identifier [[41545,0],0] from [[39075,0],0] > [lorien:172218] *** Process received signal *** > [lorien:172218] Signal: Segmentation fault (11) > [lorien:172218] Signal code: Invalid permissions (2) > [lorien:172218] Failing at address: 0x2d07e00 > [lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive stop > comm > > > unfortunately, I didn't got any coredump (???) The line: > > [lorien:172218] Signal code: Invalid permissions (2) > > is curious or not? > > as usual, here are the build logs: > > http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16 > .01h16m01s_config.log > > http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16 > .01h16m01s_ompi_info_all.txt > > Does the PR #1376 will prevent or fix this too? > > Thanks again! > > Eric > > > > On 15/09/16 09:32 AM, Eric Chamberland wrote: > >> Hi Gilles, >> >> On 15/09/16 03:38 AM, Gilles Gouaillardet wrote: >> >>> Eric, >>> >>> >>> a bug has been identified, and a patch is available at >>> https://patch-diff.githubusercontent.com/raw/open-mpi/ompi- >>> release/pull/1376.patch >>> >>> >>> >>> the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1 >>> ./a.out), so if applying a patch does not fit your test workflow, >>> >>> it might be easier for you to update it and mpirun -np 1 ./a.out instead >>> of ./a.out >>> >>> >>> basically, increasing verbosity runs some extra code, which include >>> sprintf. >>> so yes, it is possible to crash an app by increasing verbosity by >>> running into a bug that is hidden under normal operation. >>> my intuition suggests this is quite unlikely ... if you can get a core >>> file and a backtrace, we will soon find out >>> >>> Damn! I did got one but it got erased last night when the automatic >> process started again... (which erase all directories before starting) :/ >> >> I would like to put core files in a user specific directory, but it >> seems it has to be a system-wide configuration... :/ I will trick this >> by changing the "pwd" to a path outside the erased directory... >> >> So as of tonight I should be able to retrieve core files even after I >> relaunched the process.. >> >> Thanks for all the support! >> >> Eric >> >> >>> Cheers, >>> >>> Gilles >>> >>> >>> >>> On 9/15/2016 2:58 AM, Eric Chamberland wrote: >>> Ok, one test segfaulted *but* I can't tell if it is the *same* bug because there has been a segfault: stderr: http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14 .10h38m52s.faultyCerr.Triangle.h_cte_1.txt [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename hash 1366255883 [lorien:190552] plm:base:set_hnp_name: final jobfam 53310 [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL [lorien:190552] [[53310,0],0] plm:base:receive start comm *** Error in `orted': realloc(): invalid next size: 0x01e58770 *** ... ... [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 573 [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 163 *** An error occurred in MPI_Init_thread *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, ***and potentially your MPI job) [lorien:190306] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! stdout: -