Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-16 Thread Eric Chamberland

Hi,

I know the pull request has not (yet) been merged, but here is a 
somewhat "different" output from a single sequential test 
(automatically) laucnhed without mpirun last night:


[lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
path NULL
[lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename hash 
1366255883

[lorien:172229] plm:base:set_hnp_name: final jobfam 39075
[lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:172229] [[39075,0],0] plm:base:receive start comm
[lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered
[lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not a 
dynamic spawn
[lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack: received 
unexpected process identifier [[41545,0],0] from [[39075,0],0]

[lorien:172218] *** Process received signal ***
[lorien:172218] Signal: Segmentation fault (11)
[lorien:172218] Signal code: Invalid permissions (2)
[lorien:172218] Failing at address: 0x2d07e00
[lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive stop 
comm



unfortunately, I didn't got any coredump (???)  The line:

[lorien:172218] Signal code: Invalid permissions (2)

is curious or not?

as usual, here are the build logs:

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_config.log

http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16.01h16m01s_ompi_info_all.txt

Does the PR #1376 will prevent or fix this too?

Thanks again!

Eric



On 15/09/16 09:32 AM, Eric Chamberland wrote:

Hi Gilles,

On 15/09/16 03:38 AM, Gilles Gouaillardet wrote:

Eric,


a bug has been identified, and a patch is available at
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-release/pull/1376.patch



the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1
./a.out), so if applying a patch does not fit your test workflow,

it might be easier for you to update it and mpirun -np 1 ./a.out instead
of ./a.out


basically, increasing verbosity runs some extra code, which include
sprintf.
so yes, it is possible to crash an app by increasing verbosity by
running into a bug that is hidden under normal operation.
my intuition suggests this is quite unlikely ... if you can get a core
file and a backtrace, we will soon find out


Damn! I did got one but it got erased last night when the automatic
process started again... (which erase all directories before starting) :/

I would like to put core files in a user specific directory, but it
seems it has to be a system-wide configuration... :/  I will trick this
by changing the "pwd" to a path outside the erased directory...

So as of tonight I should be able to retrieve core files even after I
relaunched the process..

Thanks for all the support!

Eric



Cheers,

Gilles



On 9/15/2016 2:58 AM, Eric Chamberland wrote:

Ok,

one test segfaulted *but* I can't tell if it is the *same* bug because
there has been a segfault:

stderr:
http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14.10h38m52s.faultyCerr.Triangle.h_cte_1.txt



[lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
path NULL
[lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename
hash 1366255883
[lorien:190552] plm:base:set_hnp_name: final jobfam 53310
[lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[lorien:190552] [[53310,0],0] plm:base:receive start comm
*** Error in `orted': realloc(): invalid next size: 0x01e58770
***
...
...
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
daemon on the local node in file ess_singleton_module.c at line 573
[lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
daemon on the local node in file ess_singleton_module.c at line 163
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***and potentially your MPI job)
[lorien:190306] Local abort before MPI_INIT completed completed
successfully, but am not able to aggregate error messages, and not
able to guarantee that all other processes were killed!

stdout:

--


It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127)
instead of ORTE_SUCCESS
--


--


It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons t

Re: [OMPI devel] OpenMPI 2.x: bug: violent break at beginning with (sequential) runs...

2016-09-16 Thread Gilles Gouaillardet
Eric,

I expect the PR will fix this bug.
The crash occur after the unexpected process identifier error, and this
error should not happen in the first place. So at this stage, I would not
worry too much of that crash (to me, it is an undefined behavior anyway)

Cheers,

Gilles

On Friday, September 16, 2016, Eric Chamberland <
eric.chamberl...@giref.ulaval.ca> wrote:

> Hi,
>
> I know the pull request has not (yet) been merged, but here is a somewhat
> "different" output from a single sequential test (automatically) laucnhed
> without mpirun last night:
>
> [lorien:172229] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path
> NULL
> [lorien:172229] plm:base:set_hnp_name: initial bias 172229 nodename hash
> 1366255883
> [lorien:172229] plm:base:set_hnp_name: final jobfam 39075
> [lorien:172229] [[39075,0],0] plm:rsh_setup on agent ssh : rsh path NULL
> [lorien:172229] [[39075,0],0] plm:base:receive start comm
> [lorien:172229] [[39075,0],0] plm:base:launch [39075,1] registered
> [lorien:172229] [[39075,0],0] plm:base:launch job [39075,1] is not a
> dynamic spawn
> [lorien:172218] [[41545,589],0] usock_peer_recv_connect_ack: received
> unexpected process identifier [[41545,0],0] from [[39075,0],0]
> [lorien:172218] *** Process received signal ***
> [lorien:172218] Signal: Segmentation fault (11)
> [lorien:172218] Signal code: Invalid permissions (2)
> [lorien:172218] Failing at address: 0x2d07e00
> [lorien:172218] [ 0] [lorien:172229] [[39075,0],0] plm:base:receive stop
> comm
>
>
> unfortunately, I didn't got any coredump (???)  The line:
>
> [lorien:172218] Signal code: Invalid permissions (2)
>
> is curious or not?
>
> as usual, here are the build logs:
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16
> .01h16m01s_config.log
>
> http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.16
> .01h16m01s_ompi_info_all.txt
>
> Does the PR #1376 will prevent or fix this too?
>
> Thanks again!
>
> Eric
>
>
>
> On 15/09/16 09:32 AM, Eric Chamberland wrote:
>
>> Hi Gilles,
>>
>> On 15/09/16 03:38 AM, Gilles Gouaillardet wrote:
>>
>>> Eric,
>>>
>>>
>>> a bug has been identified, and a patch is available at
>>> https://patch-diff.githubusercontent.com/raw/open-mpi/ompi-
>>> release/pull/1376.patch
>>>
>>>
>>>
>>> the bug is specific to singleton mode (e.g. ./a.out vs mpirun -np 1
>>> ./a.out), so if applying a patch does not fit your test workflow,
>>>
>>> it might be easier for you to update it and mpirun -np 1 ./a.out instead
>>> of ./a.out
>>>
>>>
>>> basically, increasing verbosity runs some extra code, which include
>>> sprintf.
>>> so yes, it is possible to crash an app by increasing verbosity by
>>> running into a bug that is hidden under normal operation.
>>> my intuition suggests this is quite unlikely ... if you can get a core
>>> file and a backtrace, we will soon find out
>>>
>>> Damn! I did got one but it got erased last night when the automatic
>> process started again... (which erase all directories before starting) :/
>>
>> I would like to put core files in a user specific directory, but it
>> seems it has to be a system-wide configuration... :/  I will trick this
>> by changing the "pwd" to a path outside the erased directory...
>>
>> So as of tonight I should be able to retrieve core files even after I
>> relaunched the process..
>>
>> Thanks for all the support!
>>
>> Eric
>>
>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>>
>>>
>>> On 9/15/2016 2:58 AM, Eric Chamberland wrote:
>>>
 Ok,

 one test segfaulted *but* I can't tell if it is the *same* bug because
 there has been a segfault:

 stderr:
 http://www.giref.ulaval.ca/~cmpgiref/dernier_ompi/2016.09.14
 .10h38m52s.faultyCerr.Triangle.h_cte_1.txt



 [lorien:190552] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh
 path NULL
 [lorien:190552] plm:base:set_hnp_name: initial bias 190552 nodename
 hash 1366255883
 [lorien:190552] plm:base:set_hnp_name: final jobfam 53310
 [lorien:190552] [[53310,0],0] plm:rsh_setup on agent ssh : rsh path NULL
 [lorien:190552] [[53310,0],0] plm:base:receive start comm
 *** Error in `orted': realloc(): invalid next size: 0x01e58770
 ***
 ...
 ...
 [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
 daemon on the local node in file ess_singleton_module.c at line 573
 [lorien:190306] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a
 daemon on the local node in file ess_singleton_module.c at line 163
 *** An error occurred in MPI_Init_thread
 *** on a NULL communicator
 *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
 ***and potentially your MPI job)
 [lorien:190306] Local abort before MPI_INIT completed completed
 successfully, but am not able to aggregate error messages, and not
 able to guarantee that all other processes were killed!

 stdout:

 -