Ralph,

The changeset avoids SIGSEGV by calling mpi_abort before bad things happen.

The attached patch seems to fix the problem (and makes the changeset kind of 
useless).
Once again, the patch was very little tested and might break other parts of 
coll/m.laposte

Cheers,

Gilles

Ralph Castain <r...@open-mpi.org> wrote:
>Usually we have trouble with coll/ml because the process locality isn't being 
>reported sufficiently for its needs. Given the recent change in data exchange, 
>I suspect that is the root cause here - I have a note to Nathan asking for 
>clarification of the coll/ml locality requirement.
>
>Did this patch "fix" the problem by avoiding the segfault due to coll/ml 
>disqualifying itself? Or did it make everything work okay again?
>
>
>On Sep 1, 2014, at 3:16 AM, Gilles Gouaillardet 
><gilles.gouaillar...@iferc.org> wrote:
>
>> Folks,
>> 
>> mtt recently failed a bunch of times with the trunk.
>> a good suspect is the collective/ibarrier test from the ibm test suite.
>> 
>> most of the time, CHECK_AND_RECYCLE will fail
>> /* IS_COLL_SYNCMEM(coll_op) is true */
>> 
>> with this test case, we just get a glory SIGSEGV since OBJ_RELEASE is
>> called on MPI_COMM_WORLD (which has *not* been allocated with OBJ_NEW)
>> 
>> i commited r32659 in order to :
>> - display an error message
>> - abort if the communicator is an intrincic one
>> 
>> with attached modified version of the ibarrier test, i always get an
>> error on task 0 when invoked with
>> mpirun -np 2 -host node0,node1 --mca btl tcp,self ./ibarrier
>> 
>> the modified version adds some sleep(1) in order to work around the race
>> condition and get a reproducible crash
>> 
>> i tried to dig and could not find a correct way to fix this.
>> that being said, i tried the attached ml.patch and it did fix the
>> problem (even with NREQS=1024)
>> i did not commit it since this is very likely incorrect.
>> 
>> could someone have a look ?
>> 
>> Cheers,
>> 
>> Gilles
>> <ibarrier.c><ml.patch>_______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15767.php
>
>_______________________________________________
>devel mailing list
>de...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post: 
>http://www.open-mpi.org/community/lists/devel/2014/09/15769.php

Reply via email to