Folks,

mtt recently failed a bunch of times with the trunk.
a good suspect is the collective/ibarrier test from the ibm test suite.

most of the time, CHECK_AND_RECYCLE will fail
/* IS_COLL_SYNCMEM(coll_op) is true */

with this test case, we just get a glory SIGSEGV since OBJ_RELEASE is
called on MPI_COMM_WORLD (which has *not* been allocated with OBJ_NEW)

i commited r32659 in order to :
- display an error message
- abort if the communicator is an intrincic one

with attached modified version of the ibarrier test, i always get an
error on task 0 when invoked with
mpirun -np 2 -host node0,node1 --mca btl tcp,self ./ibarrier

the modified version adds some sleep(1) in order to work around the race
condition and get a reproducible crash

i tried to dig and could not find a correct way to fix this.
that being said, i tried the attached ml.patch and it did fix the
problem (even with NREQS=1024)
i did not commit it since this is very likely incorrect.

could someone have a look ?

Cheers,

Gilles
/*
 * $HEADER$
 */
/****************************************************************************

 MESSAGE PASSING INTERFACE TEST CASE SUITE

 Copyright IBM Corp. 1995

 IBM Corp. hereby grants a non-exclusive license to use, copy, modify, and
 distribute this software for any purpose and without fee provided that the
 above copyright notice and the following paragraphs appear in all copies.

 IBM Corp. makes no representation that the test cases comprising this
 suite are correct or are an accurate representation of any standard.

 In no event shall IBM be liable to any party for direct, indirect, special
 incidental, or consequential damage arising out of the use of this software
 even if IBM Corp. has been advised of the possibility of such damage.

 IBM CORP. SPECIFICALLY DISCLAIMS ANY WARRANTIES INCLUDING, BUT NOT LIMITED
 TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 PURPOSE.  THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS IS" BASIS AND IBM
 CORP. HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES,
 ENHANCEMENTS, OR MODIFICATIONS.

****************************************************************************

 These test cases reflect an interpretation of the MPI Standard.  They are
 are, in most cases, unit tests of specific MPI behaviors.  If a user of any
 test case from this set believes that the MPI Standard requires behavior
 different than that implied by the test case we would appreciate feedback.

 Comments may be sent to:
    Richard Treumann
    treum...@kgn.ibm.com

****************************************************************************
*/
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>

#include <mpi.h>

#include "ompitest_error.h"

#ifndef NREQS
#define NREQS 16
#endif


int main(int argc, char** argv)
{
    int i, me, rank, tasks;
    double t1, t2;
    MPI_Request req[NREQS];
    MPI_Comm comm;

    MPI_Init(&argc,&argv);

    ompitest_check_size(__FILE__, __LINE__, 2, 1);

    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    MPI_Comm_dup(MPI_COMM_WORLD, &comm);

    MPI_Barrier(MPI_COMM_WORLD);
    if (rank > 0) sleep(2);

    /* Do a bunch of barriers */
    for (i = 0; i < NREQS; ++i) {
        MPI_Ibarrier(comm, &req[i]);
    }
    MPI_Waitall(NREQS, req, MPI_STATUSES_IGNORE);
    if (rank > 0) sleep(2);
    MPI_Barrier(MPI_COMM_WORLD);

    MPI_Finalize();
    return 0;
}
Index: ompi/mca/coll/ml/coll_ml_inlines.h
===================================================================
--- ompi/mca/coll/ml/coll_ml_inlines.h  (revision 32658)
+++ ompi/mca/coll/ml/coll_ml_inlines.h  (working copy)
@@ -192,7 +192,7 @@
                 !out_of_resource) {
                 */
         if (((&coll_op->full_message != 
coll_op->fragment_data.message_descriptor) &&
-            !out_of_resource) || IS_COLL_SYNCMEM(coll_op)) {
+            !out_of_resource)) {
             /* non-zero offset ==> this is not fragment 0 */
             CHECK_AND_RECYCLE(coll_op);
         }

Reply via email to