We are seeing some fork issues with a simple MPI program (attached) running on 
a 2.6.16+ kernels and
OFED 1.1. We have tried both Intel MPI and mvapich2 with the same results:

t_fork> mpiexec -n 2 t_system_fork                                              
              
parent process
[0] started child process with pid=31552
send desc error
parent process
[0] Abort: [] Got completion with error 1, vendor code=69, dest rank=1
 at line 540 in file ibv_channel_manager.c
[1] I am child process with pid=25437
[1] started child process with pid=25437
[0] I am child process with pid=31552
child process
[1] finished pid=25437
child process
[0] finished pid=31552

rank 0 in job 2  svlmpicl400_32925   caused collective abort of all ranks
  exit status of rank 0: return code 252

If you run mvapich2 for uDAPL, it hangs before second MPI_Barrier() just like 
Intel MPI. If you use
the I_MPI_RDMA_USE_EVD_FALLBACK=1 option with Intel MPI you get the following 
error similar to
mvapich2:

parent process
parent process
[0] I am child process with pid=9596
[0] started child process with pid=9596
[1] I am child process with pid=11477
[1] started child process with pid=11477
[0][rdma_iba.c:1007] Intel MPI fatal error: DTO operation completed with error. 
status=0x2.
cookie=0x1
[1][rdma_iba.c:1007] Intel MPI fatal error: DTO operation completed with error. 
status=0x2.
cookie=0x1
child process
[1] finished pid=11477
child process
[0] finished pid=9596
rank 0 in job 8  cst-19_54707   caused collective abort of all ranks
  exit status of rank 0: return code 255

Any insight would be greatly appreciated. It was our assumption that the parent 
process can continue
to use IB resources after the fixes went into 2.6.16 and OFED 1.1. Is this 
true? 

Thanks,

-arlin
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>


int main(int argc,char *argv[])
{
    int  myid, numprocs;
    pid_t pid;

    MPI_Init(&argc,&argv);
    MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD,&myid);

    MPI_Barrier(MPI_COMM_WORLD);

    system("echo parent process");
    
    pid = fork();
    
    if( pid == 0)
    {
        pid = getpid();
        printf("[%d] I am child process with pid=%d\n", myid, pid);    
        system("echo child process");
        
    } else
    {
        printf("[%d] started child process with pid=%d\n", myid, pid);
        MPI_Barrier(MPI_COMM_WORLD);
        MPI_Finalize();
        pid = getpid();
    }
    
    printf("[%d] finished pid=%d\n", myid, pid);
    return 0;
}
_______________________________________________
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to