have you tried IMB benchmark with Bcast, I think the problem is in the app. All ranks in the communicator should enter Bcast, since you have if (rank==0) else state, not all of them enters the same flow. if (iRank == 0) { iLength = sizeof (acMessage); MPI_Bcast (&iLength, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast (acMessage, iLength, MPI_CHAR, 0, MPI_COMM_WORLD); printf ("Process 0: Message sent\n"); } else { MPI_Bcast (&iLength, 1, MPI_INT, 0, MPI_COMM_WORLD); pMessage = (char *) malloc (iLength); MPI_Bcast (pMessage, iLength, MPI_CHAR, 0, MPI_COMM_WORLD); printf ("Process %d: %s\n", iRank, pMessage); }
Lenny. On Mon, Jan 4, 2010 at 8:23 AM, Eugene Loh <eugene....@sun.com> wrote: > If you're willing to try some stuff: > > 1) What about "-mca coll_sync_barrier_before 100"? (The default may be > 1000. So, you can try various values less than 1000. I'm suggesting 100.) > Note that broadcast has somewhat one-way traffic flow, which can have some > undesirable flow control issues. > > 2) What about "-mca btl_sm_num_fifos 16"? Default is 1. If the problem is > trac ticket 2043, then this suggestion can help. > > P.S. There's a memory leak, right? The receive buffer is being allocated > over and over again. Might not be that closely related to the problem you > see here, but at a minimum it's bad style. > > Louis Rossi wrote: > > I am having a problem with BCast hanging on a dual quad core Opteron (2382, > 2.6GHz, Quad Core, 4 x 512KB L2, 6MB L3 Cache) system running FC11 with > openmpi-1.4. The LD_LIBRARY_PATH and PATH variables are correctly set. I > have used the FC11 rpm distribution of openmpi and built openmpi-1.4 locally > with the same results. The problem was first observed in a larger reliable > CFD code, but I can create the problem with a simple demo code (attached). > The code attempts to execute 2000 pairs of broadcasts. > > The hostfile contains a single line > <machinename> slots=8 > > If I run it with 4 cores or fewer, the code will run fine. > > If I run it with 5 cores or more, it will hang some of the time after > successfully executing several hundred broadcasts. The number varies from > run to run. The code usually finishes with 5 cores. The probability of > hanging seems to increase with the number of nodes. The syntax I use is > simple. > > mpiexec -machinefile hostfile -np 5 bcast_example > > There was some discussion of a similar problem on the user list, but I > could not find a resolution. I have tried setting the processor affinity > (--mca mpi_paffinity_alone 1). I have tried varying the broadcast algorithm > (--mca coll_tuned_bcast_algorithm 1-6). I have also tried excluding (-mca > oob_tcp_if_exclude) my eth1 interface (see ifconfig.txt attached) which is > not connected to anything. None of these changed the outcome. > > Any thoughts or suggestions would be appreciated. > > ------------------------------ > > _______________________________________________ > users mailing > listusers@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >