Hi, Sorry it took so long to respond - recompiling everything across the cluster took a while. Without the --with-threads config flag, it seems to work a little better - the limit still exists, there is still the same segfault, but now it's up around 21,000,000 characters, instead of 16,000,000.
Any ideas? -James On Wed, Sep 2, 2009 at 12:55 AM, Jeff Squyres <jsquy...@cisco.com> wrote: > Can you try without the --with-threads configure argument? > > > On Aug 28, 2009, at 11:48 PM, James Gao wrote: > > Hi everyone, I've been having a pretty odd issue with Slurm and >> openmpi the last few days. I just set up a heterogeneous cluster with >> Slurm consisting of P4 32 bit machines and a few new i7 64 bit >> machines, all running the latest version of Ubuntu linux. I compiled >> the latest OpenMPI 1.3.3 with the flags >> >> ./configure --enable-heterogeneous --with-threads --with-slurm >> --with-memory-manager --with-openib --without-udapl >> --disable-openib-ibcm >> >> I also made a trivial test program: >> #include "mpi.h" >> #include <stdio.h> >> #include <stdlib.h> >> >> #define LEN 12000000 >> >> int main (int argc, char *argv[]) { >> int size, rank, i, len = LEN; >> MPI_Init(&argc, &argv); >> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >> >> if (argc > 1) len = atoi(argv[1]); >> printf("Size: %d, ", len); >> char *greeting = malloc(sizeof(char)*len); >> >> if (rank == 0) { >> for ( i = 0; i < len-1; i++) >> greeting[i] = ' '; >> greeting[len-1] = '\0'; >> } >> MPI_Bcast(greeting, len, MPI_BYTE, 0, MPI_COMM_WORLD); >> printf("rank: %d\n", rank); >> >> MPI_Finalize(); >> free(greeting); >> return 0; >> } >> >> I run this with salloc -n 28 mpirun -n 28 mpitest on my slurm cluster. >> At 12,000,000 characters, this command works exactly as expected, no >> issues at all. However, beyond a certain critical limit somewhere >> around 16,000,000 characters, the program will consistently segfault >> with this error message: >> >> salloc -n 28 -p all mpiexec -n 28 mpitest 16500000 >> salloc: Granted job allocation 234 >> [ibogaine:24883] *** Process received signal *** >> [ibogaine:24883] Signal: Segmentation fault (11) >> [ibogaine:24883] Signal code: Address not mapped (1) >> [ibogaine:24883] Failing at address: 0x101a60f58 >> [ibogaine:24883] [ 0] /lib/libpthread.so.0 [0x7f6c00405080] >> [ibogaine:24883] [ 1] /usr/local/lib/openmpi/mca_pml_ob1.so >> [0x7f6bfd9dff68] >> [ibogaine:24883] [ 2] /usr/local/lib/openmpi/mca_btl_tcp.so >> [0x7f6bfcf3ec7c] >> [ibogaine:24883] [ 3] /usr/local/lib/libopen-pal.so.0 [0x7f6c00ed5ee8] >> [ibogaine:24883] [ 4] >> /usr/local/lib/libopen-pal.so.0(opal_progress+0xa1) [0x7f6c00eca7b1] >> [ibogaine:24883] [ 5] /usr/local/lib/libmpi.so.0 [0x7f6c013a185d] >> [ibogaine:24883] [ 6] /usr/local/lib/openmpi/mca_coll_tuned.so >> [0x7f6bfc10c29c] >> [ibogaine:24883] [ 7] /usr/local/lib/openmpi/mca_coll_tuned.so >> [0x7f6bfc10c9eb] >> [ibogaine:24883] [ 8] /usr/local/lib/openmpi/mca_coll_tuned.so >> [0x7f6bfc10295c] >> [ibogaine:24883] [ 9] /usr/local/lib/openmpi/mca_coll_sync.so >> [0x7f6bfc31b35a] >> [ibogaine:24883] [10] /usr/local/lib/libmpi.so.0(MPI_Bcast+0xa3) >> [0x7f6c013b78c3] >> [ibogaine:24883] [11] mpitest(main+0xd4) [0x400bc0] >> [ibogaine:24883] [12] /lib/libc.so.6(__libc_start_main+0xe6) >> [0x7f6c000a25a6] >> [ibogaine:24883] [13] mpitest [0x400a29] >> [ibogaine:24883] *** End of error message *** >> >> As far as I can tell, the segfault occurs on the root node doing the >> broadcast. This error only occurs when I try to send across >> heterogeneous sections. If I only communicate between homogeneous >> subsets of the cluster, I can go as far as 120,000,000 characters >> without issue. However, a hard "limit" seems to occur somewhere just >> under 16,000,000 characters across the heterogeneous cluster. Any >> ideas? >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > > -- > Jeff Squyres > jsquy...@cisco.com > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >