Re: [OMPI users] MPI_Allgather problem
What version did you upgrade to? (we don't control the Ubuntu packaging) I see a bullet in the soon-to-be-released 1.4.5 release notes: - Fix obscure cases where MPI_ALLGATHER could crash. Thanks to Andrew Senin for reporting the problem. But that would be surprising if this is what fixed your issue, especially since it's not released yet. :-) On Jan 26, 2012, at 5:24 AM, Brett Tully wrote: > As of two days ago, this problem has disappeared and the tests that I had > written and run each night are now passing. Having looked through the update > log of my machine (Ubuntu 11.10) it appears as though I got a new version of > mpi-default-dev (0.6ubuntu1). I would like to understand this problem in more > detail -- is it possible to see what changed in this update? > Thanks, > Brett. > > > > On Fri, Dec 9, 2011 at 6:43 PM, teng ma wrote: > I guess your output is from different ranks. YOu can add rank infor inside > print to tell like follows: > > (void) printf("rank %d: gathered[%d].node = %d\n", rank, i, gathered[i].node); > > From my side, I did not see anything wrong from your code in Open MPI 1.4.3. > after I add rank, the output is > rank 5: gathered[0].node = 0 > rank 5: gathered[1].node = 1 > rank 5: gathered[2].node = 2 > rank 5: gathered[3].node = 3 > rank 5: gathered[4].node = 4 > rank 5: gathered[5].node = 5 > rank 3: gathered[0].node = 0 > rank 3: gathered[1].node = 1 > rank 3: gathered[2].node = 2 > rank 3: gathered[3].node = 3 > rank 3: gathered[4].node = 4 > rank 3: gathered[5].node = 5 > rank 1: gathered[0].node = 0 > rank 1: gathered[1].node = 1 > rank 1: gathered[2].node = 2 > rank 1: gathered[3].node = 3 > rank 1: gathered[4].node = 4 > rank 1: gathered[5].node = 5 > rank 0: gathered[0].node = 0 > rank 0: gathered[1].node = 1 > rank 0: gathered[2].node = 2 > rank 0: gathered[3].node = 3 > rank 0: gathered[4].node = 4 > rank 0: gathered[5].node = 5 > rank 4: gathered[0].node = 0 > rank 4: gathered[1].node = 1 > rank 4: gathered[2].node = 2 > rank 4: gathered[3].node = 3 > rank 4: gathered[4].node = 4 > rank 4: gathered[5].node = 5 > rank 2: gathered[0].node = 0 > rank 2: gathered[1].node = 1 > rank 2: gathered[2].node = 2 > rank 2: gathered[3].node = 3 > rank 2: gathered[4].node = 4 > rank 2: gathered[5].node = 5 > > Is that what you expected? > > On Fri, Dec 9, 2011 at 12:03 PM, Brett Tully wrote: > Dear all, > > I have not used OpenMPI much before, but am maintaining a large legacy > application. We noticed a bug to do with a call to MPI_Allgather as > summarised in this post to Stackoverflow: > http://stackoverflow.com/questions/8445398/mpi-allgather-produces-inconsistent-results > > In the process of looking further into the problem, I noticed that the > following function results in strange behaviour. > > void test_all_gather() { > > struct _TEST_ALL_GATHER { > int node; > }; > > int ierr, size, rank; > ierr = MPI_Comm_size(MPI_COMM_WORLD, &size); > ierr = MPI_Comm_rank(MPI_COMM_WORLD, &rank); > > struct _TEST_ALL_GATHER local; > struct _TEST_ALL_GATHER *gathered; > > gathered = (struct _TEST_ALL_GATHER*) malloc(size * sizeof(*gathered)); > > local.node = rank; > > MPI_Allgather(&local, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE, > gathered, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE, MPI_COMM_WORLD); > > int i; > for (i = 0; i < numnodes; ++i) { > (void) printf("gathered[%d].node = %d\n", i, gathered[i].node); > } > > FREE(gathered); > } > > At one point, this function printed the following: > gathered[0].node = 2 > gathered[1].node = 3 > gathered[2].node = 2 > gathered[3].node = 3 > gathered[4].node = 4 > gathered[5].node = 5 > > Can anyone suggest a place to start looking into why this might be happening? > There is a section of the code that calls MPI_Comm_split, but I am not sure > if that is related... > > Running on Ubuntu 11.10 and a summary of ompi_info: > Package: Open MPI buildd@allspice Distribution > Open MPI: 1.4.3 > Open MPI SVN revision: r23834 > Open MPI release date: Oct 05, 2010 > Open RTE: 1.4.3 > Open RTE SVN revision: r23834 > Open RTE release date: Oct 05, 2010 > OPAL: 1.4.3 > OPAL SVN revision: r23834 > OPAL release date: Oct 05, 2010 > Ident string: 1.4.3 > Prefix: /usr > Configured architecture: x86_64-pc-linux-gnu > Configure host: allspice > Configured by: buildd > > Thanks! > Brett > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > | Teng Ma Univ. of Tennessee | > | t...@cs.utk.eduKnoxville, TN | > | http://web.eecs.utk.edu/~tma/ | > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.or
Re: [OMPI users] MPI_AllGather null terminator character
I'm not sure what you're asking. The entire contents of hostname[] will be sent -- from position 0 to position (MAX_STRING_LEN-1). If there's a \0 in there, it will be sent. If the \0 occurs after that, then it won't. Be aware that get_hostname(buf, size) will not put a \0 in the buffer if the hostname is exactly "size" bytes. So you might want to double check that your get_hostname() is returning a \0-terminated string. Does that make sense? Here's a sample I wrote to verify this: #include #include #include #include #define MAX_LEN 64 static void where_null(char *ptr, int len, int rank) { int i; for (i = 0; i < len; ++i) { if ('\0' == ptr[i]) { printf("Rank %d: Null found at position %d (string: %s)\n", rank, i, ptr); return; } } printf("Rank %d: Null not found! (string: ", rank); for (i = 0; i < len; ++i) putc(ptr[i], stdout); putc('\n', stdout); } int main() { int i; char hostname[MAX_LEN]; char *hostname_recv_buf; int rank, size; MPI_Init(NULL, NULL); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); gethostname(hostname, MAX_LEN - 1); where_null(hostname, MAX_LEN, rank); hostname_recv_buf = calloc(size * (MAX_LEN), (sizeof(char))); MPI_Allgather(hostname, MAX_LEN, MPI_CHAR, hostname_recv_buf, MAX_LEN, MPI_CHAR, MPI_COMM_WORLD); for (i = 0; i < size; ++i) { where_null(hostname_recv_buf + i * MAX_LEN, MAX_LEN, rank); } MPI_Finalize(); return 0; } On Jan 13, 2012, at 2:32 AM, Gabriele Fatigati wrote: > Dear OpenMPI, > > using MPI_Allgather with MPI_CHAR type, I have a doubt about null-terminated > character. Imaging I want to spawn node names where my program is running on: > > > > > char hostname[MAX_LEN]; > > char* > hostname_recv_buf=(char*)calloc(num_procs*(MAX_STRING_LEN),(sizeof(char))); > > MPI_Allgather(hostname, MAX_STRING_LEN, MPI_CHAR, hostname_recv_buf, > MAX_STRING_LEN, MPI_CHAR, MPI_COMM_WORLD); > > > > > Now, is the null-terminated character of each local string included? Or I > have to send and receive in MPI_Allgather MAX_STRING_LEN+1 elements? > > Using Valgrind, in a subsequent simple strcmp: > > for( i= 0; i< num_procs; i++){ > if(strcmp(&hostname_recv_buf[MAX_STRING_LEN*i], > local_hostname)==0){ >... doing something > } > } > > raise a warning: > > Conditional jump or move depends on uninitialised value(s) > ==19931==at 0x4A06E5C: strcmp (mc_replace_strmem.c:412) > > The same warning is not present if I use MAX_STRING_LEN+1 in MPI_Allgather. > > > Thanks in forward. > > -- > Ing. Gabriele Fatigati > > HPC specialist > > SuperComputing Applications and Innovation Department > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > www.cineca.itTel: +39 051 6171722 > > g.fatigati [AT] cineca.it > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] OpenMPI: How many connections?
On Jan 24, 2012, at 5:34 PM, devendra rai wrote: > I am trying to find out how many separate connections are opened by MPI as > messages are sent. Basically, I have threaded-MPI calls to a bunch of > different MPI processes (who, in turn have threaded MPI calls). > > The point is, with every thread added, are new ports opened (even if the > sender-receiver pairs already have a connection between them)? In Open MPI: no. The underlying connections are independent of how many threads you have. > Is there any way to find out? I went through MPI APIs, and the closest thing > I found was related to cartographic information. This is not sufficient, > since this only tells me the logical connections (or does it)? MPI does not have a user-level concept of a connection. You send a message, a miracle occurs, and the message is received on the other side. MPI doesn't say anything about how it got there (e.g., it may have even been routed through some other process). > Reading Open MPI FAQ, I thought adding "--mca btl self,sm,tcp --mca > btl_base_verbose 30 -display-map" to mpirun would help. But I am not getting > what I need. Basically, I want to know how many ports each process is > accessing (reading as well as writing). For Open MPI's TCP implementation, it's basically one TCP socket per peer (plus a few other utility fd's). But TCP sockets are only opened lazily, meaning that we won't open the socket until you actually send to a peer. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Strange recursive "no" error message when compiling 1.5 series with fault tolerance enabled
It looks like Jeff beat me too it. The problem was with a missing 'test' in the configure script. I'm not sure how it creeped in there, but the fix is in the pipeline for the next 1.5 release. The ticket to track the progress of this patch is on the following ticket: https://svn.open-mpi.org/trac/ompi/ticket/2979 Thanks for the bug report! -- Josh On Thu, Jan 26, 2012 at 4:16 PM, Jeff Squyres wrote: > Doh! That's a fun one. Thanks for the report! > > I filed a fix; we'll get this in very shortly (looks like the fix is > already on the trunk, but somehow got missed on the v1.5 branch). > > > On Jan 26, 2012, at 3:42 PM, David Akin wrote: > > > I can build OpenMPI with FT on my system if I'm using 1.4 source, but > > if I use any of the 1.5 series, I get hung in a strange "no" loop at the > > beginning of the compile (see below): > > > > + ./configure --build=x86_64-unknown-linux-gnu > > --host=x86_64-unknown-linux-gnu --target=x86_64-redhat-linux-gnu > > --program-prefix= --prefix=/usr/mpi/intel/openmpi-1.5-ckpt > > --exec-prefix=/usr/mpi/intel/openmpi-1.5-ckpt > > --bindir=/usr/mpi/intel/openmpi-1.5-ckpt/bin > > --sbindir=/usr/mpi/intel/openmpi-1.5-ckpt/sbin > > --sysconfdir=/usr/mpi/intel/openmpi-1.5-ckpt/etc > > --datadir=/usr/mpi/intel/openmpi-1.5-ckpt/share > > --includedir=/usr/mpi/intel/openmpi-1.5-ckpt/include > > --libdir=/usr/mpi/intel/openmpi-1.5-ckpt/lib64 > > --libexecdir=/usr/mpi/intel/openmpi-1.5-ckpt/libexec > > --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man > > --infodir=/usr/share/info --enable-ft-thread --with-ft=cr > > --enable-opal-multi-threads > > > > . > > . > > . > > > > > > > == System-specific tests > > > > > checking checking for type of MPI_Offset... long long > > checking checking for an MPI datatype for MPI_Offset... MPI_LONG_LONG > > checking for _SC_NPROCESSORS_ONLN... yes > > checking whether byte ordering is bigendian... no > > checking for broken qsort... no > > checking if word-sized integers must be word-size aligned... no > > checking if C compiler and POSIX threads work as is... no > > checking if C++ compiler and POSIX threads work as is... no > > checking if F77 compiler and POSIX threads work as is... yes > > checking if C compiler and POSIX threads work with -Kthread... no > > checking if C compiler and POSIX threads work with -kthread... no > > checking if C compiler and POSIX threads work with -pthread... yes > > checking if C++ compiler and POSIX threads work with -Kthread... no > > checking if C++ compiler and POSIX threads work with -kthread... no > > checking if C++ compiler and POSIX threads work with -pthread... yes > > checking for PTHREAD_MUTEX_ERRORCHECK_NP... yes > > checking for PTHREAD_MUTEX_ERRORCHECK... yes > > checking for working POSIX threads package... yes > > checking if C compiler and Solaris threads work... no > > checking if C++ compiler and Solaris threads work... no > > checking if F77 compiler and Solaris threads work... no > > checking for working Solaris threads package... no > > checking for type of thread support... posix > > checking if threads have different pids (pthreads on linux)... no > > checking if want OPAL thread support... yes > > checking if want fault tolerance thread... = no > > = no > > = no > > = no > > = no > > = no > > = no > > = no > > = no > > = no > > = no > > = no > > = no > > . > > . > > . > > > > > > The system just keeps repeating "no" over and over infinitely. > > > > I'm on RHEL6 2.6.32-220.2.1.el6.x86_64. I've tried the > > following OpenMPI 1.5 series tarballs with the same results: > > > > openmpi-1.5.5rc1.tar.bz2 > > openmpi-1.5.5rc2r25765.tar.bz2 > > openmpi-1.5.5rc2r25773.tar.bz2 > > > > Any guidance is appreciated. > > Thanks! > > Dave > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey
Re: [OMPI users] Strange recursive "no" error message when compiling 1.5 series with fault tolerance enabled
Well that is awfully insistent. I have been able to reproduce the problem. Upon initial inspection I don't see the bug, but I'll dig into it today and hopefully have a patch in a bit. Below is a ticket for this bug: https://svn.open-mpi.org/trac/ompi/ticket/2980 I'll let you know what I find out. -- Josh On Thu, Jan 26, 2012 at 3:42 PM, David Akin wrote: > I can build OpenMPI with FT on my system if I'm using 1.4 source, but > if I use any of the 1.5 series, I get hung in a strange "no" loop at the > beginning of the compile (see below): > > + ./configure --build=x86_64-unknown-linux-gnu > --host=x86_64-unknown-linux-gnu --target=x86_64-redhat-linux-gnu > --program-prefix= --prefix=/usr/mpi/intel/openmpi-1.5-ckpt > --exec-prefix=/usr/mpi/intel/openmpi-1.5-ckpt > --bindir=/usr/mpi/intel/openmpi-1.5-ckpt/bin > --sbindir=/usr/mpi/intel/openmpi-1.5-ckpt/sbin > --sysconfdir=/usr/mpi/intel/openmpi-1.5-ckpt/etc > --datadir=/usr/mpi/intel/openmpi-1.5-ckpt/share > --includedir=/usr/mpi/intel/openmpi-1.5-ckpt/include > --libdir=/usr/mpi/intel/openmpi-1.5-ckpt/lib64 > --libexecdir=/usr/mpi/intel/openmpi-1.5-ckpt/libexec > --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man > --infodir=/usr/share/info --enable-ft-thread --with-ft=cr > --enable-opal-multi-threads > > . > . > . > > > > == System-specific tests > > > checking checking for type of MPI_Offset... long long > checking checking for an MPI datatype for MPI_Offset... MPI_LONG_LONG > checking for _SC_NPROCESSORS_ONLN... yes > checking whether byte ordering is bigendian... no > checking for broken qsort... no > checking if word-sized integers must be word-size aligned... no > checking if C compiler and POSIX threads work as is... no > checking if C++ compiler and POSIX threads work as is... no > checking if F77 compiler and POSIX threads work as is... yes > checking if C compiler and POSIX threads work with -Kthread... no > checking if C compiler and POSIX threads work with -kthread... no > checking if C compiler and POSIX threads work with -pthread... yes > checking if C++ compiler and POSIX threads work with -Kthread... no > checking if C++ compiler and POSIX threads work with -kthread... no > checking if C++ compiler and POSIX threads work with -pthread... yes > checking for PTHREAD_MUTEX_ERRORCHECK_NP... yes > checking for PTHREAD_MUTEX_ERRORCHECK... yes > checking for working POSIX threads package... yes > checking if C compiler and Solaris threads work... no > checking if C++ compiler and Solaris threads work... no > checking if F77 compiler and Solaris threads work... no > checking for working Solaris threads package... no > checking for type of thread support... posix > checking if threads have different pids (pthreads on linux)... no > checking if want OPAL thread support... yes > checking if want fault tolerance thread... = no > = no > = no > = no > = no > = no > = no > = no > = no > = no > = no > = no > = no > . > . > . > > > The system just keeps repeating "no" over and over infinitely. > > I'm on RHEL6 2.6.32-220.2.1.el6.x86_64. I've tried the > following OpenMPI 1.5 series tarballs with the same results: > > openmpi-1.5.5rc1.tar.bz2 > openmpi-1.5.5rc2r25765.tar.bz2 > openmpi-1.5.5rc2r25773.tar.bz2 > > Any guidance is appreciated. > Thanks! > Dave > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey
Re: [OMPI users] Strange recursive "no" error message when compiling 1.5 series with fault tolerance enabled
Doh! That's a fun one. Thanks for the report! I filed a fix; we'll get this in very shortly (looks like the fix is already on the trunk, but somehow got missed on the v1.5 branch). On Jan 26, 2012, at 3:42 PM, David Akin wrote: > I can build OpenMPI with FT on my system if I'm using 1.4 source, but > if I use any of the 1.5 series, I get hung in a strange "no" loop at the > beginning of the compile (see below): > > + ./configure --build=x86_64-unknown-linux-gnu > --host=x86_64-unknown-linux-gnu --target=x86_64-redhat-linux-gnu > --program-prefix= --prefix=/usr/mpi/intel/openmpi-1.5-ckpt > --exec-prefix=/usr/mpi/intel/openmpi-1.5-ckpt > --bindir=/usr/mpi/intel/openmpi-1.5-ckpt/bin > --sbindir=/usr/mpi/intel/openmpi-1.5-ckpt/sbin > --sysconfdir=/usr/mpi/intel/openmpi-1.5-ckpt/etc > --datadir=/usr/mpi/intel/openmpi-1.5-ckpt/share > --includedir=/usr/mpi/intel/openmpi-1.5-ckpt/include > --libdir=/usr/mpi/intel/openmpi-1.5-ckpt/lib64 > --libexecdir=/usr/mpi/intel/openmpi-1.5-ckpt/libexec > --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man > --infodir=/usr/share/info --enable-ft-thread --with-ft=cr > --enable-opal-multi-threads > > . > . > . > > > == System-specific tests > > checking checking for type of MPI_Offset... long long > checking checking for an MPI datatype for MPI_Offset... MPI_LONG_LONG > checking for _SC_NPROCESSORS_ONLN... yes > checking whether byte ordering is bigendian... no > checking for broken qsort... no > checking if word-sized integers must be word-size aligned... no > checking if C compiler and POSIX threads work as is... no > checking if C++ compiler and POSIX threads work as is... no > checking if F77 compiler and POSIX threads work as is... yes > checking if C compiler and POSIX threads work with -Kthread... no > checking if C compiler and POSIX threads work with -kthread... no > checking if C compiler and POSIX threads work with -pthread... yes > checking if C++ compiler and POSIX threads work with -Kthread... no > checking if C++ compiler and POSIX threads work with -kthread... no > checking if C++ compiler and POSIX threads work with -pthread... yes > checking for PTHREAD_MUTEX_ERRORCHECK_NP... yes > checking for PTHREAD_MUTEX_ERRORCHECK... yes > checking for working POSIX threads package... yes > checking if C compiler and Solaris threads work... no > checking if C++ compiler and Solaris threads work... no > checking if F77 compiler and Solaris threads work... no > checking for working Solaris threads package... no > checking for type of thread support... posix > checking if threads have different pids (pthreads on linux)... no > checking if want OPAL thread support... yes > checking if want fault tolerance thread... = no > = no > = no > = no > = no > = no > = no > = no > = no > = no > = no > = no > = no > . > . > . > > > The system just keeps repeating "no" over and over infinitely. > > I'm on RHEL6 2.6.32-220.2.1.el6.x86_64. I've tried the > following OpenMPI 1.5 series tarballs with the same results: > > openmpi-1.5.5rc1.tar.bz2 > openmpi-1.5.5rc2r25765.tar.bz2 > openmpi-1.5.5rc2r25773.tar.bz2 > > Any guidance is appreciated. > Thanks! > Dave > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI users] Strange recursive "no" error message when compiling 1.5 series with fault tolerance enabled
I can build OpenMPI with FT on my system if I'm using 1.4 source, but if I use any of the 1.5 series, I get hung in a strange "no" loop at the beginning of the compile (see below): + ./configure --build=x86_64-unknown-linux-gnu --host=x86_64-unknown-linux-gnu --target=x86_64-redhat-linux-gnu --program-prefix= --prefix=/usr/mpi/intel/openmpi-1.5-ckpt --exec-prefix=/usr/mpi/intel/openmpi-1.5-ckpt --bindir=/usr/mpi/intel/openmpi-1.5-ckpt/bin --sbindir=/usr/mpi/intel/openmpi-1.5-ckpt/sbin --sysconfdir=/usr/mpi/intel/openmpi-1.5-ckpt/etc --datadir=/usr/mpi/intel/openmpi-1.5-ckpt/share --includedir=/usr/mpi/intel/openmpi-1.5-ckpt/include --libdir=/usr/mpi/intel/openmpi-1.5-ckpt/lib64 --libexecdir=/usr/mpi/intel/openmpi-1.5-ckpt/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --enable-ft-thread --with-ft=cr --enable-opal-multi-threads . . . == System-specific tests checking checking for type of MPI_Offset... long long checking checking for an MPI datatype for MPI_Offset... MPI_LONG_LONG checking for _SC_NPROCESSORS_ONLN... yes checking whether byte ordering is bigendian... no checking for broken qsort... no checking if word-sized integers must be word-size aligned... no checking if C compiler and POSIX threads work as is... no checking if C++ compiler and POSIX threads work as is... no checking if F77 compiler and POSIX threads work as is... yes checking if C compiler and POSIX threads work with -Kthread... no checking if C compiler and POSIX threads work with -kthread... no checking if C compiler and POSIX threads work with -pthread... yes checking if C++ compiler and POSIX threads work with -Kthread... no checking if C++ compiler and POSIX threads work with -kthread... no checking if C++ compiler and POSIX threads work with -pthread... yes checking for PTHREAD_MUTEX_ERRORCHECK_NP... yes checking for PTHREAD_MUTEX_ERRORCHECK... yes checking for working POSIX threads package... yes checking if C compiler and Solaris threads work... no checking if C++ compiler and Solaris threads work... no checking if F77 compiler and Solaris threads work... no checking for working Solaris threads package... no checking for type of thread support... posix checking if threads have different pids (pthreads on linux)... no checking if want OPAL thread support... yes checking if want fault tolerance thread... = no = no = no = no = no = no = no = no = no = no = no = no = no . . . The system just keeps repeating "no" over and over infinitely. I'm on RHEL6 2.6.32-220.2.1.el6.x86_64. I've tried the following OpenMPI 1.5 series tarballs with the same results: openmpi-1.5.5rc1.tar.bz2 openmpi-1.5.5rc2r25765.tar.bz2 openmpi-1.5.5rc2r25773.tar.bz2 Any guidance is appreciated. Thanks! Dave
Re: [OMPI users] Cant build OpenMPI!
To follow up for the web archives: We talked about this off-list. Upgrading to Open MPI 1.4.4 fixed the problem. I'm assuming it was some bug in 1.4.2 that was fixed in 1.4.4. On Jan 24, 2012, at 2:13 PM, Jeff Squyres wrote: > One more thing to check: are you building on a networked filesystem, and the > client on which you are building is not time synchronized with the file > server? > > If you are not building on a networked file system, or if you are building on > a NFS and the time is NTP-synchronized between client and server, then please > send everything listed here: > >http://www.open-mpi.org/community/help/ > > > On Jan 24, 2012, at 1:48 PM, devendra rai wrote: > >> Hello Jeff, >> >> No. I did not run autogen.sh. I just did the three steps that you showed. >> >> Will log files be of any help? >> >> (Also, if the log files are not generated by just pipe'ing or tee'ing, >> please let me know). >> >> Thanks a lot. >> >> Best >> >> Devendra >> >> From: Jeff Squyres >> To: devendra rai ; Open MPI Users >> >> Sent: Tuesday, 24 January 2012, 19:40 >> Subject: Re: [OMPI users] Cant build OpenMPI! >> >> Did you try running autogen.sh? >> >> You should not need to -- you should only need to: >> >> ./configure ... >> make all >> make install >> >> >> On Jan 24, 2012, at 1:38 PM, devendra rai wrote: >> >>> Hello All, >>> >>> I am trying to build openMPI on a server (I do not have sudo on this >>> server). >>> >>> When running make, I get this error: >>> >>> libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../../opal/include >>> -I../../orte/include -I../../ompi/include >>> -I../../opal/mca/paffinity/linux/plpa/src/libplpa -I../.. -I/usr/include -g >>> -finline-functions -fno-strict-aliasing -pthread -fvisibility=hidden -MT >>> dt_module.lo -MD -MP -MF .deps/dt_module.Tpo -c dt_module.c -fPIC -DPIC -o >>> .libs/dt_module.o >>> dt_module.c:177: error: expected expression before ‘)’ token >>> dt_module.c:182: error: expected expression before ‘)’ token >>> dt_module.c:187: error: expected expression before ‘)’ token >>> dt_module.c:192: error: expected expression before ‘)’ token >>> dt_module.c:203: error: expected expression before ‘)’ token >>> dt_module.c:208: error: expected expression before ‘)’ token >>> dt_module.c:219: error: expected expression before ‘)’ token >>> dt_module.c:224: error: expected expression before ‘)’ token >>> dt_module.c:229: error: expected expression before ‘)’ token >>> dt_module.c:234: error: expected expression before ‘)’ token >>> dt_module.c:250: error: expected expression before ‘)’ token >>> make[2]: *** [dt_module.lo] Error 1 >>> make[2]: Leaving directory `/home/raid/private/openmpi-1.4.2/ompi/datatype' >>> make[1]: *** [all-recursive] Error 1 >>> make[1]: Leaving directory `/home/raid/private/openmpi-1.4.2/ompi' >>> make: *** [all-recursive] Error 1 >>> >>> Before this, I had some warnings: >>> >>> WARNING: `aclocal-1.10' is missing on your system. You should only need it >>> if >>> you modified `acinclude.m4' or `configure.in'. You might want >>> to install the `Automake' and `Perl' packages. Grab them from >>> any GNU archive site. >>> cd . && /bin/bash >>> /home/raid/private/openmpi-1.4.2/ompi/contrib/vt/vt/missing --run >>> automake-1.10 --foreign >>> /home/raid/private/openmpi-1.4.2/ompi/contrib/vt/vt/missing: line 54: >>> automake-1.10: command not found >>> WARNING: `automake-1.10' is missing on your system. You should only need >>> it if >>> you modified `Makefile.am', `acinclude.m4' or `configure.in'. >>> You might want to install the `Automake' and `Perl' packages. >>> Grab them from any GNU archive site. >>> cd . && /bin/bash >>> /home/raid/private/openmpi-1.4.2/ompi/contrib/vt/vt/missing --run autoconf >>> aclocal.m4:16: warning: this file was generated for autoconf 2.63. >>> You have another version of autoconf. It may work, but is not guaranteed >>> to. >>> If you have problems, you may need to regenerate the build system entirely. >>> To do so, use the procedure documented by the package, typically >>> `autoreconf'. >>> /bin/bash ./config.status --recheck >>> >>> >>> Is there any relationship? If not, what else am I missing? >>> >>> Thanks a lot for pointers. >>> >>> Best >>> >>> Devendra >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For co
Re: [OMPI users] How to determine MPI rank/process number local to a socket/node
We don't provide a mechanism for determining the node number - never came up before as you can use gethostname to find out what node you are on. We do provide an envar that tells you the process rank within the node: OMPI_COMM_WORLD_LOCAL_RANK is what you are probably looking for. On Jan 26, 2012, at 10:51 AM, Frank wrote: > Say, I run a parallel program using MPI. Execution command > > mpirun -n 8 -npernode 2 > > launches 8 processes in total. That is 2 processes per node and 4 > nodes in total. (OpenMPI 1.5). Where a node comprises 1 CPU (dual > core) and network interconnect between nodes is InfiniBand. > > Now, the rank number (or process number) can be determined with > > int myrank; > MPI_Comm_rank(MPI_COMM_WORLD, &myrank); > > This returns a number between 0 and 7. > > But, How can I determine the node number (in this case a number > between 0 and 3) and the process number within a node (number between > 0 and 1)? > > You can find this question on stackoverflow (if you prefer to answer > through their interface): > > http://stackoverflow.com/questions/9022496/how-to-determine-mpi-rank-process-number-local-to-a-socket-node > > Best, > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] How to determine MPI rank/process number local to a socket/node
Say, I run a parallel program using MPI. Execution command mpirun -n 8 -npernode 2 launches 8 processes in total. That is 2 processes per node and 4 nodes in total. (OpenMPI 1.5). Where a node comprises 1 CPU (dual core) and network interconnect between nodes is InfiniBand. Now, the rank number (or process number) can be determined with int myrank; MPI_Comm_rank(MPI_COMM_WORLD, &myrank); This returns a number between 0 and 7. But, How can I determine the node number (in this case a number between 0 and 3) and the process number within a node (number between 0 and 1)? You can find this question on stackoverflow (if you prefer to answer through their interface): http://stackoverflow.com/questions/9022496/how-to-determine-mpi-rank-process-number-local-to-a-socket-node Best,
Re: [OMPI users] MPI_Comm_split and intercommunicator - Problem
Hi there, I tried to understand the behavior Thatyene said and I think is a bug in open mpi implementation. I do not know what exactly is happening because I am not an expert in ompi code, but I could see that when one process define its color as * MPI_UNDEFINED*, one of the processes on the inter-communicator blocks in the call to the function bellow: /* Step 3: set up the communicator */ /* - */ /* Create the communicator finally */ rc = ompi_comm_set ( &newcomp, /* new comm */ comm, /* old comm */ my_size,/* local_size */ lranks, /* local_ranks */ my_rsize, /* remote_size */ rranks, /* remote_ranks */ NULL, /* attrs */ comm->error_handler,/* error handler */ (pass_on_topo)? (mca_base_component_t *)comm->c_topo_component: NULL, /* topo component */ NULL, /* local group */ NULL/* remote group */ ); This function is called inside *ompi_comm_split*, in the file * ompi/communicator/comm.c* * * Is there a solution for this problem in some revision? I insist in this problem because I need to use this function for a similar purpose. Any idea? On Wed, Jan 25, 2012 at 4:50 PM, Thatyene Louise Alves de Souza Ramos < thaty...@gmail.com> wrote: > It seems the split is blocking when must return MPI_COMM_NULL, in the case > I have one process with a color that does not exist in the other group or > with the color = MPI_UNDEFINED. > > On Wed, Jan 25, 2012 at 4:28 PM, Rodrigo Oliveira < > rsilva.olive...@gmail.com> wrote: > >> Hi Thatyene, >> >> I took a look in your code and it seems to be logically correct. Maybe >> there is some problem when you call the split function having one client >> process with color = MPI_UNDEFINED. I understood you are trying to isolate >> one of the client process to do something applicable only to it, am I >> wrong? According to open mpi documentation, this function can be used to do >> that, but it is not working. Anyone have any idea about what can be? >> >> Best regards >> >> Rodrigo Oliveira >> >> > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] MPI_Allgather problem
As of two days ago, this problem has disappeared and the tests that I had written and run each night are now passing. Having looked through the update log of my machine (Ubuntu 11.10) it appears as though I got a new version of mpi-default-dev (0.6ubuntu1). I would like to understand this problem in more detail -- is it possible to see what changed in this update? Thanks, Brett. > > On Fri, Dec 9, 2011 at 6:43 PM, teng ma wrote: > >> I guess your output is from different ranks. YOu can add rank infor >> inside print to tell like follows: >> >> (void) printf("rank %d: gathered[%d].node = %d\n", rank, i, >> gathered[i].node); >> >> From my side, I did not see anything wrong from your code in Open MPI >> 1.4.3. after I add rank, the output is >> rank 5: gathered[0].node = 0 >> rank 5: gathered[1].node = 1 >> rank 5: gathered[2].node = 2 >> rank 5: gathered[3].node = 3 >> rank 5: gathered[4].node = 4 >> rank 5: gathered[5].node = 5 >> rank 3: gathered[0].node = 0 >> rank 3: gathered[1].node = 1 >> rank 3: gathered[2].node = 2 >> rank 3: gathered[3].node = 3 >> rank 3: gathered[4].node = 4 >> rank 3: gathered[5].node = 5 >> rank 1: gathered[0].node = 0 >> rank 1: gathered[1].node = 1 >> rank 1: gathered[2].node = 2 >> rank 1: gathered[3].node = 3 >> rank 1: gathered[4].node = 4 >> rank 1: gathered[5].node = 5 >> rank 0: gathered[0].node = 0 >> rank 0: gathered[1].node = 1 >> rank 0: gathered[2].node = 2 >> rank 0: gathered[3].node = 3 >> rank 0: gathered[4].node = 4 >> rank 0: gathered[5].node = 5 >> rank 4: gathered[0].node = 0 >> rank 4: gathered[1].node = 1 >> rank 4: gathered[2].node = 2 >> rank 4: gathered[3].node = 3 >> rank 4: gathered[4].node = 4 >> rank 4: gathered[5].node = 5 >> rank 2: gathered[0].node = 0 >> rank 2: gathered[1].node = 1 >> rank 2: gathered[2].node = 2 >> rank 2: gathered[3].node = 3 >> rank 2: gathered[4].node = 4 >> rank 2: gathered[5].node = 5 >> >> Is that what you expected? >> >> On Fri, Dec 9, 2011 at 12:03 PM, Brett Tully wrote: >> >>> Dear all, >>> >>> I have not used OpenMPI much before, but am maintaining a large legacy >>> application. We noticed a bug to do with a call to MPI_Allgather as >>> summarised in this post to Stackoverflow: >>> http://stackoverflow.com/questions/8445398/mpi-allgather-produces-inconsistent-results >>> >>> In the process of looking further into the problem, I noticed that the >>> following function results in strange behaviour. >>> >>> void test_all_gather() { >>> >>> struct _TEST_ALL_GATHER { >>> int node; >>> }; >>> >>> int ierr, size, rank; >>> ierr = MPI_Comm_size(MPI_COMM_WORLD, &size); >>> ierr = MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>> >>> struct _TEST_ALL_GATHER local; >>> struct _TEST_ALL_GATHER *gathered; >>> >>> gathered = (struct _TEST_ALL_GATHER*) malloc(size * >>> sizeof(*gathered)); >>> >>> local.node = rank; >>> >>> MPI_Allgather(&local, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE, >>> gathered, sizeof(struct _TEST_ALL_GATHER), MPI_BYTE, >>> MPI_COMM_WORLD); >>> >>> int i; >>> for (i = 0; i < numnodes; ++i) { >>> (void) printf("gathered[%d].node = %d\n", i, gathered[i].node); >>> } >>> >>> FREE(gathered); >>> } >>> >>> At one point, this function printed the following: >>> gathered[0].node = 2 >>> gathered[1].node = 3 >>> gathered[2].node = 2 >>> gathered[3].node = 3 >>> gathered[4].node = 4 >>> gathered[5].node = 5 >>> >>> Can anyone suggest a place to start looking into why this might be >>> happening? There is a section of the code that calls MPI_Comm_split, but I >>> am not sure if that is related... >>> >>> Running on Ubuntu 11.10 and a summary of ompi_info: >>> Package: Open MPI buildd@allspice Distribution >>> Open MPI: 1.4.3 >>> Open MPI SVN revision: r23834 >>> Open MPI release date: Oct 05, 2010 >>> Open RTE: 1.4.3 >>> Open RTE SVN revision: r23834 >>> Open RTE release date: Oct 05, 2010 >>> OPAL: 1.4.3 >>> OPAL SVN revision: r23834 >>> OPAL release date: Oct 05, 2010 >>> Ident string: 1.4.3 >>> Prefix: /usr >>> Configured architecture: x86_64-pc-linux-gnu >>> Configure host: allspice >>> Configured by: buildd >>> >>> Thanks! >>> Brett >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >> >> -- >> | Teng Ma Univ. of Tennessee | >> | t...@cs.utk.eduKnoxville, TN | >> | http://web.eecs.utk.edu/~tma/ | >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >
Re: [OMPI users] MPI_AllGather null terminator character
Dear OpenMPi users/developers, anybody can help about such problem? 2012/1/13 Gabriele Fatigati > Dear OpenMPI, > > using MPI_Allgather with MPI_CHAR type, I have a doubt about > null-terminated character. Imaging I want to spawn node names where my > program is running on: > > > > > char hostname[MAX_LEN]; > > char* > hostname_recv_buf=(char*)calloc(num_procs*(MAX_STRING_LEN),(sizeof(char))); > > MPI_Allgather(hostname, MAX_STRING_LEN, MPI_CHAR, hostname_recv_buf, > MAX_STRING_LEN, MPI_CHAR, MPI_COMM_WORLD); > > > > > Now, is the null-terminated character of each local string included? Or I > have to send and receive in MPI_Allgather MAX_STRING_LEN+1 elements? > > Using Valgrind, in a subsequent simple strcmp: > > for( i= 0; i< num_procs; i++){ > if(strcmp(&hostname_recv_buf[MAX_STRING_LEN*i], > local_hostname)==0){ >... doing something > } > } > > raise a warning: > > Conditional jump or move depends on uninitialised value(s) > ==19931==at 0x4A06E5C: strcmp (mc_replace_strmem.c:412) > > The same warning is not present if I use MAX_STRING_LEN+1 in MPI_Allgather. > > > Thanks in forward. > > -- > Ing. Gabriele Fatigati > > HPC specialist > > SuperComputing Applications and Innovation Department > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > www.cineca.itTel: +39 051 6171722 > > g.fatigati [AT] cineca.it > -- Ing. Gabriele Fatigati HPC specialist SuperComputing Applications and Innovation Department Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39 051 6171722 g.fatigati [AT] cineca.it
Re: [OMPI users] Possible bug in finalize, OpenMPI v1.5, head revision
so far did not happen yet - will report if it does. On Tue, Jan 24, 2012 at 5:10 PM, Jeff Squyres wrote: > Ralph's fix has now been committed to the v1.5 trunk (yesterday). > > Did that fix it? > > > On Jan 22, 2012, at 3:40 PM, Mike Dubman wrote: > > > it was compiled with the same ompi. > > We see it occasionally on different clusters with different ompi > folders. (all v1.5) > > > > On Thu, Jan 19, 2012 at 5:44 PM, Ralph Castain wrote: > > I didn't commit anything to the v1.5 branch yesterday - just the trunk. > > > > As I told Mike off-list, I think it may have been that the binary was > compiled against a different OMPI version by mistake. It looks very much > like what I'd expect to have happen in that scenario. > > > > On Jan 19, 2012, at 7:52 AM, Jeff Squyres wrote: > > > > > Did you "svn up"? I ask because Ralph committed some stuff yesterday > that may have fixed this. > > > > > > > > > On Jan 18, 2012, at 5:19 PM, Andrew Senin wrote: > > > > > >> No, nothing specific. Only basic settings (--mca btl openib,self > > >> --npernode 1, etc). > > >> > > >> Actually I'm were confused with this error because today it just > > >> disapeared. I had 2 separate folders where it was reproduced in 100% > > >> of test runs. Today I recompiled the source and it is gone in both > > >> folders. But yesterday I tried recompiling multiple times with no > > >> effect. So I believe this must be somehow related to some unknown > > >> settings in the lab which have been changed. Trying to reproduce the > > >> crash now... > > >> > > >> Regards, > > >> Andrew Senin. > > >> > > >> On Thu, Jan 19, 2012 at 12:05 AM, Jeff Squyres > wrote: > > >>> Jumping in pretty late in this thread here... > > >>> > > >>> I see that it's failing in opal_hwloc_base_close(). That's a little > worrysome. > > >>> > > >>> I do see an odd path through the hwloc initialization that *could* > cause an error during finalization -- but it would involve you setting an > invalid value for an MCA parameter. Are you setting > hwloc_base_mem_bind_failure_action or > > >>> hwloc_base_mem_alloc_policy, perchance? > > >>> > > >>> > > >>> On Jan 16, 2012, at 1:56 PM, Andrew Senin wrote: > > >>> > > Hi, > > > > I think I've found a bug in the hear revision of the OpenMPI 1.5 > > branch. If it is configured with --disable-debug it crashes in > > finalize on the hello_c.c example. Did I miss something out? > > > > Configure options: > > ./configure --with-pmi=/usr/ --with-slurm=/usr/ --without-psm > > --disable-debug --enable-mpirun-prefix-by-default > > > --prefix=/hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install > > > > Runtime command and output: > > LD_LIBRARY_PATH=$LD_LIBRARY_PATH:../lib ./mpirun --mca btl > openib,self > > --npernode 1 --host mir1,mir2 ./hello > > > > Hello, world, I am 0 of 2 > > Hello, world, I am 1 of 2 > > [mir1:05542] *** Process received signal *** > > [mir1:05542] Signal: Segmentation fault (11) > > [mir1:05542] Signal code: Address not mapped (1) > > [mir1:05542] Failing at address: 0xe8 > > [mir2:10218] *** Process received signal *** > > [mir2:10218] Signal: Segmentation fault (11) > > [mir2:10218] Signal code: Address not mapped (1) > > [mir2:10218] Failing at address: 0xe8 > > [mir1:05542] [ 0] /lib64/libpthread.so.0() [0x390d20f4c0] > > [mir1:05542] [ 1] > > > /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(+0x1346a8) > > [0x7f4588cee6a8] > > [mir1:05542] [ 2] > > > /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_hwloc_base_close+0x32) > > [0x7f4588cee700] > > [mir1:05542] [ 3] > > > /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_finalize+0x73) > > [0x7f4588d1beb2] > > [mir1:05542] [ 4] > > > /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(orte_finalize+0xfe) > > [0x7f4588c81eb5] > > [mir1:05542] [ 5] > > > /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(ompi_mpi_finalize+0x67a) > > [0x7f4588c217c3] > > [mir1:05542] [ 6] > > > /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(PMPI_Finalize+0x59) > > [0x7f4588c39959] > > [mir1:05542] [ 7] ./hello(main+0x69) [0x4008fd] > > [mir1:05542] [ 8] /lib64/libc.so.6(__libc_start_main+0xfd) > [0x390ca1ec5d] > > [mir1:05542] [ 9] ./hello() [0x4007d9] > > [mir1:05542] *** End of error message *** > > [mir2:10218] [ 0] /lib64/libpthread.so.0() [0x3a6dc0f4c0] > > [mir2:10218] [ 1] > > > /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(+0x1346a8) > > [0x7f409f31d6a8] > > [mir2:10218] [ 2] > > > /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/li