already working on it, together with a move_request....

Edgar --

Can you make a patch for the 1.2 series?

I found the problem in the inter-allgather, and fixed it in patch 17849.
The same test using however MPI_Intercomm_create (just to simplify my
life compared to Connect/Accept) using 2 vs 4 processes in the two
groups passes for me -- and did fail with the previous version.


Hi Jeff,

As I said in my last message (see bellow) the patch (or at least the patch I got) don't fixes the problem for me. Whether I apply it over OpenMPI 1.2.5 or 1.2.6rc2, I still get the same problem:

The client aborts with a truncation error message while the server freeze when for example the server is started on 3 process and the client on 2 process.

Feel free to try yourself the two small client and server programs I posted in my first message.



Hi Georges,

Thanks for your patch, but I'm not sure I got it correctly. The patch I got modify a few arguments passed to isend()/irecv()/recv() in coll_basic_allgather.c. Here is the patch I applied:

Index: ompi/mca/coll/basic/coll_basic_allgather.c
--- ompi/mca/coll/basic/coll_basic_allgather.c (revision 17814)
+++ ompi/mca/coll/basic/coll_basic_allgather.c (working copy)
@@ -149,7 +149,7 @@

/* Do a send-recv between the two root procs. to avoid deadlock */
- err = MCA_PML_CALL(isend(sbuf, scount, sdtype, 0,
+ err = MCA_PML_CALL(isend(sbuf, scount, sdtype, root,
                                 comm, &reqs[rsize]));
@@ -157,7 +157,7 @@
            return err;

- err = MCA_PML_CALL(irecv(rbuf, rcount, rdtype, 0,
+ err = MCA_PML_CALL(irecv(rbuf, rcount, rdtype, root,
                                 MCA_COLL_BASE_TAG_ALLGATHER, comm,
        if (OMPI_SUCCESS != err) {
@@ -186,14 +186,14 @@
            return err;

- err = MCA_PML_CALL(isend(rbuf, rsize * rcount, rdtype, 0,
+ err = MCA_PML_CALL(isend(rbuf, rsize * scount, sdtype, root,
        if (OMPI_SUCCESS != err) {
            goto exit;

- err = MCA_PML_CALL(recv(tmpbuf, size * scount, sdtype, 0,
+ err = MCA_PML_CALL(recv(tmpbuf, size * rcount, rdtype, root,
                                MCA_COLL_BASE_TAG_ALLGATHER, comm,
        if (OMPI_SUCCESS != err) {

However with this patch, I still have the problem. Suppose I start the server with three process and the client with two, the clients prints:

[audet_at_linux15 dyn_connect]$ mpiexec --universe univ1 -n 2 ./ aclient '0.2.0:2000'
intercomm_flag = 1
intercomm_remote_size = 3
rem_rank_tbl[3] = { 0 1 2}
[linux15:26114] *** An error occurred in MPI_Allgather
[linux15:26114] *** on communicator
[linux15:26114] *** MPI_ERR_TRUNCATE: message truncated
[linux15:26114] *** MPI_ERRORS_ARE_FATAL (goodbye)
mpiexec noticed that job rank 0 with PID 26113 on node linux15 exited on signal 15 (Terminated).
[audet_at_linux15 dyn_connect]$

and abort. The server on the other side simply hang (as before).



Yes, please let us know if this fixes it.  We're working on a 1.2.6
release; we can definitely put this fix in there if it's correct.


I dig into the sources and I think you correctly pinpoint the bug.
It seems we have a mismatch between the local and remote sizes in
the inter-communicator allgather in the 1.2 series (which explain
the message truncation error when the local and remote groups have a
different number of processes). Attached to this email you can find
a patch that [hopefully] solve this problem. If you can please test
it and let me know if this solve your problem.



After re-checking the MPI standard ( and MPI - The
Complete Reference), I'm more and more convinced that my small
examples programs establishing a intercommunicator with
MPI_Comm_Connect()/MPI_Comm_accept() over an MPI port and
exchanging data over it with MPI_Allgather() is correct. Especially
calling MPI_Allgather() with recvcount=1 (its third argument)
instead of the total number of MPI_INT that will be received (e.g.
intercomm_remote_size in the examples) is both correct and
consistent with MPI_Allgather() behavior on intracommunicator (e.g.
"normal" communicator).

MPI_Allgather(&comm_rank,   1, MPI_INT,
              rem_rank_tbl, 1, MPI_INT,

Also the recvbuf argument (the second argument) of MPI_Allgather()
in the examples should have a size of intercomm_remote_size (e.g.
the size of the remote group), not the sum of the local and remote
groups in the client and sever process. The standard says that for
all-to-all type of operations over an intercommunicator, the
process send and receives data from the remote group only (anyway
it is not possible to exchange data with process of the local group
over an intercommunicator).

So, for me there is no reason for stopping the process with an
error message complaining about message truncation. There should be
no truncation, sendcount, sendtype, recvcount and recvtype
arguments of MPI_Allgather() are correct and consistent.

So again for me the OpenMPI behavior with my example look more and
more like a bug...

Concerning George comment about valgrind and TCP/IP, I totally
agree, messages reported by valgrind are only a clue of a bug,
especially in this contex, not a proof of bug. Another clue is that
my small examples work perfectly with mpich2 ch3:sock.


Martin Audet


I think the recvcount argument you pass to MPI_Allgather should not
1 but instead
the number of MPI_INTs your buffer rem_rank_tbl can contain.
As it stands now, you tell MPI_Allgather that it may only receive 1

Furthermore, i'm not sure, but i think your receive buffer should be
large enough
to contain messages from *all* processes, and not just from the
"far side"




I am not aware of any problems with the allreduce/allgather. But, we
are aware of the problem with valgrind that report non initialized
values when used with TCP. It's a long story, but I can guarantee
this should not affect a correct MPI application.


PS: For those who want to know the details: we have to send a header
over TCP which contain some very basic information, including the
of the fragment. Unfortunately, we have a 2 bytes gap in the header.
As we never initialize these 2 unused bytes, but we send them over
wire, valgrind correctly detect the non initialized data transfer.

Hi again,

Thanks Pak for the link and suggesting to start an "orted" deamon,
by doing so my  clients and servers jobs were able to establish an
intercommunicator between them.

However I modified my programs to perform an MPI_Allgather() of a
single "int" over the new intercommunicator to test communication a
litle bit and I did encountered problems. I am now wondering if
there is a problem in MPI_Allreduce() itself for intercommunicators.
Note that the same program run without problems with mpich2

For example if I start orted as follows:

orted --persistent --seed --scope public --universe univ1

and then start the server with three process:

mpiexec --universe univ1 -n 3 ./aserver

it prints:

Server port = '0.2.0:2000'

Now if I start the client with two process as follow (using the
server port):

mpiexec --universe univ1 -n 2 ./aclient '0.2.0:2000'

The server prints:

intercomm_flag = 1
intercomm_remote_size = 2
rem_rank_tbl[2] = { 0 1}

which is the correct output. The client then prints:

intercomm_flag = 1
intercomm_remote_size = 3
rem_rank_tbl[3] = { 0 1 2}
[linux15:30895] *** An error occurred in MPI_Allgather
[linux15:30895] *** on communicator
[linux15:30895] *** MPI_ERR_TRUNCATE: message truncated
[linux15:30895] *** MPI_ERRORS_ARE_FATAL (goodbye)
mpiexec noticed that job rank 0 with PID 30894 on node linux15
exited on signal 15 (Terminated).

As you can see the first messages are correct but the client job
terminate with an error (and the server hang).

After re-reading the documentation about MPI_Allgather() over an
intercommunicator, I don't see anything wrong in my simple code.
Also if I run the client and server process with valgrind, I get a
few messages like:

==29821== Syscall param writev(vector[...]) points to uninitialised
==29821==    at 0x36235C2130: writev (in /lib64/
==29821== by 0x7885583: mca_btl_tcp_frag_send (in /home/ publique/
==29821==    by 0x788501B: mca_btl_tcp_endpoint_send (in /home/
==29821==    by 0x7467947: mca_pml_ob1_send_request_start_prepare
(in /home/publique/openmpi-1.2.5/lib/openmpi/
==29821==    by 0x7461494: mca_pml_ob1_isend (in /home/publique/
==29821== by 0x798BF9D: mca_coll_basic_allgather_inter (in / home/
==29821==    by 0x4A5069C: PMPI_Allgather (in /home/publique/
==29821==    by 0x400EED: main (aserver.c:53)
==29821== Address 0x40d6cac is not stack'd, malloc'd or (recently)

in both MPI_Allgather() and MPI_Comm_disconnect() calls for client
and server with valgrind always reporting that the address in
question are "not stack'd, malloc'd or (recently) free'd".

So is there a problem with MPI_Allgather() on intercommunicators or
am I doing something wrong ?



/* aserver.c */
#include <stdio.h>
#include <mpi.h>

#include <assert.h>
#include <stdlib.h>

int main(int argc, char **argv)
int       comm_rank,comm_size;
char      port_name[MPI_MAX_PORT_NAME];
MPI_Comm intercomm;
int      ok_flag;

int      intercomm_flag;
int      intercomm_remote_size;
int     *rem_rank_tbl;
int      ii;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank);
MPI_Comm_size(MPI_COMM_WORLD, &comm_size);

ok_flag = (comm_rank != 0) || (argc == 1);
MPI_Bcast(&ok_flag, 1, MPI_INT, 0, MPI_COMM_WORLD);

if (!ok_flag) {
  if (comm_rank == 0) {
     fprintf(stderr,"Usage: %s\n",argv[0]);

MPI_Open_port(MPI_INFO_NULL, port_name);

if (comm_rank == 0) {
  printf("Server port = '%s'\n", port_name);
MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD,


MPI_Comm_test_inter(intercomm, &intercomm_flag);
if (comm_rank == 0) {
  printf("intercomm_flag = %d\n", intercomm_flag);
assert(intercomm_flag != 0);
MPI_Comm_remote_size(intercomm, &intercomm_remote_size);
if (comm_rank == 0) {
  printf("intercomm_remote_size = %d\n", intercomm_remote_size);
rem_rank_tbl = malloc(intercomm_remote_size*sizeof(*rem_rank_tbl));
MPI_Allgather(&comm_rank,   1, MPI_INT,
             rem_rank_tbl, 1, MPI_INT,
if (comm_rank == 0) {
  printf("rem_rank_tbl[%d] = {", intercomm_remote_size);
  for (ii=0; ii < intercomm_remote_size; ii++) {
      printf(" %d", rem_rank_tbl[ii]);



return 0;

/* aclient.c */
#include <stdio.h>
#include <unistd.h>

#include <mpi.h>

#include <assert.h>
#include <stdlib.h>

int main(int argc, char **argv)
int      comm_rank,comm_size;
int      ok_flag;
MPI_Comm intercomm;

int      intercomm_flag;
int      intercomm_remote_size;
int     *rem_rank_tbl;
int      ii;

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank);
MPI_Comm_size(MPI_COMM_WORLD, &comm_size);

ok_flag = (comm_rank != 0)  || ((argc == 2)  &&  argv[1]  &&
(*argv[1] != '\0'));
MPI_Bcast(&ok_flag, 1, MPI_INT, 0, MPI_COMM_WORLD);

if (!ok_flag) {
  if (comm_rank == 0) {
     fprintf(stderr,"Usage: %s mpi_port\n", argv[0]);

while (MPI_Comm_connect((comm_rank == 0) ? argv[1] : 0,
  if (comm_rank == 0) {
     printf("MPI_Comm_connect() failled, sleeping and retrying...

MPI_Comm_test_inter(intercomm, &intercomm_flag);
if (comm_rank == 0) {
  printf("intercomm_flag = %d\n", intercomm_flag);
assert(intercomm_flag != 0);
MPI_Comm_remote_size(intercomm, &intercomm_remote_size);
if (comm_rank == 0) {
  printf("intercomm_remote_size = %d\n", intercomm_remote_size);
rem_rank_tbl = malloc(intercomm_remote_size*sizeof(*rem_rank_tbl));
MPI_Allgather(&comm_rank,   1, MPI_INT,
             rem_rank_tbl, 1, MPI_INT,
if (comm_rank == 0) {
  printf("rem_rank_tbl[%d] = {", intercomm_remote_size);
  for (ii=0; ii < intercomm_remote_size; ii++) {
      printf(" %d", rem_rank_tbl[ii]);



return 0;

