Re: [OMPI users] RE : MPI_Comm_connect() fails

Audet, Martin Mon, 17 Mar 2008 14:46:06 -0400

Hi Jeff,

As I said in my last message (see bellow) the patch (or at least the patch I 
got) don't fixes the problem for me. Whether I apply it over OpenMPI 1.2.5 or 
1.2.6rc2, I still get the same problem:


  The client aborts with a truncation error message while the server freeze 
when for example the server is started on 3 process and the client on 2 process.

Feel free to try yourself the two small client and server programs I posted in 
my first message.

Thanks,

Martin


Subject: [OMPI users] RE : users Digest, Vol 841, Issue 3
From: Audet, Martin (Martin.Audet_at_[hidden])
List-Post: users@lists.open-mpi.org
Date: 2008-03-13 17:04:25

Hi Georges,

Thanks for your patch, but I'm not sure I got it correctly. The patch I got 
modify a few arguments passed to isend()/irecv()/recv() in 
coll_basic_allgather.c. Here is the patch I applied:

Index: ompi/mca/coll/basic/coll_basic_allgather.c
===================================================================
--- ompi/mca/coll/basic/coll_basic_allgather.c (revision 17814)
+++ ompi/mca/coll/basic/coll_basic_allgather.c (working copy)
@@ -149,7 +149,7 @@
         }

         /* Do a send-recv between the two root procs. to avoid deadlock */
- err = MCA_PML_CALL(isend(sbuf, scount, sdtype, 0,
+ err = MCA_PML_CALL(isend(sbuf, scount, sdtype, root,
                                  MCA_COLL_BASE_TAG_ALLGATHER,
                                  MCA_PML_BASE_SEND_STANDARD,
                                  comm, &reqs[rsize]));
@@ -157,7 +157,7 @@
             return err;
         }

- err = MCA_PML_CALL(irecv(rbuf, rcount, rdtype, 0,
+ err = MCA_PML_CALL(irecv(rbuf, rcount, rdtype, root,
                                  MCA_COLL_BASE_TAG_ALLGATHER, comm,
                                  &reqs[0]));
         if (OMPI_SUCCESS != err) {
@@ -186,14 +186,14 @@
             return err;
         }

- err = MCA_PML_CALL(isend(rbuf, rsize * rcount, rdtype, 0,
+ err = MCA_PML_CALL(isend(rbuf, rsize * scount, sdtype, root,
                                  MCA_COLL_BASE_TAG_ALLGATHER,
                                  MCA_PML_BASE_SEND_STANDARD, comm, &req));
         if (OMPI_SUCCESS != err) {
             goto exit;
         }

- err = MCA_PML_CALL(recv(tmpbuf, size * scount, sdtype, 0,
+ err = MCA_PML_CALL(recv(tmpbuf, size * rcount, rdtype, root,
                                 MCA_COLL_BASE_TAG_ALLGATHER, comm,
                                 MPI_STATUS_IGNORE));
         if (OMPI_SUCCESS != err) {

However with this patch, I still have the problem. Suppose I start the server 
with three process and the client with two, the clients prints:

[audet_at_linux15 dyn_connect]$ mpiexec --universe univ1 -n 2 ./aclient 
'0.2.0:2000'
intercomm_flag = 1
intercomm_remote_size = 3
rem_rank_tbl[3] = { 0 1 2}
[linux15:26114] *** An error occurred in MPI_Allgather
[linux15:26114] *** on communicator
[linux15:26114] *** MPI_ERR_TRUNCATE: message truncated
[linux15:26114] *** MPI_ERRORS_ARE_FATAL (goodbye)
mpiexec noticed that job rank 0 with PID 26113 on node linux15 exited on signal 
15 (Terminated).
[audet_at_linux15 dyn_connect]$

and abort. The server on the other side simply hang (as before).

Regards,

Martin

-----Original Message-----
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Jeff Squyres
Sent: March 14, 2008 19:45
To: Open MPI Users
Subject: Re: [OMPI users] RE : MPI_Comm_connect() fails

Yes, please let us know if this fixes it.  We're working on a 1.2.6
release; we can definitely put this fix in there if it's correct.

Thanks!


On Mar 13, 2008, at 4:07 PM, George Bosilca wrote:

> I dig into the sources and I think you correctly pinpoint the bug.
> It seems we have a mismatch between the local and remote sizes in
> the inter-communicator allgather in the 1.2 series (which explain
> the message truncation error when the local and remote groups have a
> different number of processes). Attached to this email you can find
> a patch that [hopefully] solve this problem. If you can please test
> it and let me know if this solve your problem.
>
>  Thanks,
>    george.
>
> <inter_allgather.patch>
>
>
> On Mar 13, 2008, at 1:11 PM, Audet, Martin wrote:
>
>>
>> Hi,
>>
>> After re-checking the MPI standard (www.mpi-forum.org and MPI - The
>> Complete Reference), I'm more and more convinced that my small
>> examples programs establishing a intercommunicator with
>> MPI_Comm_Connect()/MPI_Comm_accept() over an MPI port and
>> exchanging data over it with MPI_Allgather() is correct. Especially
>> calling MPI_Allgather() with recvcount=1 (its third argument)
>> instead of the total number of MPI_INT that will be received (e.g.
>> intercomm_remote_size in the examples) is both correct and
>> consistent with MPI_Allgather() behavior on intracommunicator (e.g.
>> "normal" communicator).
>>
>>  MPI_Allgather(&comm_rank,   1, MPI_INT,
>>                rem_rank_tbl, 1, MPI_INT,
>>                intercomm);
>>
>> Also the recvbuf argument (the second argument) of MPI_Allgather()
>> in the examples should have a size of intercomm_remote_size (e.g.
>> the size of the remote group), not the sum of the local and remote
>> groups in the client and sever process. The standard says that for
>> all-to-all type of operations over an intercommunicator, the
>> process send and receives data from the remote group only (anyway
>> it is not possible to exchange data with process of the local group
>> over an intercommunicator).
>>
>> So, for me there is no reason for stopping the process with an
>> error message complaining about message truncation. There should be
>> no truncation, sendcount, sendtype, recvcount and recvtype
>> arguments of MPI_Allgather() are correct and consistent.
>>
>> So again for me the OpenMPI behavior with my example look more and
>> more like a bug...
>>
>> Concerning George comment about valgrind and TCP/IP, I totally
>> agree, messages reported by valgrind are only a clue of a bug,
>> especially in this contex, not a proof of bug. Another clue is that
>> my small examples work perfectly with mpich2 ch3:sock.
>>
>> Regards,
>>
>> Martin Audet
>>
>>
>> ------------------------------
>>
>> Message: 4
>> Date: Thu, 13 Mar 2008 08:21:51 +0100
>> From: jody <jody....@gmail.com>
>> Subject: Re: [OMPI users] RE : MPI_Comm_connect() fails
>> To: "Open MPI Users" <us...@open-mpi.org>
>> Message-ID:
>>       <9b0da5ce0803130021l4ead0f91qaf43e4ac7d332...@mail.gmail.com>
>> Content-Type: text/plain; charset=ISO-8859-1
>>
>> HI
>> I think the recvcount argument you pass to MPI_Allgather should not
>> be
>> 1 but instead
>> the number of MPI_INTs your buffer rem_rank_tbl can contain.
>> As it stands now, you tell MPI_Allgather that it may only receive 1
>> MPI_INT.
>>
>> Furthermore, i'm not sure, but i think your receive buffer should be
>> large enough
>> to contain messages from *all* processes, and not just from the
>> "far side"
>>
>> Jody
>>
>> .
>>
>>
>> ------------------------------
>>
>> Message: 6
>> Date: Thu, 13 Mar 2008 09:06:47 -0500
>> From: George Bosilca <bosi...@eecs.utk.edu>
>> Subject: Re: [OMPI users] RE : MPI_Comm_connect() fails
>> To: Open MPI Users <us...@open-mpi.org>
>> Message-ID: <82e9ff28-fb87-4ffb-a492-dde472d5d...@eecs.utk.edu>
>> Content-Type: text/plain; charset="us-ascii"
>>
>> I am not aware of any problems with the allreduce/allgather. But, we
>> are aware of the problem with valgrind that report non initialized
>> values when used with TCP. It's a long story, but I can guarantee
>> that
>> this should not affect a correct MPI application.
>>
>>  george.
>>
>> PS: For those who want to know the details: we have to send a header
>> over TCP which contain some very basic information, including the
>> size
>> of the fragment. Unfortunately, we have a 2 bytes gap in the header.
>> As we never initialize these 2 unused bytes, but we send them over
>> the
>> wire, valgrind correctly detect the non initialized data transfer.
>>
>>
>> On Mar 12, 2008, at 3:58 PM, Audet, Martin wrote:
>>
>>> Hi again,
>>>
>>> Thanks Pak for the link and suggesting to start an "orted" deamon,
>>> by doing so my  clients and servers jobs were able to establish an
>>> intercommunicator between them.
>>>
>>> However I modified my programs to perform an MPI_Allgather() of a
>>> single "int" over the new intercommunicator to test communication a
>>> litle bit and I did encountered problems. I am now wondering if
>>> there is a problem in MPI_Allreduce() itself for intercommunicators.
>>> Note that the same program run without problems with mpich2
>>> (ch3:sock).
>>>
>>> For example if I start orted as follows:
>>>
>>> orted --persistent --seed --scope public --universe univ1
>>>
>>> and then start the server with three process:
>>>
>>> mpiexec --universe univ1 -n 3 ./aserver
>>>
>>> it prints:
>>>
>>> Server port = '0.2.0:2000'
>>>
>>> Now if I start the client with two process as follow (using the
>>> server port):
>>>
>>> mpiexec --universe univ1 -n 2 ./aclient '0.2.0:2000'
>>>
>>> The server prints:
>>>
>>> intercomm_flag = 1
>>> intercomm_remote_size = 2
>>> rem_rank_tbl[2] = { 0 1}
>>>
>>> which is the correct output. The client then prints:
>>>
>>> intercomm_flag = 1
>>> intercomm_remote_size = 3
>>> rem_rank_tbl[3] = { 0 1 2}
>>> [linux15:30895] *** An error occurred in MPI_Allgather
>>> [linux15:30895] *** on communicator
>>> [linux15:30895] *** MPI_ERR_TRUNCATE: message truncated
>>> [linux15:30895] *** MPI_ERRORS_ARE_FATAL (goodbye)
>>> mpiexec noticed that job rank 0 with PID 30894 on node linux15
>>> exited on signal 15 (Terminated).
>>>
>>> As you can see the first messages are correct but the client job
>>> terminate with an error (and the server hang).
>>>
>>> After re-reading the documentation about MPI_Allgather() over an
>>> intercommunicator, I don't see anything wrong in my simple code.
>>> Also if I run the client and server process with valgrind, I get a
>>> few messages like:
>>>
>>> ==29821== Syscall param writev(vector[...]) points to uninitialised
>>> byte(s)
>>> ==29821==    at 0x36235C2130: writev (in /lib64/libc-2.3.5.so)
>>> ==29821==    by 0x7885583: mca_btl_tcp_frag_send (in /home/publique/
>>> openmpi-1.2.5/lib/openmpi/mca_btl_tcp.so)
>>> ==29821==    by 0x788501B: mca_btl_tcp_endpoint_send (in /home/
>>> publique/openmpi-1.2.5/lib/openmpi/mca_btl_tcp.so)
>>> ==29821==    by 0x7467947: mca_pml_ob1_send_request_start_prepare
>>> (in /home/publique/openmpi-1.2.5/lib/openmpi/mca_pml_ob1.so)
>>> ==29821==    by 0x7461494: mca_pml_ob1_isend (in /home/publique/
>>> openmpi-1.2.5/lib/openmpi/mca_pml_ob1.so)
>>> ==29821==    by 0x798BF9D: mca_coll_basic_allgather_inter (in /home/
>>> publique/openmpi-1.2.5/lib/openmpi/mca_coll_basic.so)
>>> ==29821==    by 0x4A5069C: PMPI_Allgather (in /home/publique/
>>> openmpi-1.2.5/lib/libmpi.so.0.0.0)
>>> ==29821==    by 0x400EED: main (aserver.c:53)
>>> ==29821==  Address 0x40d6cac is not stack'd, malloc'd or (recently)
>>> free'd
>>>
>>> in both MPI_Allgather() and MPI_Comm_disconnect() calls for client
>>> and server with valgrind always reporting that the address in
>>> question are "not stack'd, malloc'd or (recently) free'd".
>>>
>>> So is there a problem with MPI_Allgather() on intercommunicators or
>>> am I doing something wrong ?
>>>
>>> Thanks,
>>>
>>> Martin
>>>
>>>
>>> /* aserver.c */
>>> #include <stdio.h>
>>> #include <mpi.h>
>>>
>>> #include <assert.h>
>>> #include <stdlib.h>
>>>
>>> int main(int argc, char **argv)
>>> {
>>> int       comm_rank,comm_size;
>>> char      port_name[MPI_MAX_PORT_NAME];
>>> MPI_Comm intercomm;
>>> int      ok_flag;
>>>
>>> int      intercomm_flag;
>>> int      intercomm_remote_size;
>>> int     *rem_rank_tbl;
>>> int      ii;
>>>
>>> MPI_Init(&argc, &argv);
>>>
>>> MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank);
>>> MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
>>>
>>> ok_flag = (comm_rank != 0) || (argc == 1);
>>> MPI_Bcast(&ok_flag, 1, MPI_INT, 0, MPI_COMM_WORLD);
>>>
>>> if (!ok_flag) {
>>>    if (comm_rank == 0) {
>>>       fprintf(stderr,"Usage: %s\n",argv[0]);
>>>    }
>>>    MPI_Abort(MPI_COMM_WORLD, 1);
>>> }
>>>
>>> MPI_Open_port(MPI_INFO_NULL, port_name);
>>>
>>> if (comm_rank == 0) {
>>>    printf("Server port = '%s'\n", port_name);
>>> }
>>> MPI_Comm_accept(port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD,
>>> &intercomm);
>>>
>>> MPI_Close_port(port_name);
>>>
>>> MPI_Comm_test_inter(intercomm, &intercomm_flag);
>>> if (comm_rank == 0) {
>>>    printf("intercomm_flag = %d\n", intercomm_flag);
>>> }
>>> assert(intercomm_flag != 0);
>>> MPI_Comm_remote_size(intercomm, &intercomm_remote_size);
>>> if (comm_rank == 0) {
>>>    printf("intercomm_remote_size = %d\n", intercomm_remote_size);
>>> }
>>> rem_rank_tbl = malloc(intercomm_remote_size*sizeof(*rem_rank_tbl));
>>> MPI_Allgather(&comm_rank,   1, MPI_INT,
>>>               rem_rank_tbl, 1, MPI_INT,
>>>               intercomm);
>>> if (comm_rank == 0) {
>>>    printf("rem_rank_tbl[%d] = {", intercomm_remote_size);
>>>    for (ii=0; ii < intercomm_remote_size; ii++) {
>>>        printf(" %d", rem_rank_tbl[ii]);
>>>    }
>>>    printf("}\n");
>>> }
>>> free(rem_rank_tbl);
>>>
>>> MPI_Comm_disconnect(&intercomm);
>>>
>>> MPI_Finalize();
>>>
>>> return 0;
>>> }
>>>
>>> /* aclient.c */
>>> #include <stdio.h>
>>> #include <unistd.h>
>>>
>>> #include <mpi.h>
>>>
>>> #include <assert.h>
>>> #include <stdlib.h>
>>>
>>> int main(int argc, char **argv)
>>> {
>>> int      comm_rank,comm_size;
>>> int      ok_flag;
>>> MPI_Comm intercomm;
>>>
>>> int      intercomm_flag;
>>> int      intercomm_remote_size;
>>> int     *rem_rank_tbl;
>>> int      ii;
>>>
>>> MPI_Init(&argc, &argv);
>>>
>>> MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank);
>>> MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
>>>
>>> ok_flag = (comm_rank != 0)  || ((argc == 2)  &&  argv[1]  &&
>>> (*argv[1] != '\0'));
>>> MPI_Bcast(&ok_flag, 1, MPI_INT, 0, MPI_COMM_WORLD);
>>>
>>> if (!ok_flag) {
>>>    if (comm_rank == 0) {
>>>       fprintf(stderr,"Usage: %s mpi_port\n", argv[0]);
>>>    }
>>>    MPI_Abort(MPI_COMM_WORLD, 1);
>>> }
>>>
>>> while (MPI_Comm_connect((comm_rank == 0) ? argv[1] : 0,
>>> MPI_INFO_NULL, 0, MPI_COMM_WORLD, &intercomm) != MPI_SUCCESS) {
>>>    if (comm_rank == 0) {
>>>       printf("MPI_Comm_connect() failled, sleeping and retrying...
>>> \n");
>>>    }
>>>    sleep(1);
>>> }
>>>
>>> MPI_Comm_test_inter(intercomm, &intercomm_flag);
>>> if (comm_rank == 0) {
>>>    printf("intercomm_flag = %d\n", intercomm_flag);
>>> }
>>> assert(intercomm_flag != 0);
>>> MPI_Comm_remote_size(intercomm, &intercomm_remote_size);
>>> if (comm_rank == 0) {
>>>    printf("intercomm_remote_size = %d\n", intercomm_remote_size);
>>> }
>>> rem_rank_tbl = malloc(intercomm_remote_size*sizeof(*rem_rank_tbl));
>>> MPI_Allgather(&comm_rank,   1, MPI_INT,
>>>               rem_rank_tbl, 1, MPI_INT,
>>>               intercomm);
>>> if (comm_rank == 0) {
>>>    printf("rem_rank_tbl[%d] = {", intercomm_remote_size);
>>>    for (ii=0; ii < intercomm_remote_size; ii++) {
>>>        printf(" %d", rem_rank_tbl[ii]);
>>>    }
>>>    printf("}\n");
>>> }
>>> free(rem_rank_tbl);
>>>
>>> MPI_Comm_disconnect(&intercomm);
>>>
>>> MPI_Finalize();
>>>
>>> return 0;
>>> }
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> -------------- next part --------------
>> A non-text attachment was scrubbed...
>> Name: smime.p7s
>> Type: application/pkcs7-signature
>> Size: 2423 bytes
>> Desc: not available
>> Url : 
>> http://www.open-mpi.org/MailArchives/users/attachments/20080313/642d41dd/attachment.bin
>>
>> ------------------------------
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> End of users Digest, Vol 841, Issue 1
>> *************************************
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] RE : MPI_Comm_connect() fails

Reply via email to