Re: [OMPI devel] init_thread + spawn error
I believe we have stated several times that we are not thread safe at this time. You are welcome to try it, but shouldn't be surprised when it fails. Ralph On 4/3/08 4:18 PM, "Joao Vicente Lima" wrote: > Hi, > I getting a error on call init_thread and comm_spawn on this code: > > #include "mpi.h" > #include > > int > main (int argc, char *argv[]) > { > int provided; > MPI_Comm parentcomm, intercomm; > > MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided); > MPI_Comm_get_parent (&parentcomm); > > if (parentcomm == MPI_COMM_NULL) > { > printf ("spawning ... \n"); > MPI_Comm_spawn ("./spawn1", MPI_ARGV_NULL, 1, > MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm, MPI_ERRCODES_IGNORE); > MPI_Comm_disconnect (&intercomm); > } > else > { > printf ("child!\n"); > MPI_Comm_disconnect (&parentcomm); > } > > MPI_Finalize (); > return 0; > } > > and the error is: > > spawning ... > opal_mutex_lock(): Resource deadlock avoided > [localhost:18718] *** Process received signal *** > [localhost:18718] Signal: Aborted (6) > [localhost:18718] Signal code: (-6) > [localhost:18718] [ 0] /lib/libpthread.so.0 [0x2b6e5d9fced0] > [localhost:18718] [ 1] /lib/libc.so.6(gsignal+0x35) [0x2b6e5dc3b3c5] > [localhost:18718] [ 2] /lib/libc.so.6(abort+0x10e) [0x2b6e5dc3c73e] > [localhost:18718] [ 3] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 > [0x2b6e5c9560ff] > [localhost:18718] [ 4] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 > [0x2b6e5c95601d] > [localhost:18718] [ 5] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 > [0x2b6e5c9560ac] > [localhost:18718] [ 6] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 > [0x2b6e5c956a93] > [localhost:18718] [ 7] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 > [0x2b6e5c9569dd] > [localhost:18718] [ 8] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 > [0x2b6e5c95797d] > [localhost:18718] [ 9] > /usr/local/mpi/ompi-svn/lib/libmpi.so.0(ompi_proc_unpack+0x1ec) > [0x2b6e5c957dd9] > [localhost:18718] [10] > /usr/local/mpi/ompi-svn/lib/openmpi/mca_dpm_orte.so [0x2b6e607f05cf] > [localhost:18718] [11] > /usr/local/mpi/ompi-svn/lib/libmpi.so.0(MPI_Comm_spawn+0x459) > [0x2b6e5c98ede9] > [localhost:18718] [12] ./spawn1(main+0x7a) [0x400ae2] > [localhost:18718] [13] /lib/libc.so.6(__libc_start_main+0xf4) [0x2b6e5dc28b74] > [localhost:18718] [14] ./spawn1 [0x4009d9] > [localhost:18718] *** End of error message *** > opal_mutex_lock(): Resource deadlock avoided > [localhost:18719] *** Process received signal *** > [localhost:18719] Signal: Aborted (6) > [localhost:18719] Signal code: (-6) > [localhost:18719] [ 0] /lib/libpthread.so.0 [0x2b9317a17ed0] > [localhost:18719] [ 1] /lib/libc.so.6(gsignal+0x35) [0x2b9317c563c5] > [localhost:18719] [ 2] /lib/libc.so.6(abort+0x10e) [0x2b9317c5773e] > [localhost:18719] [ 3] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 > [0x2b93169710ff] > [localhost:18719] [ 4] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 > [0x2b931697101d] > [localhost:18719] [ 5] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 > [0x2b93169710ac] > [localhost:18719] [ 6] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 > [0x2b9316971a93] > [localhost:18719] [ 7] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 > [0x2b93169719dd] > [localhost:18719] [ 8] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 > [0x2b931697297d] > [localhost:18719] [ 9] > /usr/local/mpi/ompi-svn/lib/libmpi.so.0(ompi_proc_unpack+0x1ec) > [0x2b9316972dd9] > [localhost:18719] [10] > /usr/local/mpi/ompi-svn/lib/openmpi/mca_dpm_orte.so [0x2b931a80b5cf] > [localhost:18719] [11] > /usr/local/mpi/ompi-svn/lib/openmpi/mca_dpm_orte.so [0x2b931a80dad7] > [localhost:18719] [12] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 > [0x2b9316977207] > [localhost:18719] [13] > /usr/local/mpi/ompi-svn/lib/libmpi.so.0(PMPI_Init_thread+0x166) > [0x2b93169b8622] > [localhost:18719] [14] ./spawn1(main+0x25) [0x400a8d] > [localhost:18719] [15] /lib/libc.so.6(__libc_start_main+0xf4) [0x2b9317c43b74] > [localhost:18719] [16] ./spawn1 [0x4009d9] > [localhost:18719] *** End of error message *** > -- > mpirun noticed that process rank 0 with PID 18719 on node localhost > exited on signal 6 (Aborted). > -- > > if I change MPI_Init_thread to MPI_Init all works. > some suggest ? > The attachments contain my ompi_info (r18077) and config.log. > > thanks in advance, > Joao. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] init_thread + spawn error
Hi, I getting a error on call init_thread and comm_spawn on this code: #include "mpi.h" #include int main (int argc, char *argv[]) { int provided; MPI_Comm parentcomm, intercomm; MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided); MPI_Comm_get_parent (&parentcomm); if (parentcomm == MPI_COMM_NULL) { printf ("spawning ... \n"); MPI_Comm_spawn ("./spawn1", MPI_ARGV_NULL, 1, MPI_INFO_NULL, 0, MPI_COMM_SELF, &intercomm, MPI_ERRCODES_IGNORE); MPI_Comm_disconnect (&intercomm); } else { printf ("child!\n"); MPI_Comm_disconnect (&parentcomm); } MPI_Finalize (); return 0; } and the error is: spawning ... opal_mutex_lock(): Resource deadlock avoided [localhost:18718] *** Process received signal *** [localhost:18718] Signal: Aborted (6) [localhost:18718] Signal code: (-6) [localhost:18718] [ 0] /lib/libpthread.so.0 [0x2b6e5d9fced0] [localhost:18718] [ 1] /lib/libc.so.6(gsignal+0x35) [0x2b6e5dc3b3c5] [localhost:18718] [ 2] /lib/libc.so.6(abort+0x10e) [0x2b6e5dc3c73e] [localhost:18718] [ 3] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b6e5c9560ff] [localhost:18718] [ 4] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b6e5c95601d] [localhost:18718] [ 5] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b6e5c9560ac] [localhost:18718] [ 6] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b6e5c956a93] [localhost:18718] [ 7] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b6e5c9569dd] [localhost:18718] [ 8] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b6e5c95797d] [localhost:18718] [ 9] /usr/local/mpi/ompi-svn/lib/libmpi.so.0(ompi_proc_unpack+0x1ec) [0x2b6e5c957dd9] [localhost:18718] [10] /usr/local/mpi/ompi-svn/lib/openmpi/mca_dpm_orte.so [0x2b6e607f05cf] [localhost:18718] [11] /usr/local/mpi/ompi-svn/lib/libmpi.so.0(MPI_Comm_spawn+0x459) [0x2b6e5c98ede9] [localhost:18718] [12] ./spawn1(main+0x7a) [0x400ae2] [localhost:18718] [13] /lib/libc.so.6(__libc_start_main+0xf4) [0x2b6e5dc28b74] [localhost:18718] [14] ./spawn1 [0x4009d9] [localhost:18718] *** End of error message *** opal_mutex_lock(): Resource deadlock avoided [localhost:18719] *** Process received signal *** [localhost:18719] Signal: Aborted (6) [localhost:18719] Signal code: (-6) [localhost:18719] [ 0] /lib/libpthread.so.0 [0x2b9317a17ed0] [localhost:18719] [ 1] /lib/libc.so.6(gsignal+0x35) [0x2b9317c563c5] [localhost:18719] [ 2] /lib/libc.so.6(abort+0x10e) [0x2b9317c5773e] [localhost:18719] [ 3] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b93169710ff] [localhost:18719] [ 4] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b931697101d] [localhost:18719] [ 5] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b93169710ac] [localhost:18719] [ 6] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b9316971a93] [localhost:18719] [ 7] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b93169719dd] [localhost:18719] [ 8] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b931697297d] [localhost:18719] [ 9] /usr/local/mpi/ompi-svn/lib/libmpi.so.0(ompi_proc_unpack+0x1ec) [0x2b9316972dd9] [localhost:18719] [10] /usr/local/mpi/ompi-svn/lib/openmpi/mca_dpm_orte.so [0x2b931a80b5cf] [localhost:18719] [11] /usr/local/mpi/ompi-svn/lib/openmpi/mca_dpm_orte.so [0x2b931a80dad7] [localhost:18719] [12] /usr/local/mpi/ompi-svn/lib/libmpi.so.0 [0x2b9316977207] [localhost:18719] [13] /usr/local/mpi/ompi-svn/lib/libmpi.so.0(PMPI_Init_thread+0x166) [0x2b93169b8622] [localhost:18719] [14] ./spawn1(main+0x25) [0x400a8d] [localhost:18719] [15] /lib/libc.so.6(__libc_start_main+0xf4) [0x2b9317c43b74] [localhost:18719] [16] ./spawn1 [0x4009d9] [localhost:18719] *** End of error message *** -- mpirun noticed that process rank 0 with PID 18719 on node localhost exited on signal 6 (Aborted). -- if I change MPI_Init_thread to MPI_Init all works. some suggest ? The attachments contain my ompi_info (r18077) and config.log. thanks in advance, Joao. config.log.gz Description: GNU Zip compressed data ompi_info.txt.gz Description: GNU Zip compressed data
Re: [OMPI devel] MPI_Comm_connect/Accept
Take a gander at ompi/tools/ompi-server - I believe I put a man page in there. You might just try "man ompi-server" and see if it shows up. Holler if you have a question - not sure I documented it very thoroughly at the time. On 4/3/08 3:10 PM, "Aurélien Bouteiller" wrote: > Ralph, > > > I am using trunk. Is there a documentation for ompi-server ? Sounds > exactly like what I need to fix point 1. > > Aurelien > > Le 3 avr. 08 à 17:06, Ralph Castain a écrit : >> I guess I'll have to ask the basic question: what version are you >> using? >> >> If you are talking about the trunk, there no longer is a "universe" >> concept >> anywhere in the code. Two mpiruns can connect/accept to each other >> as long >> as they can make contact. To facilitate that, we created an "ompi- >> server" >> tool that is supposed to be run by the sys-admin (or a user, doesn't >> matter >> which) on the head node - there are various ways to tell mpirun how to >> contact the server, or it can self-discover it. >> >> I have tested publish/lookup pretty thoroughly and it seems to work. I >> haven't spent much time testing connect/accept except via >> comm_spawn, which >> seems to be working. Since that uses the same mechanism, I would have >> expected connect/accept to work as well. >> >> If you are talking about 1.2.x, then the story is totally different. >> >> Ralph >> >> >> >> On 4/3/08 2:29 PM, "Aurélien Bouteiller" >> wrote: >> >>> Hi everyone, >>> >>> I'm trying to figure out how complete is the implementation of >>> Comm_connect/Accept. I found two problematic cases. >>> >>> 1) Two different programs are started in two different mpirun. One >>> makes accept, the second one use connect. I would not expect >>> MPI_Publish_name/Lookup_name to work because they do not share the >>> HNP. Still I would expect to be able to connect by copying (with >>> printf-scanf) the port_name string generated by Open_port; especially >>> considering that in Open MPI, the port_name is a string containing >>> the >>> tcp address and port of the rank 0 in the server communicator. >>> However, doing so results in "no route to host" and the connecting >>> application aborts. Is the problem related to an explicit check of >>> the >>> universes on the accept HNP ? Do I expect too much from the MPI >>> standard ? Is it because my two applications does not share the same >>> universe ? Should we (re) add the ability to use the same universe >>> for >>> several mpirun ? >>> >>> 2) Second issue is when the program setup a port, and then accept >>> multiple clients on this port. Everything works fine for the first >>> client, and then accept stalls forever when waiting for the second >>> one. My understanding of the standard is that it should work: 5.4.2 >>> states "it must call MPI_Open_port to establish a port [...] it must >>> call MPI_Comm_accept to accept connections from clients". I >>> understand >>> that for one MPI_Open_port I should be able to manage several MPI >>> clients. Am I understanding correctly the standard here and should we >>> fix this ? >>> >>> Here is a copy of the non-working code for reference. >>> >>> /* >>> * Copyright (c) 2004-2007 The Trustees of the University of >>> Tennessee. >>> * All rights reserved. >>> * $COPYRIGHT$ >>> * >>> * Additional copyrights may follow >>> * >>> * $HEADER$ >>> */ >>> #include >>> #include >>> #include >>> >>> int main(int argc, char *argv[]) >>> { >>> char port[MPI_MAX_PORT_NAME]; >>> int rank; >>> int np; >>> >>> >>> MPI_Init(&argc, &argv); >>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>> MPI_Comm_size(MPI_COMM_WORLD, &np); >>> >>> if(rank) >>> { >>> MPI_Comm comm; >>> /* client */ >>> MPI_Recv(port, MPI_MAX_PORT_NAME, MPI_CHAR, 0, 0, >>> MPI_COMM_WORLD, MPI_STATUS_IGNORE); >>> printf("Read port: %s\n", port); >>> MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, >>> &comm); >>> >>> MPI_Send(&rank, 1, MPI_INT, 0, 1, comm); >>> MPI_Comm_disconnect(&comm); >>> } >>> else >>> { >>> int nc = np - 1; >>> MPI_Comm *comm_nodes = (MPI_Comm *) calloc(nc, >>> sizeof(MPI_Comm)); >>> MPI_Request *reqs = (MPI_Request *) calloc(nc, >>> sizeof(MPI_Request)); >>> int *event = (int *) calloc(nc, sizeof(int)); >>> int i; >>> >>> MPI_Open_port(MPI_INFO_NULL, port); >>> /*MPI_Publish_name("test_service_el", MPI_INFO_NULL, port);*/ >>> printf("Port name: %s\n", port); >>> for(i = 1; i < np; i++) >>> MPI_Send(port, MPI_MAX_PORT_NAME, MPI_CHAR, i, 0, >>> MPI_COMM_WORLD); >>> >>> for(i = 0; i < nc; i++) >>> { >>> MPI_Comm_accept(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, >>> &comm_nodes[i]); >>> printf("Accept %d\n", i); >>> MPI_Irecv(&event[i], 1, MPI_INT, 0, 1, comm_nodes[i], >>> &reqs[i]); >>> printf("IRecv %d\n",
Re: [OMPI devel] MPI_Comm_connect/Accept
Ralph, I am using trunk. Is there a documentation for ompi-server ? Sounds exactly like what I need to fix point 1. Aurelien Le 3 avr. 08 à 17:06, Ralph Castain a écrit : I guess I'll have to ask the basic question: what version are you using? If you are talking about the trunk, there no longer is a "universe" concept anywhere in the code. Two mpiruns can connect/accept to each other as long as they can make contact. To facilitate that, we created an "ompi- server" tool that is supposed to be run by the sys-admin (or a user, doesn't matter which) on the head node - there are various ways to tell mpirun how to contact the server, or it can self-discover it. I have tested publish/lookup pretty thoroughly and it seems to work. I haven't spent much time testing connect/accept except via comm_spawn, which seems to be working. Since that uses the same mechanism, I would have expected connect/accept to work as well. If you are talking about 1.2.x, then the story is totally different. Ralph On 4/3/08 2:29 PM, "Aurélien Bouteiller" wrote: Hi everyone, I'm trying to figure out how complete is the implementation of Comm_connect/Accept. I found two problematic cases. 1) Two different programs are started in two different mpirun. One makes accept, the second one use connect. I would not expect MPI_Publish_name/Lookup_name to work because they do not share the HNP. Still I would expect to be able to connect by copying (with printf-scanf) the port_name string generated by Open_port; especially considering that in Open MPI, the port_name is a string containing the tcp address and port of the rank 0 in the server communicator. However, doing so results in "no route to host" and the connecting application aborts. Is the problem related to an explicit check of the universes on the accept HNP ? Do I expect too much from the MPI standard ? Is it because my two applications does not share the same universe ? Should we (re) add the ability to use the same universe for several mpirun ? 2) Second issue is when the program setup a port, and then accept multiple clients on this port. Everything works fine for the first client, and then accept stalls forever when waiting for the second one. My understanding of the standard is that it should work: 5.4.2 states "it must call MPI_Open_port to establish a port [...] it must call MPI_Comm_accept to accept connections from clients". I understand that for one MPI_Open_port I should be able to manage several MPI clients. Am I understanding correctly the standard here and should we fix this ? Here is a copy of the non-working code for reference. /* * Copyright (c) 2004-2007 The Trustees of the University of Tennessee. * All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow * * $HEADER$ */ #include #include #include int main(int argc, char *argv[]) { char port[MPI_MAX_PORT_NAME]; int rank; int np; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &np); if(rank) { MPI_Comm comm; /* client */ MPI_Recv(port, MPI_MAX_PORT_NAME, MPI_CHAR, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Read port: %s\n", port); MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, &comm); MPI_Send(&rank, 1, MPI_INT, 0, 1, comm); MPI_Comm_disconnect(&comm); } else { int nc = np - 1; MPI_Comm *comm_nodes = (MPI_Comm *) calloc(nc, sizeof(MPI_Comm)); MPI_Request *reqs = (MPI_Request *) calloc(nc, sizeof(MPI_Request)); int *event = (int *) calloc(nc, sizeof(int)); int i; MPI_Open_port(MPI_INFO_NULL, port); /*MPI_Publish_name("test_service_el", MPI_INFO_NULL, port);*/ printf("Port name: %s\n", port); for(i = 1; i < np; i++) MPI_Send(port, MPI_MAX_PORT_NAME, MPI_CHAR, i, 0, MPI_COMM_WORLD); for(i = 0; i < nc; i++) { MPI_Comm_accept(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, &comm_nodes[i]); printf("Accept %d\n", i); MPI_Irecv(&event[i], 1, MPI_INT, 0, 1, comm_nodes[i], &reqs[i]); printf("IRecv %d\n", i); } MPI_Close_port(port); MPI_Waitall(nc, reqs, MPI_STATUSES_IGNORE); for(i = 0; i < nc; i++) { printf("event[%d] = %d\n", i, event[i]); MPI_Comm_disconnect(&comm_nodes[i]); printf("Disconnect %d\n", i); } } MPI_Finalize(); return EXIT_SUCCESS; } -- * Dr. Aurélien Bouteiller * Sr. Research Associate at Innovative Computing Laboratory * University of Tennessee * 1122 Volunteer Boulevard, suite 350 * Knoxville, TN 37996 * 865 974 6321 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ de
Re: [OMPI devel] MPI_Comm_connect/Accept
I guess I'll have to ask the basic question: what version are you using? If you are talking about the trunk, there no longer is a "universe" concept anywhere in the code. Two mpiruns can connect/accept to each other as long as they can make contact. To facilitate that, we created an "ompi-server" tool that is supposed to be run by the sys-admin (or a user, doesn't matter which) on the head node - there are various ways to tell mpirun how to contact the server, or it can self-discover it. I have tested publish/lookup pretty thoroughly and it seems to work. I haven't spent much time testing connect/accept except via comm_spawn, which seems to be working. Since that uses the same mechanism, I would have expected connect/accept to work as well. If you are talking about 1.2.x, then the story is totally different. Ralph On 4/3/08 2:29 PM, "Aurélien Bouteiller" wrote: > Hi everyone, > > I'm trying to figure out how complete is the implementation of > Comm_connect/Accept. I found two problematic cases. > > 1) Two different programs are started in two different mpirun. One > makes accept, the second one use connect. I would not expect > MPI_Publish_name/Lookup_name to work because they do not share the > HNP. Still I would expect to be able to connect by copying (with > printf-scanf) the port_name string generated by Open_port; especially > considering that in Open MPI, the port_name is a string containing the > tcp address and port of the rank 0 in the server communicator. > However, doing so results in "no route to host" and the connecting > application aborts. Is the problem related to an explicit check of the > universes on the accept HNP ? Do I expect too much from the MPI > standard ? Is it because my two applications does not share the same > universe ? Should we (re) add the ability to use the same universe for > several mpirun ? > > 2) Second issue is when the program setup a port, and then accept > multiple clients on this port. Everything works fine for the first > client, and then accept stalls forever when waiting for the second > one. My understanding of the standard is that it should work: 5.4.2 > states "it must call MPI_Open_port to establish a port [...] it must > call MPI_Comm_accept to accept connections from clients". I understand > that for one MPI_Open_port I should be able to manage several MPI > clients. Am I understanding correctly the standard here and should we > fix this ? > > Here is a copy of the non-working code for reference. > > /* > * Copyright (c) 2004-2007 The Trustees of the University of Tennessee. > * All rights reserved. > * $COPYRIGHT$ > * > * Additional copyrights may follow > * > * $HEADER$ > */ > #include > #include > #include > > int main(int argc, char *argv[]) > { > char port[MPI_MAX_PORT_NAME]; > int rank; > int np; > > > MPI_Init(&argc, &argv); > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > MPI_Comm_size(MPI_COMM_WORLD, &np); > > if(rank) > { > MPI_Comm comm; > /* client */ > MPI_Recv(port, MPI_MAX_PORT_NAME, MPI_CHAR, 0, 0, > MPI_COMM_WORLD, MPI_STATUS_IGNORE); > printf("Read port: %s\n", port); > MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, &comm); > > MPI_Send(&rank, 1, MPI_INT, 0, 1, comm); > MPI_Comm_disconnect(&comm); > } > else > { > int nc = np - 1; > MPI_Comm *comm_nodes = (MPI_Comm *) calloc(nc, > sizeof(MPI_Comm)); > MPI_Request *reqs = (MPI_Request *) calloc(nc, > sizeof(MPI_Request)); > int *event = (int *) calloc(nc, sizeof(int)); > int i; > > MPI_Open_port(MPI_INFO_NULL, port); > /*MPI_Publish_name("test_service_el", MPI_INFO_NULL, port);*/ > printf("Port name: %s\n", port); > for(i = 1; i < np; i++) > MPI_Send(port, MPI_MAX_PORT_NAME, MPI_CHAR, i, 0, > MPI_COMM_WORLD); > > for(i = 0; i < nc; i++) > { > MPI_Comm_accept(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, > &comm_nodes[i]); > printf("Accept %d\n", i); > MPI_Irecv(&event[i], 1, MPI_INT, 0, 1, comm_nodes[i], > &reqs[i]); > printf("IRecv %d\n", i); > } > MPI_Close_port(port); > MPI_Waitall(nc, reqs, MPI_STATUSES_IGNORE); > for(i = 0; i < nc; i++) > { > printf("event[%d] = %d\n", i, event[i]); > MPI_Comm_disconnect(&comm_nodes[i]); > printf("Disconnect %d\n", i); > } > } > > MPI_Finalize(); > return EXIT_SUCCESS; > } > > > > > -- > * Dr. Aurélien Bouteiller > * Sr. Research Associate at Innovative Computing Laboratory > * University of Tennessee > * 1122 Volunteer Boulevard, suite 350 > * Knoxville, TN 37996 > * 865 974 6321 > > > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-
[OMPI devel] MPI_Comm_connect/Accept
Hi everyone, I'm trying to figure out how complete is the implementation of Comm_connect/Accept. I found two problematic cases. 1) Two different programs are started in two different mpirun. One makes accept, the second one use connect. I would not expect MPI_Publish_name/Lookup_name to work because they do not share the HNP. Still I would expect to be able to connect by copying (with printf-scanf) the port_name string generated by Open_port; especially considering that in Open MPI, the port_name is a string containing the tcp address and port of the rank 0 in the server communicator. However, doing so results in "no route to host" and the connecting application aborts. Is the problem related to an explicit check of the universes on the accept HNP ? Do I expect too much from the MPI standard ? Is it because my two applications does not share the same universe ? Should we (re) add the ability to use the same universe for several mpirun ? 2) Second issue is when the program setup a port, and then accept multiple clients on this port. Everything works fine for the first client, and then accept stalls forever when waiting for the second one. My understanding of the standard is that it should work: 5.4.2 states "it must call MPI_Open_port to establish a port [...] it must call MPI_Comm_accept to accept connections from clients". I understand that for one MPI_Open_port I should be able to manage several MPI clients. Am I understanding correctly the standard here and should we fix this ? Here is a copy of the non-working code for reference. /* * Copyright (c) 2004-2007 The Trustees of the University of Tennessee. * All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow * * $HEADER$ */ #include #include #include int main(int argc, char *argv[]) { char port[MPI_MAX_PORT_NAME]; int rank; int np; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &np); if(rank) { MPI_Comm comm; /* client */ MPI_Recv(port, MPI_MAX_PORT_NAME, MPI_CHAR, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf("Read port: %s\n", port); MPI_Comm_connect(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, &comm); MPI_Send(&rank, 1, MPI_INT, 0, 1, comm); MPI_Comm_disconnect(&comm); } else { int nc = np - 1; MPI_Comm *comm_nodes = (MPI_Comm *) calloc(nc, sizeof(MPI_Comm)); MPI_Request *reqs = (MPI_Request *) calloc(nc, sizeof(MPI_Request)); int *event = (int *) calloc(nc, sizeof(int)); int i; MPI_Open_port(MPI_INFO_NULL, port); /*MPI_Publish_name("test_service_el", MPI_INFO_NULL, port);*/ printf("Port name: %s\n", port); for(i = 1; i < np; i++) MPI_Send(port, MPI_MAX_PORT_NAME, MPI_CHAR, i, 0, MPI_COMM_WORLD); for(i = 0; i < nc; i++) { MPI_Comm_accept(port, MPI_INFO_NULL, 0, MPI_COMM_SELF, &comm_nodes[i]); printf("Accept %d\n", i); MPI_Irecv(&event[i], 1, MPI_INT, 0, 1, comm_nodes[i], &reqs[i]); printf("IRecv %d\n", i); } MPI_Close_port(port); MPI_Waitall(nc, reqs, MPI_STATUSES_IGNORE); for(i = 0; i < nc; i++) { printf("event[%d] = %d\n", i, event[i]); MPI_Comm_disconnect(&comm_nodes[i]); printf("Disconnect %d\n", i); } } MPI_Finalize(); return EXIT_SUCCESS; } -- * Dr. Aurélien Bouteiller * Sr. Research Associate at Innovative Computing Laboratory * University of Tennessee * 1122 Volunteer Boulevard, suite 350 * Knoxville, TN 37996 * 865 974 6321
Re: [OMPI devel] RFC: changes to modex
On Apr 3, 2008, at 11:16 AM, Jeff Squyres wrote: The size of the openib modex is explained in btl_openib_component.c in the branch. It's a packed message now; we don't just blindly copy an entire struct. Here's the comment: /* The message is packed into multiple parts: * 1. a uint8_t indicating the number of modules (ports) in the message * 2. for each module: *a. the common module data *b. a uint8_t indicating how many CPCs follow *c. for each CPC: * a. a uint8_t indicating the index of the CPC in the all[] * array in btl_openib_connect_base.c * b. a uint8_t indicating the priority of this CPC * c. a uint8_t indicating the length of the blob to follow * d. a blob that is only meaningful to that CPC */ The common module data is what I sent in the other message. Gaa.. I forgot to finish explaining the spreadsheet before I sent this; sorry... The 4 lines of oob/xoob/ibcm/rdmacm cpc sizes are how many bytes those cpc's contribute (on a per-port basis) to the modex. "size 1" is what they currently contribute. "size 2" is if Jon and I are able to shave off a few more bytes (not entirely sure that's possible yet). The machine 1 and machine 2 are three configurations each of two sample machines. The first block of numbers is how big the openib part of the modex is when only using the ibcm cpc, when only using the rdmacm cpc, and when using both the ibcm and rdmacm cpc's (i.e., both are sent in the modex; one will "win" and be used at run-time). The overall number is a result of plugging in the numbers from the machine parameters (nodes, ppn, num ports) and the ibcm/rdmacm cpc sizes to the formula at the top of the spreadsheet. The second block of numbers if modifying the formula at the top of the spreadsheet to calculate basically sending the per-port information only once (this modified formula did not include sending a per-port bitmap as came up later in the thread). The green numbers in that block are the differences between these numbers and the first block. The third block of numbers is the same as the second block, but using the "size 2" cpc sizes. The green numbers are the differences between these numbers and the first block; the blue numbers are the differences between these numbers and the second block. - Note: based on what came up later in the thread (e.g., not taking into account carto and whatnot), the 2nd and 3rd blocks of numbers are not entirely accurate. But they're likely still in the right ballpark. My point was that the size differences from the 1st block and the 2nd/ 3rd blocks seemed to be significant enough to warrant moving ahead with a "reduce replication in the modex" scheme. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: changes to modex
On Apr 3, 2008, at 8:52 AM, Gleb Natapov wrote: It'll increase it compared to the optimization that we're about to make. But it will certainly be a large decrease compared to what we're doing today May be I don't understand something in what you propose then. Currently when I run two procs on the same node and each proc uses different HCA each one of them sends message that describes the HCA in use by the proc. The message is of the form . Each proc send one of those so there are two message total on the wire. You propose that one of them should send description of both available ports (that is one of them sends two messages of the form above) and then each proc send additional message with the index of the HCA that it is going to use. And this is more data on the wire after proposed optimization than we have now. I guess what I'm trying to address is optimizing the common case. What I perceive the common case to be is: - high PPN values (4, 8, 16, ...) - PPN is larger than the number of verbs-capable ports - homogeneous openfabrics network Yes, you will definitely find other cases. But I'd guess that this is, by far, the most common case (especially at scale). I don't want to penalize the common case for the sake of some one-off installations. I'm basing this optimization on the assumption that PPN's will be larger than the number of available ports, so there will guarantee to be duplication in the modex message. Removing that duplication is the main goal of this optimization. (see the spreadsheet that I sent last week). I've looked at it but I could not decipher it :( I don't understand where all these numbers a come from. Why didn't you ask? :-) The size of the openib modex is explained in btl_openib_component.c in the branch. It's a packed message now; we don't just blindly copy an entire struct. Here's the comment: /* The message is packed into multiple parts: * 1. a uint8_t indicating the number of modules (ports) in the message * 2. for each module: *a. the common module data *b. a uint8_t indicating how many CPCs follow *c. for each CPC: * a. a uint8_t indicating the index of the CPC in the all[] * array in btl_openib_connect_base.c * b. a uint8_t indicating the priority of this CPC * c. a uint8_t indicating the length of the blob to follow * d. a blob that is only meaningful to that CPC */ The common module data is what I sent in the other message. I guess I don't see the problem...? I like things to be simple. KISS principle I guess. I can see your point that this is getting fairly complicated. :-\ See below. And I do care about heterogeneous include/exclude too. How much? I still think we can support it just fine; I just want to make [what I perceive to be] the common case better. I looked at what kind of data we send during openib modex and I created file with 1 openib modex messages. mtu, subnet id and cpc list where the same in each message but lid/apm_lid where different, this is pretty close approximation of the data that is sent from HN to each process. The uncompressed file size is 489K compressed file size is 43K. More then 10 times smaller. Was this the full modex message, or just the openib part? Those are promising sizes (43k), though; how long does it take to compress/uncompress this data in memory? That also must be factored into the overall time. How about a revised and combined proposal: - openib: Use a simplified "send all ACTIVE ports" per-host message (i.e., before include/exclude and carto is applied) - openib: Send a small bitmap for each proc indicating which ports each btl module will use - modex: Compress the result (probably only if it's larger than some threshhold size?) when sending, decompress upon receive This keeps it simple -- no special cases for heterogeneous include/ exclude, etc. And if compression is cheap (can you do some experiments to find out?), perhaps we can link against libz (I see the libz in at least RHEL4 is BSD licensed, so there's no issue there) and de/compress in memory. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Ssh tunnelling broken in trunk?
On Wednesday 02 April 2008 08:04:10 pm Ralph Castain wrote: > Hmmm...something isn't making sense. Can I see the command line you used to > generate this? mpirun --n 2 --host vic12,vic20 -mca btl openib,self --mca btl_openib_receive_queues P,65536,256,128,128 -d xterm -e gdb /usr/mpi/gcc/openmpi-trunk/tests/IMB-2.3/IMB-MPI1 > I'll tell you why I'm puzzled. If orte_debug_flag is set, then the > "--daemonize" should NOT be there, and you should see "--debug" on that > command line. What I see is the reverse, which implies to me that > orte_debug_flag is NOT being set to "true". > > When I tested here and on odin, though, I found that the -d option > correctly set the flag and everything works just fine. > > So there is something in your environment or setup that is messing up that > orte_debug_flag. I have no idea what it could be - the command line should > override anything in your environment, but you could check. Otherwise, if > this diagnostic output came from a command line that included -d or > --debug-devel, or had OMPI_MCA_orte_debug=1 in the environment, then I am > at a loss - everywhere I've tried it, it works fine. I'll double check and do a completely fresh svn pull and install and see where that gets me. Thanks for the help, Jon > Ralph > > On 4/2/08 5:41 PM, "Jon Mason" wrote: > > On Wednesday 02 April 2008 05:04:47 pm Ralph Castain wrote: > >> Here's a real simple diagnostic you can do: set -mca plm_base_verbose 1 > >> and look at the cmd line being executed (send it here). It will look > >> like: > >> > >> [[xxx,1],0] plm:rsh: executing: jjkljks;jldfsaj; > >> > >> If the cmd line has --daemonize on it, then the ssh will close and xterm > >> won't work. > > > > [vic20:01863] [[40388,0],0] plm:rsh: executing: (//usr/bin/ssh) > > [/usr/bin/ssh vic12 orted --daemonize -mca ess env -mca orte_ess_jobid > > 2646867968 -mca orte_ess_vpid 1 -mca orte_ess_num_procs > > 2 --hnp-uri > > "2646867968.0;tcp://192.168.70.150:39057;tcp://10.10.0.150:39057;tcp://86 > >.75.3 0.10:39057" --nodename > > vic12 -mca btl openib,self --mca btl_openib_receive_queues > > P,65536,256,128,128 -mca plm_base_verbose 1 -mca > > mca_base_param_file_path > > /usr/mpi/gcc/ompi-trunk/share/openmpi/amca-param-sets:/root -mca > > mca_base_param_file_path_force /root] > > > > > > It looks like what you say is happening. Is this configured somewhere, > > so that I can remove it? > > > > Thanks, > > Jon > > > >> Ralph > >> > >> On 4/2/08 3:14 PM, "Jeff Squyres" wrote: > >>> Can you diagnose a little further: > >>> > >>> 1. in the case where it works, can you verify that the ssh to launch > >>> the orteds is still running? > >>> > >>> 2. in the case where it doesn't work, can you verify that the ssh to > >>> launch the orteds has actually died? > >>> > >>> On Apr 2, 2008, at 4:58 PM, Jon Mason wrote: > On Wednesday 02 April 2008 01:21:31 pm Jon Mason wrote: > > On Wednesday 02 April 2008 11:54:50 am Ralph H Castain wrote: > >> I remember that someone had found a bug that caused > >> orte_debug_flag to not > >> get properly set (local var covering over a global one) - could be > >> that > >> your tmp-public branch doesn't have that patch in it. > >> > >> You might try updating to the latest trunk > > > > I updated my ompi-trunk tree, did a clean build, and I still seem > > the same > > problem. I regressed trunk to rev 17589 and everything works as I > > expect. > > So I think the problem is still there in the top of trunk. > > I stepped through the revs of trunk and found the first failing rev > to be > 17632. Its a big patch, so I'll defer to those more in the know to > determine > what is breaking in there. > > > I don't discount user error, but I don't think I am doing anyting > > different. > > Did some setting change that perhaps I did not modify? > > > > Thanks, > > Jon > > > >> On 4/2/08 10:41 AM, "George Bosilca" wrote: > >>> I'm using this feature on the trunk with the version from > >>> yesterday. > >>> It works without problems ... > >>> > >>> george. > >>> > >>> On Apr 2, 2008, at 12:14 PM, Jon Mason wrote: > On Wednesday 02 April 2008 11:07:18 am Jeff Squyres wrote: > > Are these r numbers relevant on the /tmp-public branch, or the > > trunk? > > I pulled it out of the command used to update the branch, which > was: > svn merge -r 17590:17917 https://svn.open-mpi.org/svn/ompi/trunk . > > In the cpc tmp branch, it happened at r17920. > > Thanks, > Jon > > > On Apr 2, 2008, at 11:59 AM, Jon Mason wrote: > >> I regressed my tree and it looks like it happened between > >> 17590:17917 > >> > >> On Wednesday 02 April 2008 10:22:52 am Jon Mason wrote: > >>> I am noticing that ssh seems to be br
Re: [OMPI devel] RFC: changes to modex
On Apr 3, 2008, at 9:18 AM, Gleb Natapov wrote: I am talking about openib part of the modex. The "garbage" I am referring to is this: FWIW, on the openib-cpc2 branch, the base data that is sent in the modex is this: uint64_t subnet_id; /** LID of this port */ uint16_t lid; /** APM LID for this port */ uint16_t apm_lid; /** The MTU used by this port */ uint8_t mtu; lid is used by both the xoob and ibcm cpc's. We can skip packing the apm_lid if apm support is not used if you really want to. The MTU has been changed to the 8 bit enum value. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: changes to modex
On Thu, Apr 03, 2008 at 07:05:28AM -0600, Ralph H Castain wrote: > H...since I have no control nor involvement in what gets sent, perhaps I > can be a disinterested third party. ;-) > > Could you perhaps explain this comment: > > > BTW I looked at how we do modex now on the trunk. For OOB case more > > than half the data we send for each proc is garbage. > > > What "garbage" are you referring to? I am working to remove the stuff > inserted by proc.c - mostly hostname, hopefully arch, etc. If you are > running a "debug" version, there will also be type descriptors for each > entry, but those are eliminated for optimized builds. > > So are you referring to other things? I am talking about openib part of the modex. The "garbage" I am referring to is this: This is the structure that is sent by modex for each openib BTL. We send entire structure by copying it into a message. struct mca_btl_openib_port_info { uint32_t mtu; #if OMPI_ENABLE_HETEROGENEOUS_SUPPORT uint8_t padding[4]; #endif uint64_t subnet_id; uint16_t lid; /* used only in xrc */ uint16_t apm_lid; /* the lid is used for APM to different port */ char *cpclist; }; The sizeof() the struct is 32 byte, but how much useful info it actually contains? mtu - should really be uint8 since this is encoded value (1,2,3,4) padding - is garbage. sibnet_id - is ok lid - should be sent only for XRC case apm_lid - should be sent only if apm is enabled cpclist - is pure garbage and should not be in this struct at all. So we send 32 bytes with only 9 bytes of useful info (for non XRC case without APM enabled). -- Gleb.
Re: [OMPI devel] RFC: changes to modex
H...since I have no control nor involvement in what gets sent, perhaps I can be a disinterested third party. ;-) Could you perhaps explain this comment: > BTW I looked at how we do modex now on the trunk. For OOB case more > than half the data we send for each proc is garbage. What "garbage" are you referring to? I am working to remove the stuff inserted by proc.c - mostly hostname, hopefully arch, etc. If you are running a "debug" version, there will also be type descriptors for each entry, but those are eliminated for optimized builds. So are you referring to other things? Thanks Ralph On 4/3/08 6:52 AM, "Gleb Natapov" wrote: > On Wed, Apr 02, 2008 at 08:41:14PM -0400, Jeff Squyres wrote: that it's the same for all procs on all hosts. I guess there's a few cases: 1. homogeneous include/exclude, no carto: send all in node info; no proc info 2. homogeneous include/exclude, carto is used: send all ports in node info; send index in proc info for which node info port index it will use >>> This may actually increase modex size. Think about two procs using two >>> different hcas. We'll send all the data we send today + indexes. >> >> It'll increase it compared to the optimization that we're about to >> make. But it will certainly be a large decrease compared to what >> we're doing today > > May be I don't understand something in what you propose then. Currently > when I run two procs on the same node and each proc uses different HCA > each one of them sends message that describes the HCA in use by the > proc. The message is of the form . > Each proc send one of those so there are two message total on the wire. > You propose that one of them should send description of both > available ports (that is one of them sends two messages of the form > above) and then each proc send additional message with the index of the > HCA that it is going to use. And this is more data on the wire after > proposed optimization than we have now. > > >> (see the spreadsheet that I sent last week). > I've looked at it but I could not decipher it :( I don't understand > where all these numbers a come from. > >> >> Indeed, we can even put in the optimization that if there's only one >> process on a host, it can only publish the ports that it will use (and >> therefore there's no need for the proc data). > More special cases :( > >> 3. heterogeneous include/exclude, no cart: need user to tell us that this situation exists (e.g., use another MCA param), but then is same as #2 4. heterogeneous include/exclude, cart is used, same as #3 Right? >>> Looks like it. FWIW I don't like the idea to code all those special >>> cases. The way it works now I can be pretty sure that any crazy setup >>> I'll come up with will work. >> >> And so it will with the new scheme. The only place it won't work is >> if the user specifies a heterogeneous include/exclude (i.e., we'll >> require that the user tells us when they do that), which nobody does. >> >> I guess I don't see the problem...? > I like things to be simple. KISS principle I guess. And I do care about > heterogeneous include/exclude too. > > BTW I looked at how we do modex now on the trunk. For OOB case more > than half the data we send for each proc is garbage. > >> >>> By the way how much data are moved during modex stage? What if modex >>> will use compression? >> >> >> The spreadsheet I listed was just the openib part of the modex, and it >> was fairly hefty. I have no idea how well (or not) it would compress. >> > I looked at what kind of data we send during openib modex and I created > file with 1 openib modex messages. mtu, subnet id and cpc list where > the same in each message but lid/apm_lid where different, this is > pretty close approximation of the data that is sent from HN to each > process. The uncompressed file size is 489K compressed file size is 43K. > More then 10 times smaller. > > -- > Gleb. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: changes to modex
On Wed, Apr 02, 2008 at 08:41:14PM -0400, Jeff Squyres wrote: > >> that it's the same for all procs on all hosts. I guess there's a few > >> cases: > >> > >> 1. homogeneous include/exclude, no carto: send all in node info; no > >> proc info > >> 2. homogeneous include/exclude, carto is used: send all ports in node > >> info; send index in proc info for which node info port index it > >> will use > > This may actually increase modex size. Think about two procs using two > > different hcas. We'll send all the data we send today + indexes. > > It'll increase it compared to the optimization that we're about to > make. But it will certainly be a large decrease compared to what > we're doing today May be I don't understand something in what you propose then. Currently when I run two procs on the same node and each proc uses different HCA each one of them sends message that describes the HCA in use by the proc. The message is of the form . Each proc send one of those so there are two message total on the wire. You propose that one of them should send description of both available ports (that is one of them sends two messages of the form above) and then each proc send additional message with the index of the HCA that it is going to use. And this is more data on the wire after proposed optimization than we have now. > (see the spreadsheet that I sent last week). I've looked at it but I could not decipher it :( I don't understand where all these numbers a come from. > > Indeed, we can even put in the optimization that if there's only one > process on a host, it can only publish the ports that it will use (and > therefore there's no need for the proc data). More special cases :( > > >> 3. heterogeneous include/exclude, no cart: need user to tell us that > >> this situation exists (e.g., use another MCA param), but then is same > >> as #2 > >> 4. heterogeneous include/exclude, cart is used, same as #3 > >> > >> Right? > >> > > Looks like it. FWIW I don't like the idea to code all those special > > cases. The way it works now I can be pretty sure that any crazy setup > > I'll come up with will work. > > And so it will with the new scheme. The only place it won't work is > if the user specifies a heterogeneous include/exclude (i.e., we'll > require that the user tells us when they do that), which nobody does. > > I guess I don't see the problem...? I like things to be simple. KISS principle I guess. And I do care about heterogeneous include/exclude too. BTW I looked at how we do modex now on the trunk. For OOB case more than half the data we send for each proc is garbage. > > > By the way how much data are moved during modex stage? What if modex > > will use compression? > > > The spreadsheet I listed was just the openib part of the modex, and it > was fairly hefty. I have no idea how well (or not) it would compress. > I looked at what kind of data we send during openib modex and I created file with 1 openib modex messages. mtu, subnet id and cpc list where the same in each message but lid/apm_lid where different, this is pretty close approximation of the data that is sent from HN to each process. The uncompressed file size is 489K compressed file size is 43K. More then 10 times smaller. -- Gleb.