[OMPI users] MPI_Comm_spawn question
Hi, I am trying to write trivial master-slave program. Master simply creates slaves, sends them a string, they print it out and exit. Everything works just fine, however, when I add a delay (more than 2 sec) before calling MPI_Init on slave, MPI fails with MPI_ERR_SPAWN. I am pretty sure that MPI_Comm_spawn has some kind of timeout on waiting for slaves to call MPI_Init, and if they fail to respond in time, it returns an error. I believe there is a way to change this behaviour, but I wasn't able to find any suggestions/ideas in the internet. I would appreciate if someone could help with this. --- --- terminal command i use to run program: mpirun -n 1 hello 2 2 // the first argument to "hello" is number of slaves, the second is delay in seconds --- Error message I get when delay is >=2 sec: [host:2231] *** An error occurred in MPI_Comm_spawn [host:2231] *** reported by process [3453419521,0] [host:2231] *** on communicator MPI_COMM_SELF [host:2231] *** MPI_ERR_SPAWN: could not spawn processes [host:2231] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [host:2231] ***and potentially your MPI job) --- The program itself: #include "stdlib.h" #include "stdio.h" #include "mpi.h" #include "unistd.h" MPI_Comm slave_comm; MPI_Comm new_world; #define MESSAGE_SIZE 40 void slave() { printf("Slave initialized; "); MPI_Comm_get_parent(&slave_comm); MPI_Intercomm_merge(slave_comm, 1, &new_world); int slave_rank; MPI_Comm_rank(new_world, &slave_rank); char message[MESSAGE_SIZE]; MPI_Bcast(message, MESSAGE_SIZE, MPI_CHAR, 0, new_world); printf("Slave %d received message from master: %s\n", slave_rank, message); } void master(int slave_count, char* executable, char* delay) { char* slave_argv[] = { delay, NULL }; MPI_Comm_spawn( executable, slave_argv, slave_count, MPI_INFO_NULL, 0, MPI_COMM_SELF, &slave_comm, MPI_ERRCODES_IGNORE); MPI_Intercomm_merge(slave_comm, 0, &new_world); char* helloWorld = "Hello New World!\0"; MPI_Bcast(helloWorld, MESSAGE_SIZE, MPI_CHAR, 0, new_world); printf("Processes spawned!\n"); } int main(int argc, char* argv[]) { if (argc > 2) { MPI_Init(&argc, &argv); master(atoi(argv[1]), argv[0], argv[2]); } else { sleep(atoi(argv[1])); /// delay MPI_Init(&argc, &argv); slave(); } MPI_Comm_free(&new_world); MPI_Comm_free(&slave_comm); MPI_Finalize(); } Thank you, Andrew Elistratov ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] mpi_comm_spawn question
Hi, I am trying to run the following setup in fortran without much success: I have an MPI program, that uses mpi_comm_spawn which spawns some interface program that communicates with the one that spawned it. This spawned program then prepares some data and uses call system() statement in fortran. Now if the program that is called from system is not mpi program itself everything is running OK. But I want to run the program with something like mpirun -n X ... and then this is a no go. Different versions of open mpi give different messages before they either die or hang. I googled all the messages but all I get is just links to some openmpi sources, so I would appreciate if someone can help me explain how to run above setup. Given so many MCA options I hope there is one which can run the above setup ?? The message for 1.6 is the following: ... routed:binomial: connection to lifeline lost (+ PIDs and port numbers) The message for 1.8.1 is: ... FORKING HNP: orted --hnp --set-sid --report-uri 18 --singleton-died-pipe 19 -mca state_novm_select 1 -mca ess_base_jobid 3378249728 If this is not trivial to solve problem I can provide a simple test programs (we need 3) that show all of this. Thanks, Milan Hodoscek -- National Institute of Chemistry tel:+386-1-476-0278 Hajdrihova 19fax:+386-1-476-0300 SI-1000 Ljubljanae-mail: mi...@cmm.ki.si Slovenia web: http://a.cmm.ki.si
Re: [OMPI users] MPI_Comm_spawn question
What version of OMPI are you using? > On Jan 31, 2017, at 7:33 AM, elistrato...@info.sgu.ru wrote: > > Hi, > > I am trying to write trivial master-slave program. Master simply creates > slaves, sends them a string, they print it out and exit. Everything works > just fine, however, when I add a delay (more than 2 sec) before calling > MPI_Init on slave, MPI fails with MPI_ERR_SPAWN. I am pretty sure that > MPI_Comm_spawn has some kind of timeout on waiting for slaves to call > MPI_Init, and if they fail to respond in time, it returns an error. > > I believe there is a way to change this behaviour, but I wasn't able to > find any suggestions/ideas in the internet. > I would appreciate if someone could help with this. > > --- > --- terminal command i use to run program: > mpirun -n 1 hello 2 2 // the first argument to "hello" is number of > slaves, the second is delay in seconds > > --- Error message I get when delay is >=2 sec: > [host:2231] *** An error occurred in MPI_Comm_spawn > [host:2231] *** reported by process [3453419521,0] > [host:2231] *** on communicator MPI_COMM_SELF > [host:2231] *** MPI_ERR_SPAWN: could not spawn processes > [host:2231] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will > now abort, > [host:2231] ***and potentially your MPI job) > > --- The program itself: > #include "stdlib.h" > #include "stdio.h" > #include "mpi.h" > #include "unistd.h" > > MPI_Comm slave_comm; > MPI_Comm new_world; > #define MESSAGE_SIZE 40 > > void slave() { > printf("Slave initialized; "); > MPI_Comm_get_parent(&slave_comm); > MPI_Intercomm_merge(slave_comm, 1, &new_world); > > int slave_rank; > MPI_Comm_rank(new_world, &slave_rank); > > char message[MESSAGE_SIZE]; > MPI_Bcast(message, MESSAGE_SIZE, MPI_CHAR, 0, new_world); > > printf("Slave %d received message from master: %s\n", slave_rank, > message); > } > > void master(int slave_count, char* executable, char* delay) { > char* slave_argv[] = { delay, NULL }; > MPI_Comm_spawn( executable, > slave_argv, > slave_count, > MPI_INFO_NULL, > 0, > MPI_COMM_SELF, > &slave_comm, > MPI_ERRCODES_IGNORE); > MPI_Intercomm_merge(slave_comm, 0, &new_world); > char* helloWorld = "Hello New World!\0"; > MPI_Bcast(helloWorld, MESSAGE_SIZE, MPI_CHAR, 0, new_world); > printf("Processes spawned!\n"); > } > > int main(int argc, char* argv[]) { > if (argc > 2) { > MPI_Init(&argc, &argv); > master(atoi(argv[1]), argv[0], argv[2]); > } else { > sleep(atoi(argv[1])); /// delay > MPI_Init(&argc, &argv); > slave(); > } > MPI_Comm_free(&new_world); > MPI_Comm_free(&slave_comm); > MPI_Finalize(); > } > > > Thank you, > > Andrew Elistratov > > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] MPI_Comm_spawn question
I am using Open MPI version 2.0.1. ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] MPI_Comm_spawn question
We know v2.0.1 has problems with comm_spawn, and so you may be encountering one of those. Regardless, there is indeed a timeout mechanism in there. It was added because people would execute a comm_spawn, and then would hang and eat up their entire allocation time for nothing. In v2.0.2, I see it is still hardwired at 60 seconds. I believe we eventually realized we needed to make that a variable, but it didn’t get into the 2.0.2 release. > On Feb 1, 2017, at 1:00 AM, elistrato...@info.sgu.ru wrote: > > I am using Open MPI version 2.0.1. > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] MPI_Comm_spawn question
Andrew, the 2 seconds timeout is very likely a bug that was fixed, so i strongly suggest you give a try to the latest 2.0.2 that was released earlier this week. Ralph is referring an other timeout which is hard coded (fwiw, the MPI standard says nothing about timeout, so we hardcoded one to prevent jobs from hanging forever) to 600 seconds in master, but is still 60 seconds in the v2.0.x branch IIRC, the hard coded timeout is in MPI_Comm_{accept,connect} and i do not know if it is somehow involved in MPI_Comm_spawn. Cheers, Gilles On Saturday, February 4, 2017, r...@open-mpi.org wrote: > We know v2.0.1 has problems with comm_spawn, and so you may be > encountering one of those. Regardless, there is indeed a timeout mechanism > in there. It was added because people would execute a comm_spawn, and then > would hang and eat up their entire allocation time for nothing. > > In v2.0.2, I see it is still hardwired at 60 seconds. I believe we > eventually realized we needed to make that a variable, but it didn’t get > into the 2.0.2 release. > > > > On Feb 1, 2017, at 1:00 AM, elistrato...@info.sgu.ru > wrote: > > > > I am using Open MPI version 2.0.1. > > ___ > > users mailing list > > users@lists.open-mpi.org > > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] mpi_comm_spawn question
Unfortunately, that has never been supported. The problem is that the embedded mpirun picks up all those MCA params that were provided to the original application process, and gets hopelessly confused. We have tried in the past to figure out a solution, but it has proved difficult to separate those params that were set during launch of the original child from ones you are trying to provide to the embedded mpirun. So it remains an "unsupported" operation. On Jul 3, 2014, at 7:34 AM, Milan Hodoscek wrote: > Hi, > > I am trying to run the following setup in fortran without much > success: > > I have an MPI program, that uses mpi_comm_spawn which spawns some > interface program that communicates with the one that spawned it. This > spawned program then prepares some data and uses call system() > statement in fortran. Now if the program that is called from system is > not mpi program itself everything is running OK. But I want to run the > program with something like mpirun -n X ... and then this is a no go. > > Different versions of open mpi give different messages before they > either die or hang. I googled all the messages but all I get is just > links to some openmpi sources, so I would appreciate if someone can > help me explain how to run above setup. Given so many MCA options I > hope there is one which can run the above setup ?? > > The message for 1.6 is the following: > ... routed:binomial: connection to lifeline lost (+ PIDs and port numbers) > > The message for 1.8.1 is: > ... FORKING HNP: orted --hnp --set-sid --report-uri 18 --singleton-died-pipe > 19 -mca state_novm_select 1 -mca ess_base_jobid 3378249728 > > > If this is not trivial to solve problem I can provide a simple test > programs (we need 3) that show all of this. > > Thanks, > > > Milan Hodoscek > -- > National Institute of Chemistry tel:+386-1-476-0278 > Hajdrihova 19fax:+386-1-476-0300 > SI-1000 Ljubljanae-mail: mi...@cmm.ki.si > Slovenia web: http://a.cmm.ki.si > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/07/24744.php
Re: [OMPI users] mpi_comm_spawn question
Why are you using system() the second time ? As you want to spawn an MPI application calling MPI_Call_spawn would make everything simpler. George On Jul 3, 2014 4:34 PM, "Milan Hodoscek" wrote: > > Hi, > > I am trying to run the following setup in fortran without much > success: > > I have an MPI program, that uses mpi_comm_spawn which spawns some > interface program that communicates with the one that spawned it. This > spawned program then prepares some data and uses call system() > statement in fortran. Now if the program that is called from system is > not mpi program itself everything is running OK. But I want to run the > program with something like mpirun -n X ... and then this is a no go. > > Different versions of open mpi give different messages before they > either die or hang. I googled all the messages but all I get is just > links to some openmpi sources, so I would appreciate if someone can > help me explain how to run above setup. Given so many MCA options I > hope there is one which can run the above setup ?? > > The message for 1.6 is the following: > ... routed:binomial: connection to lifeline lost (+ PIDs and port numbers) > > The message for 1.8.1 is: > ... FORKING HNP: orted --hnp --set-sid --report-uri 18 --singleton-died-pipe 19 -mca state_novm_select 1 -mca ess_base_jobid 3378249728 > > > If this is not trivial to solve problem I can provide a simple test > programs (we need 3) that show all of this. > > Thanks, > > > Milan Hodoscek > -- > National Institute of Chemistry tel:+386-1-476-0278 > Hajdrihova 19fax:+386-1-476-0300 > SI-1000 Ljubljanae-mail: mi...@cmm.ki.si > Slovenia web: http://a.cmm.ki.si > ___ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: http://www.open-mpi.org/community/lists/users/2014/07/24744.php
Re: [OMPI users] mpi_comm_spawn question
> "George" == George Bosilca writes: George> Why are you using system() the second time ? As you want George> to spawn an MPI application calling MPI_Call_spawn would George> make everything simpler. Yes, this works! Very good trick... The system routine would be more flexible, but for the method we are working now mpi_comm_spawn is also OK. Thanks -- Milan