I see - well, I hope to work on it this weekend and may get it fixed. If I do, I can provide you with a patch for the 1.6 series that you can use until the actual release is issued, if that helps.
On Aug 31, 2012, at 2:33 PM, Brian Budge <brian.bu...@gmail.com> wrote: > Hi Ralph - > > This is true, but we may not know until well into the process whether > we need MPI at all. We have an SMP/NUMA mode that is designed to run > faster on a single machine. We also may build our application on > machines where there is no MPI, and we simply don't build the code > that runs the MPI functionality in that case. We have scripts all > over the place that need to start this application, and it would be > much easier to be able to simply run the program than to figure out > when or if mpirun needs to be starting the program. > > Before, we went so far as to fork and exec a full mpirun when we run > in clustered mode. This resulted in an additional process running, > and we had to use sockets to get the data to the new master process. > I very much like the idea of being able to have our process become the > MPI master instead, so I have been very excited about your work around > this singleton fork/exec under the hood. > > Once I get my new infrastructure designed to work with mpirun -n 1 + > spawn, I will try some previous openmpi versions to see if I can find > a version with this singleton functionality in-tact. > > Thanks again, > Brian > > On Thu, Aug 30, 2012 at 4:51 PM, Ralph Castain <r...@open-mpi.org> wrote: >> not off the top of my head. However, as noted earlier, there is absolutely >> no advantage to a singleton vs mpirun start - all the singleton does is >> immediately fork/exec "mpirun" to support the rest of the job. In both >> cases, you have a daemon running the job - only difference is in the number >> of characters the user types to start it. >> >> >> On Aug 30, 2012, at 8:44 AM, Brian Budge <brian.bu...@gmail.com> wrote: >> >>> In the event that I need to get this up-and-running soon (I do need >>> something working within 2 weeks), can you recommend an older version >>> where this is expected to work? >>> >>> Thanks, >>> Brian >>> >>> On Tue, Aug 28, 2012 at 4:58 PM, Brian Budge <brian.bu...@gmail.com> wrote: >>>> Thanks! >>>> >>>> On Tue, Aug 28, 2012 at 4:57 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>> Yeah, I'm seeing the hang as well when running across multiple machines. >>>>> Let me dig a little and get this fixed. >>>>> >>>>> Thanks >>>>> Ralph >>>>> >>>>> On Aug 28, 2012, at 4:51 PM, Brian Budge <brian.bu...@gmail.com> wrote: >>>>> >>>>>> Hmmm, I went to the build directories of openmpi for my two machines, >>>>>> went into the orte/test/mpi directory and made the executables on both >>>>>> machines. I set the hostsfile in the env variable on the "master" >>>>>> machine. >>>>>> >>>>>> Here's the output: >>>>>> >>>>>> OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile >>>>>> ./simple_spawn >>>>>> Parent [pid 97504] starting up! >>>>>> 0 completed MPI_Init >>>>>> Parent [pid 97504] about to spawn! >>>>>> Parent [pid 97507] starting up! >>>>>> Parent [pid 97508] starting up! >>>>>> Parent [pid 30626] starting up! >>>>>> ^C >>>>>> zsh: interrupt OMPI_MCA_orte_default_hostfile= ./simple_spawn >>>>>> >>>>>> I had to ^C to kill the hung process. >>>>>> >>>>>> When I run using mpirun: >>>>>> >>>>>> OMPI_MCA_orte_default_hostfile=/home/budgeb/p4/pseb/external/install/openmpi-1.6.1/orte/test/mpi/hostsfile >>>>>> mpirun -np 1 ./simple_spawn >>>>>> Parent [pid 97511] starting up! >>>>>> 0 completed MPI_Init >>>>>> Parent [pid 97511] about to spawn! >>>>>> Parent [pid 97513] starting up! >>>>>> Parent [pid 30762] starting up! >>>>>> Parent [pid 30764] starting up! >>>>>> Parent done with spawn >>>>>> Parent sending message to child >>>>>> 1 completed MPI_Init >>>>>> Hello from the child 1 of 3 on host budgeb-sandybridge pid 97513 >>>>>> 0 completed MPI_Init >>>>>> Hello from the child 0 of 3 on host budgeb-interlagos pid 30762 >>>>>> 2 completed MPI_Init >>>>>> Hello from the child 2 of 3 on host budgeb-interlagos pid 30764 >>>>>> Child 1 disconnected >>>>>> Child 0 received msg: 38 >>>>>> Child 0 disconnected >>>>>> Parent disconnected >>>>>> Child 2 disconnected >>>>>> 97511: exiting >>>>>> 97513: exiting >>>>>> 30762: exiting >>>>>> 30764: exiting >>>>>> >>>>>> As you can see, I'm using openmpi v 1.6.1. I just barely freshly >>>>>> installed on both machines using the default configure options. >>>>>> >>>>>> Thanks for all your help. >>>>>> >>>>>> Brian >>>>>> >>>>>> On Tue, Aug 28, 2012 at 4:39 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>> Looks to me like it didn't find your executable - could be a question >>>>>>> of where it exists relative to where you are running. If you look in >>>>>>> your OMPI source tree at the orte/test/mpi directory, you'll see an >>>>>>> example program "simple_spawn.c" there. Just "make simple_spawn" and >>>>>>> execute that with your default hostfile set - does it work okay? >>>>>>> >>>>>>> It works fine for me, hence the question. >>>>>>> >>>>>>> Also, what OMPI version are you using? >>>>>>> >>>>>>> On Aug 28, 2012, at 4:25 PM, Brian Budge <brian.bu...@gmail.com> wrote: >>>>>>> >>>>>>>> I see. Okay. So, I just tried removing the check for universe size, >>>>>>>> and set the universe size to 2. Here's my output: >>>>>>>> >>>>>>>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib >>>>>>>> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe >>>>>>>> [budgeb-interlagos:29965] [[4156,0],0] ORTE_ERROR_LOG: Fatal in file >>>>>>>> base/plm_base_receive.c at line 253 >>>>>>>> [budgeb-interlagos:29963] [[4156,1],0] ORTE_ERROR_LOG: The specified >>>>>>>> application failed to start in file dpm_orte.c at line 785 >>>>>>>> >>>>>>>> The corresponding run with mpirun still works. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Brian >>>>>>>> >>>>>>>> On Tue, Aug 28, 2012 at 2:46 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>> wrote: >>>>>>>>> I see the issue - it's here: >>>>>>>>> >>>>>>>>>> MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &puniverseSize, >>>>>>>>>> &flag); >>>>>>>>>> >>>>>>>>>> if(!flag) { >>>>>>>>>> std::cerr << "no universe size" << std::endl; >>>>>>>>>> return -1; >>>>>>>>>> } >>>>>>>>>> universeSize = *puniverseSize; >>>>>>>>>> if(universeSize == 1) { >>>>>>>>>> std::cerr << "cannot start slaves... not enough nodes" << >>>>>>>>>> std::endl; >>>>>>>>>> } >>>>>>>>> >>>>>>>>> The universe size is set to 1 on a singleton because the attribute >>>>>>>>> gets set at the beginning of time - we haven't any way to go back and >>>>>>>>> change it. The sequence of events explains why. The singleton starts >>>>>>>>> up and sets its attributes, including universe_size. It also spins >>>>>>>>> off an orte daemon to act as its own private "mpirun" in case you >>>>>>>>> call comm_spawn. At this point, however, no hostfile has been read - >>>>>>>>> the singleton is just an MPI proc doing its own thing, and the orte >>>>>>>>> daemon is just sitting there on "stand-by". >>>>>>>>> >>>>>>>>> When your app calls comm_spawn, then the orte daemon gets called to >>>>>>>>> launch the new procs. At that time, it (not the original singleton!) >>>>>>>>> reads the hostfile to find out how many nodes are around, and then >>>>>>>>> does the launch. >>>>>>>>> >>>>>>>>> You are trying to check the number of nodes from within the >>>>>>>>> singleton, which won't work - it has no way of discovering that info. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Aug 28, 2012, at 2:38 PM, Brian Budge <brian.bu...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>>> echo hostsfile >>>>>>>>>> localhost >>>>>>>>>> budgeb-sandybridge >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Brian >>>>>>>>>> >>>>>>>>>> On Tue, Aug 28, 2012 at 2:36 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>>>> wrote: >>>>>>>>>>> Hmmm...what is in your "hostsfile"? >>>>>>>>>>> >>>>>>>>>>> On Aug 28, 2012, at 2:33 PM, Brian Budge <brian.bu...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Ralph - >>>>>>>>>>>> >>>>>>>>>>>> Thanks for confirming this is possible. I'm trying this and >>>>>>>>>>>> currently >>>>>>>>>>>> failing. Perhaps there's something I'm missing in the code to make >>>>>>>>>>>> this work. Here are the two instantiations and their outputs: >>>>>>>>>>>> >>>>>>>>>>>>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib >>>>>>>>>>>>> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile ./master_exe >>>>>>>>>>>> cannot start slaves... not enough nodes >>>>>>>>>>>> >>>>>>>>>>>>> LD_LIBRARY_PATH=/home/budgeb/p4/pseb/external/lib.dev:/usr/local/lib >>>>>>>>>>>>> OMPI_MCA_orte_default_hostfile=`pwd`/hostsfile mpirun -n 1 >>>>>>>>>>>>> ./master_exe >>>>>>>>>>>> master spawned 1 slaves... >>>>>>>>>>>> slave responding... >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> The code: >>>>>>>>>>>> >>>>>>>>>>>> //master.cpp >>>>>>>>>>>> #include <mpi.h> >>>>>>>>>>>> #include <boost/filesystem.hpp> >>>>>>>>>>>> #include <iostream> >>>>>>>>>>>> >>>>>>>>>>>> int main(int argc, char **args) { >>>>>>>>>>>> int worldSize, universeSize, *puniverseSize, flag; >>>>>>>>>>>> >>>>>>>>>>>> MPI_Comm everyone; //intercomm >>>>>>>>>>>> boost::filesystem::path curPath = >>>>>>>>>>>> boost::filesystem::absolute(boost::filesystem::current_path()); >>>>>>>>>>>> >>>>>>>>>>>> std::string toRun = (curPath / "slave_exe").string(); >>>>>>>>>>>> >>>>>>>>>>>> int ret = MPI_Init(&argc, &args); >>>>>>>>>>>> >>>>>>>>>>>> if(ret != MPI_SUCCESS) { >>>>>>>>>>>> std::cerr << "failed init" << std::endl; >>>>>>>>>>>> return -1; >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> MPI_Comm_size(MPI_COMM_WORLD, &worldSize); >>>>>>>>>>>> >>>>>>>>>>>> if(worldSize != 1) { >>>>>>>>>>>> std::cerr << "too many masters" << std::endl; >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> MPI_Attr_get(MPI_COMM_WORLD, MPI_UNIVERSE_SIZE, &puniverseSize, >>>>>>>>>>>> &flag); >>>>>>>>>>>> >>>>>>>>>>>> if(!flag) { >>>>>>>>>>>> std::cerr << "no universe size" << std::endl; >>>>>>>>>>>> return -1; >>>>>>>>>>>> } >>>>>>>>>>>> universeSize = *puniverseSize; >>>>>>>>>>>> if(universeSize == 1) { >>>>>>>>>>>> std::cerr << "cannot start slaves... not enough nodes" << >>>>>>>>>>>> std::endl; >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> char *buf = (char*)alloca(toRun.size() + 1); >>>>>>>>>>>> memcpy(buf, toRun.c_str(), toRun.size()); >>>>>>>>>>>> buf[toRun.size()] = '\0'; >>>>>>>>>>>> >>>>>>>>>>>> MPI_Comm_spawn(buf, MPI_ARGV_NULL, universeSize-1, MPI_INFO_NULL, >>>>>>>>>>>> 0, MPI_COMM_SELF, &everyone, >>>>>>>>>>>> MPI_ERRCODES_IGNORE); >>>>>>>>>>>> >>>>>>>>>>>> std::cerr << "master spawned " << universeSize-1 << " slaves..." >>>>>>>>>>>> << std::endl; >>>>>>>>>>>> >>>>>>>>>>>> MPI_Finalize(); >>>>>>>>>>>> >>>>>>>>>>>> return 0; >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> //slave.cpp >>>>>>>>>>>> #include <mpi.h> >>>>>>>>>>>> >>>>>>>>>>>> int main(int argc, char **args) { >>>>>>>>>>>> int size; >>>>>>>>>>>> MPI_Comm parent; >>>>>>>>>>>> MPI_Init(&argc, &args); >>>>>>>>>>>> >>>>>>>>>>>> MPI_Comm_get_parent(&parent); >>>>>>>>>>>> >>>>>>>>>>>> if(parent == MPI_COMM_NULL) { >>>>>>>>>>>> std::cerr << "slave has no parent" << std::endl; >>>>>>>>>>>> } >>>>>>>>>>>> MPI_Comm_remote_size(parent, &size); >>>>>>>>>>>> if(size != 1) { >>>>>>>>>>>> std::cerr << "parent size is " << size << std::endl; >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> std::cerr << "slave responding..." << std::endl; >>>>>>>>>>>> >>>>>>>>>>>> MPI_Finalize(); >>>>>>>>>>>> >>>>>>>>>>>> return 0; >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Any ideas? Thanks for any help. >>>>>>>>>>>> >>>>>>>>>>>> Brian >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Aug 22, 2012 at 9:03 AM, Ralph Castain <r...@open-mpi.org> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> It really is just that simple :-) >>>>>>>>>>>>> >>>>>>>>>>>>> On Aug 22, 2012, at 8:56 AM, Brian Budge <brian.bu...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Okay. Is there a tutorial or FAQ for setting everything up? Or >>>>>>>>>>>>>> is it >>>>>>>>>>>>>> really just that simple? I don't need to run a copy of the orte >>>>>>>>>>>>>> server somewhere? >>>>>>>>>>>>>> >>>>>>>>>>>>>> if my current ip is 192.168.0.1, >>>>>>>>>>>>>> >>>>>>>>>>>>>> 0 > echo 192.168.0.11 > /tmp/hostfile >>>>>>>>>>>>>> 1 > echo 192.168.0.12 >> /tmp/hostfile >>>>>>>>>>>>>> 2 > export OMPI_MCA_orte_default_hostfile=/tmp/hostfile >>>>>>>>>>>>>> 3 > ./mySpawningExe >>>>>>>>>>>>>> >>>>>>>>>>>>>> At this point, mySpawningExe will be the master, running on >>>>>>>>>>>>>> 192.168.0.1, and I can have spawned, for example, childExe on >>>>>>>>>>>>>> 192.168.0.11 and 192.168.0.12? Or childExe1 on 192.168.0.11 and >>>>>>>>>>>>>> childExe2 on 192.168.0.12? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks for the help. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Brian >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Aug 22, 2012 at 7:15 AM, Ralph Castain >>>>>>>>>>>>>> <r...@open-mpi.org> wrote: >>>>>>>>>>>>>>> Sure, that's still true on all 1.3 or above releases. All you >>>>>>>>>>>>>>> need to do is set the hostfile envar so we pick it up: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> OMPI_MCA_orte_default_hostfile=<foo> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Aug 21, 2012, at 7:23 PM, Brian Budge >>>>>>>>>>>>>>> <brian.bu...@gmail.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi. I know this is an old thread, but I'm curious if there >>>>>>>>>>>>>>>> are any >>>>>>>>>>>>>>>> tutorials describing how to set this up? Is this still >>>>>>>>>>>>>>>> available on >>>>>>>>>>>>>>>> newer open mpi versions? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Brian >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Jan 4, 2008 at 7:57 AM, Ralph Castain <r...@lanl.gov> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> Hi Elena >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm copying this to the user list just to correct a >>>>>>>>>>>>>>>>> mis-statement on my part >>>>>>>>>>>>>>>>> in an earlier message that went there. I had stated that a >>>>>>>>>>>>>>>>> singleton could >>>>>>>>>>>>>>>>> comm_spawn onto other nodes listed in a hostfile by setting >>>>>>>>>>>>>>>>> an environmental >>>>>>>>>>>>>>>>> variable that pointed us to the hostfile. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This is incorrect in the 1.2 code series. That series does >>>>>>>>>>>>>>>>> not allow >>>>>>>>>>>>>>>>> singletons to read a hostfile at all. Hence, any comm_spawn >>>>>>>>>>>>>>>>> done by a >>>>>>>>>>>>>>>>> singleton can only launch child processes on the singleton's >>>>>>>>>>>>>>>>> local host. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This situation has been corrected for the upcoming 1.3 code >>>>>>>>>>>>>>>>> series. For the >>>>>>>>>>>>>>>>> 1.2 series, though, you will have to do it via an mpirun >>>>>>>>>>>>>>>>> command line. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Sorry for the confusion - I sometimes have too many code >>>>>>>>>>>>>>>>> families to keep >>>>>>>>>>>>>>>>> straight in this old mind! >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Ralph >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 1/4/08 5:10 AM, "Elena Zhebel" <ezhe...@fugro-jason.com> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hello Ralph, >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Thank you very much for the explanations. >>>>>>>>>>>>>>>>>> But I still do not get it running... >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> For the case >>>>>>>>>>>>>>>>>> mpirun -n 1 -hostfile my_hostfile -host my_master_host >>>>>>>>>>>>>>>>>> my_master.exe >>>>>>>>>>>>>>>>>> everything works. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> For the case >>>>>>>>>>>>>>>>>> ./my_master.exe >>>>>>>>>>>>>>>>>> it does not. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I did: >>>>>>>>>>>>>>>>>> - create my_hostfile and put it in the >>>>>>>>>>>>>>>>>> $HOME/.openmpi/components/ >>>>>>>>>>>>>>>>>> my_hostfile : >>>>>>>>>>>>>>>>>> bollenstreek slots=2 max_slots=3 >>>>>>>>>>>>>>>>>> octocore01 slots=8 max_slots=8 >>>>>>>>>>>>>>>>>> octocore02 slots=8 max_slots=8 >>>>>>>>>>>>>>>>>> clstr000 slots=2 max_slots=3 >>>>>>>>>>>>>>>>>> clstr001 slots=2 max_slots=3 >>>>>>>>>>>>>>>>>> clstr002 slots=2 max_slots=3 >>>>>>>>>>>>>>>>>> clstr003 slots=2 max_slots=3 >>>>>>>>>>>>>>>>>> clstr004 slots=2 max_slots=3 >>>>>>>>>>>>>>>>>> clstr005 slots=2 max_slots=3 >>>>>>>>>>>>>>>>>> clstr006 slots=2 max_slots=3 >>>>>>>>>>>>>>>>>> clstr007 slots=2 max_slots=3 >>>>>>>>>>>>>>>>>> - setenv OMPI_MCA_rds_hostfile_path my_hostfile (I put it >>>>>>>>>>>>>>>>>> in .tcshrc and >>>>>>>>>>>>>>>>>> then source .tcshrc) >>>>>>>>>>>>>>>>>> - in my_master.cpp I did >>>>>>>>>>>>>>>>>> MPI_Info info1; >>>>>>>>>>>>>>>>>> MPI_Info_create(&info1); >>>>>>>>>>>>>>>>>> char* hostname = >>>>>>>>>>>>>>>>>> "clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02"; >>>>>>>>>>>>>>>>>> MPI_Info_set(info1, "host", hostname); >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> _intercomm = intracomm.Spawn("./childexe", argv1, _nProc, >>>>>>>>>>>>>>>>>> info1, 0, >>>>>>>>>>>>>>>>>> MPI_ERRCODES_IGNORE); >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> - After I call the executable, I've got this error message >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> bollenstreek: > ./my_master >>>>>>>>>>>>>>>>>> number of processes to run: 1 >>>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> Some of the requested hosts are not included in the current >>>>>>>>>>>>>>>>>> allocation for >>>>>>>>>>>>>>>>>> the application: >>>>>>>>>>>>>>>>>> ./childexe >>>>>>>>>>>>>>>>>> The requested hosts were: >>>>>>>>>>>>>>>>>> clstr002,clstr003,clstr005,clstr006,clstr007,octocore01,octocore02 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Verify that you have mapped the allocated resources properly >>>>>>>>>>>>>>>>>> using the >>>>>>>>>>>>>>>>>> --host specification. >>>>>>>>>>>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource >>>>>>>>>>>>>>>>>> in file >>>>>>>>>>>>>>>>>> base/rmaps_base_support_fns.c at line 225 >>>>>>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource >>>>>>>>>>>>>>>>>> in file >>>>>>>>>>>>>>>>>> rmaps_rr.c at line 478 >>>>>>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource >>>>>>>>>>>>>>>>>> in file >>>>>>>>>>>>>>>>>> base/rmaps_base_map_job.c at line 210 >>>>>>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource >>>>>>>>>>>>>>>>>> in file >>>>>>>>>>>>>>>>>> rmgr_urm.c at line 372 >>>>>>>>>>>>>>>>>> [bollenstreek:21443] [0,0,0] ORTE_ERROR_LOG: Out of resource >>>>>>>>>>>>>>>>>> in file >>>>>>>>>>>>>>>>>> communicator/comm_dyn.c at line 608 >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Did I miss something? >>>>>>>>>>>>>>>>>> Thanks for help! >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Elena >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>>>>>>> From: Ralph H Castain [mailto:r...@lanl.gov] >>>>>>>>>>>>>>>>>> Sent: Tuesday, December 18, 2007 3:50 PM >>>>>>>>>>>>>>>>>> To: Elena Zhebel; Open MPI Users <us...@open-mpi.org> >>>>>>>>>>>>>>>>>> Cc: Ralph H Castain >>>>>>>>>>>>>>>>>> Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster >>>>>>>>>>>>>>>>>> configuration >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On 12/18/07 7:35 AM, "Elena Zhebel" >>>>>>>>>>>>>>>>>> <ezhe...@fugro-jason.com> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks a lot! Now it works! >>>>>>>>>>>>>>>>>>> The solution is to use mpirun -n 1 -hostfile my.hosts *.exe >>>>>>>>>>>>>>>>>>> and pass >>>>>>>>>>>>>>>>>> MPI_Info >>>>>>>>>>>>>>>>>>> Key to the Spawn function! >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> One more question: is it necessary to start my "master" >>>>>>>>>>>>>>>>>>> program with >>>>>>>>>>>>>>>>>>> mpirun -n 1 -hostfile my_hostfile -host my_master_host >>>>>>>>>>>>>>>>>>> my_master.exe ? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> No, it isn't necessary - assuming that my_master_host is the >>>>>>>>>>>>>>>>>> first host >>>>>>>>>>>>>>>>>> listed in your hostfile! If you are only executing one >>>>>>>>>>>>>>>>>> my_master.exe (i.e., >>>>>>>>>>>>>>>>>> you gave -n 1 to mpirun), then we will automatically map >>>>>>>>>>>>>>>>>> that process onto >>>>>>>>>>>>>>>>>> the first host in your hostfile. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> If you want my_master.exe to go on someone other than the >>>>>>>>>>>>>>>>>> first host in the >>>>>>>>>>>>>>>>>> file, then you have to give us the -host option. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Are there other possibilities for easy start? >>>>>>>>>>>>>>>>>>> I would say just to run ./my_master.exe , but then the >>>>>>>>>>>>>>>>>>> master process >>>>>>>>>>>>>>>>>> doesn't >>>>>>>>>>>>>>>>>>> know about the available in the network hosts. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> You can set the hostfile parameter in your environment >>>>>>>>>>>>>>>>>> instead of on the >>>>>>>>>>>>>>>>>> command line. Just set OMPI_MCA_rds_hostfile_path = my.hosts. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> You can then just run ./my_master.exe on the host where you >>>>>>>>>>>>>>>>>> want the master >>>>>>>>>>>>>>>>>> to reside - everything should work the same. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Just as an FYI: the name of that environmental variable is >>>>>>>>>>>>>>>>>> going to change >>>>>>>>>>>>>>>>>> in the 1.3 release, but everything will still work the same. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Hope that helps >>>>>>>>>>>>>>>>>> Ralph >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Thanks and regards, >>>>>>>>>>>>>>>>>>> Elena >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>>>>>>>> From: Ralph H Castain [mailto:r...@lanl.gov] >>>>>>>>>>>>>>>>>>> Sent: Monday, December 17, 2007 5:49 PM >>>>>>>>>>>>>>>>>>> To: Open MPI Users <us...@open-mpi.org>; Elena Zhebel >>>>>>>>>>>>>>>>>>> Cc: Ralph H Castain >>>>>>>>>>>>>>>>>>> Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster >>>>>>>>>>>>>>>>>>> configuration >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On 12/17/07 8:19 AM, "Elena Zhebel" >>>>>>>>>>>>>>>>>>> <ezhe...@fugro-jason.com> wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hello Ralph, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thank you for your answer. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I'm using OpenMPI 1.2.3. , compiler glibc232, Linux Suse >>>>>>>>>>>>>>>>>>>> 10.0. >>>>>>>>>>>>>>>>>>>> My "master" executable runs only on the one local host, >>>>>>>>>>>>>>>>>>>> then it spawns >>>>>>>>>>>>>>>>>>>> "slaves" (with MPI::Intracomm::Spawn). >>>>>>>>>>>>>>>>>>>> My question was: how to determine the hosts where these >>>>>>>>>>>>>>>>>>>> "slaves" will be >>>>>>>>>>>>>>>>>>>> spawned? >>>>>>>>>>>>>>>>>>>> You said: "You have to specify all of the hosts that can >>>>>>>>>>>>>>>>>>>> be used by >>>>>>>>>>>>>>>>>>>> your job >>>>>>>>>>>>>>>>>>>> in the original hostfile". How can I specify the host >>>>>>>>>>>>>>>>>>>> file? I can not >>>>>>>>>>>>>>>>>>>> find it >>>>>>>>>>>>>>>>>>>> in the documentation. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hmmm...sorry about the lack of documentation. I always >>>>>>>>>>>>>>>>>>> assumed that the MPI >>>>>>>>>>>>>>>>>>> folks in the project would document such things since it >>>>>>>>>>>>>>>>>>> has little to do >>>>>>>>>>>>>>>>>>> with the underlying run-time, but I guess that fell through >>>>>>>>>>>>>>>>>>> the cracks. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> There are two parts to your question: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 1. how to specify the hosts to be used for the entire job. >>>>>>>>>>>>>>>>>>> I believe that >>>>>>>>>>>>>>>>>> is >>>>>>>>>>>>>>>>>>> somewhat covered here: >>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/faq/?category=running#simple-spmd-run >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> That FAQ tells you what a hostfile should look like, though >>>>>>>>>>>>>>>>>>> you may already >>>>>>>>>>>>>>>>>>> know that. Basically, we require that you list -all- of the >>>>>>>>>>>>>>>>>>> nodes that both >>>>>>>>>>>>>>>>>>> your master and slave programs will use. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2. how to specify which nodes are available for the master, >>>>>>>>>>>>>>>>>>> and which for >>>>>>>>>>>>>>>>>>> the slave. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> You would specify the host for your master on the mpirun >>>>>>>>>>>>>>>>>>> command line with >>>>>>>>>>>>>>>>>>> something like: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> mpirun -n 1 -hostfile my_hostfile -host my_master_host >>>>>>>>>>>>>>>>>>> my_master.exe >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> This directs Open MPI to map that specified executable on >>>>>>>>>>>>>>>>>>> the specified >>>>>>>>>>>>>>>>>> host >>>>>>>>>>>>>>>>>>> - note that my_master_host must have been in my_hostfile. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Inside your master, you would create an MPI_Info key "host" >>>>>>>>>>>>>>>>>>> that has a >>>>>>>>>>>>>>>>>> value >>>>>>>>>>>>>>>>>>> consisting of a string "host1,host2,host3" identifying the >>>>>>>>>>>>>>>>>>> hosts you want >>>>>>>>>>>>>>>>>>> your slave to execute upon. Those hosts must have been >>>>>>>>>>>>>>>>>>> included in >>>>>>>>>>>>>>>>>>> my_hostfile. Include that key in the MPI_Info array passed >>>>>>>>>>>>>>>>>>> to your Spawn. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> We don't currently support providing a hostfile for the >>>>>>>>>>>>>>>>>>> slaves (as opposed >>>>>>>>>>>>>>>>>>> to the host-at-a-time string above). This may become >>>>>>>>>>>>>>>>>>> available in a future >>>>>>>>>>>>>>>>>>> release - TBD. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hope that helps >>>>>>>>>>>>>>>>>>> Ralph >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Thanks and regards, >>>>>>>>>>>>>>>>>>>> Elena >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> -----Original Message----- >>>>>>>>>>>>>>>>>>>> From: users-boun...@open-mpi.org >>>>>>>>>>>>>>>>>>>> [mailto:users-boun...@open-mpi.org] On >>>>>>>>>>>>>>>>>>>> Behalf Of Ralph H Castain >>>>>>>>>>>>>>>>>>>> Sent: Monday, December 17, 2007 3:31 PM >>>>>>>>>>>>>>>>>>>> To: Open MPI Users <us...@open-mpi.org> >>>>>>>>>>>>>>>>>>>> Cc: Ralph H Castain >>>>>>>>>>>>>>>>>>>> Subject: Re: [OMPI users] MPI::Intracomm::Spawn and cluster >>>>>>>>>>>>>>>>>>>> configuration >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On 12/12/07 5:46 AM, "Elena Zhebel" >>>>>>>>>>>>>>>>>>>> <ezhe...@fugro-jason.com> wrote: >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> I'm working on a MPI application where I'm using OpenMPI >>>>>>>>>>>>>>>>>>>>> instead of >>>>>>>>>>>>>>>>>>>>> MPICH. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> In my "master" program I call the function >>>>>>>>>>>>>>>>>>>>> MPI::Intracomm::Spawn which >>>>>>>>>>>>>>>>>>>> spawns >>>>>>>>>>>>>>>>>>>>> "slave" processes. It is not clear for me how to spawn >>>>>>>>>>>>>>>>>>>>> the "slave" >>>>>>>>>>>>>>>>>>>> processes >>>>>>>>>>>>>>>>>>>>> over the network. Currently "master" creates "slaves" on >>>>>>>>>>>>>>>>>>>>> the same >>>>>>>>>>>>>>>>>>>>> host. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> If I use 'mpirun --hostfile openmpi.hosts' then processes >>>>>>>>>>>>>>>>>>>>> are spawn >>>>>>>>>>>>>>>>>>>>> over >>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>> network as expected. But now I need to spawn processes >>>>>>>>>>>>>>>>>>>>> over the >>>>>>>>>>>>>>>>>>>>> network >>>>>>>>>>>>>>>>>>>> from >>>>>>>>>>>>>>>>>>>>> my own executable using MPI::Intracomm::Spawn, how can I >>>>>>>>>>>>>>>>>>>>> achieve it? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> I'm not sure from your description exactly what you are >>>>>>>>>>>>>>>>>>>> trying to do, >>>>>>>>>>>>>>>>>>>> nor in >>>>>>>>>>>>>>>>>>>> what environment this is all operating within or what >>>>>>>>>>>>>>>>>>>> version of Open >>>>>>>>>>>>>>>>>>>> MPI >>>>>>>>>>>>>>>>>>>> you are using. Setting aside the environment and version >>>>>>>>>>>>>>>>>>>> issue, I'm >>>>>>>>>>>>>>>>>>>> guessing >>>>>>>>>>>>>>>>>>>> that you are running your executable over some specified >>>>>>>>>>>>>>>>>>>> set of hosts, >>>>>>>>>>>>>>>>>>>> but >>>>>>>>>>>>>>>>>>>> want to provide a different hostfile that specifies the >>>>>>>>>>>>>>>>>>>> hosts to be >>>>>>>>>>>>>>>>>>>> used for >>>>>>>>>>>>>>>>>>>> the "slave" processes. Correct? >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> If that is correct, then I'm afraid you can't do that in >>>>>>>>>>>>>>>>>>>> any version >>>>>>>>>>>>>>>>>>>> of Open >>>>>>>>>>>>>>>>>>>> MPI today. You have to specify all of the hosts that can >>>>>>>>>>>>>>>>>>>> be used by >>>>>>>>>>>>>>>>>>>> your job >>>>>>>>>>>>>>>>>>>> in the original hostfile. You can then specify a subset of >>>>>>>>>>>>>>>>>>>> those hosts >>>>>>>>>>>>>>>>>>>> to be >>>>>>>>>>>>>>>>>>>> used by your original "master" program, and then specify a >>>>>>>>>>>>>>>>>>>> different >>>>>>>>>>>>>>>>>>>> subset >>>>>>>>>>>>>>>>>>>> to be used by the "slaves" when calling Spawn. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> But the system requires that you tell it -all- of the >>>>>>>>>>>>>>>>>>>> hosts that are >>>>>>>>>>>>>>>>>>>> going >>>>>>>>>>>>>>>>>>>> to be used at the beginning of the job. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> At the moment, there is no plan to remove that >>>>>>>>>>>>>>>>>>>> requirement, though >>>>>>>>>>>>>>>>>>>> there has >>>>>>>>>>>>>>>>>>>> been occasional discussion about doing so at some point in >>>>>>>>>>>>>>>>>>>> the future. >>>>>>>>>>>>>>>>>>>> No >>>>>>>>>>>>>>>>>>>> promises that it will happen, though - managed >>>>>>>>>>>>>>>>>>>> environments, in >>>>>>>>>>>>>>>>>>>> particular, >>>>>>>>>>>>>>>>>>>> currently object to the idea of changing the allocation >>>>>>>>>>>>>>>>>>>> on-the-fly. We >>>>>>>>>>>>>>>>>>>> may, >>>>>>>>>>>>>>>>>>>> though, make a provision for purely hostfile-based >>>>>>>>>>>>>>>>>>>> environments (i.e., >>>>>>>>>>>>>>>>>>>> unmanaged) at some time in the future. >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Ralph >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks in advance for any help. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Elena >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users