Smells like a bug - I'll take a look.
On Aug 16, 2011, at 9:10 AM, Simone Pellegrini wrote: > On 08/16/2011 02:11 PM, Ralph Castain wrote: >> That should work, then. When you set the "host" property, did you give the >> same name as was in your machine file? >> >> Debug options that might help: >> >> -mca plm_base_verbose 5 -mca rmaps_base_verbose 5 >> >> You'll need to configure --enable-debug to get the output, but that should >> help tell us what is happening. > To be clear here is the code I am using to spawn the MPI job: > // create the info object > MPI_Info info; > MPI_Info_create(&info); > MPI_Info_set(info, "host", const_cast<char*>(hostname.c_str())); > LOG(ERROR) << hostname; > LOG(DEBUG) << "Invoking task ID '" << task_id <<"': '" << exec_name << "'"; > > MPI_Comm_spawn( const_cast<char*>(exec_name.c_str()), cargs, num_procs, > info, 0, MPI_COMM_SELF, &intercomm, MPI_ERRCODES_IGNORE ); > > delete[] cargs; > MPI_Info_free(&info); > > and here is the log message: > In this case the MPI_Spaw creates a job with 3 MPI processes. As you can see > MPI_Spawn doesn't care about my "host" setting, it just goes ahead and map > the processes to node b05 and node b06 which are in my machinefile. (which is > the same as before) > > is there any way to overwrite this behaviour? > > DEBUG 14628:R<0> 17:00:13] Spawning new MPI processes... > DEBUG 14628:R<0> 17:00:13] Serving event 'TASK_CREATED', (number of > registered handlers: 1) > ERROR 14628:R<0> 17:00:13] b01 > DEBUG 14628:R<0> 17:00:13] Invoking task ID '4': './simulator' > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:receive got message > from [[34621,1],0] > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:receive job launch > command > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:rsh: setting up job > [34621,4] > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:setup_job for job > [34621,4] > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot: > created new proc [[34621,4],INVALID] > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot > mapping proc in job [34621,4] to node b02 > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: adding node b02 > to map > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: mapping proc for > job [34621,4] to node b02 > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot: > created new proc [[34621,4],INVALID] > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot > mapping proc in job [34621,4] to node b01 > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: adding node b01 > to map > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: mapping proc for > job [34621,4] to node b01 > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot: > created new proc [[34621,4],INVALID] > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:claim_slot > mapping proc in job [34621,4] to node b02 > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base: mapping proc for > job [34621,4] to node b02 > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:compute_usage > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:define_daemons > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:define_daemons > existing daemon [[34621,0],2] already launched > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] rmaps:base:define_daemons > existing daemon [[34621,0],1] already launched > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:rsh: no new daemons to > launch > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:launch_apps for job > [34621,4] > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:report_launched for > job [34621,4] > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch > from daemon [[34621,0],0] > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch > completed processing > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch > reissuing non-blocking recv > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch > from daemon [[34621,0],1] > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launched > for proc [[34621,4],1] from daemon [[34621,0],1]: pid 14646 state 2 exit 0 > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch > completed processing > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch > reissuing non-blocking recv > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch > from daemon [[34621,0],2] > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launched > for proc [[34621,4],0] from daemon [[34621,0],2]: pid 9803 state 2 exit 0 > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launched > for proc [[34621,4],2] from daemon [[34621,0],2]: pid 9804 state 2 exit 0 > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:app_report_launch > completed processing > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:report_launched all > apps reported > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:launch wiring up iof > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:launch completed > for job [34621,4] > [kreusspitze.dps.uibk.ac.at:02647] [[34621,0],0] plm:base:receive job > [34621,4] launched > > cheers, Simone P. >> >> >> On Aug 16, 2011, at 5:09 AM, Simone Pellegrini wrote: >> >>> On 08/16/2011 12:30 PM, Ralph Castain wrote: >>>> What version are you using? >>> OpenMPI 1.4.3 >>> >>>> >>>> On Aug 16, 2011, at 3:19 AM, Simone Pellegrini wrote: >>>> >>>>> Dear all, >>>>> I am developing a system to manage MPI tasks on top of MPI. The >>>>> architecture is rather simple, I have a set of scheduler processes which >>>>> takes care to manage the resources of a node. The idea is to have 1 (or >>>>> more) of those scheduler allocated on each node of a cluster and then >>>>> create new MPI processes (on demand) as computation is needed. Allocation >>>>> of processes is done using MPI_Spawn. >>>>> >>>>> The system now works fine on a single node by allocating the main >>>>> scheduler using the following mpi command: >>>>> mpirun --np 1 ./scheduler ... >>>>> >>>>> Now when I scale to multiple nodes problems with default MPI behaviour >>>>> starts. For example lets assume I have 2 nodes with 8 cpu cores each. I >>>>> therefore set up a machine file in the following way: >>>>> >>>>> s01 slots=1 >>>>> s02 slots=1 >>>>> >>>>> and start the node schedulers in the following way: >>>>> mpirun --np 2 --machinefile machinefile ./scheduler ... >>>>> >>>>> This allocates the processes correctly, now the problem starts when I >>>>> invoke MPI_Spawn. basically MPI spawn also uses the informations from the >>>>> machinefile and if 4 MPI processes are spawned 2 are allocated in s01 and >>>>> 2 on s02. What I want is to allocate the processes always in the same >>>>> node. >>>>> >>>>> I tried to do this by specifying an MPI_Info object which is then passed >>>>> to the MPI_Spawn routine. I tried to set the "host" property to the >>>>> hostname of the machine where the scheduler is running but this didn't >>>>> help. >>>>> >>>>> Unfortunately there is very little documentation on this. >>>>> >>>>> Thanks for the help, >>>>> Simone >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >