Hi, I agree that my examples are not very clear. What I want to do is to launch a multiexes application (masters-slaves) and benefit from the processor affinity. Could you show me how to convert this command , using -rf option (whatever the affinity is)
mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host r001n002 master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -host r001n002 slave.x options4 Thanks for your help Geoffroy > > Message: 2 > Date: Sun, 12 Apr 2009 18:26:35 +0300 > From: Lenny Verkhovsky <lenny.verkhov...@gmail.com> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? > To: Open MPI Users <us...@open-mpi.org> > Message-ID: > <453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com> > Content-Type: text/plain; charset="iso-8859-1" > > Hi, > > The first "crash" is OK, since your rankfile has ranks 0 and 1 defined, > while n=1, which means only rank 0 is present and can be allocated. > > NP must be >= the largest rank in rankfile. > > What exactly are you trying to do ? > > I tried to recreate your seqv but all I got was > > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile hostfile.0 > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname > [witch19:30798] mca: base: component_find: paffinity "mca_paffinity_linux" > uses an MCA interface that is not recognized (component MCA v1.0.0 != > supported MCA v2.0.0) -- ignored > -------------------------------------------------------------------------- > It looks like opal_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during opal_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > opal_carto_base_select failed > --> Returned value -13 instead of OPAL_SUCCESS > -------------------------------------------------------------------------- > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file > ../../orte/runtime/orte_init.c at line 78 > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file > ../../orte/orted/orted_main.c at line 344 > -------------------------------------------------------------------------- > A daemon (pid 11629) died unexpectedly with status 243 while attempting > to launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > mpirun: clean termination accomplished > > > Lenny. > > > On 4/10/09, Geoffroy Pignot <geopig...@gmail.com> wrote: > > > > Hi , > > > > I am currently testing the process affinity capabilities of openmpi and I > > would like to know if the rankfile behaviour I will describe below is > normal > > or not ? > > > > cat hostfile.0 > > r011n002 slots=4 > > r011n003 slots=4 > > > > cat rankfile.0 > > rank 0=r011n002 slot=0 > > rank 1=r011n003 slot=1 > > > > > > > ################################################################################## > > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname ### OK > > r011n002 > > r011n003 > > > > > > > ################################################################################## > > but > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 hostname > > ### CRASHED > > * > > > -------------------------------------------------------------------------- > > Error, invalid rank (1) in the rankfile (rankfile.0) > > > -------------------------------------------------------------------------- > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file > > rmaps_rank_file.c at line 404 > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file > > base/rmaps_base_map_job.c at line 87 > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file > > base/plm_base_launch_support.c at line 77 > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file > > plm_rsh_module.c at line 985 > > > -------------------------------------------------------------------------- > > A daemon (pid unknown) died unexpectedly on signal 1 while attempting to > > launch so we are aborting. > > > > There may be more information reported by the environment (see above). > > > > This may be because the daemon was unable to find all the needed shared > > libraries on the remote node. You may set your LD_LIBRARY_PATH to have > the > > location of the shared libraries on the remote nodes and this will > > automatically be forwarded to the remote nodes. > > > -------------------------------------------------------------------------- > > > -------------------------------------------------------------------------- > > orterun noticed that the job aborted, but has no info as to the process > > that caused that situation. > > > -------------------------------------------------------------------------- > > orterun: clean termination accomplished > > * > > It seems that the rankfile option is not propagted to the second command > > line ; there is no global understanding of the ranking inside a mpirun > > command. > > > > > > > ################################################################################## > > > > Assuming that , I tried to provide a rankfile to each command line: > > > > cat rankfile.0 > > rank 0=r011n002 slot=0 > > > > cat rankfile.1 > > rank 0=r011n003 slot=1 > > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf > rankfile.1 > > -n 1 hostname ### CRASHED > > *[r011n002:28778] *** Process received signal *** > > [r011n002:28778] Signal: Segmentation fault (11) > > [r011n002:28778] Signal code: Address not mapped (1) > > [r011n002:28778] Failing at address: 0x34 > > [r011n002:28778] [ 0] [0xffffe600] > > [r011n002:28778] [ 1] > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x55d) > > [0x5557decd] > > [r011n002:28778] [ 2] > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x117) > > [0x555842a7] > > [r011n002:28778] [ 3] > /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/mca_plm_rsh.so > > [0x556098c0] > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun [0x804aa27] > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun [0x804a022] > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc) [0x9f1dec] > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun [0x8049f71] > > [r011n002:28778] *** End of error message *** > > Segmentation fault (core dumped)* > > > > > > > > I hope that I've found a bug because it would be very important for me to > > have this kind of capabiliy . > > Launch a multiexe mpirun command line and be able to bind my exes and > > sockets together. > > > > Thanks in advance for your help > > > > Geoffroy >