Hi

 The result is : everything works fine with MPI executables : logical !!!

What I was trying to do , was to run non MPI exes thanks to mpirun. There ,
openmpi is not able to bind these processes to a particular CPU.
My conclusion is that the process affinity is set in MPI_Init, right ?

Could it be possible to have the paffinity features working without any
MPI_Init call, using taskset for example. I agree , it's not your job to
support the execution of any kind of exes but it would be nice !!
Thanks again for all your efforts, I really appreciate

I am looking forward to downloading, trying and deploying the next official
release

Regards

Geoffroy



2009/5/4 Geoffroy Pignot <geopig...@gmail.com>

> Hi Ralph
>
> Thanks for your extra tests.  Before leaving , I just pointed out a problem
> coming from running plpa across different rh distribs (<=> different Linux
> kernels). Indeed, I configure and compile openmpi on rhel4 , then I run on
> rhel5. I think my problem comes from this approximation. I'll do few more
> tests tomorrow morning (France) and keep you inform.
>
> Regards
>
> Geoffroy
>
>
>>
>> Message: 2
>> Date: Mon, 4 May 2009 13:34:40 -0600
>> From: Ralph Castain <r...@open-mpi.org>
>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> To: Open MPI Users <us...@open-mpi.org>
>> Message-ID:
>>        <71d2d8cc0905041234m76eb5a9dx57a773997779d...@mail.gmail.com>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> Hmmm...I'm afraid I can't replicate the problem. All seems to be working
>> just fine on the RHEL systems available to me. The procs indeed bind to
>> the
>> specified processors in every case.
>>
>> rhc@odin ~/trunk]$ cat rankfile
>> rank 0=odin001 slot=0
>> rank 1=odin002 slot=1
>>
>> [rhc@odin mpi]$ mpirun -rf ../../../rankfile -n 2
>> --leave-session-attached
>> -mca paffinity_base_verbose 5 ./mpi_spin
>> [odin001.cs.indiana.edu:09297 <http://odin001.cs.indiana.edu:9297/>]
>> paffinity slot assignment: slot_list == 0
>> [odin001.cs.indiana.edu:09297 <http://odin001.cs.indiana.edu:9297/>]
>> paffinity slot assignment: rank 0 runs on cpu #0 (#0)
>> [odin002.cs.indiana.edu:13566] paffinity slot assignment: slot_list == 1
>> [odin002.cs.indiana.edu:13566] paffinity slot assignment: rank 1 runs on
>> cpu
>> #1 (#1)
>>
>> Suspended
>> [rhc@odin mpi]$ ssh odin001
>> [rhc@odin001 ~]$ ps axo stat,user,psr,pid,pcpu,comm | grep rhc
>> S    rhc        0  9296  0.0 orted
>> RLl  rhc        0  9297  100 mpi_spin
>>
>> [rhc@odin mpi]$ ssh odin002
>> [rhc@odin002 ~]$ ps axo stat,user,psr,pid,pcpu,comm | grep rhc
>> S    rhc        0 13562  0.0 orted
>> RLl  rhc        1 13566  102 mpi_spin
>>
>>
>> Not sure where to go from here...perhaps someone else can spot the
>> problem?
>> Ralph
>>
>>
>> On Mon, May 4, 2009 at 8:28 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>> > Unfortunately, I didn't write any of that code - I was just fixing the
>> > mapper so it would properly map the procs. From what I can tell, the
>> proper
>> > things are happening there.
>> >
>> > I'll have to dig into the code that specifically deals with parsing the
>> > results to bind the processes. Afraid that will take awhile longer -
>> pretty
>> > dark in that hole.
>> >
>> >
>> >
>> > On Mon, May 4, 2009 at 8:04 AM, Geoffroy Pignot <geopig...@gmail.com
>> >wrote:
>>
>> >
>> >> Hi,
>> >>
>> >> So, there are no more crashes with my "crazy" mpirun command. But the
>> >> paffinity feature seems to be broken. Indeed I am not able to pin my
>> >> processes.
>> >>
>> >> Simple test with a program using your plpa library :
>> >>
>> >> r011n006% cat hostf
>> >> r011n006 slots=4
>> >>
>> >> r011n006% cat rankf
>> >> rank 0=r011n006 slot=0   ----> bind to CPU 0 , exact ?
>> >>
>> >> r011n006% /tmp/HALMPI/openmpi-1.4a/bin/mpirun --hostfile hostf
>> --rankfile
>> >> rankf --wdir /tmp -n 1 a.out
>> >>  >>> PLPA Number of processors online: 4
>> >>  >>> PLPA Number of processor sockets: 2
>> >>  >>> PLPA Socket 0 (ID 0): 2 cores
>> >>  >>> PLPA Socket 1 (ID 3): 2 cores
>> >>
>> >> Ctrl+Z
>> >> r011n006%bg
>> >>
>> >> r011n006% ps axo stat,user,psr,pid,pcpu,comm | grep gpignot
>> >> R+   gpignot    3  9271 97.8 a.out
>> >>
>> >> In fact whatever the slot number I put in my rankfile , a.out always
>> runs
>> >> on the CPU 3. I was looking for it on CPU 0 accordind to my cpuinfo
>> file
>> >> (see below)
>> >> The result is the same if I try another syntax (rank 0=r011n006
>> slot=0:0
>> >> bind to socket 0 - core 0  , exact ? )
>> >>
>> >> Thanks in advance
>> >>
>> >> Geoffroy
>> >>
>> >> PS: I run on rhel5
>> >>
>> >> r011n006% uname -a
>> >> Linux r011n006 2.6.18-92.1.1NOMAP32.el5 #1 SMP Sat Mar 15 01:46:39 CDT
>> >> 2008 x86_64 x86_64 x86_64 GNU/Linux
>> >>
>> >> My configure is :
>> >>  ./configure --prefix=/tmp/openmpi-1.4a --libdir='${exec_prefix}/lib64'
>> >> --disable-dlopen --disable-mpi-cxx --enable-heterogeneous
>> >>
>> >>
>> >> r011n006% cat /proc/cpuinfo
>> >> processor       : 0
>> >> vendor_id       : GenuineIntel
>> >> cpu family      : 6
>> >> model           : 15
>> >> model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
>> >> stepping        : 6
>> >> cpu MHz         : 2660.007
>> >> cache size      : 4096 KB
>> >> physical id     : 0
>> >> siblings        : 2
>> >> core id         : 0
>> >> cpu cores       : 2
>> >> fpu             : yes
>> >> fpu_exception   : yes
>> >> cpuid level     : 10
>> >> wp              : yes
>> >> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>> mca
>> >> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx
>> lm
>> >> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
>> >> bogomips        : 5323.68
>> >> clflush size    : 64
>> >> cache_alignment : 64
>> >> address sizes   : 36 bits physical, 48 bits virtual
>> >> power management:
>> >>
>> >> processor       : 1
>> >> vendor_id       : GenuineIntel
>> >> cpu family      : 6
>> >> model           : 15
>> >> model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
>> >> stepping        : 6
>> >> cpu MHz         : 2660.007
>> >> cache size      : 4096 KB
>> >> physical id     : 3
>> >> siblings        : 2
>> >> core id         : 0
>> >> cpu cores       : 2
>> >> fpu             : yes
>> >> fpu_exception   : yes
>> >> cpuid level     : 10
>> >> wp              : yes
>> >> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>> mca
>> >> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx
>> lm
>> >> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
>> >> bogomips        : 5320.03
>> >> clflush size    : 64
>> >> cache_alignment : 64
>> >> address sizes   : 36 bits physical, 48 bits virtual
>> >> power management:
>> >>
>> >> processor       : 2
>> >> vendor_id       : GenuineIntel
>> >> cpu family      : 6
>> >> model           : 15
>> >> model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
>> >> stepping        : 6
>> >> cpu MHz         : 2660.007
>> >> cache size      : 4096 KB
>> >> physical id     : 0
>> >> siblings        : 2
>> >> core id         : 1
>> >> cpu cores       : 2
>> >> fpu             : yes
>> >> fpu_exception   : yes
>> >> cpuid level     : 10
>> >> wp              : yes
>> >> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>> mca
>> >> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx
>> lm
>> >> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
>> >> bogomips        : 5319.39
>> >> clflush size    : 64
>> >> cache_alignment : 64
>> >> address sizes   : 36 bits physical, 48 bits virtual
>> >> power management:
>> >>
>> >> processor       : 3
>> >> vendor_id       : GenuineIntel
>> >> cpu family      : 6
>> >> model           : 15
>> >> model name      : Intel(R) Xeon(R) CPU            5150  @ 2.66GHz
>> >> stepping        : 6
>> >> cpu MHz         : 2660.007
>> >> cache size      : 4096 KB
>> >> physical id     : 3
>> >> siblings        : 2
>> >> core id         : 1
>> >> cpu cores       : 2
>> >> fpu             : yes
>> >> fpu_exception   : yes
>> >> cpuid level     : 10
>> >> wp              : yes
>> >> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
>> mca
>> >> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx
>> lm
>> >> constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
>> >> bogomips        : 5320.03
>> >> clflush size    : 64
>> >> cache_alignment : 64
>> >> address sizes   : 36 bits physical, 48 bits virtual
>> >> power management:
>> >>
>> >>
>> >>> ------------------------------
>> >>>
>> >>> Message: 2
>> >>> Date: Mon, 4 May 2009 04:45:57 -0600
>> >>> From: Ralph Castain <r...@open-mpi.org>
>> >>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> >>> To: Open MPI Users <us...@open-mpi.org>
>> >>> Message-ID: <d01d7b16-4b47-46f3-ad41-d1a90b2e4...@open-mpi.org>
>> >>>
>> >>> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>> >>>        DelSp="yes"
>> >>>
>> >>> My apologies - I wasn't clear enough. You need a tarball from r21111
>> >>> or greater...such as:
>> >>>
>> >>> http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r21142.tar.gz
>> >>>
>> >>> HTH
>> >>> Ralph
>> >>>
>> >>>
>> >>> On May 4, 2009, at 2:14 AM, Geoffroy Pignot wrote:
>> >>>
>> >>> > Hi ,
>> >>> >
>> >>> > I got the openmpi-1.4a1r21095.tar.gz tarball, but unfortunately my
>> >>> > command doesn't work
>> >>> >
>> >>> > cat rankf:
>> >>> > rank 0=node1 slot=*
>> >>> > rank 1=node2 slot=*
>> >>> >
>> >>> > cat hostf:
>> >>> > node1 slots=2
>> >>> > node2 slots=2
>> >>> >
>> >>> > mpirun  --rankfile rankf --hostfile hostf  --host node1 -n 1
>> >>> > hostname : --host node2 -n 1 hostname
>> >>> >
>> >>> > Error, invalid rank (1) in the rankfile (rankf)
>> >>> >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------
>> >>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> >>> > rmaps_rank_file.c at line 403
>> >>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> >>> > base/rmaps_base_map_job.c at line 86
>> >>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> >>> > base/plm_base_launch_support.c at line 86
>> >>> > [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> >>> > plm_rsh_module.c at line 1016
>> >>> >
>> >>> >
>> >>> > Ralph, could you tell me if my command syntax is correct or not ? if
>> >>> > not, give me the expected one ?
>> >>> >
>> >>> > Regards
>> >>> >
>> >>> > Geoffroy
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > 2009/4/30 Geoffroy Pignot <geopig...@gmail.com>
>> >>> > Immediately Sir !!! :)
>> >>> >
>> >>> > Thanks again Ralph
>> >>> >
>> >>> > Geoffroy
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> >
>> >>> > ------------------------------
>> >>> >
>> >>> > Message: 2
>> >>> > Date: Thu, 30 Apr 2009 06:45:39 -0600
>> >>> > From: Ralph Castain <r...@open-mpi.org>
>> >>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> >>> > To: Open MPI Users <us...@open-mpi.org>
>> >>> > Message-ID:
>> >>> >        <71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com
>> >
>> >>> > Content-Type: text/plain; charset="iso-8859-1"
>> >>> >
>> >>> > I believe this is fixed now in our development trunk - you can
>> >>> > download any
>> >>> > tarball starting from last night and give it a try, if you like. Any
>> >>> > feedback would be appreciated.
>> >>> >
>> >>> > Ralph
>> >>> >
>> >>> >
>> >>> > On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
>> >>> >
>> >>> > Ah now, I didn't say it -worked-, did I? :-)
>> >>> >
>> >>> > Clearly a bug exists in the program. I'll try to take a look at it
>> >>> > (if Lenny
>> >>> > doesn't get to it first), but it won't be until later in the week.
>> >>> >
>> >>> > On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>> >>> >
>> >>> > I agree with you Ralph , and that 's what I expect from openmpi but
>> my
>> >>> > second example shows that it's not working
>> >>> >
>> >>> > cat hostfile.0
>> >>> >   r011n002 slots=4
>> >>> >   r011n003 slots=4
>> >>> >
>> >>> >  cat rankfile.0
>> >>> >    rank 0=r011n002 slot=0
>> >>> >    rank 1=r011n003 slot=1
>> >>> >
>> >>> > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
>> >>> > hostname
>> >>> > ### CRASHED
>> >>> >
>> >>> > > > Error, invalid rank (1) in the rankfile (rankfile.0)
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------
>> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> >>> > file
>> >>> > > > rmaps_rank_file.c at line 404
>> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> >>> > file
>> >>> > > > base/rmaps_base_map_job.c at line 87
>> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> >>> > file
>> >>> > > > base/plm_base_launch_support.c at line 77
>> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> >>> > file
>> >>> > > > plm_rsh_module.c at line 985
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------
>> >>> > > > A daemon (pid unknown) died unexpectedly on signal 1  while
>> >>> > > attempting to
>> >>> > > > launch so we are aborting.
>> >>> > > >
>> >>> > > > There may be more information reported by the environment (see
>> >>> > > above).
>> >>> > > >
>> >>> > > > This may be because the daemon was unable to find all the needed
>> >>> > > shared
>> >>> > > > libraries on the remote node. You may set your LD_LIBRARY_PATH
>> to
>> >>> > > have the
>> >>> > > > location of the shared libraries on the remote nodes and this
>> will
>> >>> > > > automatically be forwarded to the remote nodes.
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------
>> >>> > > > orterun noticed that the job aborted, but has no info as to the
>> >>> > > process
>> >>> > > > that caused that situation.
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------
>> >>> > > > orterun: clean termination accomplished
>> >>> >
>> >>> >
>> >>> >
>> >>> > Message: 4
>> >>> > Date: Tue, 14 Apr 2009 06:55:58 -0600
>> >>> > From: Ralph Castain <r...@lanl.gov>
>> >>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> >>> > To: Open MPI Users <us...@open-mpi.org>
>> >>> > Message-ID: <f6290ada-a196-43f0-a853-cbcb802d8...@lanl.gov>
>> >>> > Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>> >>> >       DelSp="yes"
>> >>> >
>> >>> > The rankfile cuts across the entire job - it isn't applied on an
>> >>> > app_context basis. So the ranks in your rankfile must correspond to
>> >>> > the eventual rank of each process in the cmd line.
>> >>> >
>> >>> > Unfortunately, that means you have to count ranks. In your case, you
>> >>> > only have four, so that makes life easier. Your rankfile would look
>> >>> > something like this:
>> >>> >
>> >>> > rank 0=r001n001 slot=0
>> >>> > rank 1=r001n002 slot=1
>> >>> > rank 2=r001n001 slot=1
>> >>> > rank 3=r001n002 slot=2
>> >>> >
>> >>> > HTH
>> >>> > Ralph
>> >>> >
>> >>> > On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
>> >>> >
>> >>> > > Hi,
>> >>> > >
>> >>> > > I agree that my examples are not very clear. What I want to do is
>> to
>> >>> > > launch a multiexes application (masters-slaves) and benefit from
>> the
>> >>> > > processor affinity.
>> >>> > > Could you show me how to convert this command , using -rf option
>> >>> > > (whatever the affinity is)
>> >>> > >
>> >>> > > mpirun -n 1 -host r001n001 master.x options1  : -n 1 -host
>> r001n002
>> >>> > > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
>> >>> > > host r001n002 slave.x options4
>> >>> > >
>> >>> > > Thanks for your help
>> >>> > >
>> >>> > > Geoffroy
>> >>> > >
>> >>> > >
>> >>> > >
>> >>> > >
>> >>> > >
>> >>> > > Message: 2
>> >>> > > Date: Sun, 12 Apr 2009 18:26:35 +0300
>> >>> > > From: Lenny Verkhovsky <lenny.verkhov...@gmail.com>
>> >>> > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> >>> > > To: Open MPI Users <us...@open-mpi.org>
>> >>> > > Message-ID:
>> >>> > >        <
>> 453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com>
>> >>> > > Content-Type: text/plain; charset="iso-8859-1"
>> >>> > >
>> >>> > > Hi,
>> >>> > >
>> >>> > > The first "crash" is OK, since your rankfile has ranks 0 and 1
>> >>> > > defined,
>> >>> > > while n=1, which means only rank 0 is present and can be
>> allocated.
>> >>> > >
>> >>> > > NP must be >= the largest rank in rankfile.
>> >>> > >
>> >>> > > What exactly are you trying to do ?
>> >>> > >
>> >>> > > I tried to recreate your seqv but all I got was
>> >>> > >
>> >>> > > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile
>> >>> > > hostfile.0
>> >>> > > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
>> >>> > > [witch19:30798] mca: base: component_find: paffinity
>> >>> > > "mca_paffinity_linux"
>> >>> > > uses an MCA interface that is not recognized (component MCA
>> >>> > v1.0.0 !=
>> >>> > > supported MCA v2.0.0) -- ignored
>> >>> > >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------
>> >>> > > It looks like opal_init failed for some reason; your parallel
>> >>> > > process is
>> >>> > > likely to abort. There are many reasons that a parallel process
>> can
>> >>> > > fail during opal_init; some of which are due to configuration or
>> >>> > > environment problems. This failure appears to be an internal
>> >>> > failure;
>> >>> > > here's some additional information (which may only be relevant to
>> an
>> >>> > > Open MPI developer):
>> >>> > >
>> >>> > >  opal_carto_base_select failed
>> >>> > >  --> Returned value -13 instead of OPAL_SUCCESS
>> >>> > >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------
>> >>> > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
>> >>> > file
>> >>> > > ../../orte/runtime/orte_init.c at line 78
>> >>> > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
>> >>> > file
>> >>> > > ../../orte/orted/orted_main.c at line 344
>> >>> > >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------
>> >>> > > A daemon (pid 11629) died unexpectedly with status 243 while
>> >>> > > attempting
>> >>> > > to launch so we are aborting.
>> >>> > >
>> >>> > > There may be more information reported by the environment (see
>> >>> > above).
>> >>> > >
>> >>> > > This may be because the daemon was unable to find all the needed
>> >>> > > shared
>> >>> > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
>> >>> > > have the
>> >>> > > location of the shared libraries on the remote nodes and this will
>> >>> > > automatically be forwarded to the remote nodes.
>> >>> > >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------
>> >>> > >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------
>> >>> > > mpirun noticed that the job aborted, but has no info as to the
>> >>> > process
>> >>> > > that caused that situation.
>> >>> > >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------
>> >>> > > mpirun: clean termination accomplished
>> >>> > >
>> >>> > >
>> >>> > > Lenny.
>> >>> > >
>> >>> > >
>> >>> > > On 4/10/09, Geoffroy Pignot <geopig...@gmail.com> wrote:
>> >>> > > >
>> >>> > > > Hi ,
>> >>> > > >
>> >>> > > > I am currently testing the process affinity capabilities of
>> >>> > > openmpi and I
>> >>> > > > would like to know if the rankfile behaviour I will describe
>> below
>> >>> > > is normal
>> >>> > > > or not ?
>> >>> > > >
>> >>> > > > cat hostfile.0
>> >>> > > > r011n002 slots=4
>> >>> > > > r011n003 slots=4
>> >>> > > >
>> >>> > > > cat rankfile.0
>> >>> > > > rank 0=r011n002 slot=0
>> >>> > > > rank 1=r011n003 slot=1
>> >>> > > >
>> >>> > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> ##################################################################################
>> >>> > > >
>> >>> > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2  hostname ###
>> OK
>> >>> > > > r011n002
>> >>> > > > r011n003
>> >>> > > >
>> >>> > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> ##################################################################################
>> >>> > > > but
>> >>> > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
>> >>> > > hostname
>> >>> > > > ### CRASHED
>> >>> > > > *
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------
>> >>> > > > Error, invalid rank (1) in the rankfile (rankfile.0)
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------
>> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> >>> > file
>> >>> > > > rmaps_rank_file.c at line 404
>> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> >>> > file
>> >>> > > > base/rmaps_base_map_job.c at line 87
>> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> >>> > file
>> >>> > > > base/plm_base_launch_support.c at line 77
>> >>> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
>> >>> > file
>> >>> > > > plm_rsh_module.c at line 985
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------
>> >>> > > > A daemon (pid unknown) died unexpectedly on signal 1  while
>> >>> > > attempting to
>> >>> > > > launch so we are aborting.
>> >>> > > >
>> >>> > > > There may be more information reported by the environment (see
>> >>> > > above).
>> >>> > > >
>> >>> > > > This may be because the daemon was unable to find all the needed
>> >>> > > shared
>> >>> > > > libraries on the remote node. You may set your LD_LIBRARY_PATH
>> to
>> >>> > > have the
>> >>> > > > location of the shared libraries on the remote nodes and this
>> will
>> >>> > > > automatically be forwarded to the remote nodes.
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------
>> >>> > > > orterun noticed that the job aborted, but has no info as to the
>> >>> > > process
>> >>> > > > that caused that situation.
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> --------------------------------------------------------------------------
>> >>> > > > orterun: clean termination accomplished
>> >>> > > > *
>> >>> > > > It seems that the rankfile option is not propagted to the second
>> >>> > > command
>> >>> > > > line ; there is no global understanding of the ranking inside a
>> >>> > > mpirun
>> >>> > > > command.
>> >>> > > >
>> >>> > > >
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> ##################################################################################
>> >>> > > >
>> >>> > > > Assuming that , I tried to provide a rankfile to each command
>> >>> > line:
>> >>> > > >
>> >>> > > > cat rankfile.0
>> >>> > > > rank 0=r011n002 slot=0
>> >>> > > >
>> >>> > > > cat rankfile.1
>> >>> > > > rank 0=r011n003 slot=1
>> >>> > > >
>> >>> > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf
>> >>> > > rankfile.1
>> >>> > > > -n 1 hostname ### CRASHED
>> >>> > > > *[r011n002:28778] *** Process received signal ***
>> >>> > > > [r011n002:28778] Signal: Segmentation fault (11)
>> >>> > > > [r011n002:28778] Signal code: Address not mapped (1)
>> >>> > > > [r011n002:28778] Failing at address: 0x34
>> >>> > > > [r011n002:28778] [ 0] [0xffffe600]
>> >>> > > > [r011n002:28778] [ 1]
>> >>> > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
>> >>> > > 0(orte_odls_base_default_get_add_procs_data+0x55d)
>> >>> > > > [0x5557decd]
>> >>> > > > [r011n002:28778] [ 2]
>> >>> > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.
>> >>> > > 0(orte_plm_base_launch_apps+0x117)
>> >>> > > > [0x555842a7]
>> >>> > > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/
>> >>> > > mca_plm_rsh.so
>> >>> > > > [0x556098c0]
>> >>> > > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>> >>> > > [0x804aa27]
>> >>> > > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>> >>> > > [0x804a022]
>> >>> > > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc)
>> >>> > > [0x9f1dec]
>> >>> > > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun
>> >>> > > [0x8049f71]
>> >>> > > > [r011n002:28778] *** End of error message ***
>> >>> > > > Segmentation fault (core dumped)*
>> >>> > > >
>> >>> > > >
>> >>> > > >
>> >>> > > > I hope that I've found a bug because it would be very important
>> >>> > > for me to
>> >>> > > > have this kind of capabiliy .
>> >>> > > > Launch a multiexe mpirun command line and be able to bind my
>> exes
>> >>> > > and
>> >>> > > > sockets together.
>> >>> > > >
>> >>> > > > Thanks in advance for your help
>> >>> > > >
>> >>> > > > Geoffroy
>> >>> > > _______________________________________________
>> >>> > > users mailing list
>> >>> > > us...@open-mpi.org
>> >>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>> >
>> >>> > -------------- next part --------------
>> >>> > HTML attachment scrubbed and removed
>> >>> >
>> >>> > ------------------------------
>> >>> >
>> >>> > _______________________________________________
>> >>> > users mailing list
>> >>> > us...@open-mpi.org
>> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>> >
>> >>> > End of users Digest, Vol 1202, Issue 2
>> >>> > **************************************
>> >>> >
>> >>> > _______________________________________________
>> >>> > users mailing list
>> >>> > us...@open-mpi.org
>> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>> >
>> >>> > _______________________________________________
>> >>> > users mailing list
>> >>> > us...@open-mpi.org
>> >>> > -------------- next part --------------
>> >>> > HTML attachment scrubbed and removed
>> >>> >
>> >>> > ------------------------------
>> >>> >
>> >>> > _______________________________________________
>> >>> > users mailing list
>> >>> > us...@open-mpi.org
>> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>> >
>> >>> > End of users Digest, Vol 1218, Issue 2
>> >>> > **************************************
>> >>> >
>> >>> >
>> >>> > _______________________________________________
>> >>> > users mailing list
>> >>> > us...@open-mpi.org
>> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>>
>> >>> -------------- next part --------------
>> >>> HTML attachment scrubbed and removed
>> >>>
>> >>> ------------------------------
>> >>>
>> >>> _______________________________________________
>> >>> users mailing list
>> >>> us...@open-mpi.org
>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>>
>> >>> End of users Digest, Vol 1221, Issue 3
>> >>> **************************************
>> >>>
>> >>
>> >>
>> >> _______________________________________________
>> >> users mailing list
>> >> us...@open-mpi.org
>> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >>
>> >
>> >
>> -------------- next part --------------
>> HTML attachment scrubbed and removed
>>
>> ------------------------------
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> End of users Digest, Vol 1221, Issue 17
>> ***************************************
>>
>
>

Reply via email to