Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-07-16 Thread Geoffroy Pignot
Hi,

I did my classic test (see below) with the 1.3.3 , and unfortunately, it
doesnt works. It seems that the modifications you made in
 openmpi-1.4a1r21142 (test passed) have not been correctly reported in this
release. Could you confirm that ??

Thanks

Geoffroy

* BASIC TEST **
cat rankf:
 rank 0=node1 slot=*
 rank 1=node2 slot=*

cat hostf:
  node1 slots=2
  node2 slots=2

mpirun  --rankfile rankf --hostfile hostf  --host node1 -n 1 hostname :
--host node2 -n 1 hostname

--
Error, invalid rank (1) in the rankfile (rankf)

--
[rdmftd02:01726] [[60757,0],0] ORTE_ERROR_LOG: Bad parameter in file
rmaps_rank_file.c at line 404
[rdmftd02:01726] [[60757,0],0] ORTE_ERROR_LOG: Bad parameter in file
base/rmaps_base_map_job.c at line 87
[rdmftd02:01726] [[60757,0],0] ORTE_ERROR_LOG: Bad parameter in file
base/plm_base_launch_support.c at line 77
[rdmftd02:01726] [[60757,0],0] ORTE_ERROR_LOG: Bad parameter in file
plm_rsh_module.c at line 990
--









2009/7/15 Geoffroy Pignot 

> Hi Lenny and Ralph,
>
> I saw nothing about rankfile in the 1.3.3 press release. Does it means that
> the bug fixes are not included there ??
> Thanks
>
> Geoffroy
>
> 2009/7/15 
>
>> Send users mailing list submissions to
>>us...@open-mpi.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>>http://www.open-mpi.org/mailman/listinfo.cgi/users
>> or, via email, send a message with subject or body 'help' to
>>users-requ...@open-mpi.org
>>
>> You can reach the person managing the list at
>>users-ow...@open-mpi.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of users digest..."
>>
>>
>> Today's Topics:
>>
>>   1. Re: 1.3.1 -rf rankfile behaviour ?? (Lenny Verkhovsky)
>>
>>
>> --
>>
>> Message: 1
>> Date: Wed, 15 Jul 2009 15:08:39 +0300
>> From: Lenny Verkhovsky 
>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> To: Open MPI Users 
>> Message-ID:
>><453d39990907150508j33ffa3f0qefc0801ea40f0...@mail.gmail.com>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> Same result.
>> I still suspect that rankfile claims for node in small hostlist provided
>> by
>> line in the app file, and not from the hostlist provided by mpirun on HNP
>> node.
>> According to my suspections your proposal should not work(and it does
>> not),
>> since in appfile line I provide np=1, and 1 host, while rankfile tries to
>> allocate all ranks (np=2).
>>
>> $orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 338
>>
>>  if(ORTE_SUCCESS != (rc = orte_rmaps_base_get_target_nodes(&node_list,
>> &num_slots, app,
>>
>> map->policy))) {
>>
>> node_list will be partial, according to app, and not full provided by
>> mpirun
>> cmd. If I didnt provide hostlist in the appfile line, mpirun uses local
>> host
>> and not hosts from the hostfile.
>>
>>
>> Tell me if I am wrong by expecting the following behaivor
>>
>> I provide to mpirun NP, full_hostlist, full_rankfile, appfile
>> I provide in appfile only partial NP and partial hostlist.
>> and it works.
>>
>> Currently, in order to get it working I need to provide full hostlist in
>> the
>> appfile. Which is quit a problematic.
>>
>>
>> $mpirun -np 2 -rf rankfile -app appfile
>> --
>> Rankfile claimed host +n1 by index that is bigger than number of allocated
>> hosts.
>> --
>> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c at line 422
>> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> ../../../../orte/mca/rmaps/base/rmaps_base_map_job.c at line 85
>> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> ../../../../orte/mca/plm/base/plm_base_launch_support.c at line 103
>> [dellix7:17277] [[23928,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001
>>
>>
>> Thanks
>> Lenny.
>>
>>
>> On Wed, Jul 15, 2009 at 2:02 

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-07-15 Thread Geoffroy Pignot
 line 103
> >>>> [dellix7:13414] [[10851,0],0] ORTE_ERROR_LOG: Bad parameter in file
> >>>> ../../../../../orte/mca/plm/rsh/plm_rsh_module.c at line 1001
> >>>>
> >>>>
> >>>> The problem is, that rankfile mapper tries to find an appropriate host
> >>>> in the partial ( and not full ) hostlist.
> >>>>
> >>>> Any suggestions how to fix it?
> >>>>
> >>>> Thanks
> >>>> Lenny.
> >>>>
> >>>> On Wed, May 13, 2009 at 1:55 AM, Ralph Castain  >wrote:
> >>>>
> >>>>> Okay, I fixed this today toor21219
> >>>>>
> >>>>>
> >>>>> On May 11, 2009, at 11:27 PM, Anton Starikov wrote:
> >>>>>
> >>>>> Now there is another problem :)
> >>>>>>
> >>>>>> You can try oversubscribe node. At least by 1 task.
> >>>>>> If you hostfile and rank file limit you at N procs, you can ask
> mpirun
> >>>>>> for N+1 and it wil be not rejected.
> >>>>>> Although in reality there will be N tasks.
> >>>>>> So, if your hostfile limit is 4, then "mpirun -np 4" and "mpirun -np
> >>>>>> 5" both works, but in both cases there are only 4 tasks. It isn't
> crucial,
> >>>>>> because there is nor real oversubscription, but there is still some
> bug
> >>>>>> which can affect something in future.
> >>>>>>
> >>>>>> --
> >>>>>> Anton Starikov.
> >>>>>>
> >>>>>> On May 12, 2009, at 1:45 AM, Ralph Castain wrote:
> >>>>>>
> >>>>>> This is fixed as of r21208.
> >>>>>>>
> >>>>>>> Thanks for reporting it!
> >>>>>>> Ralph
> >>>>>>>
> >>>>>>>
> >>>>>>> On May 11, 2009, at 12:51 PM, Anton Starikov wrote:
> >>>>>>>
> >>>>>>> Although removing this check solves problem of having more slots in
> >>>>>>>> rankfile than necessary, there is another problem.
> >>>>>>>>
> >>>>>>>> If I set rmaps_base_no_oversubscribe=1 then if, for example:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> hostfile:
> >>>>>>>>
> >>>>>>>> node01
> >>>>>>>> node01
> >>>>>>>> node02
> >>>>>>>> node02
> >>>>>>>>
> >>>>>>>> rankfile:
> >>>>>>>>
> >>>>>>>> rank 0=node01 slot=1
> >>>>>>>> rank 1=node01 slot=0
> >>>>>>>> rank 2=node02 slot=1
> >>>>>>>> rank 3=node02 slot=0
> >>>>>>>>
> >>>>>>>> mpirun -np 4 ./something
> >>>>>>>>
> >>>>>>>> complains with:
> >>>>>>>>
> >>>>>>>> "There are not enough slots available in the system to satisfy the
> 4
> >>>>>>>> slots
> >>>>>>>> that were requested by the application"
> >>>>>>>>
> >>>>>>>> but "mpirun -np 3 ./something" will work though. It works, when
> you
> >>>>>>>> ask for 1 CPU less. And the same behavior in any case (shared
> nodes,
> >>>>>>>> non-shared nodes, multi-node)
> >>>>>>>>
> >>>>>>>> If you switch off rmaps_base_no_oversubscribe, then it works and
> all
> >>>>>>>> affinities set as it requested in rankfile, there is no
> oversubscription.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Anton.
> >>>>>>>>
> >>>>>>>> On May 5, 2009, at 3:08 PM, Ralph Castain wrote:
> >>>>>>>>
> >>>>>>>> Ah - thx for catching that, I'll remove that check. It no longer
> is
> >>>>>>>>> required.
> >>>>>>>>>
> >>>>>>>>> Thx!
> &

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-05-05 Thread Geoffroy Pignot
Hi

 The result is : everything works fine with MPI executables : logical !!!

What I was trying to do , was to run non MPI exes thanks to mpirun. There ,
openmpi is not able to bind these processes to a particular CPU.
My conclusion is that the process affinity is set in MPI_Init, right ?

Could it be possible to have the paffinity features working without any
MPI_Init call, using taskset for example. I agree , it's not your job to
support the execution of any kind of exes but it would be nice !!
Thanks again for all your efforts, I really appreciate

I am looking forward to downloading, trying and deploying the next official
release

Regards

Geoffroy



2009/5/4 Geoffroy Pignot 

> Hi Ralph
>
> Thanks for your extra tests.  Before leaving , I just pointed out a problem
> coming from running plpa across different rh distribs (<=> different Linux
> kernels). Indeed, I configure and compile openmpi on rhel4 , then I run on
> rhel5. I think my problem comes from this approximation. I'll do few more
> tests tomorrow morning (France) and keep you inform.
>
> Regards
>
> Geoffroy
>
>
>>
>> Message: 2
>> Date: Mon, 4 May 2009 13:34:40 -0600
>> From: Ralph Castain 
>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> To: Open MPI Users 
>> Message-ID:
>><71d2d8cc0905041234m76eb5a9dx57a773997779d...@mail.gmail.com>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> Hmmm...I'm afraid I can't replicate the problem. All seems to be working
>> just fine on the RHEL systems available to me. The procs indeed bind to
>> the
>> specified processors in every case.
>>
>> rhc@odin ~/trunk]$ cat rankfile
>> rank 0=odin001 slot=0
>> rank 1=odin002 slot=1
>>
>> [rhc@odin mpi]$ mpirun -rf ../../../rankfile -n 2
>> --leave-session-attached
>> -mca paffinity_base_verbose 5 ./mpi_spin
>> [odin001.cs.indiana.edu:09297 <http://odin001.cs.indiana.edu:9297/>]
>> paffinity slot assignment: slot_list == 0
>> [odin001.cs.indiana.edu:09297 <http://odin001.cs.indiana.edu:9297/>]
>> paffinity slot assignment: rank 0 runs on cpu #0 (#0)
>> [odin002.cs.indiana.edu:13566] paffinity slot assignment: slot_list == 1
>> [odin002.cs.indiana.edu:13566] paffinity slot assignment: rank 1 runs on
>> cpu
>> #1 (#1)
>>
>> Suspended
>> [rhc@odin mpi]$ ssh odin001
>> [rhc@odin001 ~]$ ps axo stat,user,psr,pid,pcpu,comm | grep rhc
>> Srhc0  9296  0.0 orted
>> RLl  rhc0  9297  100 mpi_spin
>>
>> [rhc@odin mpi]$ ssh odin002
>> [rhc@odin002 ~]$ ps axo stat,user,psr,pid,pcpu,comm | grep rhc
>> Srhc0 13562  0.0 orted
>> RLl  rhc1 13566  102 mpi_spin
>>
>>
>> Not sure where to go from here...perhaps someone else can spot the
>> problem?
>> Ralph
>>
>>
>> On Mon, May 4, 2009 at 8:28 AM, Ralph Castain  wrote:
>>
>> > Unfortunately, I didn't write any of that code - I was just fixing the
>> > mapper so it would properly map the procs. From what I can tell, the
>> proper
>> > things are happening there.
>> >
>> > I'll have to dig into the code that specifically deals with parsing the
>> > results to bind the processes. Afraid that will take awhile longer -
>> pretty
>> > dark in that hole.
>> >
>> >
>> >
>> > On Mon, May 4, 2009 at 8:04 AM, Geoffroy Pignot > >wrote:
>>
>> >
>> >> Hi,
>> >>
>> >> So, there are no more crashes with my "crazy" mpirun command. But the
>> >> paffinity feature seems to be broken. Indeed I am not able to pin my
>> >> processes.
>> >>
>> >> Simple test with a program using your plpa library :
>> >>
>> >> r011n006% cat hostf
>> >> r011n006 slots=4
>> >>
>> >> r011n006% cat rankf
>> >> rank 0=r011n006 slot=0   > bind to CPU 0 , exact ?
>> >>
>> >> r011n006% /tmp/HALMPI/openmpi-1.4a/bin/mpirun --hostfile hostf
>> --rankfile
>> >> rankf --wdir /tmp -n 1 a.out
>> >>  >>> PLPA Number of processors online: 4
>> >>  >>> PLPA Number of processor sockets: 2
>> >>  >>> PLPA Socket 0 (ID 0): 2 cores
>> >>  >>> PLPA Socket 1 (ID 3): 2 cores
>> >>
>> >> Ctrl+Z
>> >> r011n006%bg
>> >>
>> >> r011n006% ps axo stat,user,psr,pid,pcpu,comm | grep gpignot
>> >> R+   gpignot3  9271 97.8 a.out
>> &g

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-05-04 Thread Geoffroy Pignot
Hi Ralph

Thanks for your extra tests.  Before leaving , I just pointed out a problem
coming from running plpa across different rh distribs (<=> different Linux
kernels). Indeed, I configure and compile openmpi on rhel4 , then I run on
rhel5. I think my problem comes from this approximation. I'll do few more
tests tomorrow morning (France) and keep you inform.

Regards

Geoffroy








>
>
> Message: 2
> Date: Mon, 4 May 2009 13:34:40 -0600
> From: Ralph Castain 
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users 
> Message-ID:
><71d2d8cc0905041234m76eb5a9dx57a773997779d...@mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hmmm...I'm afraid I can't replicate the problem. All seems to be working
> just fine on the RHEL systems available to me. The procs indeed bind to the
> specified processors in every case.
>
> rhc@odin ~/trunk]$ cat rankfile
> rank 0=odin001 slot=0
> rank 1=odin002 slot=1
>
> [rhc@odin mpi]$ mpirun -rf ../../../rankfile -n 2 --leave-session-attached
> -mca paffinity_base_verbose 5 ./mpi_spin
> [odin001.cs.indiana.edu:09297 <http://odin001.cs.indiana.edu:9297/>]
> paffinity slot assignment: slot_list == 0
> [odin001.cs.indiana.edu:09297 <http://odin001.cs.indiana.edu:9297/>]
> paffinity slot assignment: rank 0 runs on cpu #0 (#0)
> [odin002.cs.indiana.edu:13566] paffinity slot assignment: slot_list == 1
> [odin002.cs.indiana.edu:13566] paffinity slot assignment: rank 1 runs on
> cpu
> #1 (#1)
>
> Suspended
> [rhc@odin mpi]$ ssh odin001
> [rhc@odin001 ~]$ ps axo stat,user,psr,pid,pcpu,comm | grep rhc
> Srhc0  9296  0.0 orted
> RLl  rhc0  9297  100 mpi_spin
>
> [rhc@odin mpi]$ ssh odin002
> [rhc@odin002 ~]$ ps axo stat,user,psr,pid,pcpu,comm | grep rhc
> Srhc0 13562  0.0 orted
> RLl  rhc1 13566  102 mpi_spin
>
>
> Not sure where to go from here...perhaps someone else can spot the problem?
> Ralph
>
>
> On Mon, May 4, 2009 at 8:28 AM, Ralph Castain  wrote:
>
> > Unfortunately, I didn't write any of that code - I was just fixing the
> > mapper so it would properly map the procs. From what I can tell, the
> proper
> > things are happening there.
> >
> > I'll have to dig into the code that specifically deals with parsing the
> > results to bind the processes. Afraid that will take awhile longer -
> pretty
> > dark in that hole.
> >
> >
> >
> > On Mon, May 4, 2009 at 8:04 AM, Geoffroy Pignot  >wrote:
> >
> >> Hi,
> >>
> >> So, there are no more crashes with my "crazy" mpirun command. But the
> >> paffinity feature seems to be broken. Indeed I am not able to pin my
> >> processes.
> >>
> >> Simple test with a program using your plpa library :
> >>
> >> r011n006% cat hostf
> >> r011n006 slots=4
> >>
> >> r011n006% cat rankf
> >> rank 0=r011n006 slot=0   > bind to CPU 0 , exact ?
> >>
> >> r011n006% /tmp/HALMPI/openmpi-1.4a/bin/mpirun --hostfile hostf
> --rankfile
> >> rankf --wdir /tmp -n 1 a.out
> >>  >>> PLPA Number of processors online: 4
> >>  >>> PLPA Number of processor sockets: 2
> >>  >>> PLPA Socket 0 (ID 0): 2 cores
> >>  >>> PLPA Socket 1 (ID 3): 2 cores
> >>
> >> Ctrl+Z
> >> r011n006%bg
> >>
> >> r011n006% ps axo stat,user,psr,pid,pcpu,comm | grep gpignot
> >> R+   gpignot3  9271 97.8 a.out
> >>
> >> In fact whatever the slot number I put in my rankfile , a.out always
> runs
> >> on the CPU 3. I was looking for it on CPU 0 accordind to my cpuinfo file
> >> (see below)
> >> The result is the same if I try another syntax (rank 0=r011n006 slot=0:0
> >> bind to socket 0 - core 0  , exact ? )
> >>
> >> Thanks in advance
> >>
> >> Geoffroy
> >>
> >> PS: I run on rhel5
> >>
> >> r011n006% uname -a
> >> Linux r011n006 2.6.18-92.1.1NOMAP32.el5 #1 SMP Sat Mar 15 01:46:39 CDT
> >> 2008 x86_64 x86_64 x86_64 GNU/Linux
> >>
> >> My configure is :
> >>  ./configure --prefix=/tmp/openmpi-1.4a --libdir='${exec_prefix}/lib64'
> >> --disable-dlopen --disable-mpi-cxx --enable-heterogeneous
> >>
> >>
> >> r011n006% cat /proc/cpuinfo
> >> processor   : 0
> >> vendor_id   : GenuineIntel
> >> cpu family  : 6
> >> model   : 15
> >> model name  : Intel(R) Xeon(

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-05-04 Thread Geoffroy Pignot
Hi,

So, there are no more crashes with my "crazy" mpirun command. But the
paffinity feature seems to be broken. Indeed I am not able to pin my
processes.

Simple test with a program using your plpa library :

r011n006% cat hostf
r011n006 slots=4

r011n006% cat rankf
rank 0=r011n006 slot=0   > bind to CPU 0 , exact ?

r011n006% /tmp/HALMPI/openmpi-1.4a/bin/mpirun --hostfile hostf --rankfile
rankf --wdir /tmp -n 1 a.out
 >>> PLPA Number of processors online: 4
 >>> PLPA Number of processor sockets: 2
 >>> PLPA Socket 0 (ID 0): 2 cores
 >>> PLPA Socket 1 (ID 3): 2 cores

Ctrl+Z
r011n006%bg

r011n006% ps axo stat,user,psr,pid,pcpu,comm | grep gpignot
R+   gpignot3  9271 97.8 a.out

In fact whatever the slot number I put in my rankfile , a.out always runs on
the CPU 3. I was looking for it on CPU 0 accordind to my cpuinfo file (see
below)
The result is the same if I try another syntax (rank 0=r011n006 slot=0:0
bind to socket 0 - core 0  , exact ? )

Thanks in advance

Geoffroy

PS: I run on rhel5

r011n006% uname -a
Linux r011n006 2.6.18-92.1.1NOMAP32.el5 #1 SMP Sat Mar 15 01:46:39 CDT 2008
x86_64 x86_64 x86_64 GNU/Linux

My configure is :
 ./configure --prefix=/tmp/openmpi-1.4a --libdir='${exec_prefix}/lib64'
--disable-dlopen --disable-mpi-cxx --enable-heterogeneous


r011n006% cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Xeon(R) CPU5150  @ 2.66GHz
stepping: 6
cpu MHz : 2660.007
cache size  : 4096 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm
constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips: 5323.68
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Xeon(R) CPU5150  @ 2.66GHz
stepping: 6
cpu MHz : 2660.007
cache size  : 4096 KB
physical id : 3
siblings: 2
core id : 0
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm
constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips: 5320.03
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 2
vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Xeon(R) CPU5150  @ 2.66GHz
stepping: 6
cpu MHz : 2660.007
cache size  : 4096 KB
physical id : 0
siblings: 2
core id : 1
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm
constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips: 5319.39
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 3
vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Xeon(R) CPU5150  @ 2.66GHz
stepping: 6
cpu MHz : 2660.007
cache size  : 4096 KB
physical id : 3
siblings: 2
core id : 1
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm
constant_tsc pni monitor ds_cpl vmx est tm2 cx16 xtpr lahf_lm
bogomips: 5320.03
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:


> --
>
> Message: 2
> Date: Mon, 4 May 2009 04:45:57 -0600
> From: Ralph Castain 
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users 
> Message-ID: 
> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>DelSp="yes"
>
> My apologies - I wasn't clear enough. You need a tarball from r2
> or greater...such as:
>
> http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r21142.tar.gz
>
> HTH
> Ralph
>
>
> On May 4, 2009, at 2:14 AM, Geoffroy Pignot w

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-05-04 Thread Geoffroy Pignot
Hi ,

I got the 
openmpi-1.4a1r21095.tar.gz<http://www.open-mpi.org/nightly/trunk/openmpi-1.4a1r21095.tar.gz>tarball,
but unfortunately my command doesn't work

cat rankf:
rank 0=node1 slot=*
rank 1=node2 slot=*

cat hostf:
node1 slots=2
node2 slots=2

mpirun  --rankfile rankf --hostfile hostf  --host node1 -n 1 hostname :
--host node2 -n 1 hostname

Error, invalid rank (1) in the rankfile (rankf)

--
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
rmaps_rank_file.c at line 403
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
base/rmaps_base_map_job.c at line 86
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
base/plm_base_launch_support.c at line 86
[r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
plm_rsh_module.c at line 1016


Ralph, could you tell me if my command syntax is correct or not ? if not,
give me the expected one ?

Regards

Geoffroy




2009/4/30 Geoffroy Pignot 

> Immediately Sir !!! :)
>
> Thanks again Ralph
>
> Geoffroy
>
>
>
>>
>>
>> --
>>
>> Message: 2
>> Date: Thu, 30 Apr 2009 06:45:39 -0600
>> From: Ralph Castain 
>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> To: Open MPI Users 
>> Message-ID:
>><71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> I believe this is fixed now in our development trunk - you can download
>> any
>> tarball starting from last night and give it a try, if you like. Any
>> feedback would be appreciated.
>>
>> Ralph
>>
>>
>> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
>>
>> Ah now, I didn't say it -worked-, did I? :-)
>>
>> Clearly a bug exists in the program. I'll try to take a look at it (if
>> Lenny
>> doesn't get to it first), but it won't be until later in the week.
>>
>> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>>
>> I agree with you Ralph , and that 's what I expect from openmpi but my
>> second example shows that it's not working
>>
>> cat hostfile.0
>>   r011n002 slots=4
>>   r011n003 slots=4
>>
>>  cat rankfile.0
>>rank 0=r011n002 slot=0
>>rank 1=r011n003 slot=1
>>
>> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 hostname
>> ### CRASHED
>>
>> > > Error, invalid rank (1) in the rankfile (rankfile.0)
>> > >
>> >
>> --
>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> > > rmaps_rank_file.c at line 404
>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> > > base/rmaps_base_map_job.c at line 87
>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> > > base/plm_base_launch_support.c at line 77
>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> > > plm_rsh_module.c at line 985
>> > >
>> >
>> --
>> > > A daemon (pid unknown) died unexpectedly on signal 1  while
>> > attempting to
>> > > launch so we are aborting.
>> > >
>> > > There may be more information reported by the environment (see
>> > above).
>> > >
>> > > This may be because the daemon was unable to find all the needed
>> > shared
>> > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
>> > have the
>> > > location of the shared libraries on the remote nodes and this will
>> > > automatically be forwarded to the remote nodes.
>> > >
>> >
>> --
>> > >
>> >
>> --
>> > > orterun noticed that the job aborted, but has no info as to the
>> > process
>> > > that caused that situation.
>> > >
>> >
>> --
>> > > orterun: clean termination accomplished
>>
>>
>>
>> Message: 4
>> Date: Tue, 14 Apr 2009 06:55:58 -0600
>> From: Ralph Castain 
>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> To:

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-04-30 Thread Geoffroy Pignot
Immediately Sir !!! :)

Thanks again Ralph

Geoffroy






>
>
> --
>
> Message: 2
> Date: Thu, 30 Apr 2009 06:45:39 -0600
> From: Ralph Castain 
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users 
> Message-ID:
><71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> I believe this is fixed now in our development trunk - you can download any
> tarball starting from last night and give it a try, if you like. Any
> feedback would be appreciated.
>
> Ralph
>
>
> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
>
> Ah now, I didn't say it -worked-, did I? :-)
>
> Clearly a bug exists in the program. I'll try to take a look at it (if
> Lenny
> doesn't get to it first), but it won't be until later in the week.
>
> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>
> I agree with you Ralph , and that 's what I expect from openmpi but my
> second example shows that it's not working
>
> cat hostfile.0
>   r011n002 slots=4
>   r011n003 slots=4
>
>  cat rankfile.0
>rank 0=r011n002 slot=0
>rank 1=r011n003 slot=1
>
> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 hostname
> ### CRASHED
>
> > > Error, invalid rank (1) in the rankfile (rankfile.0)
> > >
> >
> --
> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > > rmaps_rank_file.c at line 404
> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > > base/rmaps_base_map_job.c at line 87
> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > > base/plm_base_launch_support.c at line 77
> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > > plm_rsh_module.c at line 985
> > >
> >
> --
> > > A daemon (pid unknown) died unexpectedly on signal 1  while
> > attempting to
> > > launch so we are aborting.
> > >
> > > There may be more information reported by the environment (see
> > above).
> > >
> > > This may be because the daemon was unable to find all the needed
> > shared
> > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
> > have the
> > > location of the shared libraries on the remote nodes and this will
> > > automatically be forwarded to the remote nodes.
> > >
> >
> --
> > >
> >
> --
> > > orterun noticed that the job aborted, but has no info as to the
> > process
> > > that caused that situation.
> > >
> >
> --
> > > orterun: clean termination accomplished
>
>
>
> Message: 4
> Date: Tue, 14 Apr 2009 06:55:58 -0600
> From: Ralph Castain 
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users 
> Message-ID: 
> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>   DelSp="yes"
>
> The rankfile cuts across the entire job - it isn't applied on an
> app_context basis. So the ranks in your rankfile must correspond to
> the eventual rank of each process in the cmd line.
>
> Unfortunately, that means you have to count ranks. In your case, you
> only have four, so that makes life easier. Your rankfile would look
> something like this:
>
> rank 0=r001n001 slot=0
> rank 1=r001n002 slot=1
> rank 2=r001n001 slot=1
> rank 3=r001n002 slot=2
>
> HTH
> Ralph
>
> On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
>
> > Hi,
> >
> > I agree that my examples are not very clear. What I want to do is to
> > launch a multiexes application (masters-slaves) and benefit from the
> > processor affinity.
> > Could you show me how to convert this command , using -rf option
> > (whatever the affinity is)
> >
> > mpirun -n 1 -host r001n001 master.x options1  : -n 1 -host r001n002
> > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
> > host r001n002 slave.x options4
> >
> > Thanks for your help
> >
> > Geoffroy
> >
> >
> >
> >
> >
> > Message: 2
> > Date: Sun, 12 Apr 2009 18:26:35 

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-04-21 Thread Geoffroy Pignot
Hi Lenny,

Here is the basic mpirun command I would like to run :

mpirun -rf rankfile -n 1 -host r001n001 master.x options1  : -n 1 -host
r001n002 master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1
-host r001n002 slave.x options4

with cat rankfile
rank 0=r001n001 slot=0:*
rank 1=r001n002 slot=0:*
rank 3=r001n001 slot=1:*
rank 4=r001n002 slot=1:*

It should be equivalent and more elegant to run :
mpirun -hostfile myhostfile -rf rankfile -n 1 master.x options1 : -n 1
master.x options2 : -n 1 slave.x options3 : -n 1 slave.x options4

with cat myhostfile
r001n001 slots=2
r001n002 slots=2

I hope these examples will set you straight about I want to do

Regards

Geoffroy


>
> It's something in the basis, right,
> I tried to investigate it yesterday and saw that for some reason
> jdata->bookmark->index is 2 instead of 1 ( in this example ).
>
> [dellix7:28454] [ ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c
> +417 ]  node->index = 1, jdata->bookmark->index=2
> [dellix7:28454] [ ../../../../../orte/mca/rmaps/rank_file/rmaps_rank_file.c
> +417 ]  node->index = 2, jdata->bookmark->index=2
> I am not so familiar with this part of code, since it appears in all rmap
> component and I just copied it :).
>
> I am also not quite understand what Geoffroy tries to run, so I can think
> od
> workaround.
> Lenny.
>
>


Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-04-20 Thread Geoffroy Pignot
Thanks,

I am not in a hurry but it would be nice if I could benefit from this
feature in the next release.
Regards

Geoffroy



2009/4/20 

> Send users mailing list submissions to
>us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
>users-requ...@open-mpi.org
>
> You can reach the person managing the list at
>users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>   1. Re: 1.3.1 -rf rankfile behaviour ?? (Ralph Castain)
>
>
> --
>
> Message: 1
> Date: Mon, 20 Apr 2009 05:59:52 -0600
> From: Ralph Castain 
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users 
> Message-ID: <6378a8c1-1763-4a1c-abca-c6fcc3605...@open-mpi.org>
> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>DelSp="yes"
>
> Honestly haven't had time to look at it yet - hopefully in the next
> couple of days...
>
> Sorry for delay
>
>
> On Apr 20, 2009, at 2:58 AM, Geoffroy Pignot wrote:
>
> > Do you have any news about this bug.
> > Thanks
> >
> > Geoffroy
> >
> >
> > Message: 1
> > Date: Tue, 14 Apr 2009 07:57:44 -0600
> > From: Ralph Castain 
> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> > To: Open MPI Users 
> > Message-ID: 
> > Content-Type: text/plain; charset="us-ascii"; Format="flowed";
> >DelSp="yes"
> >
> > Ah now, I didn't say it -worked-, did I? :-)
> >
> > Clearly a bug exists in the program. I'll try to take a look at it (if
> > Lenny doesn't get to it first), but it won't be until later in the
> > week.
> >
> > On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
> >
> > > I agree with you Ralph , and that 's what I expect from openmpi but
> > > my second example shows that it's not working
> > >
> > > cat hostfile.0
> > >r011n002 slots=4
> > >r011n003 slots=4
> > >
> > >  cat rankfile.0
> > > rank 0=r011n002 slot=0
> > > rank 1=r011n003 slot=1
> > >
> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
> > > hostname
> > > ### CRASHED
> > >
> > > > > Error, invalid rank (1) in the rankfile (rankfile.0)
> > > > >
> > > >
> > >
> >
> --
> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > > file
> > > > > rmaps_rank_file.c at line 404
> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > > file
> > > > > base/rmaps_base_map_job.c at line 87
> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > > file
> > > > > base/plm_base_launch_support.c at line 77
> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > > file
> > > > > plm_rsh_module.c at line 985
> > > > >
> > > >
> > >
> >
> --
> > > > > A daemon (pid unknown) died unexpectedly on signal 1  while
> > > > attempting to
> > > > > launch so we are aborting.
> > > > >
> > > > > There may be more information reported by the environment (see
> > > > above).
> > > > >
> > > > > This may be because the daemon was unable to find all the needed
> > > > shared
> > > > > libraries on the remote node. You may set your LD_LIBRARY_PATH
> > to
> > > > have the
> > > > > location of the shared libraries on the remote nodes and this
> > will
> > > > > automatically be forwarded to the remote nodes.
> > > > >
> > > >
> > >
> >
> --
> > > > >
> > > >
> > >
> >
> --
> > > > > orterun noticed 

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-04-20 Thread Geoffroy Pignot
Do you have any news about this bug.
Thanks

Geoffroy


>
> Message: 1
> Date: Tue, 14 Apr 2009 07:57:44 -0600
> From: Ralph Castain 
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users 
> Message-ID: 
> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>DelSp="yes"
>
> Ah now, I didn't say it -worked-, did I? :-)
>
> Clearly a bug exists in the program. I'll try to take a look at it (if
> Lenny doesn't get to it first), but it won't be until later in the week.
>
> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>
> > I agree with you Ralph , and that 's what I expect from openmpi but
> > my second example shows that it's not working
> >
> > cat hostfile.0
> >r011n002 slots=4
> >r011n003 slots=4
> >
> >  cat rankfile.0
> > rank 0=r011n002 slot=0
> > rank 1=r011n003 slot=1
> >
> > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
> > hostname
> > ### CRASHED
> >
> > > > Error, invalid rank (1) in the rankfile (rankfile.0)
> > > >
> > >
> >
> --
> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > file
> > > > rmaps_rank_file.c at line 404
> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > file
> > > > base/rmaps_base_map_job.c at line 87
> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > file
> > > > base/plm_base_launch_support.c at line 77
> > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> > file
> > > > plm_rsh_module.c at line 985
> > > >
> > >
> >
> --
> > > > A daemon (pid unknown) died unexpectedly on signal 1  while
> > > attempting to
> > > > launch so we are aborting.
> > > >
> > > > There may be more information reported by the environment (see
> > > above).
> > > >
> > > > This may be because the daemon was unable to find all the needed
> > > shared
> > > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
> > > have the
> > > > location of the shared libraries on the remote nodes and this will
> > > > automatically be forwarded to the remote nodes.
> > > >
> > >
> >
> --
> > > >
> > >
> >
> --
> > > > orterun noticed that the job aborted, but has no info as to the
> > > process
> > > > that caused that situation.
> > > >
> > >
> >
> --
> > > > orterun: clean termination accomplished
> >
> >
> >
> > Message: 4
> > Date: Tue, 14 Apr 2009 06:55:58 -0600
> > From: Ralph Castain 
> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> > To: Open MPI Users 
> > Message-ID: 
> > Content-Type: text/plain; charset="us-ascii"; Format="flowed";
> >DelSp="yes"
> >
> > The rankfile cuts across the entire job - it isn't applied on an
> > app_context basis. So the ranks in your rankfile must correspond to
> > the eventual rank of each process in the cmd line.
> >
> > Unfortunately, that means you have to count ranks. In your case, you
> > only have four, so that makes life easier. Your rankfile would look
> > something like this:
> >
> > rank 0=r001n001 slot=0
> > rank 1=r001n002 slot=1
> > rank 2=r001n001 slot=1
> > rank 3=r001n002 slot=2
> >
> > HTH
> > Ralph
> >
> > On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
> >
> > > Hi,
> > >
> > > I agree that my examples are not very clear. What I want to do is to
> > > launch a multiexes application (masters-slaves) and benefit from the
> > > processor affinity.
> > > Could you show me how to convert this command , using -rf option
> > > (whatever the affinity is)
> > >
> > > mpirun -n 1 -host r001n001 master.x options1  : -n 1 -host r001n002
> > > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
> > > host r0

[OMPI users] openmpi 1.3.1 : mpirun status is 0 after receiving TERM signal

2009-04-14 Thread Geoffroy Pignot
Hi,

I am not sure it's a bug but I think we wait for something else when we kill
a proccess - by the way , the signal propagation works well.
I read an explanation on a previous thread - (
http://www.open-mpi.org/community/lists/users/2009/03/8514.php ) . .

It's not important but it could contribute to make openmpi better !!

Geoffroy


Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-04-14 Thread Geoffroy Pignot
I agree with you Ralph , and that 's what I expect from openmpi but my
second example shows that it's not working

cat hostfile.0
   r011n002 slots=4
   r011n003 slots=4

 cat rankfile.0
rank 0=r011n002 slot=0
rank 1=r011n003 slot=1

mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 hostname
### CRASHED

> > Error, invalid rank (1) in the rankfile (rankfile.0)
> > >
> >
> --
> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > > rmaps_rank_file.c at line 404
> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > > base/rmaps_base_map_job.c at line 87
> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > > base/plm_base_launch_support.c at line 77
> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > > plm_rsh_module.c at line 985
> > >
> >
> --
> > > A daemon (pid unknown) died unexpectedly on signal 1  while
> > attempting to
> > > launch so we are aborting.
> > >
> > > There may be more information reported by the environment (see
> > above).
> > >
> > > This may be because the daemon was unable to find all the needed
> > shared
> > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
> > have the
> > > location of the shared libraries on the remote nodes and this will
> > > automatically be forwarded to the remote nodes.
> > >
> >
> --
> > >
> >
> --
> > > orterun noticed that the job aborted, but has no info as to the
> > process
> > > that caused that situation.
> > >
> >
> --
> > > orterun: clean termination accomplished




>
> Message: 4
> Date: Tue, 14 Apr 2009 06:55:58 -0600
> From: Ralph Castain 
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users 
> Message-ID: 
> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>DelSp="yes"
>
> The rankfile cuts across the entire job - it isn't applied on an
> app_context basis. So the ranks in your rankfile must correspond to
> the eventual rank of each process in the cmd line.
>
> Unfortunately, that means you have to count ranks. In your case, you
> only have four, so that makes life easier. Your rankfile would look
> something like this:
>
> rank 0=r001n001 slot=0
> rank 1=r001n002 slot=1
> rank 2=r001n001 slot=1
> rank 3=r001n002 slot=2
>
> HTH
> Ralph
>
> On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote:
>
> > Hi,
> >
> > I agree that my examples are not very clear. What I want to do is to
> > launch a multiexes application (masters-slaves) and benefit from the
> > processor affinity.
> > Could you show me how to convert this command , using -rf option
> > (whatever the affinity is)
> >
> > mpirun -n 1 -host r001n001 master.x options1  : -n 1 -host r001n002
> > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -
> > host r001n002 slave.x options4
> >
> > Thanks for your help
> >
> > Geoffroy
> >
> >
> >
> >
> >
> > Message: 2
> > Date: Sun, 12 Apr 2009 18:26:35 +0300
> > From: Lenny Verkhovsky 
> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> > To: Open MPI Users 
> > Message-ID:
> ><453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com>
> > Content-Type: text/plain; charset="iso-8859-1"
> >
> > Hi,
> >
> > The first "crash" is OK, since your rankfile has ranks 0 and 1
> > defined,
> > while n=1, which means only rank 0 is present and can be allocated.
> >
> > NP must be >= the largest rank in rankfile.
> >
> > What exactly are you trying to do ?
> >
> > I tried to recreate your seqv but all I got was
> >
> > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile
> > hostfile.0
> > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
> > [witch19:30798] mca: base: component_find: paffinity
> > "mca_paffinity_linux"
> > uses an MCA interface that is not recognized (component MCA v1.0.0 !=
> > supported MCA 

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-04-14 Thread Geoffroy Pignot
Hi,

I agree that my examples are not very clear. What I want to do is to launch
a multiexes application (masters-slaves) and benefit from the processor
affinity.
Could you show me how to convert this command , using -rf option (whatever
the affinity is)

mpirun -n 1 -host r001n001 master.x options1  : -n 1 -host r001n002 master.x
options2 : -n 1 -host r001n001 slave.x options3 : -n 1 -host r001n002
slave.x options4

Thanks for your help

Geoffroy





>
> Message: 2
> Date: Sun, 12 Apr 2009 18:26:35 +0300
> From: Lenny Verkhovsky 
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users 
> Message-ID:
><453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi,
>
> The first "crash" is OK, since your rankfile has ranks 0 and 1 defined,
> while n=1, which means only rank 0 is present and can be allocated.
>
> NP must be >= the largest rank in rankfile.
>
> What exactly are you trying to do ?
>
> I tried to recreate your seqv but all I got was
>
> ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile hostfile.0
> -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname
> [witch19:30798] mca: base: component_find: paffinity "mca_paffinity_linux"
> uses an MCA interface that is not recognized (component MCA v1.0.0 !=
> supported MCA v2.0.0) -- ignored
> --
> It looks like opal_init failed for some reason; your parallel process is
> likely to abort. There are many reasons that a parallel process can
> fail during opal_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>  opal_carto_base_select failed
>  --> Returned value -13 instead of OPAL_SUCCESS
> --
> [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
> ../../orte/runtime/orte_init.c at line 78
> [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
> ../../orte/orted/orted_main.c at line 344
> --
> A daemon (pid 11629) died unexpectedly with status 243 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> mpirun: clean termination accomplished
>
>
> Lenny.
>
>
> On 4/10/09, Geoffroy Pignot  wrote:
> >
> > Hi ,
> >
> > I am currently testing the process affinity capabilities of openmpi and I
> > would like to know if the rankfile behaviour I will describe below is
> normal
> > or not ?
> >
> > cat hostfile.0
> > r011n002 slots=4
> > r011n003 slots=4
> >
> > cat rankfile.0
> > rank 0=r011n002 slot=0
> > rank 1=r011n003 slot=1
> >
> >
> >
> ##
> >
> > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2  hostname ### OK
> > r011n002
> > r011n003
> >
> >
> >
> ##
> > but
> > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 hostname
> > ### CRASHED
> > *
> >
>  --
> > Error, invalid rank (1) in the rankfile (rankfile.0)
> >
> --
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > rmaps_rank_file.c at line 404
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > base/rmaps_base_map_job.c at line 87
> > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
> > base/plm_base_launch_support.c at line 77
> > [r011n002:25129] [[639

[OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-04-10 Thread Geoffroy Pignot
Hi ,

I am currently testing the process affinity capabilities of openmpi and I
would like to know if the rankfile behaviour I will describe below is normal
or not ?

cat hostfile.0
r011n002 slots=4
r011n003 slots=4

cat rankfile.0
rank 0=r011n002 slot=0
rank 1=r011n003 slot=1

##

mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2  hostname ### OK
r011n002
r011n003

##
but
mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 hostname
### CRASHED
* --
Error, invalid rank (1) in the rankfile (rankfile.0)
--
[r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
rmaps_rank_file.c at line 404
[r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
base/rmaps_base_map_job.c at line 87
[r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
base/plm_base_launch_support.c at line 77
[r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
plm_rsh_module.c at line 985
--
A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
orterun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
orterun: clean termination accomplished
*
It seems that the rankfile option is not propagted to the second command
line ; there is no global understanding of the ranking inside a mpirun
command.

##

Assuming that , I tried to provide a rankfile to each command line:

cat rankfile.0
rank 0=r011n002 slot=0

cat rankfile.1
rank 0=r011n003 slot=1

mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf rankfile.1
-n 1 hostname ### CRASHED
*[r011n002:28778] *** Process received signal ***
[r011n002:28778] Signal: Segmentation fault (11)
[r011n002:28778] Signal code: Address not mapped (1)
[r011n002:28778] Failing at address: 0x34
[r011n002:28778] [ 0] [0xe600]
[r011n002:28778] [ 1]
/tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x55d)
[0x5557decd]
[r011n002:28778] [ 2]
/tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x117)
[0x555842a7]
[r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/mca_plm_rsh.so
[0x556098c0]
[r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun [0x804aa27]
[r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun [0x804a022]
[r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc) [0x9f1dec]
[r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun [0x8049f71]
[r011n002:28778] *** End of error message ***
Segmentation fault (core dumped)*



I hope that I've found a bug because it would be very important for me to
have this kind of capabiliy .
Launch a multiexe mpirun command line and be able to bind my exes and
sockets together.

Thanks in advance for your help

Geoffroy










2009/4/9 

> Send users mailing list submissions to
>us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
>users-requ...@open-mpi.org
>
> You can reach the person managing the list at
>users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>   1. mpirun self,sm (Robert Kubrick)
>   2. Re: mpirun self,sm (Ralph Castain)
>   3. shared libraries issue compiling 1.3.1/intel 10.1.022
>  (Francesco Pietra)
>
>
> --
>
> Message: 1
> Date: Thu, 9 Apr 2009 00:15:03 -0400
> From: Robert Kubrick 
> Subject: [OMPI users] mpirun self,sm
> To: Open MPI Users 
> Message-ID: <99ab3654-dd6a-4e96-94ac-b74107382...@gmail.com>
> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
>
> How is this possible?
>
> dx:~> mpirun -v -np 2 --mca btl self,sm --host dx,sx hostname
> dx
> sx
>
> dx:~> netstat -i
> Kernel Interface table
> Iface   MTU Met   RX-OK RX

Re: [OMPI users] 1.3 hangs running 2 exes with different names (Ralph Castain)

2009-01-23 Thread Geoffroy Pignot
Hi Ralph,

Thanks for taking time to look into my problem. As you can see , it happens
when i dont have both exe available on both nodes.
When it's the case (test3) , it works.  I dont know if my particular libdir
causes the problem or not but I 'll try on Monday with a more classical
setup.

I ll keep you inform.

Geoffroy



>
> HI Geoffrey
>
> Hmmmwell, I redid my tests to mirror yours, and still cannot
> replicate this problem. I tried it with both slurm and ssh
> environments - no difference in the results.
>
> % make hello
>
> % cp hello hello2
>
> % ls
> hello hello2
>
> % mpirun -n 1 -host odin038 ./hello : -n 1 -host odin039 ./hello2
> Hello World, I am 0 of 2
> Hello World, I am 1 of 2
>
> I have tried a variety of combinations, including giving a fake
> executable as one of the apps, and have not been able to replicate
> your observed behavior. In all cases, it works correctly.
>
> It looks like you are using rsh/ssh as you launch environment. All I
> can advise at this stage is to again check to ensure that
> the .login/.cshrc (or whatever) on your remote nodes isn't setting
> your path to point at another OMPI installation. The fact that you can
> run at all would seem to indicate that things are okay, but I honestly
> have no ideas at this stage as to why you are seeing this behavior.
>
> Sorry I can't be of more help...
> Ralph
>
> On Jan 23, 2009, at 12:57 AM, Geoffroy Pignot wrote:
>
> > Hello
> >
> > I redid few tests with my hello world , here are my results.
> >
> > First of all my config :
> > configure --prefix=/tmp/openmpi-1.3 --libdir=/tmp/openmpi-1.3/lib64
> > --enable-heterogeneous . you will find attached my ompi_info -param
> > all all
> > compil02 and compil03 are identical Rh43 64 bits nodes.
> >
> > Test 1 :
> > compil02% ls /tmp
> > a.out  openmpi-1.3
> >
> > compil03% ls /tmp
> > a.out  openmpi-1.3
> >
> > /tmp/openmpi-1.3/bin/mpirun -d -n 1 -host compil03 /tmp/a.out : -n 1
> > -host compil02 /tmp/a.out
> > WORKS
> >
> > Test 2 :
> > compil02% mv a.out a.out_64 ; ls /tmp
> > a.out_64  openmpi-1.3
> >
> > compil03% ls /tmp
> > a.out  openmpi-1.3
> >
> > compil03% /tmp/openmpi-1.3/bin/mpirun -d -n 1 -host compil03 /tmp/
> > a.out : -n 1 -host compil02 /tmp/a.out_64
> > [compil03:03774] procdir: /tmp/openmpi-sessions-
> > gpignot@compil03_0/20717/0/0
> > [compil03:03774] jobdir: /tmp/openmpi-sessions-
> > gpignot@compil03_0/20717/0
> > [compil03:03774] top: openmpi-sessions-gpignot@compil03_0
> > [compil03:03774] tmp: /tmp
> > [compil03:03774] mpirun: reset PATH: /tmp/openmpi-1.3/bin:/u/gpignot/
> > jobmgr/bin:.:/cgg/lv5000/jobmgr/bin:/cgg/lv5000/jobmgr/exec/Linux2.6-
> > x86_64/PIV:/cgg/jobmgr/bin:/cgg/jobmgr/exec/Linux2.6-x86_64/PIV:/cgg/
> > lv5000/bin:/cgg/lv5000/exec/Linux2.6-x86_64/PIV:/cgg/util:/bin:/usr/
> > bin:/usr/sbin:/etc:/usr/etc:/usr/local/bin:/usr/bin/X11:/nfs/softs/
> > TOOLS/bin:/nfs/netapp1/DEVTOOLS/bin:/nfs/netapp1/DEVTOOLS/free/
> > Linux2.6-x86_64/bin:/cgg/localdev:/cgg/Applis/bin
> > [compil03:03774] mpirun: reset LD_LIBRARY_PATH: /tmp/openmpi-1.3/
> > lib64:/tmp/openmpi-1.3/lib64
> > [compil02:10684] procdir: /tmp/openmpi-sessions-
> > gpignot@compil02_0/20717/0/1
> > [compil02:10684] jobdir: /tmp/openmpi-sessions-
> > gpignot@compil02_0/20717/0
> > [compil02:10684] top: openmpi-sessions-gpignot@compil02_0
> > [compil02:10684] tmp: /tmp
> > [compil03:03774] [[20717,0],0] node[0].name compil03 daemon 0 arch
> > ffc91200
> > [compil03:03774] [[20717,0],0] node[1].name compil02 daemon 1 arch
> > ffc91200
> > [compil02:10684] [[20717,0],1] node[0].name compil03 daemon 0 arch
> > ffc91200
> > [compil02:10684] [[20717,0],1] node[1].name compil02 daemon 1 arch
> > ffc91200
> > [compil03:03774] Info: Setting up debugger process table for
> > applications
> >   MPIR_being_debugged = 0
> >   MPIR_debug_state = 1
> >   MPIR_partial_attach_ok = 1
> >   MPIR_i_am_starter = 0
> >   MPIR_proctable_size = 2
> >   MPIR_proctable:
> > (i, host, exe, pid) = (0, compil03, /tmp/a.out, 0)
> > (i, host, exe, pid) = (1, compil02, /tmp/a.out_64, 0)
> >
> > HANGS : both exe have pid 0
> >
> > Test 3 :
> >
> > compil02% cp a.out_64 a.out ; ls /tmp
> > a.out_64  a.out  openmpi-1.3
> >
> > compil03% ls /tmp
> > a.out  openmpi-1.3
> >
> > [compil03:03777] procdir: /tmp/openmpi-sessions-
> > gpignot@compil03_0/20626/0/0
> > [compil03:03777] jobdir: /tmp/openmp

Re: [OMPI users] 1.3 hangs running 2 exes with different names (Ralph Castain)

2009-01-23 Thread Geoffroy Pignot
Hello

I redid few tests with my hello world , here are my results.

First of all my config :
configure --prefix=/tmp/openmpi-1.3 --libdir=/tmp/openmpi-1.3/lib64
--enable-heterogeneous . you will find attached my ompi_info -param all all
compil02 and compil03 are identical Rh43 64 bits nodes.

*Test 1 : *
compil02% ls /tmp
a.out  openmpi-1.3

compil03% ls /tmp
a.out  openmpi-1.3

/tmp/openmpi-1.3/bin/mpirun -d -n 1 -host compil03 /tmp/a.out : -n 1 -host
compil02 /tmp/a.out
WORKS

*Test 2 :*
compil02% mv a.out a.out_64 ; ls /tmp
a.out_64  openmpi-1.3

compil03% ls /tmp
a.out  openmpi-1.3

compil03% /tmp/openmpi-1.3/bin/mpirun -d -n 1 -host compil03 /tmp/a.out : -n
1 -host compil02 /tmp/a.out_64
[compil03:03774] procdir: /tmp/openmpi-sessions-gpignot@compil03_0/20717/0/0
[compil03:03774] jobdir: /tmp/openmpi-sessions-gpignot@compil03_0/20717/0
[compil03:03774] top: openmpi-sessions-gpignot@compil03_0
[compil03:03774] tmp: /tmp
[compil03:03774] mpirun: reset PATH:
/tmp/openmpi-1.3/bin:/u/gpignot/jobmgr/bin:.:/cgg/lv5000/jobmgr/bin:/cgg/lv5000/jobmgr/exec/Linux2.6-x86_64/PIV:/cgg/jobmgr/bin:/cgg/jobmgr/exec/Linux2.6-x86_64/PIV:/cgg/lv5000/bin:/cgg/lv5000/exec/Linux2.6-x86_64/PIV:/cgg/util:/bin:/usr/bin:/usr/sbin:/etc:/usr/etc:/usr/local/bin:/usr/bin/X11:/nfs/softs/TOOLS/bin:/nfs/netapp1/DEVTOOLS/bin:/nfs/netapp1/DEVTOOLS/free/Linux2.6-x86_64/bin:/cgg/localdev:/cgg/Applis/bin
[compil03:03774] mpirun: reset LD_LIBRARY_PATH:
/tmp/openmpi-1.3/lib64:/tmp/openmpi-1.3/lib64
[compil02:10684] procdir: /tmp/openmpi-sessions-gpignot@compil02_0/20717/0/1
[compil02:10684] jobdir: /tmp/openmpi-sessions-gpignot@compil02_0/20717/0
[compil02:10684] top: openmpi-sessions-gpignot@compil02_0
[compil02:10684] tmp: /tmp
[compil03:03774] [[20717,0],0] node[0].name compil03 daemon 0 arch ffc91200
[compil03:03774] [[20717,0],0] node[1].name compil02 daemon 1 arch ffc91200
[compil02:10684] [[20717,0],1] node[0].name compil03 daemon 0 arch ffc91200
[compil02:10684] [[20717,0],1] node[1].name compil02 daemon 1 arch ffc91200
[compil03:03774] Info: Setting up debugger process table for applications
  MPIR_being_debugged = 0
  MPIR_debug_state = 1
  MPIR_partial_attach_ok = 1
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 2
  MPIR_proctable:
(i, host, exe, pid) = (0, compil03, /tmp/a.out, 0)
(i, host, exe, pid) = (1, compil02, /tmp/a.out_64, 0)

HANGS : both exe have pid 0
*
Test 3 :*

compil02% cp a.out_64 a.out ; ls /tmp
a.out_64  a.out  openmpi-1.3

compil03% ls /tmp
a.out  openmpi-1.3

[compil03:03777] procdir: /tmp/openmpi-sessions-gpignot@compil03_0/20626/0/0
[compil03:03777] jobdir: /tmp/openmpi-sessions-gpignot@compil03_0/20626/0
[compil03:03777] top: openmpi-sessions-gpignot@compil03_0
[compil03:03777] tmp: /tmp
[compil03:03777] mpirun: reset PATH:
/tmp/openmpi-1.3/bin:/u/gpignot/jobmgr/bin:.:/cgg/lv5000/jobmgr/bin:/cgg/lv5000/jobmgr/exec/Linux2.6-x86_64/PIV:/cgg/jobmgr/bin:/cgg/jobmgr/exec/Linux2.6-x86_64/PIV:/cgg/lv5000/bin:/cgg/lv5000/exec/Linux2.6-x86_64/PIV:/cgg/util:/bin:/usr/bin:/usr/sbin:/etc:/usr/etc:/usr/local/bin:/usr/bin/X11:/nfs/softs/TOOLS/bin:/nfs/netapp1/DEVTOOLS/bin:/nfs/netapp1/DEVTOOLS/free/Linux2.6-x86_64/bin:/cgg/localdev:/cgg/Applis/bin
[compil03:03777] mpirun: reset LD_LIBRARY_PATH:
/tmp/openmpi-1.3/lib64:/tmp/openmpi-1.3/lib64
[compil02:10786] procdir: /tmp/openmpi-sessions-gpignot@compil02_0/20626/0/1
[compil02:10786] jobdir: /tmp/openmpi-sessions-gpignot@compil02_0/20626/0
[compil02:10786] top: openmpi-sessions-gpignot@compil02_0
[compil02:10786] tmp: /tmp
[compil03:03777] [[20626,0],0] node[0].name compil03 daemon 0 arch ffc91200
[compil03:03777] [[20626,0],0] node[1].name compil02 daemon 1 arch ffc91200
[compil02:10786] [[20626,0],1] node[0].name compil03 daemon 0 arch ffc91200
[compil02:10786] [[20626,0],1] node[1].name compil02 daemon 1 arch ffc91200
[compil03:03777] Info: Setting up debugger process table for applications
  MPIR_being_debugged = 0
  MPIR_debug_state = 1
  MPIR_partial_attach_ok = 1
  MPIR_i_am_starter = 0
  MPIR_proctable_size = 2
  MPIR_proctable:
(i, host, exe, pid) = (0, compil03, /tmp/a.out, 0)
(i, host, exe, pid) = (1, compil02, /tmp/a.out_64, 10787)
[compil02:10787] procdir: /tmp/openmpi-sessions-gpignot@compil02_0/20626/1/1
[compil02:10787] jobdir: /tmp/openmpi-sessions-gpignot@compil02_0/20626/1
[compil02:10787] top: openmpi-sessions-gpignot@compil02_0
[compil02:10787] tmp: /tmp
[compil02:10787] [[20626,1],1] node[0].name compil03 daemon 0 arch ffc91200
[compil02:10787] [[20626,1],1] node[1].name compil02 daemon 1 arch ffc91200

HANGS : go a little bit further but still one pid = 0

*Test4:*

compil02% ls /tmp
a.out_64  a.out  openmpi-1.3

compil03% cp a.out a.out_64 ; ls /tmp
a.out_64  a.out  openmpi-1.3

compil03% /tmp/openmpi-1.3/bin/mpirun -d -n 1 -host compil03 /tmp/a.out : -n
1 -host compil02 /tmp/a.out_64
[compil03:03789] procdir: /tmp/openmpi-sessions-gpignot@compil03_0/20638/0/0
[compil03:03789] jobdir: /tmp/openmpi-sessions-gpignot@compil03_0/20638/0

[OMPI users] 1.3 and --preload-files and --preload-binary

2009-01-22 Thread Geoffroy Pignot
Hello,

As you can notice , I am trying the work done on this new release.
preload-files and preload-binary options are very interesting to me because
I work on a cluster without any shared space between nodes.
I tried those basically , but no success . You will find below the error
messages.
If I did things wrong,  would it be possible to get simple examples showing
how these options work.

Thanks

Geoffroy

* /tmp/openmpi-1.3/bin/mpirun --preload-files hello.c --hostfile
/tmp/hostlist -np 2 hostname
--
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/hello.c
Host: compil03

Will continue attempting to launch the process.

--
[compil03:26657] filem:rsh: get(): Failed to preare the request structure
(-1)
--
WARNING: Could not preload the requested files and directories.

Fileset:
Fileset: hello.c

Will continue attempting to launch the process.

--
[compil03:26657] [[13938,0],0] ORTE_ERROR_LOG: Error in file
base/odls_base_state.c at line 127
[compil03:26657] [[13938,0],0] ORTE_ERROR_LOG: Error in file
base/odls_base_default_fns.c at line 831
[compil03:26657] *** Process received signal ***
[compil03:26657] Signal: Segmentation fault (11)
[compil03:26657] Signal code: Address not mapped (1)
[compil03:26657] Failing at address: 0x395eb15000
[compil03:26657] [ 0] /lib64/tls/libpthread.so.0 [0x395f80c420]
[compil03:26657] [ 1] /lib64/tls/libc.so.6(memcpy+0x3f) [0x395ed718df]
[compil03:26657] [ 2] /tmp/openmpi-1.3/lib64/libopen-pal.so.0 [0x2a956b0a10]
[compil03:26657] [ 3]
/tmp/openmpi-1.3/lib64/libopen-rte.so.0(orte_odls_base_default_launch_local+0x55c)
[0x2a955809cc]
[compil03:26657] [ 4] /tmp/openmpi-1.3/lib64/openmpi/mca_odls_default.so
[0x2a963655f2]
[compil03:26657] [ 5]
/tmp/openmpi-1.3/lib64/libopen-rte.so.0(orte_daemon_cmd_processor+0x57d)
[0x2a9557812d]
[compil03:26657] [ 6] /tmp/openmpi-1.3/lib64/libopen-pal.so.0 [0x2a956b9828]
[compil03:26657] [ 7]
/tmp/openmpi-1.3/lib64/libopen-pal.so.0(opal_progress+0xb0) [0x2a956ae820]
[compil03:26657] [ 8]
/tmp/openmpi-1.3/lib64/libopen-rte.so.0(orte_plm_base_launch_apps+0x1ed)
[0x2a95584e7d]
[compil03:26657] [ 9] /tmp/openmpi-1.3/lib64/openmpi/mca_plm_rsh.so
[0x2a95c3ed98]
[compil03:26657] [10] /tmp/openmpi-1.3/bin/mpirun [0x403330]
[compil03:26657] [11] /tmp/openmpi-1.3/bin/mpirun [0x402ad3]
[compil03:26657] [12] /lib64/tls/libc.so.6(__libc_start_main+0xdb)
[0x395ed1c4bb]
[compil03:26657] [13] /tmp/openmpi-1.3/bin/mpirun [0x402a2a]
[compil03:26657] *** End of error message ***
Segmentation fault

*And it's not better with --preload-binary . a.out_32

*compil03% /tmp/openmpi-1.3/bin/mpirun -s --hostfile /tmp/hostlist -wdir
/tmp -np 2 a.out_32
--
mpirun was unable to launch the specified application as it could not find
an executable:

Executable: a.out_32
Node: compil02

while attempting to start process rank 1.*


[OMPI users] 1.3 hangs running 2 exes with different names

2009-01-22 Thread Geoffroy Pignot
=MPI-SPAWN,
> >>> which implies the use of dynamic process management
> >>> realised in MPI2. It got compiled and tested successfully.
> >>> However when it is spawning on different nodes (machine) one
> >>> additional process on each node appears, i.e. if nodes=2:ppn=2
> >>> then on each node there are 3 running processes. In the case
> >>> when it runs just on one pc with a few cores (let say
> >>> nodes=1:ppn=4),
> >>> the number of processes exactly equals the number of cpus (ppn)
> >>> requested and there is no additional process.
> >>> I am wondering whether it is normal behavior. Thanks!
> >>>
> >>> Best regards,
> >>> Evgeniy
> >>>
> >>>
> >>>
> >>> ___
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > --
> > ___
> > Dr. Evgeniy Gromov
> > Theoretische Chemie
> > Physikalisch-Chemisches Institut
> > Im Neuenheimer Feld 229
> > D-69120 Heidelberg
> > Germany
> >
> > Telefon: +49/(0)6221/545263
> > Fax: +49/(0)6221/545221
> > E-mail: evge...@pci.uni-heidelberg.de
> > ___
> >
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
>
> Message: 5
> Date: Wed, 21 Jan 2009 11:40:28 -0700
> From: Ralph Castain 
> Subject: Re: [OMPI users] openmpi 1.3 and --wdir problem
> To: Open MPI Users 
> Message-ID: 
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>
> This is now fixed in the trunk and will be in the 1.3.1 release.
>
> Thanks again for the heads-up!
> Ralph
>
> On Jan 21, 2009, at 8:45 AM, Ralph Castain wrote:
>
> > You are correct - that is a bug in 1.3.0. I'm working on a fix for
> > it now and will report back.
> >
> > Thanks for catching it!
> > Ralph
> >
> >
> > On Jan 21, 2009, at 3:22 AM, Geoffroy Pignot wrote:
> >
> >> Hello
> >>
> >>   I'm currently trying the new release but I cant reproduce the
> >> 1.2.8 behaviour
> >>   concerning --wdir option
> >>
> >>   Then
> >>   %% /tmp/openmpi-1.2.8/bin/mpirun -n 1 --wdir /tmp --host r003n030
> >> pwd :   --wdir /scr1 -n 1 --host r003n031 pwd
> >>   /scr1
> >>   /tmp
> >>
> >>   but
> >>   %% /tmp/openmpi-1.3/bin/mpirun -n 1 --wdir /tmp --host r003n030
> >> pwd : --wdir  /scr1 -n 1 --host r003n031 pwd
> >>   /scr1
> >>   /scr1
> >>   Thanks in advance
> >>   Regards
> >>   Geoffroy
> >>
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
>
> Message: 6
> Date: Wed, 21 Jan 2009 14:06:42 -0500
> From: Jeff Squyres 
> Subject: Re: [OMPI users] Problem compiling open mpi 1.3 with
>sunstudio12 express
> To: Open MPI Users 
> Message-ID: <36fcdf58-9138-46a9-a432-cdf2a99a1...@cisco.com>
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>
> FWIW, I have run with my LD_LIBRARY_PATH set to a combination of
> multiple OMPI installations; it ended up using the leftmost entry in
> the LD_LIBRARY_PATH (as I intended).  I'm not quite sure why it
> wouldn't do that for you.  :-(
>
>
> On Jan 21, 2009, at 4:53 AM, Olivier Marsden wrote:
>
> >
> >>
> >> - Check that /opt/mpi_sun and /opt/mpi_gfortran* are actually
> >> distinct subdirectories; there's no hidden sym/hard links in there
> >> somewhere (where directories and/or individual files might
> >> accidentally be pointing to the other tree)
> >>
> >
> > no hidden links in the directories
> >
> >> - does "env | grep mpi_" show anythin

[OMPI users] openmpi 1.3 and --wdir problem

2009-01-21 Thread Geoffroy Pignot
Hello

   I'm currently trying the new release but I cant reproduce the 1.2.8
behaviour
   concerning --wdir option

   Then
   %% /tmp/openmpi-1.2.8/bin/mpirun -n 1 --wdir /tmp --host r003n030 pwd :
--wdir /scr1 -n 1 --host r003n031 pwd
   /scr1
   /tmp

   but
   %% /tmp/openmpi-1.3/bin/mpirun -n 1 --wdir /tmp --host r003n030 pwd :
--wdir  /scr1 -n 1 --host r003n031 pwd
   /scr1
   /scr1
   Thanks in advance
   Regards
   Geoffroy


Re: [OMPI users] Problem with gateway between 2 hosts

2008-07-03 Thread Geoffroy Pignot
Hi,
To answer your question 172.x.y.z are not behind a NAT .
Moreover, I check the netstat command on the remote host and it seems like
the connection is ok

tcp0  0 10.160.x.x:39794172.x.y.z:50858
ESTABLISHED 20956/orted
unix  3  [ ] STREAM CONNECTED 76348311 20956/orted
unix  3  [ ] STREAM CONNECTED 76348310 20956/orted

I hope it'll help
Does anyone run openmpi program in such environment ???
Thanks again

Geoffroy




2008/7/2 Geoffroy Pignot :

> are the 172.x.y.z nodes behind a NAT (hence the communication back
>> isn't possible - only the stdout from the rsh/ssh is working in this
>> case)?
>>
>> -- Reuti
>
>
> Actually I dont know  exactly , I am asking  extra informations to my
> network architect
> Interesting thing to notice is that LAM  worked in such kind of  network
> configuration.
> I keep you inform as soon as I have more infos.
>
>
> Regards
> Geoffroy
>
>


Re: [OMPI users] Problem with gateway between 2 hosts

2008-07-02 Thread Geoffroy Pignot
>
> are the 172.x.y.z nodes behind a NAT (hence the communication back
> isn't possible - only the stdout from the rsh/ssh is working in this
> case)?
>
> -- Reuti


Actually I dont know  exactly , I am asking  extra informations to my
network architect
Interesting thing to notice is that LAM  worked in such kind of  network
configuration.
I keep you inform as soon as I have more infos.


Regards
Geoffroy


[OMPI users] Problem with gateway between 2 hosts

2008-06-30 Thread Geoffroy Pignot
Hi,

Does anybody face problems running Openmpi on two hosts with different
networks (gateway to reach the other) ?
Let say compil02 ip adress is 172.3.9.10 and r009n001 is 10.160.4.1

There is no problem with MPI_init free executables (for example hostname)

compil02% /tmp/HALMPI/openmpi-1.2.2/bin/mpirun --prefix
/tmp/HALMPI/openmpi-1.2.2 -np 1 -host compil02 hostname : -np 1 -host
r009n001 hostname
r009n001
compil02

But as soon as I try a simple hello world , it 's crashing with the
following error message.
Please note that when I try to run hello between r009n001 (10.160.4.1) and
r009n002 (10.160.4.2), it works fine

Thanks in advance for your help.
Regards

Geoffroy


PS: same error with openmpi v1.2.5


compil02% /tmp/HALMPI/openmpi-1.2.2/bin/mpirun --prefix
/tmp/HALMPI/openmpi-1.2.2 -np 1 -host compil02 /tmp/hello : -np 1 -host
r009n001 /tmp/hello
--
Process 0.1.0 is unable to reach 0.1.1 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--
--
Process 0.1.1 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)


[OMPI users] homogeneous environment

2007-08-07 Thread Geoffroy PIGNOT
Does openmpi mpirun command have the equivalent option "-O" as lam 
(homogeneous universe)
I would like avoid automatic byteswap in heterogeneous execution 
environment

Thanks in advance

Geoffroy
<>