Re: [OMPI users] MPI_Comm_spawn and exported variables

2013-12-19 Thread Tim Miller
Hi Ralph,

That's correct. All of the original processes see the -x values, but
spawned ones do not.

Regards,
Tim


On Thu, Dec 19, 2013 at 6:09 PM, Ralph Castain  wrote:

>
> On Dec 19, 2013, at 2:57 PM, Tim Miller  wrote:
>
> > Hi All,
> >
> > I have a question similar (but not identical to) the one asked by Tom
> Fogel a week or so back...
> >
> > I have a code that uses MPI_Comm_spawn to launch different processes.
> The executables for these use libraries in non-standard locations, so what
> I've done is add the directories containing them to my LD_LIBRARY_PATH
> environment variable, and then calling mpirun with "-x LD_LIBRARY_PATH".
> This works well for me on OpenMPI 1.6.3 and earlier. However, I've been
> playing with OpenMPI 1.7.3 and this no longer seems to work. As soon as my
> code MPI_Comm_spawns, all the spawned processes die complaining that they
> can't find the correct libraries to start the executable.
> >
> > Has there been a way that exported variables are passed to spawned
> processes between OpenMPI 1.6 and 1.7?
>
> Not intentionally, though it is possible that some bug crept into the
> code. If I understand correctly, the -x values are being seen by the
> original procs, but not by the comm_spawned ones?
>
>
> > Is there something else I'm doing wrong here?
> >
> > Best Regards,
> > Tim
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-19 Thread Ralph Castain
Actually, it looks like it would happen with hetero-nodes set - only required 
that at least two nodes have the same architecture. So you might want to give 
the trunk a shot as it may well now be fixed.


On Dec 19, 2013, at 8:35 AM, Ralph Castain  wrote:

> Hmmm...not having any luck tracking this down yet. If anything, based on what 
> I saw in the code, I would have expected it to fail when hetero-nodes was 
> false, not the other way around.
> 
> I'll keep poking around - just wanted to provide an update.
> 
> On Dec 19, 2013, at 12:54 AM, tmish...@jcity.maeda.co.jp wrote:
> 
>> 
>> 
>> Hi Ralph, sorry for intersecting post.
>> 
>> Your advice about -hetero-nodes in other thread gives me a hint.
>> 
>> I already put "orte_hetero_nodes = 1" in my mca-params.conf, because
>> you told me a month ago that my environment would need this option.
>> 
>> Removing this line from mca-params.conf, then it works.
>> In other word, you can replicate it by adding -hetero-nodes as
>> shown below.
>> 
>> qsub: job 8364.manage.cluster completed
>> [mishima@manage mpi]$ qsub -I -l nodes=2:ppn=8
>> qsub: waiting for job 8365.manage.cluster to start
>> qsub: job 8365.manage.cluster ready
>> 
>> [mishima@node11 ~]$ ompi_info --all | grep orte_hetero_nodes
>>   MCA orte: parameter "orte_hetero_nodes" (current value:
>> "false", data source: default, level: 9 dev/all,
>> type: bool)
>> [mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
>> myprog
>> [node11.cluster:27895] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>> [node11.cluster:27895] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket
>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>> [node12.cluster:24891] MCW rank 3 bound to socket 1[core 4[hwt 0]], socket
>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>> [node12.cluster:24891] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket
>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>> Hello world from process 0 of 4
>> Hello world from process 1 of 4
>> Hello world from process 2 of 4
>> Hello world from process 3 of 4
>> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
>> -hetero-nodes myprog
>> --
>> A request was made to bind to that would result in binding more
>> processes than cpus on a resource:
>> 
>>  Bind to: CORE
>>  Node:node12
>>  #processes:  2
>>  #cpus:  1
>> 
>> You can override this protection by adding the "overload-allowed"
>> option to your binding directive.
>> --
>> 
>> 
>> As far as I checked, data->num_bound seems to become bad in bind_downwards,
>> when I put "-hetero-nodes". I hope you can clear the problem.
>> 
>> Regards,
>> Tetsuya Mishima
>> 
>> 
>>> Yes, it's very strange. But I don't think there's any chance that
>>> I have < 8 actual cores on the node. I guess that you cat replicate
>>> it with SLURM, please try it again.
>>> 
>>> I changed to use node10 and node11, then I got the warning against
>>> node11.
>>> 
>>> Furthermore, just as an information for you, I tried to add
>>> "-bind-to core:overload-allowed", then it worked as shown below.
>>> But I think node11 is never overloaded because it has 8 cores.
>>> 
>>> qsub: job 8342.manage.cluster completed
>>> [mishima@manage ~]$ qsub -I -l nodes=node10:ppn=8+node11:ppn=8
>>> qsub: waiting for job 8343.manage.cluster to start
>>> qsub: job 8343.manage.cluster ready
>>> 
>>> [mishima@node10 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>>> [mishima@node10 demos]$ cat $PBS_NODEFILE
>>> node10
>>> node10
>>> node10
>>> node10
>>> node10
>>> node10
>>> node10
>>> node10
>>> node11
>>> node11
>>> node11
>>> node11
>>> node11
>>> node11
>>> node11
>>> node11
>>> [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
>>> myprog
>>> 
>> --
>>> A request was made to bind to that would result in binding more
>>> processes than cpus on a resource:
>>> 
>>> Bind to: CORE
>>> Node:node11
>>> #processes:  2
>>> #cpus:  1
>>> 
>>> You can override this protection by adding the "overload-allowed"
>>> option to your binding directive.
>>> 
>> --
>>> [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
>>> -bind-to core:overload-allowed myprog
>>> [node10.cluster:27020] MCW rank 0 bound to socket 0[core 0[hwt 0]],
>> socket
>>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>>> [node10.cluster:27020] MCW ra

Re: [OMPI users] MPI_Comm_spawn and exported variables

2013-12-19 Thread Ralph Castain

On Dec 19, 2013, at 2:57 PM, Tim Miller  wrote:

> Hi All,
> 
> I have a question similar (but not identical to) the one asked by Tom Fogel a 
> week or so back...
> 
> I have a code that uses MPI_Comm_spawn to launch different processes. The 
> executables for these use libraries in non-standard locations, so what I've 
> done is add the directories containing them to my LD_LIBRARY_PATH environment 
> variable, and then calling mpirun with "-x LD_LIBRARY_PATH". This works well 
> for me on OpenMPI 1.6.3 and earlier. However, I've been playing with OpenMPI 
> 1.7.3 and this no longer seems to work. As soon as my code MPI_Comm_spawns, 
> all the spawned processes die complaining that they can't find the correct 
> libraries to start the executable.
> 
> Has there been a way that exported variables are passed to spawned processes 
> between OpenMPI 1.6 and 1.7?

Not intentionally, though it is possible that some bug crept into the code. If 
I understand correctly, the -x values are being seen by the original procs, but 
not by the comm_spawned ones?


> Is there something else I'm doing wrong here?
> 
> Best Regards,
> Tim
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] MPI_Comm_spawn and exported variables

2013-12-19 Thread Tim Miller
Hi All,

I have a question similar (but not identical to) the one asked by Tom Fogel
a week or so back...

I have a code that uses MPI_Comm_spawn to launch different processes. The
executables for these use libraries in non-standard locations, so what I've
done is add the directories containing them to my LD_LIBRARY_PATH
environment variable, and then calling mpirun with "-x LD_LIBRARY_PATH".
This works well for me on OpenMPI 1.6.3 and earlier. However, I've been
playing with OpenMPI 1.7.3 and this no longer seems to work. As soon as my
code MPI_Comm_spawns, all the spawned processes die complaining that they
can't find the correct libraries to start the executable.

Has there been a way that exported variables are passed to spawned
processes between OpenMPI 1.6 and 1.7? Is there something else I'm doing
wrong here?

Best Regards,
Tim


Re: [OMPI users] environment variables and MPI_Comm_spawn

2013-12-19 Thread Ralph Castain
In trunk, cmr'd for 1.7.4 - copied you on ticket

Thanks!
Ralph

On Dec 19, 2013, at 12:37 PM, tom fogal  wrote:

> Okay, no worries on the delay, and thanks!  -tom
> 
> On 12/19/2013 04:32 PM, Ralph Castain wrote:
>> Sorry for delay - buried in my "day job". Adding values to the env array is 
>> fine, but this isn't how we would normally do it. I've got it noted on my 
>> "to-do" list and will try to get to it in time for 1.7.5
>> 
>> Thanks
>> Ralph
>> 
>> On Dec 13, 2013, at 4:42 PM, Jeff Squyres (jsquyres)  
>> wrote:
>> 
>>> Thanks for the first 2 patches, Tom -- I applied them to the SVN trunk and 
>>> scheduled them to go into the v1.7 series.  I don't know if they'll make 
>>> 1.7.4 or be pushed to 1.7.5, but they'll get there.
>>> 
>>> I'll defer to Ralph for the rest of the discussion about info keys.
>>> 
>>> 
>>> On Dec 13, 2013, at 9:16 AM, tom fogal  wrote:
>>> 
 Hi Ralph, thanks for your help!
 
 Ralph Castain writes:
> It would have to be done via MPI_Info arguments, and we never had a
> request to do so (and hence, don't define such an argument). It would
> be easy enough to do so (look in the ompi/mca/dpm/orte/dpm_orte.c
> code).
 
 Well, I wanted to just report success, but I've only got the easy
 side of it: saving the arguments from the MPI_Info arguments into
 the orte_job_t struct.  See attached "0003" patch (against trunk).
 However, I couldn't figure out how to get the other side: reading out
 the environment variables and setting them at fork.  Maybe you could
 help with (or do :-) that?
 
 Or just guide me as to where again: I threw abort()s in 'spawn'
 functions I found under plm/, but my programs didn't abort and so I'm
 not sure where they went.
 
> MPI implementations generally don't forcibly propagate envars because
> it is so hard to know which ones to handle - it is easy to propagate
> a system envar that causes bad things to happen on the remote end.
 
 I understand.  Though in this case, I'm /trying/ to make Bad Things
 (tm) happen ;-).
 
> One thing you could do, of course, is add that envar to your default
> shell setup (.bashrc or whatever). This would set the variable by
> default on your remote locations (assuming you are using rsh/ssh
> for your launcher), and then any process you start would get
> it. However, that won't help if this is an envar intended only for
> the comm_spawned process.
 
 Unfortunately what I want to play with at the moment are LD_*
 variables, and fiddling with these in my .bashrc will mess up a lot
 more than just the simulation I am presently hacking.
 
> I can add this capability to the OMPI trunk, and port it to the 1.7
> release - but we don't go all the way back to the 1.4 series any
> more.
 
 Yes, having this in a 1.7 release would be great!
 
 
 BTW, I encountered a couple other small things while grepping through
 source/waiting for trunk to build, so there are two other small patches
 attached.  One gets rid of warnings about unused functions in generated
 lexing code.  I believe the second fixes resource leaks on error paths.
 However, it turned out none of my user-level code hit that function at
 all, so I haven't been able to test it.  Take from it what you will...
 
 -tom
 
> On Wed, Dec 11, 2013 at 2:10 PM, tom fogal  wrote:
> 
>> Hi all,
>> 
>> I'm developing on Open MPI 1.4.5-ubuntu2 on Ubuntu 13.10 (so, Ubuntu's
>> packaged Open MPI) at the moment.
>> 
>> I'd like to pass environment variables to processes started via
>> MPI_Comm_spawn.  Unfortunately, the MPI 3.0 standard (at least) does
>> not seem to specify a way to do this; thus I have been searching for
>> implementation-specific ways to accomplish my task.
>> 
>> I have tried setting the environment variable using the POSIX setenv(3)
>> call, but it seems that Open MPI comm-spawn'd processes do not inherit
>> environment variables.  See the attached 2 C99 programs; one prints
>> out the environment it receives, and one sets the MEANING_OF_LIFE
>> environment variable, spawns the previous 'env printing' program, and
>> exits.  I run via:
>> 
>> $ env -i HOME=/home/tfogal \
>> PATH=/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin \
>> mpirun -x TJFVAR=testing -n 5 ./mpienv ./envpar
>> 
>> and expect (well, hope) to find the MEANING_OF_LIFE in 'envpar's
>> output.  I do see TJFVAR, but the MEANING_OF_LIFE sadly does not
>> propagate.  Perhaps I am asking the wrong question...
>> 
>> I found another MPI implementation which allowed passing such
>> information via the MPI_Info argument, however I could find no
>> documentation of similar functionality in Open MPI.
>> 
>> Is there a way to accomplish what I'm looking for?  I could even be
>>>

Re: [OMPI users] Error: Unable to create the sub-directory (/tmp/openmpi etc...)

2013-12-19 Thread Brandon Turner
Thanks a lot! Indeed, it was an issue of permissions. I did not realize the
difference in the /tmp directories, and it seems that the /tmp directory
for the node in question was "read-only". This has since been switched, and
presumably everything else will run smoothly now. My fingers are crossed.

-Brandon


On Tue, Dec 17, 2013 at 2:26 PM, Reuti  wrote:

> Hi,
>
> Am 17.12.2013 um 22:32 schrieb Brandon Turner:
>
> > I've been struggling with this problem for a few days now and am out of
> ideas. I am submitting a job using TORQUE on a beowulf cluster. One step
> involves running mpiexec, and that is where this error occurs. I've found
> some similar other queries in the past:
> >
> > http://www.open-mpi.org/community/lists/users/att-11378/attachment
> >
> > http://www.open-mpi.org/community/lists/users/2013/09/22608.php
> >
> > http://www.open-mpi.org/community/lists/users/2009/11/11129.php
> >
> > I'm new to using open-mpi so much of this is very new to me. However, it
> does not seem that my /tmp folder is full as far as I can tell. I've tried
> reassigning the temporary directory using the MCA attribute (i.e. mpiexec
> --mca orte_tmpdir_base /home/pathA/pathB process argument1 argument2
> argument3), but that was unsuccessful as well. Similarly, if thousands of
> sub-directories are being created, I have no idea where those would be if
> this is some ext3 violation issue. It's worth noting that when I submit
> this job--it works on some occassions and not on others. I suspect it has
> something to do with the nodes that I am assigned and some property of
> certain nodes that is an issue.
> >
> > It never used to have this problem until a few days ago, and now I
> mostly can't get it to work except on a few occasions, which makes me think
> that perhaps it is a node-specific issue. Any thoughts or suggestions would
> be much appreciated!
>
> a) As it's not your personal /tmp, but a machine wide, it might be full on
> this particular node.
>
> b) Or the admin changed the permissions on /tmp so that only Torque can
> generate any temporary directory therein, and any additional one created by
> a batch job should go to $TMPDIR which is created and removed by Torque for
> your particular job. It might be that Open MPI is not tightly integrated
> into your Torque installation. Did you ever have the chance to peek on a
> node whether your MPI processes are kids of pbs_mom and not of any ssh
> connection?
>
> -- Reuti
>
>
> > Thanks,
> >
> > Brandon
> >
> > PS I've copied the full error output below:
> > [bc11bl08.deac.wfu.edu:31532] opal_os_dirpath_create: Error: Unable to
> create the sub-directory
> (/tmp/openmpi-sessions-turn...@bc11bl08.deac.wfu.edu_0) of
> (/tmp/openmpi-sessions-turn...@bc11bl08.deac.wfu.edu_0/2243/0/7), mkdir
> failed [1]
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in
> file ../../orte/util/session_dir.c at line 106
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in
> file ../../orte/util/session_dir.c at line 399
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in
> file ../../../../orte/mca/ess/base/ess_base_std_orted.c at line 283
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is unknown in
> file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 104
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] could not get route to
> [[INVALID],INVALID]
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is unknown in
> file ../../orte/util/show_help.c at line 627
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in
> file ../../../../../orte/mca/ess/tm/ess_tm_module.c at line 112
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is unknown in
> file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 104
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] could not get route to
> [[INVALID],INVALID]
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is unknown in
> file ../../orte/util/show_help.c at line 627
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: Error in
> file ../../orte/runtime/orte_init.c at line 128
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is unknown in
> file ../../../../../orte/mca/rml/oob/rml_oob_send.c at line 104
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] could not get route to
> [[INVALID],INVALID]
> > [bc11bl08.deac.wfu.edu:31532] [[2243,0],7] ORTE_ERROR_LOG: A message is
> attempting to be sent to a process whose contact information is unknown in
> file ../../orte/util/show_help.c at line 627
> > [bc11bl08.deac.wfu.edu:31

Re: [OMPI users] environment variables and MPI_Comm_spawn

2013-12-19 Thread tom fogal

Okay, no worries on the delay, and thanks!  -tom

On 12/19/2013 04:32 PM, Ralph Castain wrote:

Sorry for delay - buried in my "day job". Adding values to the env array is fine, but 
this isn't how we would normally do it. I've got it noted on my "to-do" list and will try 
to get to it in time for 1.7.5

Thanks
Ralph

On Dec 13, 2013, at 4:42 PM, Jeff Squyres (jsquyres)  wrote:


Thanks for the first 2 patches, Tom -- I applied them to the SVN trunk and 
scheduled them to go into the v1.7 series.  I don't know if they'll make 1.7.4 
or be pushed to 1.7.5, but they'll get there.

I'll defer to Ralph for the rest of the discussion about info keys.


On Dec 13, 2013, at 9:16 AM, tom fogal  wrote:


Hi Ralph, thanks for your help!

Ralph Castain writes:

It would have to be done via MPI_Info arguments, and we never had a
request to do so (and hence, don't define such an argument). It would
be easy enough to do so (look in the ompi/mca/dpm/orte/dpm_orte.c
code).


Well, I wanted to just report success, but I've only got the easy
side of it: saving the arguments from the MPI_Info arguments into
the orte_job_t struct.  See attached "0003" patch (against trunk).
However, I couldn't figure out how to get the other side: reading out
the environment variables and setting them at fork.  Maybe you could
help with (or do :-) that?

Or just guide me as to where again: I threw abort()s in 'spawn'
functions I found under plm/, but my programs didn't abort and so I'm
not sure where they went.


MPI implementations generally don't forcibly propagate envars because
it is so hard to know which ones to handle - it is easy to propagate
a system envar that causes bad things to happen on the remote end.


I understand.  Though in this case, I'm /trying/ to make Bad Things
(tm) happen ;-).


One thing you could do, of course, is add that envar to your default
shell setup (.bashrc or whatever). This would set the variable by
default on your remote locations (assuming you are using rsh/ssh
for your launcher), and then any process you start would get
it. However, that won't help if this is an envar intended only for
the comm_spawned process.


Unfortunately what I want to play with at the moment are LD_*
variables, and fiddling with these in my .bashrc will mess up a lot
more than just the simulation I am presently hacking.


I can add this capability to the OMPI trunk, and port it to the 1.7
release - but we don't go all the way back to the 1.4 series any
more.


Yes, having this in a 1.7 release would be great!


BTW, I encountered a couple other small things while grepping through
source/waiting for trunk to build, so there are two other small patches
attached.  One gets rid of warnings about unused functions in generated
lexing code.  I believe the second fixes resource leaks on error paths.
However, it turned out none of my user-level code hit that function at
all, so I haven't been able to test it.  Take from it what you will...

-tom


On Wed, Dec 11, 2013 at 2:10 PM, tom fogal  wrote:


Hi all,

I'm developing on Open MPI 1.4.5-ubuntu2 on Ubuntu 13.10 (so, Ubuntu's
packaged Open MPI) at the moment.

I'd like to pass environment variables to processes started via
MPI_Comm_spawn.  Unfortunately, the MPI 3.0 standard (at least) does
not seem to specify a way to do this; thus I have been searching for
implementation-specific ways to accomplish my task.

I have tried setting the environment variable using the POSIX setenv(3)
call, but it seems that Open MPI comm-spawn'd processes do not inherit
environment variables.  See the attached 2 C99 programs; one prints
out the environment it receives, and one sets the MEANING_OF_LIFE
environment variable, spawns the previous 'env printing' program, and
exits.  I run via:

$ env -i HOME=/home/tfogal \
PATH=/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin \
mpirun -x TJFVAR=testing -n 5 ./mpienv ./envpar

and expect (well, hope) to find the MEANING_OF_LIFE in 'envpar's
output.  I do see TJFVAR, but the MEANING_OF_LIFE sadly does not
propagate.  Perhaps I am asking the wrong question...

I found another MPI implementation which allowed passing such
information via the MPI_Info argument, however I could find no
documentation of similar functionality in Open MPI.

Is there a way to accomplish what I'm looking for?  I could even be
convinced to hack source, but a starting pointer would be appreciated.

Thanks,

-tom


 From 8285a7625e5ea014b9d4df5dd65a7642fd4bc322 Mon Sep 17 00:00:00 2001
From: Tom Fogal 
Date: Fri, 13 Dec 2013 12:03:56 +0100
Subject: [PATCH 1/3] btl: Remove warnings about unused lexing functions.

---
ompi/mca/btl/openib/btl_openib_lex.l | 2 ++
1 file changed, 2 insertions(+)

diff --git a/ompi/mca/btl/openib/btl_openib_lex.l 
b/ompi/mca/btl/openib/btl_openib_lex.l
index 2aa6059..7455b78 100644
--- a/ompi/mca/btl/openib/btl_openib_lex.l
+++ b/ompi/mca/btl/openib/btl_openib_lex.l
@@ -1,3 +1,5 @@
+%option nounput
+%option noinput
%{ /* -*- C -*- */
/*
* Copyright (c) 2004-2005 The T

Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-19 Thread Ralph Castain
Hmmm...not having any luck tracking this down yet. If anything, based on what I 
saw in the code, I would have expected it to fail when hetero-nodes was false, 
not the other way around.

I'll keep poking around - just wanted to provide an update.

On Dec 19, 2013, at 12:54 AM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph, sorry for intersecting post.
> 
> Your advice about -hetero-nodes in other thread gives me a hint.
> 
> I already put "orte_hetero_nodes = 1" in my mca-params.conf, because
> you told me a month ago that my environment would need this option.
> 
> Removing this line from mca-params.conf, then it works.
> In other word, you can replicate it by adding -hetero-nodes as
> shown below.
> 
> qsub: job 8364.manage.cluster completed
> [mishima@manage mpi]$ qsub -I -l nodes=2:ppn=8
> qsub: waiting for job 8365.manage.cluster to start
> qsub: job 8365.manage.cluster ready
> 
> [mishima@node11 ~]$ ompi_info --all | grep orte_hetero_nodes
>MCA orte: parameter "orte_hetero_nodes" (current value:
> "false", data source: default, level: 9 dev/all,
> type: bool)
> [mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
> myprog
> [node11.cluster:27895] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> [node11.cluster:27895] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket
> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> [node12.cluster:24891] MCW rank 3 bound to socket 1[core 4[hwt 0]], socket
> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> [node12.cluster:24891] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> Hello world from process 0 of 4
> Hello world from process 1 of 4
> Hello world from process 2 of 4
> Hello world from process 3 of 4
> [mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
> -hetero-nodes myprog
> --
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
> 
>   Bind to: CORE
>   Node:node12
>   #processes:  2
>   #cpus:  1
> 
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
> --
> 
> 
> As far as I checked, data->num_bound seems to become bad in bind_downwards,
> when I put "-hetero-nodes". I hope you can clear the problem.
> 
> Regards,
> Tetsuya Mishima
> 
> 
>> Yes, it's very strange. But I don't think there's any chance that
>> I have < 8 actual cores on the node. I guess that you cat replicate
>> it with SLURM, please try it again.
>> 
>> I changed to use node10 and node11, then I got the warning against
>> node11.
>> 
>> Furthermore, just as an information for you, I tried to add
>> "-bind-to core:overload-allowed", then it worked as shown below.
>> But I think node11 is never overloaded because it has 8 cores.
>> 
>> qsub: job 8342.manage.cluster completed
>> [mishima@manage ~]$ qsub -I -l nodes=node10:ppn=8+node11:ppn=8
>> qsub: waiting for job 8343.manage.cluster to start
>> qsub: job 8343.manage.cluster ready
>> 
>> [mishima@node10 ~]$ cd ~/Desktop/openmpi-1.7/demos/
>> [mishima@node10 demos]$ cat $PBS_NODEFILE
>> node10
>> node10
>> node10
>> node10
>> node10
>> node10
>> node10
>> node10
>> node11
>> node11
>> node11
>> node11
>> node11
>> node11
>> node11
>> node11
>> [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
>> myprog
>> 
> --
>> A request was made to bind to that would result in binding more
>> processes than cpus on a resource:
>> 
>> Bind to: CORE
>> Node:node11
>> #processes:  2
>> #cpus:  1
>> 
>> You can override this protection by adding the "overload-allowed"
>> option to your binding directive.
>> 
> --
>> [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
>> -bind-to core:overload-allowed myprog
>> [node10.cluster:27020] MCW rank 0 bound to socket 0[core 0[hwt 0]],
> socket
>> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
>> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
>> [node10.cluster:27020] MCW rank 1 bound to socket 1[core 4[hwt 0]],
> socket
>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>> [node11.cluster:26597] MCW rank 3 bound to socket 1[core 4[hwt 0]],
> socket
>> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
>> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
>> [node11.cluster:26597] MCW rank 2 bound to socket 0[core 0[hwt 0]],
> s

Re: [OMPI users] environment variables and MPI_Comm_spawn

2013-12-19 Thread Ralph Castain
Sorry for delay - buried in my "day job". Adding values to the env array is 
fine, but this isn't how we would normally do it. I've got it noted on my 
"to-do" list and will try to get to it in time for 1.7.5

Thanks
Ralph

On Dec 13, 2013, at 4:42 PM, Jeff Squyres (jsquyres)  wrote:

> Thanks for the first 2 patches, Tom -- I applied them to the SVN trunk and 
> scheduled them to go into the v1.7 series.  I don't know if they'll make 
> 1.7.4 or be pushed to 1.7.5, but they'll get there.
> 
> I'll defer to Ralph for the rest of the discussion about info keys.
> 
> 
> On Dec 13, 2013, at 9:16 AM, tom fogal  wrote:
> 
>> Hi Ralph, thanks for your help!
>> 
>> Ralph Castain writes:
>>> It would have to be done via MPI_Info arguments, and we never had a
>>> request to do so (and hence, don't define such an argument). It would
>>> be easy enough to do so (look in the ompi/mca/dpm/orte/dpm_orte.c
>>> code).
>> 
>> Well, I wanted to just report success, but I've only got the easy
>> side of it: saving the arguments from the MPI_Info arguments into
>> the orte_job_t struct.  See attached "0003" patch (against trunk).
>> However, I couldn't figure out how to get the other side: reading out
>> the environment variables and setting them at fork.  Maybe you could
>> help with (or do :-) that?
>> 
>> Or just guide me as to where again: I threw abort()s in 'spawn'
>> functions I found under plm/, but my programs didn't abort and so I'm
>> not sure where they went.
>> 
>>> MPI implementations generally don't forcibly propagate envars because
>>> it is so hard to know which ones to handle - it is easy to propagate
>>> a system envar that causes bad things to happen on the remote end.
>> 
>> I understand.  Though in this case, I'm /trying/ to make Bad Things
>> (tm) happen ;-).
>> 
>>> One thing you could do, of course, is add that envar to your default
>>> shell setup (.bashrc or whatever). This would set the variable by
>>> default on your remote locations (assuming you are using rsh/ssh
>>> for your launcher), and then any process you start would get
>>> it. However, that won't help if this is an envar intended only for
>>> the comm_spawned process.
>> 
>> Unfortunately what I want to play with at the moment are LD_*
>> variables, and fiddling with these in my .bashrc will mess up a lot
>> more than just the simulation I am presently hacking.
>> 
>>> I can add this capability to the OMPI trunk, and port it to the 1.7
>>> release - but we don't go all the way back to the 1.4 series any
>>> more.
>> 
>> Yes, having this in a 1.7 release would be great!
>> 
>> 
>> BTW, I encountered a couple other small things while grepping through
>> source/waiting for trunk to build, so there are two other small patches
>> attached.  One gets rid of warnings about unused functions in generated
>> lexing code.  I believe the second fixes resource leaks on error paths.
>> However, it turned out none of my user-level code hit that function at
>> all, so I haven't been able to test it.  Take from it what you will...
>> 
>> -tom
>> 
>>> On Wed, Dec 11, 2013 at 2:10 PM, tom fogal  wrote:
>>> 
 Hi all,
 
 I'm developing on Open MPI 1.4.5-ubuntu2 on Ubuntu 13.10 (so, Ubuntu's
 packaged Open MPI) at the moment.
 
 I'd like to pass environment variables to processes started via
 MPI_Comm_spawn.  Unfortunately, the MPI 3.0 standard (at least) does
 not seem to specify a way to do this; thus I have been searching for
 implementation-specific ways to accomplish my task.
 
 I have tried setting the environment variable using the POSIX setenv(3)
 call, but it seems that Open MPI comm-spawn'd processes do not inherit
 environment variables.  See the attached 2 C99 programs; one prints
 out the environment it receives, and one sets the MEANING_OF_LIFE
 environment variable, spawns the previous 'env printing' program, and
 exits.  I run via:
 
 $ env -i HOME=/home/tfogal \
 PATH=/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin \
 mpirun -x TJFVAR=testing -n 5 ./mpienv ./envpar
 
 and expect (well, hope) to find the MEANING_OF_LIFE in 'envpar's
 output.  I do see TJFVAR, but the MEANING_OF_LIFE sadly does not
 propagate.  Perhaps I am asking the wrong question...
 
 I found another MPI implementation which allowed passing such
 information via the MPI_Info argument, however I could find no
 documentation of similar functionality in Open MPI.
 
 Is there a way to accomplish what I'm looking for?  I could even be
 convinced to hack source, but a starting pointer would be appreciated.
 
 Thanks,
 
 -tom
>> 
>> From 8285a7625e5ea014b9d4df5dd65a7642fd4bc322 Mon Sep 17 00:00:00 2001
>> From: Tom Fogal 
>> Date: Fri, 13 Dec 2013 12:03:56 +0100
>> Subject: [PATCH 1/3] btl: Remove warnings about unused lexing functions.
>> 
>> ---
>> ompi/mca/btl/openib/btl_openib_lex.l | 2 ++
>> 1 file changed, 2 insertions(+)
>> 
>> diff --git 

Re: [OMPI users] EXTERNAL: Re: What's the status of OpenMPI and thread safety?

2013-12-19 Thread Blosch, Edwin L
Thanks Ralph,

We are attempting to use 1.6.4 with an application that requires 
multi-threading, and it is hanging most of the time; it is using openib.  They 
steered us to try Intel MPI for now.  If you lack drivers/testers for improved 
thread safety on openib, let me know and I'll encourage the developers of the 
application to support you.

Ed

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Wednesday, December 18, 2013 6:50 PM
To: Open MPI Users
Subject: EXTERNAL: Re: [OMPI users] What's the status of OpenMPI and thread 
safety?

This was, in fact, a primary point of discussion at last week's OMPI 
developer's conference. Bottom line is that we are only a little further along 
than we used to be, but are focusing on improving it. You'll find good thread 
support for some transports (some of the MTLs and at least the TCP BTL), not so 
good for others (e.g., openib is flat-out not thread safe).


On Dec 18, 2013, at 3:57 PM, Blosch, Edwin L 
mailto:edwin.l.blo...@lmco.com>> wrote:


I was wondering if the FAQ entry below is considered current opinion or perhaps 
a little stale.  Is multi-threading still considered to be 'lightly tested'?  
Are there known open bugs?

Thank you,

Ed


7. Is Open MPI thread safe?

Support for MPI_THREAD_MULTIPLE (i.e., multiple threads executing within the 
MPI library) and asynchronous message passing progress (i.e., continuing 
message passing operations even while no user threads are in the MPI library) 
has been designed into Open MPI from its first planning meetings.

Support for MPI_THREAD_MULTIPLE is included in the first version of Open MPI, 
but it is only lightly tested and likely still has some bugs. Support for 
asynchronous progress is included in the TCP point-to-point device, but it, 
too, has only had light testing and likely still has bugs.

Completing the testing for full support of MPI_THREAD_MULTIPLE and asynchronous 
progress is planned in the near future.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] [EXTERNAL] Re: What's the status of OpenMPI and thread safety?

2013-12-19 Thread Barrett, Brian W
Pablo -

As Ralph mentioned, it will be different, possibly not for the better, in
1.7.  This is an area of active work, so any help would be appreciated.

However, the one issue you brought up is going to be problematic, even
with threads.  Our design essentially makes it such that blocking MPI
calls never block internally (for any thread level).  It's one of the
trade-offs in our multi-device design.  Good: multi-device just works
without any complicated state sharing between devices.  Bad: it's hard for
us to block.  We've talked about making a blocking (or slow polling,
semi-blocking) option for when we can detect we only have one device, but
it hasn't been a high priority.

Like Ralph said, if you're interested in working on the threading or
blocking issues, please join the devel list and let us know.  We're always
willing to take new patches.

Thanks,

Brian

On 12/19/13 5:34 AM, "Pablo Barrio"  wrote:

>Hi all, this is the first time I post to the list (although I have read
>it for a while now). I hope this helps.
>
>I'm heavily using MPI_THREAD_MULTIPLE on multicores (sm BTL) and my
>programs work fine from a CORRECTNESS point of view. I use OpenMPI 1.6
>(SVN rev. 26429) and pthreads on Linux.
>
>This said, the performance is still very poor. Some of my programs become
>a thousand times slower. After some profiling/tracing, I found out that
>the Linux scheduler gave CPU time to threads stuck in blocking calls
>(Ssend, Recv, Wait, etcetera). It seems to
> me that the MPI implementation can be improved to avoid spending CPU
>time in threads waiting for messages.
>
>In short, my experience is that the implementation is correct but not
>very efficient so far.
>
>I have a few questions:
>
>1. My OpenMPI version is more than a year old. Have these performance
>issues been fixed in the latest versions?
>
>2. If not, perhaps I could contribute to OpenMPI multithreading
>support. Who takes care of this? How can I help?
>
>Thanks ahead.
>-- 
>Pablo Barrio
>Dpt. Electrical Engineering - Technical University of Madrid
>Office C-203
>Avda. Complutense s/n, 28040 Madrid
>Tel. (+34) 915495700 ext. 4234
>@: pbar...@die.upm.es
>On 19/12/13 01:49, Ralph Castain wrote:
>
>
>This was, in fact, a primary point of discussion at last week's OMPI
>developer's conference. Bottom line is that we are only a little further
>along than we used to be, but are focusing on improving it. You'll find
>good thread support for some transports (some
> of the MTLs and at least the TCP BTL), not so good for others (e.g.,
>openib is flat-out not thread safe).
>
>
>
>On Dec 18, 2013, at 3:57 PM, Blosch, Edwin L 
>wrote:
>
>
>I was wondering if the FAQ entry below is considered current opinion or
>perhaps a little stale.  Is multi-threading still considered to be
>Œlightly tested¹?  Are there known open bugs?
> 
>Thank you,
> 
>Ed
> 
> 
>7. Is Open MPI thread safe?
> 
>Support for MPI_THREAD_MULTIPLE (i.e., multiple threads executing within
>the MPI library) and asynchronous message passing progress (i.e.,
>continuing message passing operations even while no user threads are in
>the MPI library) has been designed into Open MPI
> from its first planning meetings.
> 
>Support for MPI_THREAD_MULTIPLE is included in the first version of Open
>MPI, but it is only lightly tested and likely still has some bugs.
>Support for asynchronous progress is included in the TCP point-to-point
>device, but it, too, has only had light testing
> and likely still has bugs.
> 
>Completing the testing for full support of MPI_THREAD_MULTIPLE and
>asynchronous progress is planned in the near future.
> 
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
>
>
> 
>___
>users mailing list
>users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
>
>


--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories





Re: [OMPI users] What's the status of OpenMPI and thread safety?

2013-12-19 Thread Ralph Castain
Just answered a similar question yesterday:

This was, in fact, a primary point of discussion at last week's OMPI 
developer's conference. Bottom line is that we are only a little further along 
than we used to be, but are focusing on improving it. You'll find good thread 
support for some transports (some of the MTLs and at least the TCP BTL), not so 
good for others (e.g., openib is flat-out not thread safe).

We welcome contributors! I'd suggest inquiring on the devel mailing list, 
though, as discussion of what needs to be done can get rather detailed.

Ralph


On Dec 19, 2013, at 4:34 AM, Pablo Barrio  wrote:

> Hi all, this is the first time I post to the list (although I have read it 
> for a while now). I hope this helps.
> 
> I'm heavily using MPI_THREAD_MULTIPLE on multicores (sm BTL) and my programs 
> work fine from a CORRECTNESS point of view. I use OpenMPI 1.6 (SVN rev. 
> 26429) and pthreads on Linux.
> 
> This said, the performance is still very poor. Some of my programs become a 
> thousand times slower. After some profiling/tracing, I found out that the 
> Linux scheduler gave CPU time to threads stuck in blocking calls (Ssend, 
> Recv, Wait, etcetera). It seems to me that the MPI implementation can be 
> improved to avoid spending CPU time in threads waiting for messages.
> 
> In short, my experience is that the implementation is correct but not very 
> efficient so far.
> 
> I have a few questions:
> 
> 1. My OpenMPI version is more than a year old. Have these performance 
> issues been fixed in the latest versions?
> 
> 2. If not, perhaps I could contribute to OpenMPI multithreading support. 
> Who takes care of this? How can I help?
> 
> Thanks ahead.
> -- 
> Pablo Barrio
> Dpt. Electrical Engineering - Technical University of Madrid
> Office C-203
> Avda. Complutense s/n, 28040 Madrid
> Tel. (+34) 915495700 ext. 4234
> @: pbar...@die.upm.es
> 
> On 19/12/13 01:49, Ralph Castain wrote:
>> This was, in fact, a primary point of discussion at last week's OMPI 
>> developer's conference. Bottom line is that we are only a little further 
>> along than we used to be, but are focusing on improving it. You'll find good 
>> thread support for some transports (some of the MTLs and at least the TCP 
>> BTL), not so good for others (e.g., openib is flat-out not thread safe).
>> 
>> 
>> On Dec 18, 2013, at 3:57 PM, Blosch, Edwin L  wrote:
>> 
>>> I was wondering if the FAQ entry below is considered current opinion or 
>>> perhaps a little stale.  Is multi-threading still considered to be ‘lightly 
>>> tested’?  Are there known open bugs?
>>>  
>>> Thank you,
>>>  
>>> Ed
>>>  
>>>  
>>> 7. Is Open MPI thread safe?
>>>  
>>> Support for MPI_THREAD_MULTIPLE (i.e., multiple threads executing within 
>>> the MPI library) and asynchronous message passing progress (i.e., 
>>> continuing message passing operations even while no user threads are in the 
>>> MPI library) has been designed into Open MPI from its first planning 
>>> meetings.
>>>  
>>> Support for MPI_THREAD_MULTIPLE is included in the first version of Open 
>>> MPI, but it is only lightly tested and likely still has some bugs. Support 
>>> for asynchronous progress is included in the TCP point-to-point device, but 
>>> it, too, has only had light testing and likely still has bugs.
>>>  
>>> Completing the testing for full support of MPI_THREAD_MULTIPLE and 
>>> asynchronous progress is planned in the near future.
>>>  
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-19 Thread Noam Bernstein
On Dec 18, 2013, at 5:19 PM, Martin Siegert  wrote:
> 
> Thanks for figuring this out. Does this work for 1.6.x as well?
> The FAQ http://www.open-mpi.org/faq/?category=tuning#using-paffinity
> covers versions 1.2.x to 1.5.x. 
> Does 1.6.x support mpi_paffinity_alone = 1 ?
> I set this in openmpi-mca-params.conf but
> 
> # ompi_info | grep affinity
>  MPI extensions: affinity example
>   MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.4)
>   MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.6.4)
>   MCA maffinity: hwloc (MCA v2.0, API v2.0, Component v1.6.4)
> 
> does not give any indication that this is actually used.

I never checked actual bindings with hwloc-ps or anything like that,
but as far as I can tell, 1.6.4 had consistently high performance when I
used mpi_paffinity_alone=1, and slowdowns of up to a factor of ~2
when I didn't.  1.7.3 with the old kernel never showed extreme slowdowns,
but we didn't benchmark it carefully, so it's conceivable it had minor
(same factor of 2) slowdowns.  With the new kernel 1.7.3 would
show slowdowns between a factor of 2 and maybe 20 (paffinity definitely
did nothing) , and "--bind-to core" restored consistent performance.


Noam

smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI users] What's the status of OpenMPI and thread safety?

2013-12-19 Thread Pablo Barrio
Hi all, this is the first time I post to the list (although I have read 
it for a while now). I hope this helps.


I'm heavily using MPI_THREAD_MULTIPLE on multicores (sm BTL) and my 
programs work fine from a CORRECTNESS point of view. I use OpenMPI 1.6 
(SVN rev. 26429) and pthreads on Linux.


This said, the performance is still very poor. Some of my programs 
become a thousand times slower. After some profiling/tracing, I found 
out that the Linux scheduler gave CPU time to threads stuck in blocking 
calls (Ssend, Recv, Wait, etcetera). It seems to me that the MPI 
implementation can be improved to avoid spending CPU time in threads 
waiting for messages.


In short, my experience is that the implementation is correct but not 
very efficient so far.


I have a few questions:

1. My OpenMPI version is more than a year old. Have these 
performance issues been fixed in the latest versions?


2. If not, perhaps I could contribute to OpenMPI multithreading 
support. Who takes care of this? How can I help?


Thanks ahead.

--
Pablo Barrio
Dpt. Electrical Engineering - Technical University of Madrid
Office C-203
Avda. Complutense s/n, 28040 Madrid
Tel. (+34) 915495700 ext. 4234
@: pbar...@die.upm.es


On 19/12/13 01:49, Ralph Castain wrote:
This was, in fact, a primary point of discussion at last week's OMPI 
developer's conference. Bottom line is that we are only a little 
further along than we used to be, but are focusing on improving it. 
You'll find good thread support for some transports (some of the MTLs 
and at least the TCP BTL), not so good for others (e.g., openib is 
flat-out not thread safe).



On Dec 18, 2013, at 3:57 PM, Blosch, Edwin L > wrote:


I was wondering if the FAQ entry below is considered current opinion 
or perhaps a little stale.  Is multi-threading still considered to be 
'lightly tested'?  Are there known open bugs?

Thank you,
Ed
7. Is Open MPI thread safe?
Support for MPI_THREAD_MULTIPLE (i.e., multiple threads executing 
within the MPI library) and asynchronous message passing progress 
(i.e., continuing message passing operations even while no user 
threads are in the MPI library) has been designed into Open MPI from 
its first planning meetings.
Support for MPI_THREAD_MULTIPLE is included in the first version of 
Open MPI, but it is only lightly tested and likely still has some 
bugs. Support for asynchronous progress is included in the TCP 
point-to-point device, but it, too, has only had light testing and 
likely still has bugs.
Completing the testing for full support of MPI_THREAD_MULTIPLE and 
asynchronous progress is planned in the near future.

___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-19 Thread Dave Love
Brice Goglin  writes:

> hwloc-ps (and lstopo --top) are better at showing process binding but
> they lack a nice pseudographical interface with dynamic refresh.

That seems like an advantage when you want to check on a cluster!

> htop uses hwloc internally iirc, so there's hope we'll have everything needed 
> in htop one day ;)

Apparently not in RH EPEL, for what it's worth, and I don't understand
how to get bindings out of it.



Re: [OMPI users] slowdown with infiniband and latest CentOS kernel

2013-12-19 Thread Dave Love
Noam Bernstein  writes:

> On Dec 18, 2013, at 10:32 AM, Dave Love  wrote:
>
>> Noam Bernstein  writes:
>> 
>>> We specifically switched to 1.7.3 because of a bug in 1.6.4 (lock up in 
>>> some 
>>> collective communication), but now I'm wondering whether I should just test
>>> 1.6.5.
>> 
>> What bug, exactly?  As you mentioned vasp, is it specifically affecting
>> that?
>
> Yes - I never characterized it fully, but we attached with gdb to every
> single vasp running process, and all were stuck in the same
> call to MPI_allreduce() every time. It's only happening on a rather large 
> jobs, so it's not the easiest setup to debug.  

Maybe that's a different problem.  I know they tried multiple versions
of vasp, which had different failures.  Actually, I just remembered that
the version I examined with padb was built with the intel compiler and
run with gcc openmpi (I know...), but builds with gcc failed too.  I
don't know if that was taken up with the developers.

I guess this isn't the place to discuss vasp, unless it's helping to pin
down an ompi problem, but people might benefit from notes of problems in
the archive.

> If I can reproduce the problem with 1.6.5, and I can confirm that it's always 
> locking up in the same call to mpi_allreduce, and all processes are stuck 
> in the same call, is there interest in looking into a possible mpi issue?  

I'd have thought so from the point of view of those of us running 1.6
for compatibility with the RHEL6 openmpi.

Thanks for the info, anyhow.

Incidentally, if vasp is built with ompi's alltoallv -- I understand it
has its own implementation of that or something similar --
 may be
relevant, if you haven't seen it.



Re: [OMPI users] openmpi-1.7.4a1r29646 with -hostfile option under Torque manager

2013-12-19 Thread tmishima


Hi Ralph, sorry for intersecting post.

Your advice about -hetero-nodes in other thread gives me a hint.

I already put "orte_hetero_nodes = 1" in my mca-params.conf, because
you told me a month ago that my environment would need this option.

Removing this line from mca-params.conf, then it works.
In other word, you can replicate it by adding -hetero-nodes as
shown below.

qsub: job 8364.manage.cluster completed
[mishima@manage mpi]$ qsub -I -l nodes=2:ppn=8
qsub: waiting for job 8365.manage.cluster to start
qsub: job 8365.manage.cluster ready

[mishima@node11 ~]$ ompi_info --all | grep orte_hetero_nodes
MCA orte: parameter "orte_hetero_nodes" (current value:
"false", data source: default, level: 9 dev/all,
 type: bool)
[mishima@node11 ~]$ cd ~/Desktop/openmpi-1.7/demos/
[mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
myprog
[node11.cluster:27895] MCW rank 0 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
[node11.cluster:27895] MCW rank 1 bound to socket 1[core 4[hwt 0]], socket
1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
[node12.cluster:24891] MCW rank 3 bound to socket 1[core 4[hwt 0]], socket
1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
[node12.cluster:24891] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
Hello world from process 0 of 4
Hello world from process 1 of 4
Hello world from process 2 of 4
Hello world from process 3 of 4
[mishima@node11 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
-hetero-nodes myprog
--
A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to: CORE
   Node:node12
   #processes:  2
   #cpus:  1

You can override this protection by adding the "overload-allowed"
option to your binding directive.
--


As far as I checked, data->num_bound seems to become bad in bind_downwards,
when I put "-hetero-nodes". I hope you can clear the problem.

Regards,
Tetsuya Mishima


> Yes, it's very strange. But I don't think there's any chance that
> I have < 8 actual cores on the node. I guess that you cat replicate
> it with SLURM, please try it again.
>
> I changed to use node10 and node11, then I got the warning against
> node11.
>
> Furthermore, just as an information for you, I tried to add
> "-bind-to core:overload-allowed", then it worked as shown below.
> But I think node11 is never overloaded because it has 8 cores.
>
> qsub: job 8342.manage.cluster completed
> [mishima@manage ~]$ qsub -I -l nodes=node10:ppn=8+node11:ppn=8
> qsub: waiting for job 8343.manage.cluster to start
> qsub: job 8343.manage.cluster ready
>
> [mishima@node10 ~]$ cd ~/Desktop/openmpi-1.7/demos/
> [mishima@node10 demos]$ cat $PBS_NODEFILE
> node10
> node10
> node10
> node10
> node10
> node10
> node10
> node10
> node11
> node11
> node11
> node11
> node11
> node11
> node11
> node11
> [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
> myprog
>
--
> A request was made to bind to that would result in binding more
> processes than cpus on a resource:
>
> Bind to: CORE
> Node:node11
> #processes:  2
> #cpus:  1
>
> You can override this protection by adding the "overload-allowed"
> option to your binding directive.
>
--
> [mishima@node10 demos]$ mpirun -np 4 -cpus-per-proc 4 -report-bindings
> -bind-to core:overload-allowed myprog
> [node10.cluster:27020] MCW rank 0 bound to socket 0[core 0[hwt 0]],
socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> [node10.cluster:27020] MCW rank 1 bound to socket 1[core 4[hwt 0]],
socket
> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> [node11.cluster:26597] MCW rank 3 bound to socket 1[core 4[hwt 0]],
socket
> 1[core 5[hwt 0]], socket 1[core 6[hwt 0]], so
> cket 1[core 7[hwt 0]]: [./././.][B/B/B/B]
> [node11.cluster:26597] MCW rank 2 bound to socket 0[core 0[hwt 0]],
socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]: [B/B/B/B][./././.]
> Hello world from process 1 of 4
> Hello world from process 0 of 4
> Hello world from process 3 of 4
> Hello world from process 2 of 4
>
> Regards,
> Tetsuya Mishima
>
>
> > Very strange - I can't seem to replicate it. Is there any chance that
you
> have < 8 actual cores on node12?
> >
> >
> > On Dec 18, 2013, at 4:53 PM, tmish...@jcity.maeda.co.jp wrote:
> >
> > >
> > >
> > > Hi Ralph, sorry for confusing you.
> 

Re: [OMPI users] "-bind-to numa" of openmpi-1.7.4rc1 dosen't work for our magny cours based 32 core node

2013-12-19 Thread tmishima


I can wait it'll be fixed in 1.7.5 or later, because putting "-bind-to
numa"
and "-map-by numa" at the same time works as a workaround.

Thanks,
Tetsuya Mishima

> Yeah, it will impact everything that uses hwloc topology maps, I fear.
>
> One side note: you'll need to add --hetero-nodes to your cmd line. If we
don't see that, we assume that all the node topologies are identical -
which clearly isn't true here.
>
> I'll try to resolve the hier inversion over the holiday - won't be for
1.7.4, but hopefully for 1.7.5
>
> Thanks
> Ralph
>
> On Dec 18, 2013, at 9:44 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > I think it's normal for AMD opteron having 8/16 cores such as
> > magny cours or interlagos. Because it usually has 2 numa nodes
> > in a cpu(socket), numa-node can not include a socket. This type
> > of hierarchy would be natural.
> >
> > (node03 is Dell PowerEdge R815 and maybe quite common, I guess)
> >
> > By the way, I think this inversion should affect rmaps_lama mapping.
> >
> > Tetsuya Mishima
> >
> >> Ick - yeah, that would be a problem. I haven't seen that type of
> > hierarchical inversion before - is node03 a different type of chip?
> >>
> >> Might take awhile for me to adjust the code to handle hier
> > inversion... :-(
> >>
> >> On Dec 18, 2013, at 9:05 PM, tmish...@jcity.maeda.co.jp wrote:
> >>
> >>>
> >>>
> >>> Hi Ralph,
> >>>
> >>> I found the reason. I attached the main part of output with 32
> >>> core node(node03) and 8 core node(node05) at the bottom.
> >>>
> >>> From this information, socket of node03 includes numa-node.
> >>> On the other hand, numa-node of node05 includes socket.
> >>> The direction of object tree is opposite.
> >>>
> >>> Since "-map-by socket" may be assumed as default,
> >>> for node05, "-bind-to numa and -map-by socket" means
> >>> upward search. For node03, this should be downward.
> >>>
> >>> I guess that openmpi-1.7.4rc1 will always assume numa-node
> >>> includes socket. Is it right? Then, upward search is assumed
> >>> in orte_rmaps_base_compute_bindings even for node03 when I
> >>> put "-bind-to numa and -map-by socket" option.
> >>>
> >>> [node03.cluster:15508] [[38286,0],0] rmaps:base:compute_usage
> >>> [node03.cluster:15508] mca:rmaps: compute bindings for job [38286,1]
> > with
> >>> policy NUMA
> >>> [node03.cluster:15508] mca:rmaps: bind upwards for job [38286,1] with
> >>> bindings NUMA
> >>> [node03.cluster:15508] [[38286,0],0] bind:upward target NUMANode type
> >>> Machine
> >>>
> >>> That's the reason of this trouble. Therefore, adding "-map-by core"
> > works.
> >>> (mapping pattern seems to be strange ...)
> >>>
> >>> [mishima@node03 demos]$ mpirun -np 8 -bind-to numa -map-by core
> >>> -report-bindings myprog
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> >>> NUMANode
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> >>> NUMANode
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> >>> NUMANode
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> >>> NUMANode
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> >>> NUMANode
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> >>> NUMANode
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> >>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > Cache
> 

Re: [OMPI users] "-bind-to numa" of openmpi-1.7.4rc1 dosen't work for our magny cours based 32 core node

2013-12-19 Thread Ralph Castain
Yeah, it will impact everything that uses hwloc topology maps, I fear.

One side note: you'll need to add --hetero-nodes to your cmd line. If we don't 
see that, we assume that all the node topologies are identical - which clearly 
isn't true here.

I'll try to resolve the hier inversion over the holiday - won't be for 1.7.4, 
but hopefully for 1.7.5

Thanks
Ralph

On Dec 18, 2013, at 9:44 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> I think it's normal for AMD opteron having 8/16 cores such as
> magny cours or interlagos. Because it usually has 2 numa nodes
> in a cpu(socket), numa-node can not include a socket. This type
> of hierarchy would be natural.
> 
> (node03 is Dell PowerEdge R815 and maybe quite common, I guess)
> 
> By the way, I think this inversion should affect rmaps_lama mapping.
> 
> Tetsuya Mishima
> 
>> Ick - yeah, that would be a problem. I haven't seen that type of
> hierarchical inversion before - is node03 a different type of chip?
>> 
>> Might take awhile for me to adjust the code to handle hier
> inversion... :-(
>> 
>> On Dec 18, 2013, at 9:05 PM, tmish...@jcity.maeda.co.jp wrote:
>> 
>>> 
>>> 
>>> Hi Ralph,
>>> 
>>> I found the reason. I attached the main part of output with 32
>>> core node(node03) and 8 core node(node05) at the bottom.
>>> 
>>> From this information, socket of node03 includes numa-node.
>>> On the other hand, numa-node of node05 includes socket.
>>> The direction of object tree is opposite.
>>> 
>>> Since "-map-by socket" may be assumed as default,
>>> for node05, "-bind-to numa and -map-by socket" means
>>> upward search. For node03, this should be downward.
>>> 
>>> I guess that openmpi-1.7.4rc1 will always assume numa-node
>>> includes socket. Is it right? Then, upward search is assumed
>>> in orte_rmaps_base_compute_bindings even for node03 when I
>>> put "-bind-to numa and -map-by socket" option.
>>> 
>>> [node03.cluster:15508] [[38286,0],0] rmaps:base:compute_usage
>>> [node03.cluster:15508] mca:rmaps: compute bindings for job [38286,1]
> with
>>> policy NUMA
>>> [node03.cluster:15508] mca:rmaps: bind upwards for job [38286,1] with
>>> bindings NUMA
>>> [node03.cluster:15508] [[38286,0],0] bind:upward target NUMANode type
>>> Machine
>>> 
>>> That's the reason of this trouble. Therefore, adding "-map-by core"
> works.
>>> (mapping pattern seems to be strange ...)
>>> 
>>> [mishima@node03 demos]$ mpirun -np 8 -bind-to numa -map-by core
>>> -report-bindings myprog
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
>>> NUMANode
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
>>> NUMANode
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
>>> NUMANode
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
>>> NUMANode
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
>>> NUMANode
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
>>> NUMANode
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
>>> NUMANode
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> Cache
>>> [node03.cluster:15885] [[38679,0],0] bind

Re: [OMPI users] "-bind-to numa" of openmpi-1.7.4rc1 dosen't work for our magny cours based 32 core node

2013-12-19 Thread tmishima


I think it's normal for AMD opteron having 8/16 cores such as
magny cours or interlagos. Because it usually has 2 numa nodes
in a cpu(socket), numa-node can not include a socket. This type
of hierarchy would be natural.

(node03 is Dell PowerEdge R815 and maybe quite common, I guess)

By the way, I think this inversion should affect rmaps_lama mapping.

Tetsuya Mishima

> Ick - yeah, that would be a problem. I haven't seen that type of
hierarchical inversion before - is node03 a different type of chip?
>
> Might take awhile for me to adjust the code to handle hier
inversion... :-(
>
> On Dec 18, 2013, at 9:05 PM, tmish...@jcity.maeda.co.jp wrote:
>
> >
> >
> > Hi Ralph,
> >
> > I found the reason. I attached the main part of output with 32
> > core node(node03) and 8 core node(node05) at the bottom.
> >
> > From this information, socket of node03 includes numa-node.
> > On the other hand, numa-node of node05 includes socket.
> > The direction of object tree is opposite.
> >
> > Since "-map-by socket" may be assumed as default,
> > for node05, "-bind-to numa and -map-by socket" means
> > upward search. For node03, this should be downward.
> >
> > I guess that openmpi-1.7.4rc1 will always assume numa-node
> > includes socket. Is it right? Then, upward search is assumed
> > in orte_rmaps_base_compute_bindings even for node03 when I
> > put "-bind-to numa and -map-by socket" option.
> >
> > [node03.cluster:15508] [[38286,0],0] rmaps:base:compute_usage
> > [node03.cluster:15508] mca:rmaps: compute bindings for job [38286,1]
with
> > policy NUMA
> > [node03.cluster:15508] mca:rmaps: bind upwards for job [38286,1] with
> > bindings NUMA
> > [node03.cluster:15508] [[38286,0],0] bind:upward target NUMANode type
> > Machine
> >
> > That's the reason of this trouble. Therefore, adding "-map-by core"
works.
> > (mapping pattern seems to be strange ...)
> >
> > [mishima@node03 demos]$ mpirun -np 8 -bind-to numa -map-by core
> > -report-bindings myprog
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > NUMANode
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > NUMANode
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > NUMANode
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > NUMANode
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > NUMANode
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > NUMANode
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > NUMANode
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
Cache
> > [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> > NUMANode
> > [node03.cluster:15885] MCW rank 2 bound to socket 0[core 0[hwt 0]],
socket
> > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > cket 0[core 3[hwt 0]]:
> > [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> > [node03.cluster:15885] MCW rank 3 bound to socket 0[core 0[hwt 0]],
socket
> > 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> > cket 0[core 3[hwt 0]]:
> > [B/B/B/B/./././.][.

Re: [OMPI users] "-bind-to numa" of openmpi-1.7.4rc1 dosen't work for our magny cours based 32 core node

2013-12-19 Thread Ralph Castain
Ick - yeah, that would be a problem. I haven't seen that type of hierarchical 
inversion before - is node03 a different type of chip?

Might take awhile for me to adjust the code to handle hier inversion... :-(

On Dec 18, 2013, at 9:05 PM, tmish...@jcity.maeda.co.jp wrote:

> 
> 
> Hi Ralph,
> 
> I found the reason. I attached the main part of output with 32
> core node(node03) and 8 core node(node05) at the bottom.
> 
> From this information, socket of node03 includes numa-node.
> On the other hand, numa-node of node05 includes socket.
> The direction of object tree is opposite.
> 
> Since "-map-by socket" may be assumed as default,
> for node05, "-bind-to numa and -map-by socket" means
> upward search. For node03, this should be downward.
> 
> I guess that openmpi-1.7.4rc1 will always assume numa-node
> includes socket. Is it right? Then, upward search is assumed
> in orte_rmaps_base_compute_bindings even for node03 when I
> put "-bind-to numa and -map-by socket" option.
> 
> [node03.cluster:15508] [[38286,0],0] rmaps:base:compute_usage
> [node03.cluster:15508] mca:rmaps: compute bindings for job [38286,1] with
> policy NUMA
> [node03.cluster:15508] mca:rmaps: bind upwards for job [38286,1] with
> bindings NUMA
> [node03.cluster:15508] [[38286,0],0] bind:upward target NUMANode type
> Machine
> 
> That's the reason of this trouble. Therefore, adding "-map-by core" works.
> (mapping pattern seems to be strange ...)
> 
> [mishima@node03 demos]$ mpirun -np 8 -bind-to numa -map-by core
> -report-bindings myprog
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> NUMANode
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> NUMANode
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> NUMANode
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> NUMANode
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> NUMANode
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> NUMANode
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> NUMANode
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
> [node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
> NUMANode
> [node03.cluster:15885] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]:
> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> [node03.cluster:15885] MCW rank 3 bound to socket 0[core 0[hwt 0]], socket
> 0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
> cket 0[core 3[hwt 0]]:
> [B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
> [node03.cluster:15885] MCW rank 4 bound to socket 0[core 4[hwt 0]], socket
> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> cket 0[core 7[hwt 0]]:
> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> [node03.cluster:15885] MCW rank 5 bound to socket 0[core 4[hwt 0]], socket
> 0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
> cket 0[core 7[hwt 0]]:
> [././././B/B/B/B][./././././././.][./././././././.][./././././././.]
> [node03.cluster:15885] MCW rank 6 bound to

Re: [OMPI users] "-bind-to numa" of openmpi-1.7.4rc1 dosen't work for our magny cours based 32 core node

2013-12-19 Thread tmishima


Hi Ralph,

I found the reason. I attached the main part of output with 32
core node(node03) and 8 core node(node05) at the bottom.

>From this information, socket of node03 includes numa-node.
On the other hand, numa-node of node05 includes socket.
The direction of object tree is opposite.

Since "-map-by socket" may be assumed as default,
for node05, "-bind-to numa and -map-by socket" means
upward search. For node03, this should be downward.

I guess that openmpi-1.7.4rc1 will always assume numa-node
includes socket. Is it right? Then, upward search is assumed
in orte_rmaps_base_compute_bindings even for node03 when I
put "-bind-to numa and -map-by socket" option.

[node03.cluster:15508] [[38286,0],0] rmaps:base:compute_usage
[node03.cluster:15508] mca:rmaps: compute bindings for job [38286,1] with
policy NUMA
[node03.cluster:15508] mca:rmaps: bind upwards for job [38286,1] with
bindings NUMA
[node03.cluster:15508] [[38286,0],0] bind:upward target NUMANode type
Machine

That's the reason of this trouble. Therefore, adding "-map-by core" works.
(mapping pattern seems to be strange ...)

[mishima@node03 demos]$ mpirun -np 8 -bind-to numa -map-by core
-report-bindings myprog
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
NUMANode
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
NUMANode
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
NUMANode
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
NUMANode
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
NUMANode
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
NUMANode
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
NUMANode
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type Cache
[node03.cluster:15885] [[38679,0],0] bind:upward target NUMANode type
NUMANode
[node03.cluster:15885] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]:
[B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
[node03.cluster:15885] MCW rank 3 bound to socket 0[core 0[hwt 0]], socket
0[core 1[hwt 0]], socket 0[core 2[hwt 0]], so
cket 0[core 3[hwt 0]]:
[B/B/B/B/./././.][./././././././.][./././././././.][./././././././.]
[node03.cluster:15885] MCW rank 4 bound to socket 0[core 4[hwt 0]], socket
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
cket 0[core 7[hwt 0]]:
[././././B/B/B/B][./././././././.][./././././././.][./././././././.]
[node03.cluster:15885] MCW rank 5 bound to socket 0[core 4[hwt 0]], socket
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
cket 0[core 7[hwt 0]]:
[././././B/B/B/B][./././././././.][./././././././.][./././././././.]
[node03.cluster:15885] MCW rank 6 bound to socket 0[core 4[hwt 0]], socket
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
cket 0[core 7[hwt 0]]:
[././././B/B/B/B][./././././././.][./././././././.][./././././././.]
[node03.cluster:15885] MCW rank 7 bound to socket 0[core 4[hwt 0]], socket
0[core 5[hwt 0]], socket 0[core 6[hwt 0]], so
cket 0[core 7[hwt 0]]:
[././././B/B/B/B][./././././././.][./././././././.][./././././././.]
[node03.cluster:15885] MCW rank 0 bound to socket 0[core 0[hwt 0]], soc