Re: [OMPI devel] autodetect broken

2009-07-22 Thread Jeff Squyres

Heh.  The list overrides the reply-to.  :-)

Agreed -- let's you, me, and Brian discuss off-list and let anyone who  
cares know what the final solution is that we come up with.


FWIW, we've become quite fond of developing in mercurial branches and  
pushing them somewhere to share when you want to get others opinions  
before committing to the trunk.  bitbucket.org, for example, offers  
free mercurial hosting.  I use it quite heavily.



On Jul 22, 2009, at 1:46 PM, Iain Bason wrote:



On Jul 22, 2009, at 10:55 AM, Brian W. Barrett wrote:

> The current autodetect implementation seems like the wrong approach
> to me. I'm rather unhappy the base functionality was hacked up like
> it was without any advanced notice or questions about original
> design intent. We seem to have a set of base functions which are now
> more unreadable than before, overly complex, and which leak memory.

First, I should apologize for my procedural misstep.  I had asked my
group here at Sun whether I should do an RFC or something, and I guess
I didn't make my question clear enough.  I was under the impression
that I should check something in and let people comment on it.

That being said, I would argue that there are good reasons for adding
some complexity to the base component:

1. The pre-existing implementation of expansion is broken (although
the cases for which it is broken are somewhat obscure).

2. The autodetect component cannot set more than one directory without
some knowledge of the relationships amongst the various directories.
Giving it that knowledge would violate the independence of the
components.

You can see #1 by doing "OPAL_PREFIX='${datarootdir}/..'
OPAL_DATAROOTDIR='/opt/SUNWhpc/HPC8.2/share' mpirun hostname" (if you
have installed in /opt/SUNWhpc/HPC8.2).  Yes, I know, "Why would
anyone do that?"  Nonetheless, I find that a poor excuse for having a
bug in the code.

To expand on #2 a little, consider the case where OpenMPI is
configured with "--prefix=/path/one --libdir=/path/two".  We can tell
that libopen-pal is in /path/two, but it is not correct to assume that
the prefix is therefore /path.  (Unfortunately, there is code in
OpenMPI that does make that sort of assumption -- see orterun.c.)  We
need information from the config component to avoid making incorrect
assumptions.

There are, of course, alternate ways of getting to the same point.
But it is not feasible simply to leave the design of the base
component unchanged.  (More on that below.)

As for readability, I am always open to constructive suggestions as to
how to make code more readable.  I didn't fix the memory leak because
I hadn't yet found a way to do that without decreasing readability.

> The intent of the installdirs framework was to allow this type of
> behavior, but without rehacking all this infer crap into base.  The
> autodetect component should just set $prefix in the set of functions
> it returns (and possibly libdir and bindir if you really want, which
> might actually make sense if you guess wrong), and let the expansion
> code take over from there.  The general thought on how this would
> work went something like:
>
> - Run after config
> - If determine you have a special $prefix, set the
>   opal_instal_dirs.prefix to NULL (yes, it's a bit of a hack) and
>   set your special prefix.
> - Same with bindir and libdir if needed
> - Let expansion (which runs after all components have had the
>   chance to fill in their fields) expand out with your special
>   data

If we run the autodetect component after config, and allow it to
override values that are already in opal_install_dirs, then there will
be no way for users to have environment variables take precedence.
(Let's say someone runs with OPAL_LIBDIR=/foo.  The autodetect
component will not know whether opal_install_dirs.libdir has been set
by the env component or by the config component.)

Moreover, if the user has used an environment variable to override one
of the paths in the config component, then the autodetect component
may make the wrong inference.  For example, let's say someone runs
with OPAL_LIBDIR=/foo.  The autodetect component finds libopen-pal  
in /

usr/renamed/lib, and sets opal_install_dirs.libdir to /usr/renamed/
lib.  However, it has to use the config component's idea of libdir
(e.g., ${exec_prefix}/lib) to correctly infer that prefix should be /
usr/renamed.  Since it only has /foo from the environment variable, it
will have to decide that it cannot infer the prefix.

All of this will lead to behavior that users will have trouble
diagnosing.  While I appreciate simple code, I think that a simple
user interface is more important.

We could add some infrastructure so that the autodetect component can
figure out the provenance of each field in opal_install_dirs, but that
would make the boundary between the base component and the autodetect
component unclear.

> And the base stays simple, the components do all the heavy lifting,
> and life is happy.

Except 

Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

2009-07-22 Thread Ralph Castain
I apologize for coming to this late - IU's email forwarding jammed up  
yesterday, so I'm only now getting around to reading the backlog.


There has been some off-list discussion about advanced paffinity  
mappings/bindings. As I noted there, I am in the latter stages of  
completing a new mapper that allows users to easily specify #cpus to  
"bind" to each processor.


As part of that effort, we have interfaced to the slurm cpus_per_task  
and cpuset envars. So we should (once this gets done) pretty much  
handle the slurm environment in that regard.


Having worked on the paffinity issue for some time, I am somewhat  
strongly opinionated that PLPA is doing exactly what it should do. It  
is up to OMPI/ORTE to identify what cpusets were assigned to the job  
and figure out the mappings - the PLPA is there to tell us how many  
processors are available, how many are in each socket, etc., and to do  
the mechanics of binding the specified process to the specified cpus.


I would tend to oppose any change in the relative responsibilities of  
OMPI/ORTE and PLPA. It is a good division as it stands, and is working  
well. I haven't read anything in this chain that would change my  
opinion.


Just my $0.0002
Ralph

On Jul 22, 2009, at 11:22 AM, Jeff Squyres wrote:


On Jul 22, 2009, at 11:17 AM, Sylvain Jeaugey wrote:

I'm interested in joining the effort, since we will likely have the  
same

problem with SLURM's cpuset support.



Ok.

> But as to why it's getting EINVAL, that could be wonky.  We might  
want to
> take this to the PLPA list and have you run some small, non-MPI  
examples to

> ensure that PLPA is parsing your /sys tree properly, etc.
I don't see the /sys implication here. Can you be more precise on  
which

files are read to determine placement ?



Check in opal/mca/paffinity/linux/plpa/src/libplpa/ 
plpa_map.c:load_cache().


IIRC, when you are inside a cpuset, you can see all cpus (/sys  
should be
unmodified) but calling set_schedaffinity with a mask containing a  
cpu

outside the cpuset will return EINVAL.



Ah, that could be the issue.


The only solution I see to solve
this would be to get the "allowed" cpus with sched_getaffinity,
which should be set according to the cpuset mask.




There are two issues here:

- what should OMPI do
- what should PLPA do

PLPA currently does two things:

1. provide a portable set/get affinity API (to isolate you from  
whatever version you have in your linux install)

2. provide topology mapping information (sockets, cores)

PLPA does not currently deal with cpusets.  If we want to expand  
PLPA to somehow interact with cpusets, that should probably be  
brought up on the PLPA mailing lists (someone made this suggestion  
to me about a month or two ago and I haven't had a chance to follow  
up on it :-( ).


OMPI (as a whole -- meaning: including the ORTE layer) does the  
following:


1. decide whether to bind MPI processes or not
2. if we do bind, use the paffinity module to bind processes to  
specific processors (the linux paffinity module uses PLPA to do the  
actual binding -- PLPA is wholly embedded inside OMPI's linux  
paffinity module)


And there's two layers involved here:

- the main ORTE logic saying both "yes, bind" and making the  
decision as to which processors to bind to
- the linux paffinity component does a thin layer of translation  
between ORTE's/OMPI's requests and calling the back-end PLPA library


As Ralph described, OMPI is currently fairly "dumb" about how it  
chooses which processors it uses -- 0 to N-1.  I think the issue  
here is to make OMPI smarter about how it chooses which processors  
to use.  It could be in ORTE itself, or it could be in the linux  
paffinity translation layer (e.g., linux paffinity component could  
report only as many processors as are available in the cpuset...?   
And binding could be relative to the cpuset...?).


--
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] BTL receive callback

2009-07-22 Thread Don Kerr

Hello Sebastian,

Sounds like you are using the openib btl as a starting point, which is a 
good place to start. I am curious if you are indeed using a new 
interconnect (new hardware and protocol) or if it is requirements of the 
3D-torus network that are not addressed by the openib btl that are 
driving the need for a new btl?


-DON

On 07/21/09 11:55, Sebastian Rinke wrote:

Hello,
I am developing a new BTL component (Open MPI v1.3.2) for a new 
3D-torus interconnect. During a simple message transfer of 16362 B 
between two nodes with MPI_Send(), MPI_Recv() I encounter the following:


The sender:
---

1. prepare_src() size: 16304 reserve: 32
-> alloc() size: 16336
-> ompi_convertor_pack(): 16304
2. send()
3. component_progress()
-> send cb ()
-> free()
4. component_progress()
-> recv cb ()
-> prepare_src() size: 58 reserve: 32
-> alloc() size: 90
-> ompi_convertor_pack(): 58
-> free() size: 90 Send is missing !!!
5. NO PROGRESS

The receiver:
-

1. component_progress()
-> recv cb ()
-> alloc() size: 32
-> send()
2. component_progress()
-> send cb ()
-> free() size: 32
3. component_progress() for ever !!!

The problem is that after prepare_src() for the 2nd fragment, the
sender calls free() instead of send() in its recv cb. Thus, the 2nd
fragment is not being transmitted.
As a consequence, the receiver waits for the 2nd fragment.

I have found that mca_pml_ob1_recv_frag_callback_ack() is the
corresponding recv cb. Before diving into the ob1 code,
could you tell me under which conditions this cb calls free() instead 
of send()
so that I can get an idea of where to look for errors in my BTL 
component.


Thank you very much in advance.

Sebastian Rinke

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

2009-07-22 Thread Bert Wesarg
On Wed, Jul 22, 2009 at 19:24, Jeff Squyres wrote:
> Bert -- is this functionality something we'd want to incorporate into PLPA?
What functionality? The complete libcpuset or just the 'get me the
cpuset mask of this task'? I don't think its good if we duplicate the
whole functionality of the libcpuset and taking libcpuset as a
dependency of PLPA sounds too heavy. Actually I really don't know yet.

That the cpuset should be considered into the decision process of
ORTE/OMPI is out of question. But I think its suffice if ORTE/OMPI
query the current affinity mask and takes only these processors into
account. Whoever set this to something smaller than the cpuset mask or
the online mask (in absence of cpusets) needs to live with that
a-priori taken decision or knows what he is doing (maybe a batch
system without cpuset support).

Bert


Re: [OMPI devel] autodetect broken

2009-07-22 Thread Iain Bason


On Jul 22, 2009, at 10:55 AM, Brian W. Barrett wrote:

The current autodetect implementation seems like the wrong approach  
to me. I'm rather unhappy the base functionality was hacked up like  
it was without any advanced notice or questions about original  
design intent. We seem to have a set of base functions which are now  
more unreadable than before, overly complex, and which leak memory.


First, I should apologize for my procedural misstep.  I had asked my  
group here at Sun whether I should do an RFC or something, and I guess  
I didn't make my question clear enough.  I was under the impression  
that I should check something in and let people comment on it.


That being said, I would argue that there are good reasons for adding  
some complexity to the base component:


1. The pre-existing implementation of expansion is broken (although  
the cases for which it is broken are somewhat obscure).


2. The autodetect component cannot set more than one directory without  
some knowledge of the relationships amongst the various directories.   
Giving it that knowledge would violate the independence of the  
components.


You can see #1 by doing "OPAL_PREFIX='${datarootdir}/..'  
OPAL_DATAROOTDIR='/opt/SUNWhpc/HPC8.2/share' mpirun hostname" (if you  
have installed in /opt/SUNWhpc/HPC8.2).  Yes, I know, "Why would  
anyone do that?"  Nonetheless, I find that a poor excuse for having a  
bug in the code.


To expand on #2 a little, consider the case where OpenMPI is  
configured with "--prefix=/path/one --libdir=/path/two".  We can tell  
that libopen-pal is in /path/two, but it is not correct to assume that  
the prefix is therefore /path.  (Unfortunately, there is code in  
OpenMPI that does make that sort of assumption -- see orterun.c.)  We  
need information from the config component to avoid making incorrect  
assumptions.


There are, of course, alternate ways of getting to the same point.   
But it is not feasible simply to leave the design of the base  
component unchanged.  (More on that below.)


As for readability, I am always open to constructive suggestions as to  
how to make code more readable.  I didn't fix the memory leak because  
I hadn't yet found a way to do that without decreasing readability.


The intent of the installdirs framework was to allow this type of  
behavior, but without rehacking all this infer crap into base.  The  
autodetect component should just set $prefix in the set of functions  
it returns (and possibly libdir and bindir if you really want, which  
might actually make sense if you guess wrong), and let the expansion  
code take over from there.  The general thought on how this would  
work went something like:


- Run after config
- If determine you have a special $prefix, set the
  opal_instal_dirs.prefix to NULL (yes, it's a bit of a hack) and
  set your special prefix.
- Same with bindir and libdir if needed
- Let expansion (which runs after all components have had the
  chance to fill in their fields) expand out with your special
  data


If we run the autodetect component after config, and allow it to  
override values that are already in opal_install_dirs, then there will  
be no way for users to have environment variables take precedence.   
(Let's say someone runs with OPAL_LIBDIR=/foo.  The autodetect  
component will not know whether opal_install_dirs.libdir has been set  
by the env component or by the config component.)


Moreover, if the user has used an environment variable to override one  
of the paths in the config component, then the autodetect component  
may make the wrong inference.  For example, let's say someone runs  
with OPAL_LIBDIR=/foo.  The autodetect component finds libopen-pal in / 
usr/renamed/lib, and sets opal_install_dirs.libdir to /usr/renamed/ 
lib.  However, it has to use the config component's idea of libdir  
(e.g., ${exec_prefix}/lib) to correctly infer that prefix should be / 
usr/renamed.  Since it only has /foo from the environment variable, it  
will have to decide that it cannot infer the prefix.


All of this will lead to behavior that users will have trouble  
diagnosing.  While I appreciate simple code, I think that a simple  
user interface is more important.


We could add some infrastructure so that the autodetect component can  
figure out the provenance of each field in opal_install_dirs, but that  
would make the boundary between the base component and the autodetect  
component unclear.


And the base stays simple, the components do all the heavy lifting,  
and life is happy.


Except in the cases where it doesn't work.

 I would not be opposed to putting in a "find expaneded part" type  
function that takes two strings like "${prefix}/lib" and "/usr/local/ 
lib" and returns "/usr/local" being added to the base so that other  
autodetect-style components don't need to handle such a case, but  
that's about the extent of the base changes I think are appropriate.


Finally, a first quick code review 

Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

2009-07-22 Thread Jeff Squyres
Bert -- is this functionality something we'd want to incorporate into  
PLPA?


On Jul 22, 2009, at 1:13 PM, Bert Wesarg wrote:

On Wed, Jul 22, 2009 at 18:55, Bert  
Wesarg wrote:

> I does not know any C interface to get a tasks cpuset mask (ok,
> libcpuset
Just an amendment to give the url to the libcpuset homepage:

http://oss.sgi.com/projects/cpusets/

>
> Bert
>>
>> Sylvain
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

2009-07-22 Thread Jeff Squyres

On Jul 22, 2009, at 11:17 AM, Sylvain Jeaugey wrote:

I'm interested in joining the effort, since we will likely have the  
same

problem with SLURM's cpuset support.



Ok.

> But as to why it's getting EINVAL, that could be wonky.  We might  
want to
> take this to the PLPA list and have you run some small, non-MPI  
examples to

> ensure that PLPA is parsing your /sys tree properly, etc.
I don't see the /sys implication here. Can you be more precise on  
which

files are read to determine placement ?



Check in opal/mca/paffinity/linux/plpa/src/libplpa/ 
plpa_map.c:load_cache().


IIRC, when you are inside a cpuset, you can see all cpus (/sys  
should be

unmodified) but calling set_schedaffinity with a mask containing a cpu
outside the cpuset will return EINVAL.



Ah, that could be the issue.


The only solution I see to solve
this would be to get the "allowed" cpus with sched_getaffinity,
which should be set according to the cpuset mask.




There are two issues here:

- what should OMPI do
- what should PLPA do

PLPA currently does two things:

1. provide a portable set/get affinity API (to isolate you from  
whatever version you have in your linux install)

2. provide topology mapping information (sockets, cores)

PLPA does not currently deal with cpusets.  If we want to expand PLPA  
to somehow interact with cpusets, that should probably be brought up  
on the PLPA mailing lists (someone made this suggestion to me about a  
month or two ago and I haven't had a chance to follow up on it :-( ).


OMPI (as a whole -- meaning: including the ORTE layer) does the  
following:


1. decide whether to bind MPI processes or not
2. if we do bind, use the paffinity module to bind processes to  
specific processors (the linux paffinity module uses PLPA to do the  
actual binding -- PLPA is wholly embedded inside OMPI's linux  
paffinity module)


And there's two layers involved here:

- the main ORTE logic saying both "yes, bind" and making the decision  
as to which processors to bind to
- the linux paffinity component does a thin layer of translation  
between ORTE's/OMPI's requests and calling the back-end PLPA library


As Ralph described, OMPI is currently fairly "dumb" about how it  
chooses which processors it uses -- 0 to N-1.  I think the issue here  
is to make OMPI smarter about how it chooses which processors to use.   
It could be in ORTE itself, or it could be in the linux paffinity  
translation layer (e.g., linux paffinity component could report only  
as many processors as are available in the cpuset...?  And binding  
could be relative to the cpuset...?).


--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

2009-07-22 Thread Bert Wesarg
On Wed, Jul 22, 2009 at 18:55, Bert Wesarg wrote:
> I does not know any C interface to get a tasks cpuset mask (ok,
> libcpuset
Just an amendment to give the url to the libcpuset homepage:

http://oss.sgi.com/projects/cpusets/

>
> Bert
>>
>> Sylvain
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>


Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

2009-07-22 Thread Bert Wesarg
On Wed, Jul 22, 2009 at 17:17, Sylvain Jeaugey wrote:
> Hi Jeff,
>
> I'm interested in joining the effort, since we will likely have the same
> problem with SLURM's cpuset support.
>
> On Wed, 22 Jul 2009, Jeff Squyres wrote:
>
>> But as to why it's getting EINVAL, that could be wonky.  We might want to
>> take this to the PLPA list and have you run some small, non-MPI examples to
>> ensure that PLPA is parsing your /sys tree properly, etc.
>
> I don't see the /sys implication here. Can you be more precise on which
> files are read to determine placement ?
Most files under /sys/devices/system/cpu/cpu*/topology/*

>
> IIRC, when you are inside a cpuset, you can see all cpus (/sys should be
> unmodified) but calling set_schedaffinity with a mask containing a cpu
> outside the cpuset will return EINVAL.
No. The Linux kernel ands the cpumask of the tasks cpuset with the
given affinity mask. If no cpuset is involved it takes the online
mask. After that, it checks if the resulting mask is a subset of the
online mask and would return -EINVAL.

> The only solution I see to solve this
> would be to get the "allowed" cpus with sched_getaffinity, which should be
> set according to the cpuset mask.
sched_getaffinity() doesn't return the cpuset mask. It returns the
mask the task can run, which is a subset of the cpuset. Also the
initial mask of the task (right after exec) does not to be the cpuset
mask, because the affinity mask is inherited from the parent.

I does not know any C interface to get a tasks cpuset mask (ok,
libcpuset, looks like this lib is debian now, note to myself: check
this). The Cpus_allowed* fields in /proc//status are the same as
sched_getaffinity returns and the /proc//cpuset needs to be
resolved, i.e. where is the cpuset fs mounted?

Bert
>
> Sylvain
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

2009-07-22 Thread Sylvain Jeaugey

Hi Jeff,

I'm interested in joining the effort, since we will likely have the same 
problem with SLURM's cpuset support.


On Wed, 22 Jul 2009, Jeff Squyres wrote:

But as to why it's getting EINVAL, that could be wonky.  We might want to 
take this to the PLPA list and have you run some small, non-MPI examples to 
ensure that PLPA is parsing your /sys tree properly, etc.
I don't see the /sys implication here. Can you be more precise on which 
files are read to determine placement ?


IIRC, when you are inside a cpuset, you can see all cpus (/sys should be 
unmodified) but calling set_schedaffinity with a mask containing a cpu 
outside the cpuset will return EINVAL. The only solution I see to solve 
this would be to get the "allowed" cpus with sched_getaffinity, 
which should be set according to the cpuset mask.


Sylvain


Re: [OMPI devel] pb with --enable-mpi-threads --enable-progress-threads options

2009-07-22 Thread Ralph Castain
***Progress thread support currently does not work, and may never be fully
implemented. If you remove that configure option, it should work.*


**

*I'm pretty sure we only left that option so developers could play at fixing
it, though I don't know of anyone actually making the attempt at the moment
(certainly, it would require significant changes to ORTE).
*

*
*

*From:* Bernard Secher - SFME/LGLS (*bernard.secher_at_[hidden]*)
*Date:* 2009-07-22 06:29:32


 Hi,

I have added the two following options: --enable-mpi-threads
--enable-progress-threads in configure step of openmpi-1.3.3.

After install, mpirun command doesn't work on a very simple mpi program.
There is a dead lock and program is not executed.
Bernard


Re: [OMPI devel] OpenMPI, PLPA and Linux cpuset/cgroup support

2009-07-22 Thread Jeff Squyres
I'm the "primary PLPA" guy that Ralph referred to, and I was on  
vacation last week -- sorry for missing all the chatter.


Based on your mails, it looks like you're out this week -- so little  
will likely occur.  I'm at the MPI Forum standards meeting next week,  
so my replies to email will be sporatic.


OMPI is pretty much directly calling PLPA to set affinity for  
"processors" 0, 1, 2, 3 (which PLPA translates into Linux virtual  
processor IDs, and then invokes sched_setaffinity with each of those  
IDs).


Note that the EFAULT errors you're seeing in the output are  
deliberate.  PLPA has to "probe" the kernel to see what flavor of API  
it uses.  Based on the error codes that comes back, it knows which  
flavor to use when actually invoking the syscall for  
sched_setaffinity.  So you can ignore those EFAULT's.


But as to why it's getting EINVAL, that could be wonky.  We might want  
to take this to the PLPA list and have you run some small, non-MPI  
examples to ensure that PLPA is parsing your /sys tree properly, etc.


Ping when you get back from vacation.



On Jul 19, 2009, at 8:14 PM, Chris Samuel wrote:



- "Ralph Castain"  wrote:

> Should just be
>
> -mca paffinity_base_verbose 5
>
> Any value greater than 4 should turn it "on"

Yup, that's what I was trying, but couldn't get any output.

> Something I should have mentioned. The paffinity_base_service.c   
file

> is solely used by the rank_file syntax. It has nothing to do with
> setting mpi_paffinity_alone and letting OMPI self-determine the
> process-to-core binding.

That would explain why I'm not seeing any output from it
then, it and the solaris module are the only ones containing
any opal_output() statements in the paffinity section of MCA.

I'll try scattering some opal_output()'s into the linux module
instead along the same lines as the base module.

> You want to dig into the linux module code that calls down
> into the plpa. The same mca param should give you messages
> from the module, and -might- give you messages from inside
> plpa (not sure of the latter).

The PLPA output is not run time selectable:

#if defined(PLPA_DEBUG) && PLPA_DEBUG && 0

:-)

cheers,
Chris
--
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI devel] autodetect broken

2009-07-22 Thread Brian W. Barrett
The current autodetect implementation seems like the wrong approach to me. 
I'm rather unhappy the base functionality was hacked up like it was 
without any advanced notice or questions about original design intent. 
We seem to have a set of base functions which are now more unreadable than 
before, overly complex, and which leak memory.


The intent of the installdirs framework was to allow this type of 
behavior, but without rehacking all this infer crap into base.  The 
autodetect component should just set $prefix in the set of functions it 
returns (and possibly libdir and bindir if you really want, which might 
actually make sense if you guess wrong), and let the expansion code take 
over from there.  The general thought on how this would work went 
something like:


 - Run after config
 - If determine you have a special $prefix, set the
   opal_instal_dirs.prefix to NULL (yes, it's a bit of a hack) and
   set your special prefix.
 - Same with bindir and libdir if needed
 - Let expansion (which runs after all components have had the
   chance to fill in their fields) expand out with your special
   data

And the base stays simple, the components do all the heavy lifting, and 
life is happy.  I would not be opposed to putting in a "find expaneded 
part" type function that takes two strings like "${prefix}/lib" and 
"/usr/local/lib" and returns "/usr/local" being added to the base so that 
other autodetect-style components don't need to handle such a case, but 
that's about the extent of the base changes I think are appropriate.


Finally, a first quick code review reveals a couple of problems:

 - We don't AC_SUBST variables adding .lo files to build sources in
   OMPI.  Instead, we use AM_CONDITIONALS to add sources as needed.
 - Obviously, there's a problem with the _happy variable name
   consistency in configure.m4
 - There's a naming convention issue - files should all start with
   opal_installdirs_autodetect_, and a number of the added files
   do not.
 - From a finding code standpoint, I'd rather walkcontext.c and
   backtrace.c be one file with #ifs - for such short functions,
   it makes it more obvious that it's just portability implementations
   of the same function.

I'd be happy to discuss the issues further or review any code before it 
gets committed.  But I think the changes as they exist today (even with 
bugs fixed) aren't consistent with what the installdirs framework was 
trying to accomplish and should be reworked.


Brian


Re: [OMPI devel] default btl eager_limit

2009-07-22 Thread Terry Dontje

Jeff Squyres wrote:
Just to follow up for the web archives -- we discussed this on the 
teleconf yesterday and decided that the assert()'s were not the way to 
go.  Brian was going to hack up a quick check at the end of OB1 
add_procs that checks each btl's eager_limit, etc.  Terry would expand 
this to cover dr and csum.


I've received the change from Brian and working on porting it across all 
the other PMLs.


--td


On Jul 16, 2009, at 10:10 AM, Terry Dontje wrote:


Another way to do this which I am not sure makes sense is to just add
sizeof(mca_pml_ob1_hdr_t) to the btl_eager_limit passed into by the
user.  Thus the defining the limit to be specifically for the user data
and not the internal headers
which the user may not have any inkling about.  However, that may lead
to the user
to not realize there is a man behind the curtain bumping up the limit
for the internal headers.

--td

Terry Dontje wrote:
> I was playing around with some really silly fragment sizes (sub 72
> bytes) when I ran into some asserts in the btl_openib_sendi.  I traced
> the assert to be caused by mca_pml_ob1_send_request_start_btl()
> calculating the true eager_limit with the following line:
>
>   size_t eager_limit = btl->btl_eager_limit - 
sizeof(mca_pml_ob1_hdr_t);

>
> If btl_eager_limit ends up being less than the
> sizeof(mca_pml_ob1_hdr_t) the eager_limit calculated results in a very
> large number and an assert later on in the stack.
>
> It seems to me that it would be nice to insert some checks in
> mca_btl_base_param_register() to make sure btl_eager_limit is >
> sizeof(mca_pml_ob1_hdr_t).  Am I missing a reason why this was not
> done in the first place?
>
> --td
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel








Re: [OMPI devel] default btl eager_limit

2009-07-22 Thread Jeff Squyres
Just to follow up for the web archives -- we discussed this on the  
teleconf yesterday and decided that the assert()'s were not the way to  
go.  Brian was going to hack up a quick check at the end of OB1  
add_procs that checks each btl's eager_limit, etc.  Terry would expand  
this to cover dr and csum.



On Jul 16, 2009, at 10:10 AM, Terry Dontje wrote:


Another way to do this which I am not sure makes sense is to just add
sizeof(mca_pml_ob1_hdr_t) to the btl_eager_limit passed into by the
user.  Thus the defining the limit to be specifically for the user  
data

and not the internal headers
which the user may not have any inkling about.  However, that may lead
to the user
to not realize there is a man behind the curtain bumping up the limit
for the internal headers.

--td

Terry Dontje wrote:
> I was playing around with some really silly fragment sizes (sub 72
> bytes) when I ran into some asserts in the btl_openib_sendi.  I  
traced

> the assert to be caused by mca_pml_ob1_send_request_start_btl()
> calculating the true eager_limit with the following line:
>
>   size_t eager_limit = btl->btl_eager_limit -  
sizeof(mca_pml_ob1_hdr_t);

>
> If btl_eager_limit ends up being less than the
> sizeof(mca_pml_ob1_hdr_t) the eager_limit calculated results in a  
very

> large number and an assert later on in the stack.
>
> It seems to me that it would be nice to insert some checks in
> mca_btl_base_param_register() to make sure btl_eager_limit is >
> sizeof(mca_pml_ob1_hdr_t).  Am I missing a reason why this was not
> done in the first place?
>
> --td
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--
Jeff Squyres
jsquy...@cisco.com



[OMPI devel] pb with --enable-mpi-threads --enable-progress-threads options

2009-07-22 Thread Bernard Secher - SFME/LGLS

Hi,

I have added the two following options: --enable-mpi-threads 
--enable-progress-threads in configure step of openmpi-1.3.3.


After install, mpirun command doesn't work on a very simple mpi program.
There is a dead lock and program is not executed.

Bernard



Re: [OMPI devel] fortran MPI_COMPLEX datatype broken

2009-07-22 Thread Jeff Squyres

A little more data...

ompi_datatype_module.c:442 says

#if 0 /* XXX TODO The following may be deleted, both CXX and F77/F90  
complex types are statically set up */

...followed by code to initialize ompi_mpi_cplx (i.e., MPI_COMPLEX).

(another TODO!!)

But ompi_mpi_cplex is setup with:

ompi_predefined_datatype_t ompi_mpi_cplex =   
OMPI_DATATYPE_INIT_DEFER (COMPLEX, OMPI_DATATYPE_FLAG_DATA_FORTRAN |  
OMPI_DATATYPE_FLAG_DATA_COMPLEX );


and OMPI_DATATYPE_INIT_DEFER has a comment above it saying:

/*
 * Initilization for these types is deferred until runtime.
 *
 * Using this macro implies that at this point not all informations  
needed
 * to fill up the datatype are known. We fill them with zeros and  
then later

 * when the datatype engine will be initialized we complete with the
 * correct information. This macro should be used for all composed  
types.

 */

So this first thing is clearly wrong.

Assumedly, ompi_mpi_cplx (and friends) *do* need to be setup  
dynamically at runtime, and the code must be fixed to do so.





On Jul 21, 2009, at 8:51 PM, Jeff Squyres (jsquyres) wrote:


On Jul 21, 2009, at 8:44 PM, Jeff Squyres (jsquyres) wrote:

> The extent for MPI_COMPLEX is returning 0.
>


Sorry -- I accidentally hit "send" way before I finished typing.  :-\

You can reproduce the problem with a trivial program:

-
#include 
#include 

int main(int argc, char* argv[])
{
 MPI_Aint extent;
 MPI_Init(NULL, NULL);
 MPI_Type_extent(MPI_COMPLEX, );
 printf("Got extent: %d\n", extent);
 MPI_Finalize();
 return 0;
}
-

This is an OMPI that was compiled with Fortran support.  If I break at
MPI_Type_extent in gdb, here's what *type is:

-
(gdb) p *type
$1 = {super = {super = {obj_magic_id = 16046253926196952813,
   obj_class = 0x2a95aa0520, obj_reference_count = 1,
   cls_init_file_name = 0x2a95626ce0 "ompi_datatype_module.c",
   cls_init_lineno = 134}, flags = 63011, id = 0, bdt_used = 0,
size = 0,
 true_lb = 0, true_ub = 0, lb = 0, ub = 0, align = 0, nbElems = 1,
 name = "OPAL_UNAVAILABLE", '\0' , desc =
{length = 1,
   used = 1, desc = 0x2a95ac4640}, opt_desc = {length = 1, used  
= 1,
   desc = 0x2a95ac4640}, btypes = {0 }}, id =  
25,

   d_f_to_c_index = 18, d_keyhash = 0x0, args = 0x0,
packed_description = 0x0,
   name = "MPI_COMPLEX", '\0' }
-

The OPAL_UNAVAILABLE looks ominous...?  When I do the same thing with
MPI_INTEGER, it doesn't say OPAL_UNAVAILABLE:

-
(gdb) p *type
$2 = {super = {super = {obj_magic_id = 16046253926196952813,
   obj_class = 0x2a95aa0520, obj_reference_count = 1,
   cls_init_file_name = 0x2a95626ce0 "ompi_datatype_module.c",
   cls_init_lineno = 131}, flags = 55094, id = 6, bdt_used = 64,
size = 4,
 true_lb = 0, true_ub = 4, lb = 0, ub = 4, align = 4, nbElems = 1,
 name = "OPAL_INT4", '\0' , desc = {length = 1,
   used = 1, desc = 0x2a95777920}, opt_desc = {length = 1, used  
= 1,

   desc = 0x2a95777920}, btypes = {0, 0, 0, 0, 0, 0, 1,
   0 }}, id = 22, d_f_to_c_index = 7, d_keyhash
= 0x0,
   args = 0x0, packed_description = 0x0,
   name = "MPI_INTEGER", '\0' }
-

Note that configure was happy with all the COMPLEX datatypes;
config.out and config.log attached.  This was with gcc 3.4 on RHEL4.

--
Jeff Squyres
jsquy...@cisco.com





--
Jeff Squyres
jsquy...@cisco.com