Re: [OMPI devel] [patch] Bugs in mpi-f90-interfaces.h and its bridge implementation

2012-04-04 Thread Kawashima
Hi Jeff,

Jeffrey Squyres wrote:
> 
> On Apr 3, 2012, at 10:56 PM, Kawashima wrote:
> 
> > I and my coworkers checked mpi-f90-interfaces.h against MPI 2.2 standard
> > and found many bugs in it. Attached patches fix them for trunk.
> > Though some of them are trivial, others are not so trivial.
> > So I'll explain them below.
> 
> Excellent -- many thanks for these!
> 
> I have some notes on the specific patches, below, but first note the 
> following: Craig Rasmussen and I have been working on revamping the Open MPI 
> Fortran bindings for quite a while.  Here's a quick summary of the changes:
> 
> 1. two prototypes of the new MPI-3 mpi_f08 module
>a. Full implementation of all functions, not using F08 descriptors
>b. 6-function MPI using F08 descriptors (showing that it can be done) for 
> ifort
> 2. for the existing mpi module:
>a. New implementation using "ignore TKR" directives
>b. If your fortran compiler doesn't support "ignore TKR" directives (e.g., 
> gfortran), fall back and use the old mpi module implementation
> 3. wrapper compiler changes
>a. New "mpifort" wrapper compiler; all Fortran interfaces are available
>b. mpif77 and mpif90 are sym links to mpifort (and may disappear someday)
> 
> All of this work is available in a public bitbucket here:
> 
>https://bitbucket.org/jsquyres/mpi3-fortran

Great. I'll see it later.

> There's a linker error in there at the tip at the moment; Craig is working on 
> fixing it.  When that's done, we'll likely put out a final public test 
> tarball, and assuming that goes well, merge all this stuff into the OMPI SVN 
> trunk.  The SVN merge will likely be a little disruptive because the 
> directory structure of the Fortran bindings changed a bit (e.g., all 5 
> implementations are now under ompi/mpi/fortran).
> 
> The point is that I plan to bring in all your fixes to this bitbucket branch 
> so that all the new stuff and all your fixes come in to the trunk at the same 
> time.
> 
> 1.4 is dead; I doubt we'll be applying your fixes there.
> 
> 1.5 has transitioned to 1.6 (yesterday); I can look into making a patch for 
> the v1.6 series.  The tricky part is preserving the mpi module ABI between 
> 1.5.5 and 1.6.  We've done this before, though, so I think it'll be do-able.

Thanks for your explanation.
I know we are preparing v1.6 and v1.4 is not active.

> > 1. incorrect parameter types
> > 
> >  Two trivial parameter type mismatches.
> >  Fixed in my mpi-f90-interface.type-mismatch.patch.
> > 
> >  MPI_Cart_map periods: integer -> logical
> >  MPI_Reduce_scatter recvcounts: missing "dimension(*)"
> 
> Applied, and also made corresponding fixes to my ignore TKR mpi module (the 
> f08 module didn't have these issues).
> 
> > 2. incorrect intent against MPI 2.2 standard
> > 
> >  This is a somewhat complex issue.
> >  First, I'll cite MPI 2.2 standard below.
> > 
> >  2.3 in MPI 2.2 standard says:
> >There is one special case - if an argument is a handle to an opaque
> >object (these terms are defined in Section 2.5.1), and the object is
> >updated by the procedure call, then the argument is marked INOUT or OUT.
> >It is marked this way even though the handle itself is not modified -
> >we use the INOUT or OUT attribute to denote that what the handle
> >references is updated. Thus, in C++, IN arguments are usually either
> >references or pointers to const objects.
> > 
> >  2.3 in MPI 2.2 standard also says:
> >MPI's use of IN, OUT and INOUT is intended to indicate to the user
> >how an argument is to be used, but does not provide a rigorous
> >classification that can be translated directly into all language
> >bindings (e.g., INTENT in Fortran 90 bindings or const in C bindings).
> >For instance, the "constant" MPI_BOTTOM can usually be passed to
> >OUT buffer arguments. Similarly, MPI_STATUS_IGNORE can be passed as
> >the OUT status argument.
> > 
> >  16.2.4 in MPI 2.2 standard says:
> >Advice to implementors.
> >The appropriate INTENT may be different from what is given in the
> >MPI generic interface. Implementations must choose INTENT so that
> >the function adheres to the MPI standard.
> > 
> >  Hmm. intent in mpi-f90-interfaces.h does not necessarily match
> >  IN/OUT/INOUT in MPI 2.2, especially regarding opaque objects.
> >  mpi-f90-interfaces.h seems to have consideration of opaque objects
> >  partially, which is handled in r9922 and r23098.
> > 
> >  I compared MPI 2.2 ("standard") and mpi-f90-interfaces.h ("header")
> >  and classified problematic parameters into four types.
> > 
> >  Type A. Trivial error.
> > 
> >  These are fixed in my mpi-f90-interface.intent-error-a.patch to match
> >  the standard.
> > 
> > subroutine|  parameter  | standard | header
> >  -+-+--+
> >  MPI_Bcast| buffer  | inout| in
> 
> Bcast is a little tricky.  I think th

Re: [OMPI devel] [patch] Bugs in mpi-f90-interfaces.h and its bridge implementation

2012-04-04 Thread Jeffrey Squyres
On Apr 3, 2012, at 10:56 PM, Kawashima wrote:

> I and my coworkers checked mpi-f90-interfaces.h against MPI 2.2 standard
> and found many bugs in it. Attached patches fix them for trunk.
> Though some of them are trivial, others are not so trivial.
> So I'll explain them below.

Excellent -- many thanks for these!

I have some notes on the specific patches, below, but first note the following: 
Craig Rasmussen and I have been working on revamping the Open MPI Fortran 
bindings for quite a while.  Here's a quick summary of the changes:

1. two prototypes of the new MPI-3 mpi_f08 module
   a. Full implementation of all functions, not using F08 descriptors
   b. 6-function MPI using F08 descriptors (showing that it can be done) for 
ifort
2. for the existing mpi module:
   a. New implementation using "ignore TKR" directives
   b. If your fortran compiler doesn't support "ignore TKR" directives (e.g., 
gfortran), fall back and use the old mpi module implementation
3. wrapper compiler changes
   a. New "mpifort" wrapper compiler; all Fortran interfaces are available
   b. mpif77 and mpif90 are sym links to mpifort (and may disappear someday)

All of this work is available in a public bitbucket here:

   https://bitbucket.org/jsquyres/mpi3-fortran

There's a linker error in there at the tip at the moment; Craig is working on 
fixing it.  When that's done, we'll likely put out a final public test tarball, 
and assuming that goes well, merge all this stuff into the OMPI SVN trunk.  The 
SVN merge will likely be a little disruptive because the directory structure of 
the Fortran bindings changed a bit (e.g., all 5 implementations are now under 
ompi/mpi/fortran).

The point is that I plan to bring in all your fixes to this bitbucket branch so 
that all the new stuff and all your fixes come in to the trunk at the same time.

1.4 is dead; I doubt we'll be applying your fixes there.

1.5 has transitioned to 1.6 (yesterday); I can look into making a patch for the 
v1.6 series.  The tricky part is preserving the mpi module ABI between 1.5.5 
and 1.6.  We've done this before, though, so I think it'll be do-able.

> 1. incorrect parameter types
> 
>  Two trivial parameter type mismatches.
>  Fixed in my mpi-f90-interface.type-mismatch.patch.
> 
>  MPI_Cart_map periods: integer -> logical
>  MPI_Reduce_scatter recvcounts: missing "dimension(*)"

Applied, and also made corresponding fixes to my ignore TKR mpi module (the f08 
module didn't have these issues).

> 2. incorrect intent against MPI 2.2 standard
> 
>  This is a somewhat complex issue.
>  First, I'll cite MPI 2.2 standard below.
> 
>  2.3 in MPI 2.2 standard says:
>There is one special case - if an argument is a handle to an opaque
>object (these terms are defined in Section 2.5.1), and the object is
>updated by the procedure call, then the argument is marked INOUT or OUT.
>It is marked this way even though the handle itself is not modified -
>we use the INOUT or OUT attribute to denote that what the handle
>references is updated. Thus, in C++, IN arguments are usually either
>references or pointers to const objects.
> 
>  2.3 in MPI 2.2 standard also says:
>MPI's use of IN, OUT and INOUT is intended to indicate to the user
>how an argument is to be used, but does not provide a rigorous
>classification that can be translated directly into all language
>bindings (e.g., INTENT in Fortran 90 bindings or const in C bindings).
>For instance, the "constant" MPI_BOTTOM can usually be passed to
>OUT buffer arguments. Similarly, MPI_STATUS_IGNORE can be passed as
>the OUT status argument.
> 
>  16.2.4 in MPI 2.2 standard says:
>Advice to implementors.
>The appropriate INTENT may be different from what is given in the
>MPI generic interface. Implementations must choose INTENT so that
>the function adheres to the MPI standard.
> 
>  Hmm. intent in mpi-f90-interfaces.h does not necessarily match
>  IN/OUT/INOUT in MPI 2.2, especially regarding opaque objects.
>  mpi-f90-interfaces.h seems to have consideration of opaque objects
>  partially, which is handled in r9922 and r23098.
> 
>  I compared MPI 2.2 ("standard") and mpi-f90-interfaces.h ("header")
>  and classified problematic parameters into four types.
> 
>  Type A. Trivial error.
> 
>  These are fixed in my mpi-f90-interface.intent-error-a.patch to match
>  the standard.
> 
> subroutine|  parameter  | standard | header
>  -+-+--+
>  MPI_Bcast| buffer  | inout| in

Bcast is a little tricky.  I think the real solution is to remove the intent 
altogether from the buffer argument.  We talked about this issue a LOT in the 
MPI-3 Fortran WG.  

Here's the short version:

- intent(in): obvious
- intent(out): the compiler is actually allowed to zero the buffer (!), which 
we clearly don't want
- intent(inout): ...there is a reason that this is not desir

Re: [OMPI devel] mca_btl_tcp_alloc

2012-04-04 Thread Shamis, Pavel
> In mca_btl_tcp_alloc (openmpi-trunk/ompi/mca/btl/tcp/btl_tcp.c:188) the 
> first segment is initialized to point to "frag + 1".
> I don't get it... how/when is this location allocated? Isn't it just 
> after the mca_btl_tcp_frag_t structure ends?

Alex,
The frag allocation macros take the fragments from the free lists.
The free lists are created in function mca_btl_tcp_component_init().
As you will see there fragment size is mca_btl_tcp_frag_t + some_size.
frag + 1 , means skip the frag structure and jump to payload.

Bahazlaha ;-)

Pasha.

> 
> Thanks,
> Alex
> 
> mca_btl_base_descriptor_t* mca_btl_tcp_alloc(
> struct mca_btl_base_module_t* btl,
> struct mca_btl_base_endpoint_t* endpoint,
> uint8_t order,
> size_t size,
> uint32_t flags)
> {
> mca_btl_tcp_frag_t* frag = NULL;
> int rc;
> 
> if(size <= btl->btl_eager_limit) {
> MCA_BTL_TCP_FRAG_ALLOC_EAGER(frag, rc);
> } else if (size <= btl->btl_max_send_size) {
> MCA_BTL_TCP_FRAG_ALLOC_MAX(frag, rc);
> }
> if( OPAL_UNLIKELY(NULL == frag) ) {
> return NULL;
> }
> 
> frag->segments[0].seg_len = size;
> frag->segments[0].seg_addr.pval = frag+1;
> 
> frag->base.des_src = frag->segments;
> frag->base.des_src_cnt = 1;
> frag->base.des_dst = NULL;
> frag->base.des_dst_cnt = 0;
> frag->base.des_flags = flags;
> frag->base.order = MCA_BTL_NO_ORDER;
> frag->btl = (mca_btl_tcp_module_t*)btl;
> return (mca_btl_base_descriptor_t*)frag;
> }
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] mca_btl_tcp_alloc

2012-04-04 Thread Jeffrey Squyres
+1 on Rolf's explanation.

Additionally, note that you don't have to do it this way.  You can implement 
yours in whatever style you want; this is just the style we used for the TCP 
BTL.


On Apr 4, 2012, at 10:18 AM, Rolf vandeVaart wrote:

> Here is my explanation.  The call to MCA_BTL_TCP_FRAG_ALLOC_EAGER or 
> MCA_BTL_TCP_FRAG_ALLOC_MAX allocate a chunk of memory that has space for both 
> the fragment as well as any payload.  So, when we do the frag+1, we are 
> setting the pointer in the frag to point where the payload of the message 
> lives.  This payload contains the PML header information and potentially the 
> user's buffer.   So, that allocation is actually returning something like 64K 
> for the eager allocation and 128K for the max allocation.  If you look at 
> btl_tcp_component.c in the function mca_btl_tcp_component_init() you can see 
> where the eager and max free lists are initialized.
> 
> In the case of TCP, there are two segments.  The first segment will contain 
> the PML header information.   If the buffer being sent (or received) is 
> contiguous, then the rest of the space allocated is not used.  Rather, the 
> second segment will point to the user's buffer as there is no need to first 
> copy it into a buffer.  If the buffer being sent (or received) is 
> non-contiguous, then the data is first copied into the allocated space as it 
> needs to be packed.
> 
> Does that make sense?
> 
> 
>> -Original Message-
>> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
>> On Behalf Of Alex Margolin
>> Sent: Wednesday, April 04, 2012 9:23 AM
>> To: Open MPI Developers
>> Subject: [OMPI devel] mca_btl_tcp_alloc
>> 
>> Hi,
>> 
>> As I'm working out the bugs in my component I used TCP as reference and
>> came across the following:
>> In mca_btl_tcp_alloc (openmpi-trunk/ompi/mca/btl/tcp/btl_tcp.c:188) the
>> first segment is initialized to point to "frag + 1".
>> I don't get it... how/when is this location allocated? Isn't it just after 
>> the
>> mca_btl_tcp_frag_t structure ends?
>> 
>> Thanks,
>> Alex
>> 
>> mca_btl_base_descriptor_t* mca_btl_tcp_alloc(
>>struct mca_btl_base_module_t* btl,
>>struct mca_btl_base_endpoint_t* endpoint,
>>uint8_t order,
>>size_t size,
>>uint32_t flags)
>> {
>>mca_btl_tcp_frag_t* frag = NULL;
>>int rc;
>> 
>>if(size <= btl->btl_eager_limit) {
>>MCA_BTL_TCP_FRAG_ALLOC_EAGER(frag, rc);
>>} else if (size <= btl->btl_max_send_size) {
>>MCA_BTL_TCP_FRAG_ALLOC_MAX(frag, rc);
>>}
>>if( OPAL_UNLIKELY(NULL == frag) ) {
>>return NULL;
>>}
>> 
>>frag->segments[0].seg_len = size;
>>frag->segments[0].seg_addr.pval = frag+1;
>> 
>>frag->base.des_src = frag->segments;
>>frag->base.des_src_cnt = 1;
>>frag->base.des_dst = NULL;
>>frag->base.des_dst_cnt = 0;
>>frag->base.des_flags = flags;
>>frag->base.order = MCA_BTL_NO_ORDER;
>>frag->btl = (mca_btl_tcp_module_t*)btl;
>>return (mca_btl_base_descriptor_t*)frag; }
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ---
> This email message is for the sole use of the intended recipient(s) and may 
> contain
> confidential information.  Any unauthorized review, use, disclosure or 
> distribution
> is prohibited.  If you are not the intended recipient, please contact the 
> sender by
> reply email and destroy all copies of the original message.
> ---
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] mca_btl_tcp_alloc

2012-04-04 Thread Rolf vandeVaart
Here is my explanation.  The call to MCA_BTL_TCP_FRAG_ALLOC_EAGER or 
MCA_BTL_TCP_FRAG_ALLOC_MAX allocate a chunk of memory that has space for both 
the fragment as well as any payload.  So, when we do the frag+1, we are setting 
the pointer in the frag to point where the payload of the message lives.  This 
payload contains the PML header information and potentially the user's buffer.  
 So, that allocation is actually returning something like 64K for the eager 
allocation and 128K for the max allocation.  If you look at btl_tcp_component.c 
in the function mca_btl_tcp_component_init() you can see where the eager and 
max free lists are initialized.

In the case of TCP, there are two segments.  The first segment will contain the 
PML header information.   If the buffer being sent (or received) is contiguous, 
then the rest of the space allocated is not used.  Rather, the second segment 
will point to the user's buffer as there is no need to first copy it into a 
buffer.  If the buffer being sent (or received) is non-contiguous, then the 
data is first copied into the allocated space as it needs to be packed.

Does that make sense?


>-Original Message-
>From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
>On Behalf Of Alex Margolin
>Sent: Wednesday, April 04, 2012 9:23 AM
>To: Open MPI Developers
>Subject: [OMPI devel] mca_btl_tcp_alloc
>
>Hi,
>
>As I'm working out the bugs in my component I used TCP as reference and
>came across the following:
>In mca_btl_tcp_alloc (openmpi-trunk/ompi/mca/btl/tcp/btl_tcp.c:188) the
>first segment is initialized to point to "frag + 1".
>I don't get it... how/when is this location allocated? Isn't it just after the
>mca_btl_tcp_frag_t structure ends?
>
>Thanks,
>Alex
>
>mca_btl_base_descriptor_t* mca_btl_tcp_alloc(
> struct mca_btl_base_module_t* btl,
> struct mca_btl_base_endpoint_t* endpoint,
> uint8_t order,
> size_t size,
> uint32_t flags)
>{
> mca_btl_tcp_frag_t* frag = NULL;
> int rc;
>
> if(size <= btl->btl_eager_limit) {
> MCA_BTL_TCP_FRAG_ALLOC_EAGER(frag, rc);
> } else if (size <= btl->btl_max_send_size) {
> MCA_BTL_TCP_FRAG_ALLOC_MAX(frag, rc);
> }
> if( OPAL_UNLIKELY(NULL == frag) ) {
> return NULL;
> }
>
> frag->segments[0].seg_len = size;
> frag->segments[0].seg_addr.pval = frag+1;
>
> frag->base.des_src = frag->segments;
> frag->base.des_src_cnt = 1;
> frag->base.des_dst = NULL;
> frag->base.des_dst_cnt = 0;
> frag->base.des_flags = flags;
> frag->base.order = MCA_BTL_NO_ORDER;
> frag->btl = (mca_btl_tcp_module_t*)btl;
> return (mca_btl_base_descriptor_t*)frag; }
>___
>devel mailing list
>de...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/devel
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---



[OMPI devel] mca_btl_tcp_alloc

2012-04-04 Thread Alex Margolin

Hi,

As I'm working out the bugs in my component I used TCP as reference and 
came across the following:
In mca_btl_tcp_alloc (openmpi-trunk/ompi/mca/btl/tcp/btl_tcp.c:188) the 
first segment is initialized to point to "frag + 1".
I don't get it... how/when is this location allocated? Isn't it just 
after the mca_btl_tcp_frag_t structure ends?


Thanks,
Alex

mca_btl_base_descriptor_t* mca_btl_tcp_alloc(
struct mca_btl_base_module_t* btl,
struct mca_btl_base_endpoint_t* endpoint,
uint8_t order,
size_t size,
uint32_t flags)
{
mca_btl_tcp_frag_t* frag = NULL;
int rc;

if(size <= btl->btl_eager_limit) {
MCA_BTL_TCP_FRAG_ALLOC_EAGER(frag, rc);
} else if (size <= btl->btl_max_send_size) {
MCA_BTL_TCP_FRAG_ALLOC_MAX(frag, rc);
}
if( OPAL_UNLIKELY(NULL == frag) ) {
return NULL;
}

frag->segments[0].seg_len = size;
frag->segments[0].seg_addr.pval = frag+1;

frag->base.des_src = frag->segments;
frag->base.des_src_cnt = 1;
frag->base.des_dst = NULL;
frag->base.des_dst_cnt = 0;
frag->base.des_flags = flags;
frag->base.order = MCA_BTL_NO_ORDER;
frag->btl = (mca_btl_tcp_module_t*)btl;
return (mca_btl_base_descriptor_t*)frag;
}


Re: [OMPI devel] [EXTERNAL] Re: Developers Meeting

2012-04-04 Thread Josh Hursey
I second Oak Ridge (or even UTK) sometime in June.

-- Josh

On Tue, Apr 3, 2012 at 3:07 PM, Barrett, Brian W  wrote:
> On 4/3/12 11:08 AM, "Jeffrey Squyres"  wrote:
>
>>On Apr 3, 2012, at 11:44 AM, Barrett, Brian W wrote:
>>
>>> There is discussion of attempting to have a developers meeting this
>>> summer.  We haven't had one in a while and people thought it would be
>>>good
>>> to work through some of the ideas on how to implement features for 1.7.
>>> We don't have a location yet, but possibilities include Los Alamos and
>>>San
>>> Jose.  To help us get an idea of who can attend, please add your
>>> information to the doodle poll below.
>>>
>>>  http://www.doodle.com/cei3ve3qyeer9bv9
>>
>>
>>Since the meeting is likely to take a whole week, might I suggest making
>>each Doodle entry represent an entire week?  E.g., June 4-11, June 11-15,
>>etc.
>
> We talked about 3 days, so I was thinking that perhaps there were half
> weeks that worked well for people.
>
> Brian
>
> --
>  Brian W. Barrett
>  Dept. 1423: Scalable System Software
>  Sandia National Laboratories
>
>
>
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey



Re: [OMPI devel] Set alignment for openib internal buffers

2012-04-04 Thread Jeffrey Squyres
Mellanox is re-working the patch; the original commit violated several 
abstractions.  I hope they'll have a new patch soon, but I don't know the exact 
timeframe.

On Apr 4, 2012, at 4:19 AM, ludovic.hab...@ext.bull.net wrote:

> Hi everybody,
> 
> I've seen that some changes have been committed for buffer alignment in 
> openib. But that changeset have been revert.
> 
> Does somebody know if/when it will be added in the trunk ?
> 
> Thanks in advance,
> 
> Ludovic
> 
> changeset:   20220:ddbb0f344524
> user:miked
> date:Thu Mar 29 14:07:13 2012 +
> summary: revert previous commit
> 
> changeset:   20219:a5f81a77acc2
> parent:  20217:8dea4399f751
> user:miked
> date:Thu Mar 29 14:00:08 2012 +
> summary: performance fix: set alignment for openib internal buffers
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI devel] Set alignment for openib internal buffers

2012-04-04 Thread Ludovic . Hablot
Hi everybody,

I've seen that some changes have been committed for buffer alignment in openib. 
But that changeset have been revert.

Does somebody know if/when it will be added in the trunk ?

Thanks in advance,

Ludovic

changeset:   20220:ddbb0f344524
user:        miked
date:        Thu Mar 29 14:07:13 2012 +
summary:     revert previous commit

changeset:   20219:a5f81a77acc2
parent:      20217:8dea4399f751
user:        miked
date:        Thu Mar 29 14:00:08 2012 +
summary:     performance fix: set alignment for openib internal buffers