Re: [OMPI devel] RFC: add atomic compare-and-swap that returns old value

2014-08-11 Thread Paul Hargrove
Well, the contents of opal/asm/asm-data.txt and the arch-specific subdirs
below opal/include/opal/sys have served me as a list of the atomics
implementations.  If those include architectures no longer officially
supported, then some cleanup may be in order (as SPARC_v8 was recently
removed from asm-data.txt).

-Paul


On Mon, Aug 11, 2014 at 11:44 AM, Jeff Squyres (jsquyres) <
jsquy...@cisco.com> wrote:

> I think the closest thing we have to a supported architecture list is in
> the README.
>
>
> On Aug 11, 2014, at 2:42 PM, Nathan Hjelm  wrote:
>
> >
> > Which brings us back to Dave's question. Is there a list of supported
> > architectures? I don't want to bother with DEC Alpha if we no longer
> > support it.
> >
> > BTW, so far I have converted: AMD64, IA32, ARM. Working on IA64 now.
> >
> > -Nathan
> >
> > On Mon, Aug 11, 2014 at 01:57:21PM -0400, George Bosilca wrote:
> >>   Dave,
> >>   We all understand your concerns. However, the current issue has
> nothing to
> >>   do with Nathan, the code for supporting ARMv5 is already in the patch
> I
> >>   submitted and that Paul validated.
> >>   What Nathan said he might take a look at is a different method for
> >>   generating assembly code, one that only supports ARMv7 and later.
> >> George.
> >>
> >>   On Mon, Aug 11, 2014 at 1:51 PM, Dave Goodell (dgoodell)
> >>    wrote:
> >>
> >> On Aug 11, 2014, at 11:54 AM, Paul Hargrove 
> wrote:
> >>
> >>> I am on the same page with George here - if it's on the list then
> >> support it until its been removed.
> >>>
> >>> I happen to have systems to test, I believe, every supported atomics
> >> implementation except for DEC Alpha, and so I did test them all.
> >>
> >> My comment was not intended to indicate that I don't value your
> testing
> >> contributions, Paul.  I am more concerned that Nathan is wasting
> time
> >> fixing support for an effectively useless platform.  It's not like
> this
> >> is a case where making the more portable change improves our general
> >> correctness on other platforms; it's a very (<= ARMv5)-specific
> >> situation.
> >>
> >> If there's actually an official list of supported platforms
> somewhere,
> >> then I'll let Nathan decide whether he wants to submit an RFC to
> drop
> >> ARMv5 support.  I know I'd support it, but I don't care enough to
> write
> >> an RFC of my own right now.
> >> -Dave
> >>
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> >> http://www.open-mpi.org/community/lists/devel/2014/08/15618.php
> >
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15619.php
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15620.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15621.php
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] RFC: add atomic compare-and-swap that returns old value

2014-08-11 Thread Jeff Squyres (jsquyres)
I think the closest thing we have to a supported architecture list is in the 
README.


On Aug 11, 2014, at 2:42 PM, Nathan Hjelm  wrote:

> 
> Which brings us back to Dave's question. Is there a list of supported
> architectures? I don't want to bother with DEC Alpha if we no longer
> support it.
> 
> BTW, so far I have converted: AMD64, IA32, ARM. Working on IA64 now.
> 
> -Nathan
> 
> On Mon, Aug 11, 2014 at 01:57:21PM -0400, George Bosilca wrote:
>>   Dave,
>>   We all understand your concerns. However, the current issue has nothing to
>>   do with Nathan, the code for supporting ARMv5 is already in the patch I
>>   submitted and that Paul validated.
>>   What Nathan said he might take a look at is a different method for
>>   generating assembly code, one that only supports ARMv7 and later.
>> George.
>> 
>>   On Mon, Aug 11, 2014 at 1:51 PM, Dave Goodell (dgoodell)
>>    wrote:
>> 
>> On Aug 11, 2014, at 11:54 AM, Paul Hargrove  wrote:
>> 
>>> I am on the same page with George here - if it's on the list then
>> support it until its been removed.
>>> 
>>> I happen to have systems to test, I believe, every supported atomics
>> implementation except for DEC Alpha, and so I did test them all.
>> 
>> My comment was not intended to indicate that I don't value your testing
>> contributions, Paul.  I am more concerned that Nathan is wasting time
>> fixing support for an effectively useless platform.  It's not like this
>> is a case where making the more portable change improves our general
>> correctness on other platforms; it's a very (<= ARMv5)-specific
>> situation.
>> 
>> If there's actually an official list of supported platforms somewhere,
>> then I'll let Nathan decide whether he wants to submit an RFC to drop
>> ARMv5 support.  I know I'd support it, but I don't care enough to write
>> an RFC of my own right now.
>> -Dave
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/08/15618.php
> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15619.php
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15620.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] RFC: add atomic compare-and-swap that returns old value

2014-08-11 Thread Nathan Hjelm

Which brings us back to Dave's question. Is there a list of supported
architectures? I don't want to bother with DEC Alpha if we no longer
support it.

BTW, so far I have converted: AMD64, IA32, ARM. Working on IA64 now.

-Nathan

On Mon, Aug 11, 2014 at 01:57:21PM -0400, George Bosilca wrote:
>Dave,
>We all understand your concerns. However, the current issue has nothing to
>do with Nathan, the code for supporting ARMv5 is already in the patch I
>submitted and that Paul validated.
>What Nathan said he might take a look at is a different method for
>generating assembly code, one that only supports ARMv7 and later.
>  George.
> 
>On Mon, Aug 11, 2014 at 1:51 PM, Dave Goodell (dgoodell)
> wrote:
> 
>  On Aug 11, 2014, at 11:54 AM, Paul Hargrove  wrote:
> 
>  > I am on the same page with George here - if it's on the list then
>  support it until its been removed.
>  >
>  > I happen to have systems to test, I believe, every supported atomics
>  implementation except for DEC Alpha, and so I did test them all.
> 
>  My comment was not intended to indicate that I don't value your testing
>  contributions, Paul.  I am more concerned that Nathan is wasting time
>  fixing support for an effectively useless platform.  It's not like this
>  is a case where making the more portable change improves our general
>  correctness on other platforms; it's a very (<= ARMv5)-specific
>  situation.
> 
>  If there's actually an official list of supported platforms somewhere,
>  then I'll let Nathan decide whether he wants to submit an RFC to drop
>  ARMv5 support.  I know I'd support it, but I don't care enough to write
>  an RFC of my own right now.
>  -Dave
> 
>  ___
>  devel mailing list
>  de...@open-mpi.org
>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>  Link to this post:
>  http://www.open-mpi.org/community/lists/devel/2014/08/15618.php

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15619.php



pgpNT3o5fHIcB.pgp
Description: PGP signature


Re: [OMPI devel] RFC: add atomic compare-and-swap that returns old value

2014-08-11 Thread George Bosilca
Dave,

We all understand your concerns. However, the current issue has nothing to
do with Nathan, the code for supporting ARMv5 is already in the patch I
submitted and that Paul validated.

What Nathan said he might take a look at is a different method for
generating assembly code, one that only supports ARMv7 and later.

  George.



On Mon, Aug 11, 2014 at 1:51 PM, Dave Goodell (dgoodell)  wrote:

> On Aug 11, 2014, at 11:54 AM, Paul Hargrove  wrote:
>
> > I am on the same page with George here - if it's on the list then
> support it until its been removed.
> >
> > I happen to have systems to test, I believe, every supported atomics
> implementation except for DEC Alpha, and so I did test them all.
>
> My comment was not intended to indicate that I don't value your testing
> contributions, Paul.  I am more concerned that Nathan is wasting time
> fixing support for an effectively useless platform.  It's not like this is
> a case where making the more portable change improves our general
> correctness on other platforms; it's a very (<= ARMv5)-specific situation.
>
> If there's actually an official list of supported platforms somewhere,
> then I'll let Nathan decide whether he wants to submit an RFC to drop ARMv5
> support.  I know I'd support it, but I don't care enough to write an RFC of
> my own right now.
>
> -Dave
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15618.php
>


Re: [OMPI devel] RFC: add atomic compare-and-swap that returns old value

2014-08-11 Thread Dave Goodell (dgoodell)
On Aug 11, 2014, at 11:54 AM, Paul Hargrove  wrote:

> I am on the same page with George here - if it's on the list then support it 
> until its been removed.
> 
> I happen to have systems to test, I believe, every supported atomics 
> implementation except for DEC Alpha, and so I did test them all.

My comment was not intended to indicate that I don't value your testing 
contributions, Paul.  I am more concerned that Nathan is wasting time fixing 
support for an effectively useless platform.  It's not like this is a case 
where making the more portable change improves our general correctness on other 
platforms; it's a very (<= ARMv5)-specific situation.

If there's actually an official list of supported platforms somewhere, then 
I'll let Nathan decide whether he wants to submit an RFC to drop ARMv5 support. 
 I know I'd support it, but I don't care enough to write an RFC of my own right 
now.

-Dave




[OMPI devel] btl thread safety question

2014-08-11 Thread Pritchard Jr., Howard
Hi Folks,

Has anyone checked about ompi thread safety support since the BTL move?

I can only get the osu latency mt test to work using sm/shmem/vader.  With
TCP I see it hang after 32KB messages.

Howard


-
Howard Pritchard
HPC-5
Los Alamos National Laboratory




Re: [OMPI devel] RFC: add atomic compare-and-swap that returns old value

2014-08-11 Thread Paul Hargrove
I am on the same page with George here - if it's on the list then support
it until its been removed.

I happen to have systems to test, I believe, every supported atomics
implementation except for DEC Alpha, and so I did test them all.

AFAIK ARMv5 is even out-dated as a smartphone platform.

-Paul


On Mon, Aug 11, 2014 at 9:46 AM, George Bosilca  wrote:

> It is not that I care, but it was one of our supported platforms and we
> don't usually drop support for anything without a proper RFC.
>
>   George.
>
>
>
>
> On Mon, Aug 11, 2014 at 12:09 PM, Dave Goodell (dgoodell) <
> dgood...@cisco.com> wrote:
>
>> On Aug 7, 2014, at 11:37 PM, George Bosilca  wrote:
>>
>> > Paul's tests identified an small issue with the previous patch (a real
>> corner-case for ARM v5). The patch below is fixing all known issues.
>>
>> Wait, why do we care about ARMv5?  It's certainly not a serious HPC
>> platform, nor is it even a relevant laptop platform at this point (AFAIK).
>>
>> -Dave
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/08/15614.php
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15615.php
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] RFC: add atomic compare-and-swap that returns old value

2014-08-11 Thread George Bosilca
It is not that I care, but it was one of our supported platforms and we
don't usually drop support for anything without a proper RFC.

  George.




On Mon, Aug 11, 2014 at 12:09 PM, Dave Goodell (dgoodell) <
dgood...@cisco.com> wrote:

> On Aug 7, 2014, at 11:37 PM, George Bosilca  wrote:
>
> > Paul's tests identified an small issue with the previous patch (a real
> corner-case for ARM v5). The patch below is fixing all known issues.
>
> Wait, why do we care about ARMv5?  It's certainly not a serious HPC
> platform, nor is it even a relevant laptop platform at this point (AFAIK).
>
> -Dave
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/08/15614.php
>


Re: [OMPI devel] RFC: add atomic compare-and-swap that returns old value

2014-08-11 Thread Dave Goodell (dgoodell)
On Aug 7, 2014, at 11:37 PM, George Bosilca  wrote:

> Paul's tests identified an small issue with the previous patch (a real 
> corner-case for ARM v5). The patch below is fixing all known issues.

Wait, why do we care about ARMv5?  It's certainly not a serious HPC platform, 
nor is it even a relevant laptop platform at this point (AFAIK).

-Dave



Re: [OMPI devel] errors and warnings with show_help() usage

2014-08-11 Thread Jeff Squyres (jsquyres)
Sweet -- thanks!


On Aug 11, 2014, at 2:07 AM, Gilles Gouaillardet 
 wrote:

> Jeff and all,
> 
> i fixed the trivial errors in the trunk, there are now 11 non trivial
> errors.
> (commits r32490 to r32497)
> 
> i ran the script vs the v1.8 branch and found 54 errors
> (first, you need to
> touch Makefile.ompi-rules
> in the top-level Open MPI directory in order to make the script happy)
> 
> Cheers,
> 
> Gilles
> 
> On 2014/08/08 22:43, Jeff Squyres (jsquyres) wrote:
>> SHORT VERSION
>> =
>> 
>> The ./contrib/check-help-strings.pl script is showing ***47 coding errors*** 
>> with regards to using show_help() in components.  Here's a summary of the 
>> offenders:
>> 
>> - ORTE (lumped together because there's a single maintainer :-) )
>> - smcuda and cuda
>> - common/verbs
>> - bcol
>> - mxm
>> - openib
>> - oshmem
>> 
>> Could the owners of these portions of the code base please run 
>> ./contrib/check-help-strings.pl and fix the ERRORs that are shown?
>> 
>> Thanks!
>> 
>> MORE DETAIL
>> ===
>> 
>> The first part of ./contrib/check-help-strings.pl's output shows ERRORs -- 
>> referring to help files that do not exist, or referring to help topics that 
>> do not exist.
>> 
>> I'm only calling out the ERRORs in this mail -- but the second part of the 
>> output shows a bazillion WARNINGs, too.  These are help topics that are 
>> probably unused -- they don't seem to be referenced by the code anywhere.  
>> 
>> It would be good to clean up all the WARNINGs, too, but the ERRORs are more 
>> worrisome.
>> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15602.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] errors and warnings with show_help() usage

2014-08-11 Thread Ralph Castain
I'm not worrying about 1.8.2, but we can take a look at this for 1.8.3 or 
beyond.

Thanks for working on the trunk!


On Aug 10, 2014, at 11:07 PM, Gilles Gouaillardet 
 wrote:

> Jeff and all,
> 
> i fixed the trivial errors in the trunk, there are now 11 non trivial
> errors.
> (commits r32490 to r32497)
> 
> i ran the script vs the v1.8 branch and found 54 errors
> (first, you need to
> touch Makefile.ompi-rules
> in the top-level Open MPI directory in order to make the script happy)
> 
> Cheers,
> 
> Gilles
> 
> On 2014/08/08 22:43, Jeff Squyres (jsquyres) wrote:
>> SHORT VERSION
>> =
>> 
>> The ./contrib/check-help-strings.pl script is showing ***47 coding errors*** 
>> with regards to using show_help() in components.  Here's a summary of the 
>> offenders:
>> 
>> - ORTE (lumped together because there's a single maintainer :-) )
>> - smcuda and cuda
>> - common/verbs
>> - bcol
>> - mxm
>> - openib
>> - oshmem
>> 
>> Could the owners of these portions of the code base please run 
>> ./contrib/check-help-strings.pl and fix the ERRORs that are shown?
>> 
>> Thanks!
>> 
>> MORE DETAIL
>> ===
>> 
>> The first part of ./contrib/check-help-strings.pl's output shows ERRORs -- 
>> referring to help files that do not exist, or referring to help topics that 
>> do not exist.
>> 
>> I'm only calling out the ERRORs in this mail -- but the second part of the 
>> output shows a bazillion WARNINGs, too.  These are help topics that are 
>> probably unused -- they don't seem to be referenced by the code anywhere.  
>> 
>> It would be good to clean up all the WARNINGs, too, but the ERRORs are more 
>> worrisome.
>> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15602.php



Re: [OMPI devel] ORTE headers in OPAL source

2014-08-11 Thread Adrian Reber
I have seen it. I am still waiting for things to settle down before I
start fixing the FT code ( again ;-)

Adrian

On Mon, Aug 11, 2014 at 01:40:33PM +, Jeff Squyres (jsquyres) wrote:
> Ah, I see.
> 
> Ok -- add it to the list of 
> FT-things-to-be-fixed-before-FT-can-be-supported-again (which I think Josh 
> just did :-) ).
> 
> Also: Adrian -- FYI.  :-)
> 
> 
> On Aug 11, 2014, at 9:05 AM, George Bosilca  wrote:
> 
> > I just checked the code and noticed that all the usages of the sstore are 
> > protected by an OPAL_ENABLE_FT_CR define. As we are not supporting FT, I 
> > don't think this is something we should spend time fixing right now.
> > 
> >   George.
> > 
> > 
> > 
> > On Sat, Aug 9, 2014 at 8:06 AM, Jeff Squyres (jsquyres) 
> >  wrote:
> > I think you're making a joke, right...?
> > 
> > I see direct calls to ORTE sstore functionality in all three.
> > 
> > 
> > 
> > 
> > On Aug 8, 2014, at 5:42 PM, George Bosilca  wrote:
> > 
> > > These are harmless. They are only used when FT is enabled which should 
> > > rarely be the case.
> > >
> > >   George.
> > >
> > >
> > >
> > > On Fri, Aug 8, 2014 at 4:36 PM, Jeff Squyres (jsquyres) 
> > >  wrote:
> > > Here's a few ORTE headers in OPAL source -- can respective owners clean 
> > > these up?  Thanks.
> > >
> > > -
> > > mca/btl/smcuda/btl_smcuda.c
> > > 63:#include "orte/mca/sstore/sstore.h"
> > >
> > > mca/btl/sm/btl_sm.c
> > > 62:#include "orte/mca/sstore/sstore.h"
> > >
> > > mca/mpool/sm/mpool_sm_module.c
> > > 34:#include "orte/mca/sstore/sstore.h"
> > > -
> > >
> > > --
> > > Jeff Squyres
> > > jsquy...@cisco.com
> > > For corporate legal information go to: 
> > > http://www.cisco.com/web/about/doing_business/legal/cri/
> > >
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post: 
> > > http://www.open-mpi.org/community/lists/devel/2014/08/15570.php
> > >
> > > ___
> > > devel mailing list
> > > de...@open-mpi.org
> > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > Link to this post: 
> > > http://www.open-mpi.org/community/lists/devel/2014/08/15571.php
> > 
> > 
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to: 
> > http://www.cisco.com/web/about/doing_business/legal/cri/
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/08/15587.php
> > 
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/08/15607.php
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/

Adrian

-- 
Adrian Reber http://lisas.de/~adrian/
Authentic:
Indubitably true, in somebody's opinion.


Re: [OMPI devel] cosmetic configure nit

2014-08-11 Thread Jeff Squyres (jsquyres)
On Aug 9, 2014, at 4:24 PM, Paul Hargrove  wrote:

> One too many 's' characters in the following:
> 
> checking for asssembly architecture...

Fixed; thanks.


> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15598.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] bus error with openmpi-1.8.2rc4r32485 and gcc-4.9.0

2014-08-11 Thread Kawashima, Takahiro
Hi Ralph,

Your commit r32459 fixed the bus error by correcting
opal/dss/dss_copy.c. It's OK for trunk because mca_dstore_hash
calls dss to copy data. But it's insufficient for v1.8 because
mca_db_hash doesn't call dss and copies data itself.

The attached patch is the minimum patch to fix it in v1.8.
My fix doesn't call dss but uses memcpy. I have confirmed it on
SPARC64/Linux.

Sorry to response so late.

Regards,
Takahiro Kawashima,
MPI development team,
Fujitsu

> Siegmar, Ralph,
> 
> I'm sorry to response so late since last week.
> 
> Ralph fixed the problem in r32459 and it was merged to v1.8
> in r32474. But in v1.8 an additional custom patch is needed
> because the db/dstore source codes are different between trunk
> and v1.8.
> 
> I'm preparing and testing the custom patch just now.
> Wait wait a minute please.
> 
> Takahiro Kawashima,
> MPI development team,
> Fujitsu
> 
> > Hi,
> > 
> > thank you very much to everybody who tried to solve my bus
> > error problem on Solaris 10 Sparc. I thought that you found
> > and fixed it, so that I installed openmpi-1.8.2rc4r32485 on
> > my machines (Solaris 10 Sparc (tyr), Solaris 10 x86_64 (sunpc1),
> > openSUSE Linux 12.1 x86_64 (linpc1)) with gcc-4.9.0. A small
> > program works on my x86_64 architectures, but still breaks
> > with a bus error on my Sparc system.
> > 
> > linpc1 fd1026 106 mpiexec -np 1 init_finalize
> > Hello!
> > linpc1 fd1026 106 exit
> > logout
> > tyr small_prog 113 ssh sunpc1
> > sunpc1 fd1026 101 mpiexec -np 1 init_finalize
> > Hello!
> > sunpc1 fd1026 102 exit
> > logout
> > tyr small_prog 114 mpiexec -np 1 init_finalize
> > [tyr:21109] *** Process received signal ***
> > [tyr:21109] Signal: Bus Error (10)
> > ...
> > 
> > 
> > gdb shows the following backtrace.
> > 
> > tyr small_prog 122 /usr/local/gdb-7.6.1_64_gcc/bin/gdb 
> > /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec
> > GNU gdb (GDB) 7.6.1
> > ...
> > (gdb) run -np 1 init_finalize
> > Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1 
> > init_finalize
> > [Thread debugging using libthread_db enabled]
> > [New Thread 1 (LWP 1)]
> > [New LWP2]
> > [tyr:21158] *** Process received signal ***
> > [tyr:21158] Signal: Bus Error (10)
> > [tyr:21158] Signal code: Invalid address alignment (1)
> > [tyr:21158] Failing at address: 7fffd224
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_backtrace_print+0x2c
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xcd130
> > /lib/sparcv9/libc.so.1:0xd8b98
> > /lib/sparcv9/libc.so.1:0xcc70c
> > /lib/sparcv9/libc.so.1:0xcc918
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3ee8
> >  [ Signal 10 (BUS)]
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_db_base_store+0xc8
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_decode_pidmap+0x798
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_nidmap_init+0x3cc
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x226c
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_init+0x308
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_init+0x31c
> > /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:MPI_Init+0x2a8
> > /home/fd1026/SunOS/sparc/bin/init_finalize:main+0x10
> > /home/fd1026/SunOS/sparc/bin/init_finalize:_start+0x7c
> > [tyr:21158] *** End of error message ***
> > --
> > mpiexec noticed that process rank 0 with PID 21158 on node tyr exited on 
> > signal 10 (Bus Error).
> > --
> > [LWP2 exited]
> > [New Thread 2]
> > [Switching to Thread 1 (LWP 1)]
> > sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to 
> > satisfy query
> > (gdb) bt
> > #0  0x7f6173d0 in rtld_db_dlactivity () from 
> > /usr/lib/sparcv9/ld.so.1
> > #1  0x7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
> > #2  0x7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
> > #3  0x7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
> > #4  0x7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1
> > #5  0x7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1
> > #6  0x7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1
> > #7  0x7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
> > #8  0x7ec7748c in vm_close () from 
> > /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
> > #9  0x7ec74a6c in lt_dlclose () from 
> > /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
> > #10 0x7ec99b90 in ri_destructor (obj=0x1001ead30)
> > at 
> > 

Re: [OMPI devel] [vt] --with-openmpi-inside configure argument

2014-08-11 Thread Matthias Jurenz

Hello Paul,

the only possible values for --with-openmpi-inside are "yes" and "1.7" 
where the latter value is interpreted as *since*. Prior version 1.7 the 
Open MPI configure provides both F77 and FC for specifying Fortran 
compilers. The VT configure only provides FC, so it sets FC (if not set) 
to F77.


Kind regards,
Matthias Jurenz


On 05.08.2014 02:40, Paul Hargrove wrote:

I noticed that Open MPI is passing
--with-openmpi-inside=1.7
in the arguments passed to
ompi/contrib/vt/vt/configure
and
ompi/contrib/vt/vt/extlib/otf/configure

The extlib/otf case just tests if the value is set, but the top-level 
vt/configure is checking for the specific string "1.7":


# Check whether we are inside Open MPI package
inside_openmpi="no"
AC_ARG_WITH(openmpi-inside, [],
[
AS_IF([test x"$withval" = "xyes" -o x"$withval" = "x1.7"],
[
inside_openmpi="$withval"
CPPFLAGS="-DINSIDE_OPENMPI $CPPFLAGS"

# Set FC to F77 if Open MPI version < 1.7
AS_IF([test x"$withval" = "xyes" -a x"$FC" = x -a 
x"$F77" != x],

[FC="$F77"])
])
])

That logic looks a bit fragile with respect to any future changes.
Specifically the inner AS_IF is true for the desired condition 
"version < 1.7" only because the outer AS_IF currently ensures the 
only possible values of "$withval" are "yes" and "1.7".


-Paul

--
Paul H. Hargrove phhargr...@lbl.gov 
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/08/15505.php


--
Matthias Jurenz

Technische Universität Dresden
Center for Information Services and High Performance Computing (ZIH)
01062 Dresden, Germany
Phone: +49 (351) 463-31945
Fax: +49 (351) 463-37773
E-Mail: matthias.jur...@tu-dresden.de



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OMPI devel] [OMPI users] bus error with openmpi-1.8.2rc4r32485 and gcc-4.9.0

2014-08-11 Thread Kawashima, Takahiro
Siegmar, Ralph,

I'm sorry to response so late since last week.

Ralph fixed the problem in r32459 and it was merged to v1.8
in r32474. But in v1.8 an additional custom patch is needed
because the db/dstore source codes are different between trunk
and v1.8.

I'm preparing and testing the custom patch just now.
Wait wait a minute please.

Takahiro Kawashima,
MPI development team,
Fujitsu

> Hi,
> 
> thank you very much to everybody who tried to solve my bus
> error problem on Solaris 10 Sparc. I thought that you found
> and fixed it, so that I installed openmpi-1.8.2rc4r32485 on
> my machines (Solaris 10 Sparc (tyr), Solaris 10 x86_64 (sunpc1),
> openSUSE Linux 12.1 x86_64 (linpc1)) with gcc-4.9.0. A small
> program works on my x86_64 architectures, but still breaks
> with a bus error on my Sparc system.
> 
> linpc1 fd1026 106 mpiexec -np 1 init_finalize
> Hello!
> linpc1 fd1026 106 exit
> logout
> tyr small_prog 113 ssh sunpc1
> sunpc1 fd1026 101 mpiexec -np 1 init_finalize
> Hello!
> sunpc1 fd1026 102 exit
> logout
> tyr small_prog 114 mpiexec -np 1 init_finalize
> [tyr:21109] *** Process received signal ***
> [tyr:21109] Signal: Bus Error (10)
> ...
> 
> 
> gdb shows the following backtrace.
> 
> tyr small_prog 122 /usr/local/gdb-7.6.1_64_gcc/bin/gdb 
> /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec
> GNU gdb (GDB) 7.6.1
> ...
> (gdb) run -np 1 init_finalize
> Starting program: /usr/local/openmpi-1.8.2_64_gcc/bin/mpiexec -np 1 
> init_finalize
> [Thread debugging using libthread_db enabled]
> [New Thread 1 (LWP 1)]
> [New LWP2]
> [tyr:21158] *** Process received signal ***
> [tyr:21158] Signal: Bus Error (10)
> [tyr:21158] Signal code: Invalid address alignment (1)
> [tyr:21158] Failing at address: 7fffd224
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_backtrace_print+0x2c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:0xcd130
> /lib/sparcv9/libc.so.1:0xd8b98
> /lib/sparcv9/libc.so.1:0xcc70c
> /lib/sparcv9/libc.so.1:0xcc918
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_db_hash.so:0x3ee8
>  [ Signal 10 (BUS)]
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6.2.0:opal_db_base_store+0xc8
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_decode_pidmap+0x798
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_util_nidmap_init+0x3cc
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/openmpi/mca_ess_env.so:0x226c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libopen-rte.so.7.0.4:orte_init+0x308
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:ompi_mpi_init+0x31c
> /export2/prog/SunOS_sparc/openmpi-1.8.2_64_gcc/lib64/libmpi.so.1.5.2:MPI_Init+0x2a8
> /home/fd1026/SunOS/sparc/bin/init_finalize:main+0x10
> /home/fd1026/SunOS/sparc/bin/init_finalize:_start+0x7c
> [tyr:21158] *** End of error message ***
> --
> mpiexec noticed that process rank 0 with PID 21158 on node tyr exited on 
> signal 10 (Bus Error).
> --
> [LWP2 exited]
> [New Thread 2]
> [Switching to Thread 1 (LWP 1)]
> sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to 
> satisfy query
> (gdb) bt
> #0  0x7f6173d0 in rtld_db_dlactivity () from /usr/lib/sparcv9/ld.so.1
> #1  0x7f6175a8 in rd_event () from /usr/lib/sparcv9/ld.so.1
> #2  0x7f618950 in lm_delete () from /usr/lib/sparcv9/ld.so.1
> #3  0x7f6226bc in remove_so () from /usr/lib/sparcv9/ld.so.1
> #4  0x7f624574 in remove_hdl () from /usr/lib/sparcv9/ld.so.1
> #5  0x7f61d97c in dlclose_core () from /usr/lib/sparcv9/ld.so.1
> #6  0x7f61d9d4 in dlclose_intn () from /usr/lib/sparcv9/ld.so.1
> #7  0x7f61db0c in dlclose () from /usr/lib/sparcv9/ld.so.1
> #8  0x7ec7748c in vm_close () from 
> /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
> #9  0x7ec74a6c in lt_dlclose () from 
> /usr/local/openmpi-1.8.2_64_gcc/lib64/libopen-pal.so.6
> #10 0x7ec99b90 in ri_destructor (obj=0x1001ead30)
> at 
> ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_component_repository.c:391
> #11 0x7ec984a8 in opal_obj_run_destructors (object=0x1001ead30)
> at ../../../../openmpi-1.8.2rc4r32485/opal/class/opal_object.h:446
> #12 0x7ec9940c in mca_base_component_repository_release (
> component=0x7b023df0 )
> at 
> ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_component_repository.c:244
> #13 0x7ec9b754 in mca_base_component_unload (
> component=0x7b023df0 , output_id=-1)
> at 
> ../../../../openmpi-1.8.2rc4r32485/opal/mca/base/mca_base_components_close.c:47
> #14 0x7ec9b7e8 in mca_base_component_close (
> component=0x7b023df0 , 

Re: [OMPI devel] errors and warnings with show_help() usage

2014-08-11 Thread Gilles Gouaillardet
Jeff and all,

i fixed the trivial errors in the trunk, there are now 11 non trivial
errors.
(commits r32490 to r32497)

i ran the script vs the v1.8 branch and found 54 errors
(first, you need to
touch Makefile.ompi-rules
in the top-level Open MPI directory in order to make the script happy)

Cheers,

Gilles

On 2014/08/08 22:43, Jeff Squyres (jsquyres) wrote:
> SHORT VERSION
> =
>
> The ./contrib/check-help-strings.pl script is showing ***47 coding errors*** 
> with regards to using show_help() in components.  Here's a summary of the 
> offenders:
>
> - ORTE (lumped together because there's a single maintainer :-) )
> - smcuda and cuda
> - common/verbs
> - bcol
> - mxm
> - openib
> - oshmem
>
> Could the owners of these portions of the code base please run 
> ./contrib/check-help-strings.pl and fix the ERRORs that are shown?
>
> Thanks!
>
> MORE DETAIL
> ===
>
> The first part of ./contrib/check-help-strings.pl's output shows ERRORs -- 
> referring to help files that do not exist, or referring to help topics that 
> do not exist.
>
> I'm only calling out the ERRORs in this mail -- but the second part of the 
> output shows a bazillion WARNINGs, too.  These are help topics that are 
> probably unused -- they don't seem to be referenced by the code anywhere.  
>
> It would be good to clean up all the WARNINGs, too, but the ERRORs are more 
> worrisome.
>



Re: [OMPI devel] ibm abort test hangs on one node

2014-08-11 Thread Gilles Gouaillardet
Thanks Ralph !

this was necessary but not sufficient :

orte_errmgr_base_abort calls orte_session_dir_finalize at
errmgr_base_fns.c:219
that will remove the proc session dir
then, orte_errmgr_base_abort (indirectly) calls orte_ess_base_app_abort
at line 227

first, the proc session dir is removed
then the "aborted" empty file is created in the previously removed directory
(and there is no error check, so the failure gets un-noticed)
as a consequence, the code you added in r32460 do not get executed.

i commited r32498 to fix this.
it simply does not call orte_session_dir_finalize in the first place
(which is sufficient but might not be necessary ...)

Cheers,

Gilles

On 2014/08/09 1:27, Ralph Castain wrote:
> Committed a fix for this in r32460 - see if I got it!
>
> On Aug 8, 2014, at 4:02 AM, Gilles Gouaillardet 
>  wrote:
>
>> Folks,
>>
>> here is the description of a hang i briefly mentionned a few days ago.
>>
>> with the trunk (i did not check 1.8 ...) simply run on one node :
>> mpirun -np 2 --mca btl sm,self ./abort
>>
>> (the abort test is taken from the ibm test suite : process 0 call
>> MPI_Abort while process 1 enters an infinite loop)
>>
>> there is a race condition : sometimes it hangs, sometimes it aborts
>> nicely as expected.
>> when the hang occurs, both abort processes have exited and mpirun waits
>> forever
>>
>> i made some investigations and i have now a better idea of what happens
>> (but i am still clueless on how to fix this)
>>
>> when process 0 abort, it :
>> - closes the tcp socket connected to mpirun
>> - closes the pipe connected to mpirun
>> - send SIGCHLD to mpirun
>>
>> then on mpirun :
>> when SIGCHLD is received, the handler basically writes 17 (the signal
>> number) to a socketpair.
>> then libevent will return from a poll and here is the race condition,
>> basically :
>> if revents is non zero for the three fds (socket, pipe and socketpair)
>> then the program will abort nicely
>> if revents is non zero for both socket and pipe but is zero for the
>> socketpair, then the mpirun will hang
>>
>> i digged a bit deeper and found that when the event on the socketpair is
>> processed, it will end up calling
>> odls_base_default_wait_local_proc.
>> if proc->state is 5 (aka ORTE_PROC_STATE_REGISTERED), then the program
>> will abort nicely
>> *but* if proc->state is 6 (aka ORTE_PROC_STATE_IOF_COMPLETE), then the
>> program will hang
>>
>> an other way to put this is that
>> when the program aborts nicely, the call sequence is
>> odls_base_default_wait_local_proc
>> proc_errors(vpid=0)
>> proc_errors(vpid=0)
>> proc_errors(vpid=1)
>> proc_errors(vpid=1)
>>
>> when the program hangs, the call sequence is
>> proc_errors(vpid=0)
>> odls_base_default_wait_local_proc
>> proc_errors(vpid=0)
>> proc_errors(vpid=1)
>> proc_errors(vpid=1)
>>
>> i will resume this on Monday unless someone can fix this in the mean
>> time :-)
>>
>> Cheers,
>>
>> Gilles
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15552.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15560.php