Re: [OMPI devel] 1.8.4rc4 now out for testing

2014-12-15 Thread Paul Hargrove
My testing on 1.8.4rc4 is not quite done, but is getting close.
With two exceptions, so far all looks good to me on almost 60 different
platforms.

I've retested on my Solaris systems and saw none of the issues I had with
rc3.
The x86-64/Linux system with mtl:psm is no longer giving a SEGV at exit.

My QEMU-based Linux/ARM and Linux/MIPS testers were OK with rc3, but I've
not yet completed testing rc4 (too slow).

The "two exceptions":

#1
I *am* still manually passing --without-xpmem on the SGI UV.
If I don't do so then the build fails as describe in
http://www.open-mpi.org/community/lists/devel/2014/12/16520.php

#2
Solaris-10/SPARC and "--enable-static --disable-shared" appears broken for
C++ apps (but OK for C).
I will report in more details when I have more information.

-Paul

On Sat, Dec 13, 2014 at 3:06 PM, Ralph Castain  wrote:
>
> Hi folks
>
> I've rolled up the bug fixes so far, including the thread-multiple
> performance fix. So please give this one a whirl
>
> http://www.open-mpi.org/software/ompi/v1.8/
>
> Ralph
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16586.php
>


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] Patch proposed: opal_set_using_threads(true) in ompi/runtime/ompi_mpi_init.c is called to late

2014-12-15 Thread Pascal Deveze
George,

Thanks for the patch. That was the solution.

Pascal

De : devel [mailto:devel-boun...@open-mpi.org] De la part de George Bosilca
Envoyé : samedi 13 décembre 2014 08:38
À : Open MPI Developers
Objet : Re: [OMPI devel] Patch proposed: opal_set_using_threads(true) in 
ompi/runtime/ompi_mpi_init.c is called to late

The source of this annoyance is the widely spread usage of 
OMPI_ENABLE_THREAD_MULTIPLE as an argument for all of the component init calls. 
This is obviously wrong as OMPI_ENABLE_THREAD_MULTIPLE is not about the 
requested support of thread support but about the less restrictive thread level 
supported by the library. Luckily the solution is simple, replace 
OMPI_ENABLE_THREAD_MULTIPLE by variable ompi_mpi_thread_multiple, and there 
should be no need for checking opal_using_threads in the initializers 
(open-mpi/ompi@343071498965a8f73d5f2b0c27a7ef404caf286c).

  George.


On Fri, Dec 12, 2014 at 2:58 AM, Pascal Deveze 
mailto:pascal.dev...@bull.net>> wrote:
George,

My initial problem is that when MPI is compiled with 
“--enable-mpi-thread-multiple”, the variable enable_mpi_threads is set to 1 
even if MPI_Init() is called in place of MPI_Init_thread().
I saw also that  opal_using_threads() exists and was used by other BTLs.

Maybe the solution is to find the way to set enable_mpi_threads to 0 when 
MPI_Init() is called.


De : devel 
[mailto:devel-boun...@open-mpi.org] De la 
part de George Bosilca
Envoyé : vendredi 12 décembre 2014 07:03

À : Open MPI Developers
Objet : Re: [OMPI devel] Patch proposed: opal_set_using_threads(true) in 
ompi/runtime/ompi_mpi_init.c is called to late

On Thu, Dec 11, 2014 at 8:30 PM, Ralph Castain 
mailto:r...@open-mpi.org>> wrote:
Just to help me understand: I don’t think this change actually changed any 
behavior. However, it certainly *allows* a different behavior. Isn’t that true?

It depends how you look at this. To be extremely clear it prevents the modules 
from using anything else than their arguments to decide the provided threading 
model. With the current change, it is possible that some of the modules will 
continue to follow this "old" behavior, while others might switch to check 
opal_using_threads to see how they might behave.

My point here is not that one is better than the other, just that we 
inadvertently introduced a possibility for non-consistent behavior.

Let me take an example. In the old scheme, the PML was allowed to run each BTL 
in a separate thread, with absolutely no BTL support for thread safety. Thus, 
the PML could have managed all the interactions between BTL and requests in an 
atomic way, without the BTL knowing about. Now, if the BTL make his decision 
based on the value returned by opal_using_threads this approach is not possible 
anymore.

If so, I guess the real question is for Pascal at Bull: why do you feel this 
earlier setting is required?

This might allow to see if using functions that require protection, such as 
opal_lifo_push, will work by default or one should use directly their atomic 
version?

  George.



On Dec 11, 2014, at 4:21 PM, George Bosilca 
mailto:bosi...@icl.utk.edu>> wrote:

The overall design in OMPI was that no OMPI module should be allowed to decide 
if threads are on (thus it should not rely on the value returned by 
opal_using_threads during it's initialization stage). Instead, they should 
respect the level of thread support requested as an argument during the 
initialization step.

And this is true even for the BTLs. The PML component init function is 
propagating the  enable_progress_threads and enable_mpi_threads, down to the 
BML, and then to the BTL. This 2 variables, enable_progress_threads and 
enable_mpi_threads, are exactly what the ompi_mpi_init is using to compute the 
the value of the opal) using_thread (and that this patch moved).

The setting of the opal_using_threads was delayed during the initialization to 
ensure that it's value was not used to select a specific thread-level in any 
module, a behavior that is allowed now with the new setting.

A drastic change in behavior...

  George.


On Tue, Dec 9, 2014 at 3:33 AM, Ralph Castain 
mailto:r...@open-mpi.org>> wrote:
Kewl - I’ll fix. Thanks!

On Dec 9, 2014, at 12:32 AM, Pascal Deveze 
mailto:pascal.dev...@bull.net>> wrote:

Hi Ralph,

This in in the trunk.

De : devel [mailto:devel-boun...@open-mpi.org] De la part de Ralph Castain
Envoyé : mardi 9 décembre 2014 09:32
À : Open MPI Developers
Objet : Re: [OMPI devel] Patch proposed: opal_set_using_threads(true) in 
ompi/runtime/ompi_mpi_init.c is called to late

Hi Pascal

Is this in the trunk or in the 1.8 series (or both)?


On Dec 9, 2014, at 12:28 AM, Pascal Deveze 
mailto:pascal.dev...@bull.net>> wrote:


In case where MPI is compiled with --enable-mpi-thread-multiple, a call to 
opal_using_threads() always returns 0 in the routine btl_xxx_component_init() 
of the BTLs, event if the application calls MPI_Init_thread() with 
MPI_

Re: [OMPI devel] 1.8.4rc4 now out for testing

2014-12-15 Thread Paul Hargrove
On Sun, Dec 14, 2014 at 10:52 PM, Paul Hargrove  wrote:
>
> Solaris-10/SPARC and "--enable-static --disable-shared" appears broken for
> C++ apps (but OK for C).
> I will report in more details when I have more information.
>

First the good news:

The problem I was experiencing (with the Solaris Studio compilers) turned
out to be "pilot error".
I had added "-library=stlport4" to CXXFLAGS but neglected to add the same
in --with-wrapper-cxxflags.
Adding to both has always sort of bothered me, and this time it bit me.
Oddly, the problem didn't appear until I forced static libs.

Now the bad news:

By trying more variants on my Solaris platforms I was able to get TWO new
failure modes.
However, I have a fix for one.

1)
Still Solaris-10/SPARC and "--enable-static --disable-shared"  but this
time with gcc-3.4.6.
With this configuration I get Bus Errors from "make check" that do not
occur without these configure options:

bash: line 5:  3141 Bus Error   (core dumped) ${dir}$tst
FAIL: position
bash: line 5:  3221 Bus Error   (core dumped) ${dir}$tst
FAIL: position_noncontig


Examining the core from the second failure:

t@1 (l@1) program terminated by signal BUS (invalid address alignment)
Current function is main
  208   opal_pack_debug = 0;
(dbx) print &opal_pack_debug
&opal_pack_debug = 0x10092e169


The problem seems to be that the tests declare this (and others) as an int,
but the opal headers say bool:

$ gegrep  -r '^extern .* opal_(pack|unpack|position)_debug' .
./test/datatype/position.c:extern int opal_unpack_debug;
./test/datatype/position.c:extern int opal_pack_debug;
./test/datatype/position.c:extern int opal_position_debug ;
./test/datatype/position_noncontig.c:extern int opal_unpack_debug;
./test/datatype/position_noncontig.c:extern int opal_pack_debug;
./test/datatype/position_noncontig.c:extern int opal_position_debug ;
./opal/datatype/opal_convertor_internal.h:extern bool opal_pack_debug;
./opal/datatype/opal_datatype_position.c:extern bool opal_position_debug;

Defn of opal_unpack_debug is well hidden, but is also "bool".

Correcting "int" to "bool" for those 3 vars in the two tests resolved this
problem for me.



2)
Now on my Solaris-11/x86-64 system with both GigE and IPoIB interfaces.
I am seeing the following when using the Solaris Studio compilers (Gnu
compilers were fine):

$ mpirun -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20
examples/ring_c'
[pcp-j-20:16239] mca_oob_tcp_accept: accept() failed: Error 0 (0).

A process or daemon was unable to complete a TCP connection
to another process:
  Local host:pcp-j-20
  Remote host:   172.18.0.120
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.



Notice the "Error 0 (0)" which means errno=0 and suggests that we've not
properly linked the thread-safe C libraries (recall that there is one
thread per interface and these hosts have two).
I see "-D_REENTRANT" in the output of "make".
However, the man pages suggest that one also needs "-mt=yes" in *both* the
compile and link steps (it defines _REENTRANT and links the proper libs).

I hoped that I could resolve this failure by adding LDFLAGS=-mt=yes to the
configure command.
However, that didn't work.


-Paul


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


[OMPI devel] 1.8.4rc3: WARNING: No loopback interface was found

2014-12-15 Thread Eric Chamberland

Hi,

I first saw this message using 1.8.4rc3:

--
WARNING: No loopback interface was found. This can cause problems
when we spawn processes as they are likely to be unable to connect
back to their host daemon. Sadly, it may take awhile for the connect
attempt to fail, so you may experience a significant hang time.

You may wish to ctrl-c out of your job and activate loopback support
on at least one interface before trying again.

--

I have compiled it in "debug" mode... is it the problem?

...but I think I do have a loopback on my host:

ifconfig -a

eth0  Link encap:Ethernet  HWaddr 00:25:90:0D:A5:38
  inet addr:132.203.7.22  Bcast:132.203.7.255  Mask:255.255.255.0
  inet6 addr: fe80::225:90ff:fe0d:a538/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:49080380 errors:0 dropped:0 overruns:0 frame:0
  TX packets:67526463 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:35710440484 (34056.1 Mb)  TX bytes:64050625687 
(61083.4 Mb)

  Interrupt:16 Memory:faee-faf0

eth1  Link encap:Ethernet  HWaddr 00:25:90:0D:A5:39
  BROADCAST MULTICAST  MTU:1500  Metric:1
  RX packets:0 errors:0 dropped:0 overruns:0 frame:0
  TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
  Interrupt:17 Memory:fafe-fb00

loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:65536  Metric:1
  RX packets:3089696 errors:0 dropped:0 overruns:0 frame:0
  TX packets:3089696 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:8421008033 (8030.8 Mb)  TX bytes:8421008033 (8030.8 Mb)

Is that message erroneous?

Thanks,

Eric


Re: [OMPI devel] 1.8.4rc3: WARNING: No loopback interface was found

2014-12-15 Thread Eric Chamberland

Forgot this:

ompi_info -all : 
http://www.giref.ulaval.ca/~ericc/ompi_bug/ompi_info.all.184rc3.txt.gz

config.log: http://www.giref.ulaval.ca/~ericc/ompi_bug/config.184rc3.log.gz

Eric



Re: [OMPI devel] 1.8.4rc3: WARNING: No loopback interface was found

2014-12-15 Thread Ralph Castain
Yes - it's been fixed in rc4


On Mon, Dec 15, 2014 at 5:16 AM, Eric Chamberland <
eric.chamberl...@giref.ulaval.ca> wrote:
>
> Hi,
>
> I first saw this message using 1.8.4rc3:
>
> --
> WARNING: No loopback interface was found. This can cause problems
> when we spawn processes as they are likely to be unable to connect
> back to their host daemon. Sadly, it may take awhile for the connect
> attempt to fail, so you may experience a significant hang time.
>
> You may wish to ctrl-c out of your job and activate loopback support
> on at least one interface before trying again.
>
> --
>
> I have compiled it in "debug" mode... is it the problem?
>
> ...but I think I do have a loopback on my host:
>
> ifconfig -a
>
> eth0  Link encap:Ethernet  HWaddr 00:25:90:0D:A5:38
>   inet addr:132.203.7.22  Bcast:132.203.7.255  Mask:255.255.255.0
>   inet6 addr: fe80::225:90ff:fe0d:a538/64 Scope:Link
>   UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>   RX packets:49080380 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:67526463 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000
>   RX bytes:35710440484 (34056.1 Mb)  TX bytes:64050625687 (61083.4
> Mb)
>   Interrupt:16 Memory:faee-faf0
>
> eth1  Link encap:Ethernet  HWaddr 00:25:90:0D:A5:39
>   BROADCAST MULTICAST  MTU:1500  Metric:1
>   RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:1000
>   RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
>   Interrupt:17 Memory:fafe-fb00
>
> loLink encap:Local Loopback
>   inet addr:127.0.0.1  Mask:255.0.0.0
>   inet6 addr: ::1/128 Scope:Host
>   UP LOOPBACK RUNNING  MTU:65536  Metric:1
>   RX packets:3089696 errors:0 dropped:0 overruns:0 frame:0
>   TX packets:3089696 errors:0 dropped:0 overruns:0 carrier:0
>   collisions:0 txqueuelen:0
>   RX bytes:8421008033 (8030.8 Mb)  TX bytes:8421008033 (8030.8 Mb)
>
> Is that message erroneous?
>
> Thanks,
>
> Eric
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: http://www.open-mpi.org/community/lists/devel/2014/12/
> 16591.php
>


[OMPI devel] 1.8.4rc Status

2014-12-15 Thread Ralph Castain
Hi folks

Trying to summarize the current situation on releasing 1.8.4. Remaining
identified issues:

1. TCP/BTL hang under mpi-thread-multiple. Asked George to look into it.

2. hwloc updates required. Brice committed them to the hwloc 1.7 repo.
Gilles volunteered to create the PR from there.

3. Fortran f08 binding disable for compilers not meeting certain
conditions. PR from Gilles awaiting review by Jeff

4. Topo signature issue reported by IBM. Ralph is waiting for more debug.

5. MPI/IO issue reported by Eric Chamberland. Gilles investigating.

6. make check issue on SPARC. Problem and fix reported by Paul Hargrove,
Ralph will commit

7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the
multi-threaded C libraries, apparently need "-mt=yes" in both compile and
link. Need someone to investigate.

Please let me know if I've missed anything.
Ralph


Re: [OMPI devel] 1.8.4rc Status

2014-12-15 Thread Tom Wurgler
Forgive me if I've missed it, but I believe using physical OR logical core 
numbering was going to be

reimplemented in the 1.8.4 series.


I've checked out rc2 and as far as I can tell, it isn't there as yet.   Is this 
correct?


thanks!



From: devel  on behalf of Ralph Castain 

Sent: Monday, December 15, 2014 8:35 AM
To: Open MPI Developers
Subject: [OMPI devel] 1.8.4rc Status

Hi folks

Trying to summarize the current situation on releasing 1.8.4. Remaining 
identified issues:

1. TCP/BTL hang under mpi-thread-multiple. Asked George to look into it.

2. hwloc updates required. Brice committed them to the hwloc 1.7 repo. Gilles 
volunteered to create the PR from there.

3. Fortran f08 binding disable for compilers not meeting certain conditions. PR 
from Gilles awaiting review by Jeff

4. Topo signature issue reported by IBM. Ralph is waiting for more debug.

5. MPI/IO issue reported by Eric Chamberland. Gilles investigating.

6. make check issue on SPARC. Problem and fix reported by Paul Hargrove, Ralph 
will commit

7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the 
multi-threaded C libraries, apparently need "-mt=yes" in both compile and link. 
Need someone to investigate.

Please let me know if I've missed anything.
Ralph



Re: [OMPI devel] 1.8.4rc Status

2014-12-15 Thread Ralph Castain
Should be there in rc4, and I thought it made it to rc2 for that matter.
I'll take a gander.

FWIW: I'm working off-list with IBM to tighten the LSF integration so we
correctly read and follow their binding directives. This will also be in
1.8.4 as we are in final test with it now.

Ralph


On Mon, Dec 15, 2014 at 5:40 AM, Tom Wurgler  wrote:
>
>  Forgive me if I've missed it, but I believe using physical OR logical
> core numbering was going to be
>
> reimplemented in the 1.8.4 series.
>
>
>  I've checked out rc2 and as far as I can tell, it isn't there as yet.
> Is this correct?
>
>
>  thanks!
>
>
>  --
> *From:* devel  on behalf of Ralph Castain <
> r...@open-mpi.org>
> *Sent:* Monday, December 15, 2014 8:35 AM
> *To:* Open MPI Developers
> *Subject:* [OMPI devel] 1.8.4rc Status
>
>  Hi folks
>
>  Trying to summarize the current situation on releasing 1.8.4. Remaining
> identified issues:
>
>  1. TCP/BTL hang under mpi-thread-multiple. Asked George to look into it.
>
>  2. hwloc updates required. Brice committed them to the hwloc 1.7 repo.
> Gilles volunteered to create the PR from there.
>
>  3. Fortran f08 binding disable for compilers not meeting certain
> conditions. PR from Gilles awaiting review by Jeff
>
>  4. Topo signature issue reported by IBM. Ralph is waiting for more debug.
>
>  5. MPI/IO issue reported by Eric Chamberland. Gilles investigating.
>
>  6. make check issue on SPARC. Problem and fix reported by Paul Hargrove,
> Ralph will commit
>
>  7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the
> multi-threaded C libraries, apparently need "-mt=yes" in both compile and
> link. Need someone to investigate.
>
>  Please let me know if I've missed anything.
> Ralph
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16595.php
>


Re: [OMPI devel] 1.8.4rc4 now out for testing

2014-12-15 Thread Adrian Reber
1.8.4rc4 works without errors on my PSM based systems.

Adrian

On Sat, Dec 13, 2014 at 03:06:07PM -0800, Ralph Castain wrote:
> Hi folks
> 
> I’ve rolled up the bug fixes so far, including the thread-multiple 
> performance fix. So please give this one a whirl
> 
> http://www.open-mpi.org/software/ompi/v1.8/ 
> 
> 
> Ralph
> 

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16586.php


Re: [OMPI devel] 1.8.4rc4 now out for testing

2014-12-15 Thread Marco Atzeri

On 12/14/2014 12:06 AM, Ralph Castain wrote:

Hi folks

I’ve rolled up the bug fixes so far, including the thread-multiple
performance fix. So please give this one a whirl

http://www.open-mpi.org/software/ompi/v1.8/

Ralph



No regression on Cygwin 64 bit

Only and usual FAIL: atomic_cmpset_noinline.exe

Tested also OSU benchmarks 4.4.1
Only test failing (as already seen)
  mpi/pt2pt/osu_latency_mt.exe
  mpi/pt2pt/osu_multi_lat.exe

and I am not sure that I am correctly running them.
All the other tests are passed

./mpi/collective/osu_allgather.exe
./mpi/collective/osu_allgatherv.exe
./mpi/collective/osu_allreduce.exe
./mpi/collective/osu_alltoall.exe
./mpi/collective/osu_alltoallv.exe
./mpi/collective/osu_barrier.exe
./mpi/collective/osu_bcast.exe
./mpi/collective/osu_gather.exe
./mpi/collective/osu_gatherv.exe
./mpi/collective/osu_reduce.exe
./mpi/collective/osu_reduce_scatter.exe
./mpi/collective/osu_scatter.exe
./mpi/collective/osu_scatterv.exe
./mpi/one-sided/osu_acc_latency.exe
./mpi/one-sided/osu_cas_latency.exe
./mpi/one-sided/osu_fop_latency.exe
./mpi/one-sided/osu_get_acc_latency.exe
./mpi/one-sided/osu_get_bw.exe
./mpi/one-sided/osu_get_latency.exe
./mpi/one-sided/osu_put_bibw.exe
./mpi/one-sided/osu_put_bw.exe
./mpi/one-sided/osu_put_latency.exe
./mpi/pt2pt/osu_bibw.exe
./mpi/pt2pt/osu_bw.exe
./mpi/pt2pt/osu_latency.exe
./mpi/pt2pt/osu_mbw_mr.exe




Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-509-g38d6627

2014-12-15 Thread Nathan Hjelm

Not yet. I am still trying to pinpoint the problem. From what I can tell
the SGI version of XPMEM should be nearly identical to the Cray
version. I should have this figured out this week. If I don't get it
fixed by Wed I will open a pull request to remove the check for
sn/xpmem.h.

-Nathan

On Fri, Dec 12, 2014 at 07:50:11PM -0800, Ralph Castain wrote:
> Nathan - does this need to come to 1.8.4? Or do you want to go with Paul’s 
> suggested fix?
> 
> > On Dec 12, 2014, at 8:09 AM, git...@crest.iu.edu wrote:
> > 
> > This is an automated email from the git hooks/post-receive script. It was
> > generated because a ref change was pushed to the repository containing
> > the project "open-mpi/ompi".
> > 
> > The branch, master has been updated
> >   via  38d66272c51fd531181d9dc282a7260f40270f64 (commit)
> >  from  f4aecdbfd22a74feadab5566d2d595b65be4a8cb (commit)
> > 
> > Those revisions listed above that are new to this repository have
> > not appeared on any other notification email; so we list those
> > revisions in full, below.
> > 
> > - Log -
> > https://github.com/open-mpi/ompi/commit/38d66272c51fd531181d9dc282a7260f40270f64
> > 
> > commit 38d66272c51fd531181d9dc282a7260f40270f64
> > Author: Nathan Hjelm 
> > Date:   Fri Dec 12 09:09:01 2014 -0700
> > 
> >btl/vader: fix compile on SGI UV
> > 
> > diff --git a/opal/mca/btl/vader/btl_vader_component.c 
> > b/opal/mca/btl/vader/btl_vader_component.c
> > index 7061612..aabf03d 100644
> > --- a/opal/mca/btl/vader/btl_vader_component.c
> > +++ b/opal/mca/btl/vader/btl_vader_component.c
> > @@ -354,9 +354,8 @@ static void mca_btl_vader_check_single_copy (void)
> > #if OPAL_BTL_VADER_HAVE_XPMEM
> > if (MCA_BTL_VADER_XPMEM == 
> > mca_btl_vader_component.single_copy_mechanism) {
> > /* try to create an xpmem segment for the entire address space */
> > -mca_btl_vader_component.my_seg_id = xpmem_make (0, 
> > VADER_MAX_ADDRESS, XPMEM_PERMIT_MODE, (void *)0666);
> > -
> > -if (-1 == mca_btl_vader_component.my_seg_id) {
> > +rc = mca_btl_vader_xpmem_init ();
> > +if (OPAL_SUCCESS != rc) {
> > if (MCA_BTL_VADER_XPMEM == initial_mechanism) {
> > opal_show_help("help-btl-vader.txt", "xpmem-make-failed",
> >true, opal_process_info.nodename, errno,
> > @@ -364,11 +363,7 @@ static void mca_btl_vader_check_single_copy (void)
> > }
> > 
> > mca_btl_vader_select_next_single_copy_mechanism ();
> > -} else {
> > -mca_btl_vader.super.btl_get = mca_btl_vader_get_xpmem;
> > -mca_btl_vader.super.btl_put = mca_btl_vader_get_xpmem;
> > }
> > -
> > }
> > #endif
> > 
> > diff --git a/opal/mca/btl/vader/btl_vader_xpmem.c 
> > b/opal/mca/btl/vader/btl_vader_xpmem.c
> > index 7e362ea..4bb9a3b 100644
> > --- a/opal/mca/btl/vader/btl_vader_xpmem.c
> > +++ b/opal/mca/btl/vader/btl_vader_xpmem.c
> > @@ -19,6 +19,19 @@
> > 
> > #if OPAL_BTL_VADER_HAVE_XPMEM
> > 
> > +int mca_btl_vader_xpmem_init (void)
> > +{
> > +mca_btl_vader_component.my_seg_id = xpmem_make (0, VADER_MAX_ADDRESS, 
> > XPMEM_PERMIT_MODE, (void *)0666);
> > +if (-1 == mca_btl_vader_component.my_seg_id) {
> > +return OPAL_ERR_NOT_AVAILABLE;
> > +}
> > +
> > +mca_btl_vader.super.btl_get = mca_btl_vader_get_xpmem;
> > +mca_btl_vader.super.btl_put = mca_btl_vader_get_xpmem;
> > +
> > +return OPAL_SUCCESS;
> > +}
> > +
> > /* look up the remote pointer in the peer rcache and attach if
> >  * necessary */
> > mca_mpool_base_registration_t *vader_get_registation (struct 
> > mca_btl_base_endpoint_t *ep, void *rem_ptr,
> > diff --git a/opal/mca/btl/vader/btl_vader_xpmem.h 
> > b/opal/mca/btl/vader/btl_vader_xpmem.h
> > index 1be188a..e040e26 100644
> > --- a/opal/mca/btl/vader/btl_vader_xpmem.h
> > +++ b/opal/mca/btl/vader/btl_vader_xpmem.h
> > @@ -22,6 +22,7 @@
> >   #include 
> > 
> >   typedef int64_t xpmem_segid_t;
> > +  typedef int64_t xpmem_apid_t;
> > #endif
> > 
> > /* look up the remote pointer in the peer rcache and attach if
> > @@ -30,6 +31,8 @@
> > /* largest address we can attach to using xpmem */
> > #define VADER_MAX_ADDRESS ((uintptr_t)0x7000ul)
> > 
> > +int mca_btl_vader_xpmem_init (void);
> > +
> > mca_mpool_base_registration_t *vader_get_registation (struct 
> > mca_btl_base_endpoint_t *endpoint, void *rem_ptr,
> >   size_t size, int flags, 
> > void **local_ptr);
> > 
> > 
> > 
> > ---
> > 
> > Summary of changes:
> > opal/mca/btl/vader/btl_vader_component.c |  9 ++---
> > opal/mca/btl/vader/btl_vader_xpmem.c | 13 +
> > opal/mca/btl/vader/btl_vader_xpmem.h |  3 +++
> > 3 files changed, 18 insertions(+), 7 deletions(-)
> > 
> > 
> > hooks/post-receive
> > -- 
> > open-mpi/ompi
> > _

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-509-g38d6627

2014-12-15 Thread Howard Pritchard
I'd prefer Paul's suggestion to disable xpmem for sgi/uv for 1.8.X
Is anyone actually supporting this?

Howard

2014-12-15 8:56 GMT-07:00 Nathan Hjelm :
>
>
> Not yet. I am still trying to pinpoint the problem. From what I can tell
> the SGI version of XPMEM should be nearly identical to the Cray
> version. I should have this figured out this week. If I don't get it
> fixed by Wed I will open a pull request to remove the check for
> sn/xpmem.h.
>
> -Nathan
>
> On Fri, Dec 12, 2014 at 07:50:11PM -0800, Ralph Castain wrote:
> > Nathan - does this need to come to 1.8.4? Or do you want to go with
> Paul’s suggested fix?
> >
> > > On Dec 12, 2014, at 8:09 AM, git...@crest.iu.edu wrote:
> > >
> > > This is an automated email from the git hooks/post-receive script. It
> was
> > > generated because a ref change was pushed to the repository containing
> > > the project "open-mpi/ompi".
> > >
> > > The branch, master has been updated
> > >   via  38d66272c51fd531181d9dc282a7260f40270f64 (commit)
> > >  from  f4aecdbfd22a74feadab5566d2d595b65be4a8cb (commit)
> > >
> > > Those revisions listed above that are new to this repository have
> > > not appeared on any other notification email; so we list those
> > > revisions in full, below.
> > >
> > > - Log -
> > >
> https://github.com/open-mpi/ompi/commit/38d66272c51fd531181d9dc282a7260f40270f64
> > >
> > > commit 38d66272c51fd531181d9dc282a7260f40270f64
> > > Author: Nathan Hjelm 
> > > Date:   Fri Dec 12 09:09:01 2014 -0700
> > >
> > >btl/vader: fix compile on SGI UV
> > >
> > > diff --git a/opal/mca/btl/vader/btl_vader_component.c
> b/opal/mca/btl/vader/btl_vader_component.c
> > > index 7061612..aabf03d 100644
> > > --- a/opal/mca/btl/vader/btl_vader_component.c
> > > +++ b/opal/mca/btl/vader/btl_vader_component.c
> > > @@ -354,9 +354,8 @@ static void mca_btl_vader_check_single_copy (void)
> > > #if OPAL_BTL_VADER_HAVE_XPMEM
> > > if (MCA_BTL_VADER_XPMEM ==
> mca_btl_vader_component.single_copy_mechanism) {
> > > /* try to create an xpmem segment for the entire address space
> */
> > > -mca_btl_vader_component.my_seg_id = xpmem_make (0,
> VADER_MAX_ADDRESS, XPMEM_PERMIT_MODE, (void *)0666);
> > > -
> > > -if (-1 == mca_btl_vader_component.my_seg_id) {
> > > +rc = mca_btl_vader_xpmem_init ();
> > > +if (OPAL_SUCCESS != rc) {
> > > if (MCA_BTL_VADER_XPMEM == initial_mechanism) {
> > > opal_show_help("help-btl-vader.txt",
> "xpmem-make-failed",
> > >true, opal_process_info.nodename, errno,
> > > @@ -364,11 +363,7 @@ static void mca_btl_vader_check_single_copy (void)
> > > }
> > >
> > > mca_btl_vader_select_next_single_copy_mechanism ();
> > > -} else {
> > > -mca_btl_vader.super.btl_get = mca_btl_vader_get_xpmem;
> > > -mca_btl_vader.super.btl_put = mca_btl_vader_get_xpmem;
> > > }
> > > -
> > > }
> > > #endif
> > >
> > > diff --git a/opal/mca/btl/vader/btl_vader_xpmem.c
> b/opal/mca/btl/vader/btl_vader_xpmem.c
> > > index 7e362ea..4bb9a3b 100644
> > > --- a/opal/mca/btl/vader/btl_vader_xpmem.c
> > > +++ b/opal/mca/btl/vader/btl_vader_xpmem.c
> > > @@ -19,6 +19,19 @@
> > >
> > > #if OPAL_BTL_VADER_HAVE_XPMEM
> > >
> > > +int mca_btl_vader_xpmem_init (void)
> > > +{
> > > +mca_btl_vader_component.my_seg_id = xpmem_make (0,
> VADER_MAX_ADDRESS, XPMEM_PERMIT_MODE, (void *)0666);
> > > +if (-1 == mca_btl_vader_component.my_seg_id) {
> > > +return OPAL_ERR_NOT_AVAILABLE;
> > > +}
> > > +
> > > +mca_btl_vader.super.btl_get = mca_btl_vader_get_xpmem;
> > > +mca_btl_vader.super.btl_put = mca_btl_vader_get_xpmem;
> > > +
> > > +return OPAL_SUCCESS;
> > > +}
> > > +
> > > /* look up the remote pointer in the peer rcache and attach if
> > >  * necessary */
> > > mca_mpool_base_registration_t *vader_get_registation (struct
> mca_btl_base_endpoint_t *ep, void *rem_ptr,
> > > diff --git a/opal/mca/btl/vader/btl_vader_xpmem.h
> b/opal/mca/btl/vader/btl_vader_xpmem.h
> > > index 1be188a..e040e26 100644
> > > --- a/opal/mca/btl/vader/btl_vader_xpmem.h
> > > +++ b/opal/mca/btl/vader/btl_vader_xpmem.h
> > > @@ -22,6 +22,7 @@
> > >   #include 
> > >
> > >   typedef int64_t xpmem_segid_t;
> > > +  typedef int64_t xpmem_apid_t;
> > > #endif
> > >
> > > /* look up the remote pointer in the peer rcache and attach if
> > > @@ -30,6 +31,8 @@
> > > /* largest address we can attach to using xpmem */
> > > #define VADER_MAX_ADDRESS ((uintptr_t)0x7000ul)
> > >
> > > +int mca_btl_vader_xpmem_init (void);
> > > +
> > > mca_mpool_base_registration_t *vader_get_registation (struct
> mca_btl_base_endpoint_t *endpoint, void *rem_ptr,
> > >   size_t size, int
> flags, void **local_ptr);
> > >
> > >
> > >
> > > -

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-509-g38d6627

2014-12-15 Thread Hjelm, Nathan Thomas
It will take about 5 mins to either fix or determine if more work is needed.



From: devel on behalf of Howard Pritchard
Sent: Monday, December 15, 2014 10:05:24 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master 
updated. dev-509-g38d6627

I'd prefer Paul's suggestion to disable xpmem for sgi/uv for 1.8.X
Is anyone actually supporting this?

Howard

2014-12-15 8:56 GMT-07:00 Nathan Hjelm 
mailto:hje...@lanl.gov>>:

Not yet. I am still trying to pinpoint the problem. From what I can tell
the SGI version of XPMEM should be nearly identical to the Cray
version. I should have this figured out this week. If I don't get it
fixed by Wed I will open a pull request to remove the check for
sn/xpmem.h.

-Nathan

On Fri, Dec 12, 2014 at 07:50:11PM -0800, Ralph Castain wrote:
> Nathan - does this need to come to 1.8.4? Or do you want to go with Paul’s 
> suggested fix?
>
> > On Dec 12, 2014, at 8:09 AM, 
> > git...@crest.iu.edu wrote:
> >
> > This is an automated email from the git hooks/post-receive script. It was
> > generated because a ref change was pushed to the repository containing
> > the project "open-mpi/ompi".
> >
> > The branch, master has been updated
> >   via  38d66272c51fd531181d9dc282a7260f40270f64 (commit)
> >  from  f4aecdbfd22a74feadab5566d2d595b65be4a8cb (commit)
> >
> > Those revisions listed above that are new to this repository have
> > not appeared on any other notification email; so we list those
> > revisions in full, below.
> >
> > - Log -
> > https://github.com/open-mpi/ompi/commit/38d66272c51fd531181d9dc282a7260f40270f64
> >
> > commit 38d66272c51fd531181d9dc282a7260f40270f64
> > Author: Nathan Hjelm mailto:hje...@lanl.gov>>
> > Date:   Fri Dec 12 09:09:01 2014 -0700
> >
> >btl/vader: fix compile on SGI UV
> >
> > diff --git a/opal/mca/btl/vader/btl_vader_component.c 
> > b/opal/mca/btl/vader/btl_vader_component.c
> > index 7061612..aabf03d 100644
> > --- a/opal/mca/btl/vader/btl_vader_component.c
> > +++ b/opal/mca/btl/vader/btl_vader_component.c
> > @@ -354,9 +354,8 @@ static void mca_btl_vader_check_single_copy (void)
> > #if OPAL_BTL_VADER_HAVE_XPMEM
> > if (MCA_BTL_VADER_XPMEM == 
> > mca_btl_vader_component.single_copy_mechanism) {
> > /* try to create an xpmem segment for the entire address space */
> > -mca_btl_vader_component.my_seg_id = xpmem_make (0, 
> > VADER_MAX_ADDRESS, XPMEM_PERMIT_MODE, (void *)0666);
> > -
> > -if (-1 == mca_btl_vader_component.my_seg_id) {
> > +rc = mca_btl_vader_xpmem_init ();
> > +if (OPAL_SUCCESS != rc) {
> > if (MCA_BTL_VADER_XPMEM == initial_mechanism) {
> > opal_show_help("help-btl-vader.txt", "xpmem-make-failed",
> >true, opal_process_info.nodename, errno,
> > @@ -364,11 +363,7 @@ static void mca_btl_vader_check_single_copy (void)
> > }
> >
> > mca_btl_vader_select_next_single_copy_mechanism ();
> > -} else {
> > -mca_btl_vader.super.btl_get = mca_btl_vader_get_xpmem;
> > -mca_btl_vader.super.btl_put = mca_btl_vader_get_xpmem;
> > }
> > -
> > }
> > #endif
> >
> > diff --git a/opal/mca/btl/vader/btl_vader_xpmem.c 
> > b/opal/mca/btl/vader/btl_vader_xpmem.c
> > index 7e362ea..4bb9a3b 100644
> > --- a/opal/mca/btl/vader/btl_vader_xpmem.c
> > +++ b/opal/mca/btl/vader/btl_vader_xpmem.c
> > @@ -19,6 +19,19 @@
> >
> > #if OPAL_BTL_VADER_HAVE_XPMEM
> >
> > +int mca_btl_vader_xpmem_init (void)
> > +{
> > +mca_btl_vader_component.my_seg_id = xpmem_make (0, VADER_MAX_ADDRESS, 
> > XPMEM_PERMIT_MODE, (void *)0666);
> > +if (-1 == mca_btl_vader_component.my_seg_id) {
> > +return OPAL_ERR_NOT_AVAILABLE;
> > +}
> > +
> > +mca_btl_vader.super.btl_get = mca_btl_vader_get_xpmem;
> > +mca_btl_vader.super.btl_put = mca_btl_vader_get_xpmem;
> > +
> > +return OPAL_SUCCESS;
> > +}
> > +
> > /* look up the remote pointer in the peer rcache and attach if
> >  * necessary */
> > mca_mpool_base_registration_t *vader_get_registation (struct 
> > mca_btl_base_endpoint_t *ep, void *rem_ptr,
> > diff --git a/opal/mca/btl/vader/btl_vader_xpmem.h 
> > b/opal/mca/btl/vader/btl_vader_xpmem.h
> > index 1be188a..e040e26 100644
> > --- a/opal/mca/btl/vader/btl_vader_xpmem.h
> > +++ b/opal/mca/btl/vader/btl_vader_xpmem.h
> > @@ -22,6 +22,7 @@
> >   #include 
> >
> >   typedef int64_t xpmem_segid_t;
> > +  typedef int64_t xpmem_apid_t;
> > #endif
> >
> > /* look up the remote pointer in the peer rcache and attach if
> > @@ -30,6 +31,8 @@
> > /* largest address we can attach to using xpmem */
> > #define VADER_MAX_ADDRESS ((uintptr_t)0x7000ul)
> >
> > +int mca_btl_vader_xpmem_init (void);
> > +
> > mca_mpool_base_registration_t *vader_get_registation (struct 
> > mca_btl_base_endpoint_t 

Re: [OMPI devel] 1.8.4rc Status

2014-12-15 Thread Tom Wurgler
It seems to be working in rc2 after all.

I was still trying to use a rankfile, but it appears that is no longer needed.

Thanks!



From: devel  on behalf of Ralph Castain 

Sent: Monday, December 15, 2014 8:45 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] 1.8.4rc Status

Should be there in rc4, and I thought it made it to rc2 for that matter. I'll 
take a gander.

FWIW: I'm working off-list with IBM to tighten the LSF integration so we 
correctly read and follow their binding directives. This will also be in 1.8.4 
as we are in final test with it now.

Ralph


On Mon, Dec 15, 2014 at 5:40 AM, Tom Wurgler 
mailto:twu...@goodyear.com>> wrote:
Forgive me if I've missed it, but I believe using physical OR logical core 
numbering was going to be

reimplemented in the 1.8.4 series.


I've checked out rc2 and as far as I can tell, it isn't there as yet.   Is this 
correct?


thanks!



From: devel mailto:devel-boun...@open-mpi.org>> on 
behalf of Ralph Castain mailto:r...@open-mpi.org>>
Sent: Monday, December 15, 2014 8:35 AM
To: Open MPI Developers
Subject: [OMPI devel] 1.8.4rc Status

Hi folks

Trying to summarize the current situation on releasing 1.8.4. Remaining 
identified issues:

1. TCP/BTL hang under mpi-thread-multiple. Asked George to look into it.

2. hwloc updates required. Brice committed them to the hwloc 1.7 repo. Gilles 
volunteered to create the PR from there.

3. Fortran f08 binding disable for compilers not meeting certain conditions. PR 
from Gilles awaiting review by Jeff

4. Topo signature issue reported by IBM. Ralph is waiting for more debug.

5. MPI/IO issue reported by Eric Chamberland. Gilles investigating.

6. make check issue on SPARC. Problem and fix reported by Paul Hargrove, Ralph 
will commit

7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the 
multi-threaded C libraries, apparently need "-mt=yes" in both compile and link. 
Need someone to investigate.

Please let me know if I've missed anything.
Ralph


___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/12/16595.php


Re: [OMPI devel] 1.8.4rc Status

2014-12-15 Thread Paul Hargrove
On Mon, Dec 15, 2014 at 5:35 AM, Ralph Castain  wrote:
>
> 7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the
> multi-threaded C libraries, apparently need "-mt=yes" in both compile and
> link. Need someone to investigate.


The lack of multi-thread libraries is my SPECULATION.

The fact that configuring with LDFLAGS=-mt=yes did not help may or may not
prove anything.
I didn't see them in "mpicc -show" and so maybe they needed to be in
wrapper-ldflags instead.
My time this week is quite limited, but I can "fire an forget" tests of any
tarballs you provide.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] 1.8.4rc Status

2014-12-15 Thread Tom Wurgler
I have to take it back.  While the first job was less than a node's worth of 
cores and ran properly on the cores I wanted. more testing is revealing other 
problems.

Anything that spans more than one node crashes and burns, with a core dump, and 
nothing in the files to indicate why.

Note this is still rc2

More testing on-going



From: devel  on behalf of Tom Wurgler 

Sent: Monday, December 15, 2014 1:23 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] 1.8.4rc Status


It seems to be working in rc2 after all.

I was still trying to use a rankfile, but it appears that is no longer needed.

Thanks!



From: devel  on behalf of Ralph Castain 

Sent: Monday, December 15, 2014 8:45 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] 1.8.4rc Status

Should be there in rc4, and I thought it made it to rc2 for that matter. I'll 
take a gander.

FWIW: I'm working off-list with IBM to tighten the LSF integration so we 
correctly read and follow their binding directives. This will also be in 1.8.4 
as we are in final test with it now.

Ralph


On Mon, Dec 15, 2014 at 5:40 AM, Tom Wurgler 
mailto:twu...@goodyear.com>> wrote:
Forgive me if I've missed it, but I believe using physical OR logical core 
numbering was going to be

reimplemented in the 1.8.4 series.


I've checked out rc2 and as far as I can tell, it isn't there as yet.   Is this 
correct?


thanks!



From: devel mailto:devel-boun...@open-mpi.org>> on 
behalf of Ralph Castain mailto:r...@open-mpi.org>>
Sent: Monday, December 15, 2014 8:35 AM
To: Open MPI Developers
Subject: [OMPI devel] 1.8.4rc Status

Hi folks

Trying to summarize the current situation on releasing 1.8.4. Remaining 
identified issues:

1. TCP/BTL hang under mpi-thread-multiple. Asked George to look into it.

2. hwloc updates required. Brice committed them to the hwloc 1.7 repo. Gilles 
volunteered to create the PR from there.

3. Fortran f08 binding disable for compilers not meeting certain conditions. PR 
from Gilles awaiting review by Jeff

4. Topo signature issue reported by IBM. Ralph is waiting for more debug.

5. MPI/IO issue reported by Eric Chamberland. Gilles investigating.

6. make check issue on SPARC. Problem and fix reported by Paul Hargrove, Ralph 
will commit

7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the 
multi-threaded C libraries, apparently need "-mt=yes" in both compile and link. 
Need someone to investigate.

Please let me know if I've missed anything.
Ralph


___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2014/12/16595.php


Re: [OMPI devel] 1.8.4rc Status

2014-12-15 Thread Paul Hargrove
A little more reading finds that...

Docs says that one needs "-mt" without the "=yes".
That will work for both old and new compilers, where "-mt=yes" chokes older
ones.

Also, man pages say "-mt" must come before "-lpthread" in the link command.

-Paul

On Mon, Dec 15, 2014 at 12:52 PM, Paul Hargrove  wrote:
>
>
> On Mon, Dec 15, 2014 at 5:35 AM, Ralph Castain  wrote:
>>
>> 7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the
>> multi-threaded C libraries, apparently need "-mt=yes" in both compile and
>> link. Need someone to investigate.
>
>
> The lack of multi-thread libraries is my SPECULATION.
>
> The fact that configuring with LDFLAGS=-mt=yes did not help may or may not
> prove anything.
> I didn't see them in "mpicc -show" and so maybe they needed to be in
> wrapper-ldflags instead.
> My time this week is quite limited, but I can "fire an forget" tests of
> any tarballs you provide.
>
> -Paul
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] 1.8.4rc Status

2014-12-15 Thread Ralph Castain
Hey Tom

Note that rc2 had a bug in the out-of-band messaging system - might be what
you are hitting. I'd suggest working with rc4.


On Mon, Dec 15, 2014 at 12:57 PM, Tom Wurgler  wrote:
>
>  I have to take it back.  While the first job was less than a node's
> worth of cores and ran properly on the cores I wanted. more testing is
> revealing other problems.
>
> Anything that spans more than one node crashes and burns, with a core
> dump, and nothing in the files to indicate why.
>
> Note this is still rc2
>
> More testing on-going
>
>
>  --
> *From:* devel  on behalf of Tom Wurgler <
> twu...@goodyear.com>
> *Sent:* Monday, December 15, 2014 1:23 PM
>
> *To:* Open MPI Developers
> *Subject:* Re: [OMPI devel] 1.8.4rc Status
>
>
> It seems to be working in rc2 after all.
>
> I was still trying to use a rankfile, but it appears that is no longer
> needed.
>
> Thanks!
>
>
>  --
> *From:* devel  on behalf of Ralph Castain <
> r...@open-mpi.org>
> *Sent:* Monday, December 15, 2014 8:45 AM
> *To:* Open MPI Developers
> *Subject:* Re: [OMPI devel] 1.8.4rc Status
>
>  Should be there in rc4, and I thought it made it to rc2 for that matter.
> I'll take a gander.
>
>  FWIW: I'm working off-list with IBM to tighten the LSF integration so we
> correctly read and follow their binding directives. This will also be in
> 1.8.4 as we are in final test with it now.
>
>  Ralph
>
>
> On Mon, Dec 15, 2014 at 5:40 AM, Tom Wurgler  wrote:
>>
>>  Forgive me if I've missed it, but I believe using physical OR logical
>> core numbering was going to be
>>
>> reimplemented in the 1.8.4 series.
>>
>>
>>  I've checked out rc2 and as far as I can tell, it isn't there as yet.
>> Is this correct?
>>
>>
>>  thanks!
>>
>>
>>  --
>> *From:* devel  on behalf of Ralph Castain <
>> r...@open-mpi.org>
>> *Sent:* Monday, December 15, 2014 8:35 AM
>> *To:* Open MPI Developers
>> *Subject:* [OMPI devel] 1.8.4rc Status
>>
>>   Hi folks
>>
>>  Trying to summarize the current situation on releasing 1.8.4. Remaining
>> identified issues:
>>
>>  1. TCP/BTL hang under mpi-thread-multiple. Asked George to look into it.
>>
>>  2. hwloc updates required. Brice committed them to the hwloc 1.7 repo.
>> Gilles volunteered to create the PR from there.
>>
>>  3. Fortran f08 binding disable for compilers not meeting certain
>> conditions. PR from Gilles awaiting review by Jeff
>>
>>  4. Topo signature issue reported by IBM. Ralph is waiting for more
>> debug.
>>
>>  5. MPI/IO issue reported by Eric Chamberland. Gilles investigating.
>>
>>  6. make check issue on SPARC. Problem and fix reported by Paul
>> Hargrove, Ralph will commit
>>
>>  7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the
>> multi-threaded C libraries, apparently need "-mt=yes" in both compile and
>> link. Need someone to investigate.
>>
>>  Please let me know if I've missed anything.
>> Ralph
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/12/16595.php
>>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16604.php
>


Re: [OMPI devel] 1.8.4rc Status

2014-12-15 Thread Paul Hargrove
I have tried with a oob_tcp_if_include setting so that there is now only 1
interface.
Even with just one interface and -mt=yes in both LDFLAGS and
wrapper-ldflags I *still* getting messages like

[pcp-j-20:11470] mca_oob_tcp_accept: accept() failed: Error 0 (0).

A process or daemon was unable to complete a TCP connection
to another process:
  Local host:pcp-j-20
  Remote host:   172.16.0.120
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.



I am getting less certain that my speculation about thread-safe libs is
correct.

-Paul

On Mon, Dec 15, 2014 at 1:24 PM, Paul Hargrove  wrote:
>
> A little more reading finds that...
>
> Docs says that one needs "-mt" without the "=yes".
> That will work for both old and new compilers, where "-mt=yes" chokes
> older ones.
>
> Also, man pages say "-mt" must come before "-lpthread" in the link command.
>
> -Paul
>
> On Mon, Dec 15, 2014 at 12:52 PM, Paul Hargrove 
> wrote:
>>
>>
>> On Mon, Dec 15, 2014 at 5:35 AM, Ralph Castain  wrote:
>>>
>>> 7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the
>>> multi-threaded C libraries, apparently need "-mt=yes" in both compile and
>>> link. Need someone to investigate.
>>
>>
>> The lack of multi-thread libraries is my SPECULATION.
>>
>> The fact that configuring with LDFLAGS=-mt=yes did not help may or may
>> not prove anything.
>> I didn't see them in "mpicc -show" and so maybe they needed to be in
>> wrapper-ldflags instead.
>> My time this week is quite limited, but I can "fire an forget" tests of
>> any tarballs you provide.
>>
>> -Paul
>>
>> --
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department   Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] 1.8.4rc Status

2014-12-15 Thread Gilles Gouaillardet
Paul,

could you please make sure configure added  "-D_REENTRANT" to the CFLAGS ?
/* otherwise, errno is a global variable instead of a per thread
variable, which can
explains some weird behaviour. note this should have been already fixed */

assuming -D_REENTRANT is set, could you please give the attached patch a
try ?

i suspect the CLOSE_THE_SOCKET macro resets errno, and hence the
confusing error message
e.g. failed: Error 0 (0)

FWIW, master is also affected.

Cheers,

Gilles

On 2014/12/16 10:47, Paul Hargrove wrote:
> I have tried with a oob_tcp_if_include setting so that there is now only 1
> interface.
> Even with just one interface and -mt=yes in both LDFLAGS and
> wrapper-ldflags I *still* getting messages like
>
> [pcp-j-20:11470] mca_oob_tcp_accept: accept() failed: Error 0 (0).
> 
> A process or daemon was unable to complete a TCP connection
> to another process:
>   Local host:pcp-j-20
>   Remote host:   172.16.0.120
> This is usually caused by a firewall on the remote host. Please
> check that any firewall (e.g., iptables) has been disabled and
> try again.
> 
>
>
> I am getting less certain that my speculation about thread-safe libs is
> correct.
>
> -Paul
>
> On Mon, Dec 15, 2014 at 1:24 PM, Paul Hargrove  wrote:
>> A little more reading finds that...
>>
>> Docs says that one needs "-mt" without the "=yes".
>> That will work for both old and new compilers, where "-mt=yes" chokes
>> older ones.
>>
>> Also, man pages say "-mt" must come before "-lpthread" in the link command.
>>
>> -Paul
>>
>> On Mon, Dec 15, 2014 at 12:52 PM, Paul Hargrove 
>> wrote:
>>>
>>> On Mon, Dec 15, 2014 at 5:35 AM, Ralph Castain  wrote:
 7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the
 multi-threaded C libraries, apparently need "-mt=yes" in both compile and
 link. Need someone to investigate.
>>>
>>> The lack of multi-thread libraries is my SPECULATION.
>>>
>>> The fact that configuring with LDFLAGS=-mt=yes did not help may or may
>>> not prove anything.
>>> I didn't see them in "mpicc -show" and so maybe they needed to be in
>>> wrapper-ldflags instead.
>>> My time this week is quite limited, but I can "fire an forget" tests of
>>> any tarballs you provide.
>>>
>>> -Paul
>>>
>>> --
>>> Paul H. Hargrove  phhargr...@lbl.gov
>>> Computer Languages & Systems Software (CLaSS) Group
>>> Computer Science Department   Tel: +1-510-495-2352
>>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>>
>>
>> --
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department   Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16607.php

diff --git a/orte/mca/oob/tcp/oob_tcp_listener.c b/orte/mca/oob/tcp/oob_tcp_listener.c
index b6d2ad8..87ff08d 100644
--- a/orte/mca/oob/tcp/oob_tcp_listener.c
+++ b/orte/mca/oob/tcp/oob_tcp_listener.c
@@ -14,6 +14,8 @@
  * Copyright (c) 2009-2014 Cisco Systems, Inc.  All rights reserved.
  * Copyright (c) 2011  Oak Ridge National Labs.  All rights reserved.
  * Copyright (c) 2013-2014 Intel, Inc.  All rights reserved.
+ * Copyright (c) 2014  Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  *
  * Additional copyrights may follow
@@ -729,7 +731,6 @@ static void* listen_thread(opal_object_t *obj)
 if (pending_connection->fd < 0) {
 if (opal_socket_errno != EAGAIN || 
 opal_socket_errno != EWOULDBLOCK) {
-CLOSE_THE_SOCKET(pending_connection->fd);
 if (EMFILE == opal_socket_errno) {
 ORTE_ERROR_LOG(ORTE_ERR_SYS_LIMITS_SOCKETS);
 orte_show_help("help-orterun.txt", "orterun:sys-limit-sockets", true);
@@ -737,6 +738,7 @@ static void* listen_thread(opal_object_t *obj)
 opal_output(0, "mca_oob_tcp_accept: accept() failed: %s (%d).",
 strerror(opal_socket_errno), opal_socket_errno);
 }
+CLOSE_THE_SOCKET(pending_connection->fd);
 OBJ_RELEASE(pending_connection);
 goto done;
 }


Re: [OMPI devel] 1.8.4rc Status

2014-12-15 Thread Ralph Castain
My correction - the fix is in the nightly tarball from tonight. You can get
it here:

openmpi-v1.8.3-272-g4e4f997.tar.bz2




On Mon, Dec 15, 2014 at 2:40 PM, Ralph Castain  wrote:
>
> Hey Tom
>
> Note that rc2 had a bug in the out-of-band messaging system - might be
> what you are hitting. I'd suggest working with rc4.
>
>
> On Mon, Dec 15, 2014 at 12:57 PM, Tom Wurgler  wrote:
>
>>  I have to take it back.  While the first job was less than a node's
>> worth of cores and ran properly on the cores I wanted. more testing is
>> revealing other problems.
>>
>> Anything that spans more than one node crashes and burns, with a core
>> dump, and nothing in the files to indicate why.
>>
>> Note this is still rc2
>>
>> More testing on-going
>>
>>
>>  --
>> *From:* devel  on behalf of Tom Wurgler <
>> twu...@goodyear.com>
>> *Sent:* Monday, December 15, 2014 1:23 PM
>>
>> *To:* Open MPI Developers
>> *Subject:* Re: [OMPI devel] 1.8.4rc Status
>>
>>
>> It seems to be working in rc2 after all.
>>
>> I was still trying to use a rankfile, but it appears that is no longer
>> needed.
>>
>> Thanks!
>>
>>
>>  --
>> *From:* devel  on behalf of Ralph Castain <
>> r...@open-mpi.org>
>> *Sent:* Monday, December 15, 2014 8:45 AM
>> *To:* Open MPI Developers
>> *Subject:* Re: [OMPI devel] 1.8.4rc Status
>>
>>  Should be there in rc4, and I thought it made it to rc2 for that
>> matter. I'll take a gander.
>>
>>  FWIW: I'm working off-list with IBM to tighten the LSF integration so
>> we correctly read and follow their binding directives. This will also be in
>> 1.8.4 as we are in final test with it now.
>>
>>  Ralph
>>
>>
>> On Mon, Dec 15, 2014 at 5:40 AM, Tom Wurgler 
>> wrote:
>>>
>>>  Forgive me if I've missed it, but I believe using physical OR logical
>>> core numbering was going to be
>>>
>>> reimplemented in the 1.8.4 series.
>>>
>>>
>>>  I've checked out rc2 and as far as I can tell, it isn't there as yet.
>>>   Is this correct?
>>>
>>>
>>>  thanks!
>>>
>>>
>>>  --
>>> *From:* devel  on behalf of Ralph Castain <
>>> r...@open-mpi.org>
>>> *Sent:* Monday, December 15, 2014 8:35 AM
>>> *To:* Open MPI Developers
>>> *Subject:* [OMPI devel] 1.8.4rc Status
>>>
>>>   Hi folks
>>>
>>>  Trying to summarize the current situation on releasing 1.8.4.
>>> Remaining identified issues:
>>>
>>>  1. TCP/BTL hang under mpi-thread-multiple. Asked George to look into
>>> it.
>>>
>>>  2. hwloc updates required. Brice committed them to the hwloc 1.7 repo.
>>> Gilles volunteered to create the PR from there.
>>>
>>>  3. Fortran f08 binding disable for compilers not meeting certain
>>> conditions. PR from Gilles awaiting review by Jeff
>>>
>>>  4. Topo signature issue reported by IBM. Ralph is waiting for more
>>> debug.
>>>
>>>  5. MPI/IO issue reported by Eric Chamberland. Gilles investigating.
>>>
>>>  6. make check issue on SPARC. Problem and fix reported by Paul
>>> Hargrove, Ralph will commit
>>>
>>>  7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the
>>> multi-threaded C libraries, apparently need "-mt=yes" in both compile and
>>> link. Need someone to investigate.
>>>
>>>  Please let me know if I've missed anything.
>>> Ralph
>>>
>>>
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2014/12/16595.php
>>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/12/16604.php
>>
>


Re: [OMPI devel] 1.8.4rc Status

2014-12-15 Thread Paul Hargrove
Gilles,

I will try the patch when I can.
However, our network is undergoing network maintenance right now, leaving
me unable to reach the necessary hosts.

As for -D_REENTRANT, I had already reported having verified in the "make"
output that it had been added automatically.

Additionally, the docs say that "-mt" *also* passes -D_REENTRANT to the
preprocessor.

-Paul

On Mon, Dec 15, 2014 at 6:07 PM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:
>
>  Paul,
>
> could you please make sure configure added  "-D_REENTRANT" to the CFLAGS ?
> /* otherwise, errno is a global variable instead of a per thread variable,
> which can
> explains some weird behaviour. note this should have been already fixed */
>
> assuming -D_REENTRANT is set, could you please give the attached patch a
> try ?
>
> i suspect the CLOSE_THE_SOCKET macro resets errno, and hence the confusing
> error message
> e.g. failed: Error 0 (0)
>
> FWIW, master is also affected.
>
> Cheers,
>
> Gilles
>
>
> On 2014/12/16 10:47, Paul Hargrove wrote:
>
> I have tried with a oob_tcp_if_include setting so that there is now only 1
> interface.
> Even with just one interface and -mt=yes in both LDFLAGS and
> wrapper-ldflags I *still* getting messages like
>
> [pcp-j-20:11470] mca_oob_tcp_accept: accept() failed: Error 0 (0).
> 
> A process or daemon was unable to complete a TCP connection
> to another process:
>   Local host:pcp-j-20
>   Remote host:   172.16.0.120
> This is usually caused by a firewall on the remote host. Please
> check that any firewall (e.g., iptables) has been disabled and
> try again.
> 
>
>
> I am getting less certain that my speculation about thread-safe libs is
> correct.
>
> -Paul
>
> On Mon, Dec 15, 2014 at 1:24 PM, Paul Hargrove  
>  wrote:
>
>  A little more reading finds that...
>
> Docs says that one needs "-mt" without the "=yes".
> That will work for both old and new compilers, where "-mt=yes" chokes
> older ones.
>
> Also, man pages say "-mt" must come before "-lpthread" in the link command.
>
> -Paul
>
> On Mon, Dec 15, 2014 at 12:52 PM, Paul Hargrove  
> 
> wrote:
>
>
> On Mon, Dec 15, 2014 at 5:35 AM, Ralph Castain  
>  wrote:
>
>  7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the
> multi-threaded C libraries, apparently need "-mt=yes" in both compile and
> link. Need someone to investigate.
>
>
> The lack of multi-thread libraries is my SPECULATION.
>
> The fact that configuring with LDFLAGS=-mt=yes did not help may or may
> not prove anything.
> I didn't see them in "mpicc -show" and so maybe they needed to be in
> wrapper-ldflags instead.
> My time this week is quite limited, but I can "fire an forget" tests of
> any tarballs you provide.
>
> -Paul
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
>
>
> ___
> devel mailing listde...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16607.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16608.php
>


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] 1.8.4rc Status

2014-12-15 Thread Gilles Gouaillardet
Paul,

did you manually set -mt ?

if i remember correctly, solaris 11 (at least with gcc compilers) do not
need any flags
(except the -D_REENTRANT that is added automatically)

Cheers,

Gilles

On 2014/12/16 12:10, Paul Hargrove wrote:
> Gilles,
>
> I will try the patch when I can.
> However, our network is undergoing network maintenance right now, leaving
> me unable to reach the necessary hosts.
>
> As for -D_REENTRANT, I had already reported having verified in the "make"
> output that it had been added automatically.
>
> Additionally, the docs say that "-mt" *also* passes -D_REENTRANT to the
> preprocessor.
>
> -Paul
>
> On Mon, Dec 15, 2014 at 6:07 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>>  Paul,
>>
>> could you please make sure configure added  "-D_REENTRANT" to the CFLAGS ?
>> /* otherwise, errno is a global variable instead of a per thread variable,
>> which can
>> explains some weird behaviour. note this should have been already fixed */
>>
>> assuming -D_REENTRANT is set, could you please give the attached patch a
>> try ?
>>
>> i suspect the CLOSE_THE_SOCKET macro resets errno, and hence the confusing
>> error message
>> e.g. failed: Error 0 (0)
>>
>> FWIW, master is also affected.
>>
>> Cheers,
>>
>> Gilles
>>
>>
>> On 2014/12/16 10:47, Paul Hargrove wrote:
>>
>> I have tried with a oob_tcp_if_include setting so that there is now only 1
>> interface.
>> Even with just one interface and -mt=yes in both LDFLAGS and
>> wrapper-ldflags I *still* getting messages like
>>
>> [pcp-j-20:11470] mca_oob_tcp_accept: accept() failed: Error 0 (0).
>> 
>> A process or daemon was unable to complete a TCP connection
>> to another process:
>>   Local host:pcp-j-20
>>   Remote host:   172.16.0.120
>> This is usually caused by a firewall on the remote host. Please
>> check that any firewall (e.g., iptables) has been disabled and
>> try again.
>> 
>>
>>
>> I am getting less certain that my speculation about thread-safe libs is
>> correct.
>>
>> -Paul
>>
>> On Mon, Dec 15, 2014 at 1:24 PM, Paul Hargrove  
>>  wrote:
>>
>>  A little more reading finds that...
>>
>> Docs says that one needs "-mt" without the "=yes".
>> That will work for both old and new compilers, where "-mt=yes" chokes
>> older ones.
>>
>> Also, man pages say "-mt" must come before "-lpthread" in the link command.
>>
>> -Paul
>>
>> On Mon, Dec 15, 2014 at 12:52 PM, Paul Hargrove  
>> 
>> wrote:
>>
>>
>> On Mon, Dec 15, 2014 at 5:35 AM, Ralph Castain  
>>  wrote:
>>
>>  7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the
>> multi-threaded C libraries, apparently need "-mt=yes" in both compile and
>> link. Need someone to investigate.
>>
>>
>> The lack of multi-thread libraries is my SPECULATION.
>>
>> The fact that configuring with LDFLAGS=-mt=yes did not help may or may
>> not prove anything.
>> I didn't see them in "mpicc -show" and so maybe they needed to be in
>> wrapper-ldflags instead.
>> My time this week is quite limited, but I can "fire an forget" tests of
>> any tarballs you provide.
>>
>> -Paul
>>
>> --
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department   Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>
>>
>> --
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Computer Languages & Systems Software (CLaSS) Group
>> Computer Science Department   Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>
>>
>>
>> ___
>> devel mailing listde...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/12/16607.php
>>
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/12/16608.php
>>
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16610.php



Re: [OMPI devel] 1.8.4rc Status

2014-12-15 Thread Paul Hargrove
Gilles,

I am NOT seeing the problem with gcc.
It is only occurring with the Studio compilers.

As I've already reported, I have tried adding either "-mt" or "-mt=yes" to
both LDFLAGS and --with-wrapper-ldflags.

The "cc" manpage (on the Solaris-10 system I can get to right now) says:

 -mt  Compile and link for multithreaded code.

  This option passes -D_REENTRANT to the preprocessor and
  passes -lthread in the correct order to ld.

  The -mt option is required if the application or
  libraries are multithreaded.

  To ensure proper library linking order, you must use
  this option, rather than -lthread, to link with lib-
  thread.

  If you are using POSIX threads, you must link with the
  options -mt -lpthread.  The -mt option is necessary
  because libC and libCrun need libthread for a mul-
  tithreaded application.

  If you compile and link in separate steps and you com-
  pile with -mt, you might get unexpected results. If you
  compile one translation unit with -mt, compile all
  units of the program with -mt.

I cannot connect to my Solaris-11 system right now, but I recall the text
to be quite similar.

-Paul

On Mon, Dec 15, 2014 at 7:12 PM, Gilles Gouaillardet <
gilles.gouaillar...@iferc.org> wrote:
>
>  Paul,
>
> did you manually set -mt ?
>
> if i remember correctly, solaris 11 (at least with gcc compilers) do not
> need any flags
> (except the -D_REENTRANT that is added automatically)
>
> Cheers,
>
> Gilles
>
>
> On 2014/12/16 12:10, Paul Hargrove wrote:
>
> Gilles,
>
> I will try the patch when I can.
> However, our network is undergoing network maintenance right now, leaving
> me unable to reach the necessary hosts.
>
> As for -D_REENTRANT, I had already reported having verified in the "make"
> output that it had been added automatically.
>
> Additionally, the docs say that "-mt" *also* passes -D_REENTRANT to the
> preprocessor.
>
> -Paul
>
> On Mon, Dec 15, 2014 at 6:07 PM, Gilles Gouaillardet 
>  wrote:
>
>
>  Paul,
>
> could you please make sure configure added  "-D_REENTRANT" to the CFLAGS ?
> /* otherwise, errno is a global variable instead of a per thread variable,
> which can
> explains some weird behaviour. note this should have been already fixed */
>
> assuming -D_REENTRANT is set, could you please give the attached patch a
> try ?
>
> i suspect the CLOSE_THE_SOCKET macro resets errno, and hence the confusing
> error message
> e.g. failed: Error 0 (0)
>
> FWIW, master is also affected.
>
> Cheers,
>
> Gilles
>
>
> On 2014/12/16 10:47, Paul Hargrove wrote:
>
> I have tried with a oob_tcp_if_include setting so that there is now only 1
> interface.
> Even with just one interface and -mt=yes in both LDFLAGS and
> wrapper-ldflags I *still* getting messages like
>
> [pcp-j-20:11470] mca_oob_tcp_accept: accept() failed: Error 0 (0).
> 
> A process or daemon was unable to complete a TCP connection
> to another process:
>   Local host:pcp-j-20
>   Remote host:   172.16.0.120
> This is usually caused by a firewall on the remote host. Please
> check that any firewall (e.g., iptables) has been disabled and
> try again.
> --
> --
>
>
> I am getting less certain that my speculation about thread-safe libs is
> correct.
>
> -Paul
>
> On Mon, Dec 15, 2014 at 1:24 PM, Paul Hargrove  
>wrote:
>
>  A little more reading finds that...
>
> Docs says that one needs "-mt" without the "=yes".
> That will work for both old and new compilers, where "-mt=yes" chokes
> older ones.
>
> Also, man pages say "-mt" must come before "-lpthread" in the link command.
>
> -Paul
>
> On Mon, Dec 15, 2014 at 12:52 PM, Paul Hargrove  
>   
> wrote:
>
>
> On Mon, Dec 15, 2014 at 5:35 AM, Ralph Castain  
>wrote:
>
>  7. Linkage issue on Solaris-11 reported by Paul Hargrove. Missing the
> multi-threaded C libraries, apparently need "-mt=yes" in both compile and
> link. Need someone to investigate.
>
>
> The lack of multi-thread libraries is my SPECULATION.
>
> The fact that configuring with LDFLAGS=-mt=yes did not help may or may
> not prove anything.
> I didn't see them in "mpicc -show" and so maybe they needed to be in
> wrapper-ldflags instead.
> My time this week is quite limited, but I can "fire an forget" tests of
> any tarballs you provide.
>
> -Paul
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
>
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
>
>
> ___