Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-1731-g8e30579

2015-05-16 Thread Chris Samuel
On Sat, 16 May 2015 02:59:35 PM Paul Hargrove wrote: > I didn't find OpenBSD or Solaris docs ("grep -rl TCP_KEEP /usr/share/man" > didn't find any matches). This seems to document it for an unspecified version of Solaris: http://docs.oracle.com/cd/E19120-01/open.solaris/819-2724/fsvdg/index.html

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-1731-g8e30579

2015-05-16 Thread Paul Hargrove
AIX, Solaris and {Free,Open,Net}BSD results are also not consistent with regards to units used for reporting: AIX$ no -o tcp_keepidle -o tcp_keepintvl tcp_keepidle = 14400 tcp_keepintvl = 150 {phargrov@solaris11-amd64 ~}$ ndd -get /dev/tcp tcp_keepalive_interval 720 [phargrov@freebsd10-amd64

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-1731-g8e30579

2015-05-16 Thread Chris Samuel
On Sat, 16 May 2015 12:49:51 PM Jeff Squyres wrote: > Linux / RHEL 6.5 / 2.6.32 kernel (this is clearly in seconds): > > $ sysctl net.ipv4.tcp_keepalive_time > net.ipv4.tcp_keepalive_time = 1800 I suspect that's a local customisation, all Linux systems I've got access to (including RHEL 6.4/6.5/

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-1731-g8e30579

2015-05-16 Thread Jeff Squyres (jsquyres)
I looked at this in a bit more detail this morning. SHORT VERSION - I think that the real issue is that we shouldn't be setting KEEPALIVE on the listening sockets (we should only be setting these values on accepted/connected sockets). I submitted a PR for this: https://github.com/o

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-1731-g8e30579

2015-05-15 Thread Jeff Squyres (jsquyres)
Good catch. If vote for the same behavior on OS X even if it's somewhat unnecessary. I.E., use keep alive, but do 1000x the value. Sent from my phone. No type good. On May 15, 2015, at 5:42 AM, Ralph Castain mailto:r...@open-mpi.org>> wrote: Did some more digging, and it turns out that Linux

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-1731-g8e30579

2015-05-15 Thread Ralph Castain
Did some more digging, and it turns out that Linux specifies the keep alive time interval in seconds - and Mac (for some strange reason) uses milliseconds. Hence the difference in behavior. So I could replace the current commit with one that multiplies the keep alive interval by 1000x if we are on

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-1731-g8e30579

2015-05-15 Thread George Bosilca
In the worst case, i.e. no other solution is possible, OS X can be identified by the existence of the macro __APPLE__. There is no need to have OPAL_HAVE_MAC. George. On Thu, May 14, 2015 at 11:12 PM, Ralph Castain wrote: > Interesting - as I said, I'll take a look. In either case, the keep a

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-1731-g8e30579

2015-05-14 Thread Ralph Castain
Interesting - as I said, I'll take a look. In either case, the keep alive on the Mac is unnecessary as it is always a standalone scenario - no value in running it. So the "fix" does no harm and just saves some useless overhead. On Thu, May 14, 2015 at 9:00 PM, George Bosilca wrote: > I'm sorry

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-1731-g8e30579

2015-05-14 Thread George Bosilca
I'm sorry Ralph what you proposed is not really a fix. My comment is based on a real execution of exactly the command you provided with lldb attached to the process. What I see is millions of OBJ_NEW(mca_oob_tcp_pending_connection_t) because the EAGAIN is not correctly handled. George. On Thu,

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-1731-g8e30579

2015-05-14 Thread Ralph Castain
Yes - this is the fix for that issue On Thu, May 14, 2015 at 8:54 PM, Howard Pritchard wrote: > Is this by any chance associated with issue 579? > > > 2015-05-14 20:49 GMT-06:00 Ralph Castain : > >> I'll look at the lines you cite, but that clearly isn't the problem we >> are seeing here. I can

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-1731-g8e30579

2015-05-14 Thread Howard Pritchard
Is this by any chance associated with issue 579? 2015-05-14 20:49 GMT-06:00 Ralph Castain : > I'll look at the lines you cite, but that clearly isn't the problem we are > seeing here. I can verify that because the test case: > > mpirun -n 1 sleep 1000 > > does not open up any connections at all.

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-1731-g8e30579

2015-05-14 Thread Ralph Castain
I'll look at the lines you cite, but that clearly isn't the problem we are seeing here. I can verify that because the test case: mpirun -n 1 sleep 1000 does not open up any connections at all. Thus, the use-case you describe never occurs - yet we still blow up in memory. If I simply tell the OOB

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-1731-g8e30579

2015-05-14 Thread George Bosilca
Ralph, The code pushed in g8e30579 is clearly not the right solution. The problem starts in oob_tcp_listener.c line 742. A new mca_oob_tcp_pending_connection_t object is allocated to store the incoming connection. The accept few lines below fails with an error code of 0x23 which means "resource t