On Sat, 16 May 2015 02:59:35 PM Paul Hargrove wrote:
> I didn't find OpenBSD or Solaris docs ("grep -rl TCP_KEEP /usr/share/man"
> didn't find any matches).
This seems to document it for an unspecified version of Solaris:
http://docs.oracle.com/cd/E19120-01/open.solaris/819-2724/fsvdg/index.html
AIX, Solaris and {Free,Open,Net}BSD results are also not consistent with
regards to units used for reporting:
AIX$ no -o tcp_keepidle -o tcp_keepintvl
tcp_keepidle = 14400
tcp_keepintvl = 150
{phargrov@solaris11-amd64 ~}$ ndd -get /dev/tcp tcp_keepalive_interval
720
[phargrov@freebsd10-amd64
On Sat, 16 May 2015 12:49:51 PM Jeff Squyres wrote:
> Linux / RHEL 6.5 / 2.6.32 kernel (this is clearly in seconds):
>
> $ sysctl net.ipv4.tcp_keepalive_time
> net.ipv4.tcp_keepalive_time = 1800
I suspect that's a local customisation, all Linux systems I've got access to
(including RHEL 6.4/6.5/
I looked at this in a bit more detail this morning.
SHORT VERSION
-
I think that the real issue is that we shouldn't be setting KEEPALIVE on the
listening sockets (we should only be setting these values on accepted/connected
sockets).
I submitted a PR for this: https://github.com/o
Good catch.
If vote for the same behavior on OS X even if it's somewhat unnecessary. I.E.,
use keep alive, but do 1000x the value.
Sent from my phone. No type good.
On May 15, 2015, at 5:42 AM, Ralph Castain
mailto:r...@open-mpi.org>> wrote:
Did some more digging, and it turns out that Linux
Did some more digging, and it turns out that Linux specifies the keep alive
time interval in seconds - and Mac (for some strange reason) uses
milliseconds. Hence the difference in behavior.
So I could replace the current commit with one that multiplies the keep
alive interval by 1000x if we are on
In the worst case, i.e. no other solution is possible, OS X can be
identified by the existence of the macro __APPLE__. There is no need to
have OPAL_HAVE_MAC.
George.
On Thu, May 14, 2015 at 11:12 PM, Ralph Castain wrote:
> Interesting - as I said, I'll take a look. In either case, the keep a
Interesting - as I said, I'll take a look. In either case, the keep alive
on the Mac is unnecessary as it is always a standalone scenario - no value
in running it. So the "fix" does no harm and just saves some useless
overhead.
On Thu, May 14, 2015 at 9:00 PM, George Bosilca wrote:
> I'm sorry
I'm sorry Ralph what you proposed is not really a fix. My comment is based
on a real execution of exactly the command you provided with lldb attached
to the process. What I see is millions of
OBJ_NEW(mca_oob_tcp_pending_connection_t)
because the EAGAIN is not correctly handled.
George.
On Thu,
Yes - this is the fix for that issue
On Thu, May 14, 2015 at 8:54 PM, Howard Pritchard
wrote:
> Is this by any chance associated with issue 579?
>
>
> 2015-05-14 20:49 GMT-06:00 Ralph Castain :
>
>> I'll look at the lines you cite, but that clearly isn't the problem we
>> are seeing here. I can
Is this by any chance associated with issue 579?
2015-05-14 20:49 GMT-06:00 Ralph Castain :
> I'll look at the lines you cite, but that clearly isn't the problem we are
> seeing here. I can verify that because the test case:
>
> mpirun -n 1 sleep 1000
>
> does not open up any connections at all.
I'll look at the lines you cite, but that clearly isn't the problem we are
seeing here. I can verify that because the test case:
mpirun -n 1 sleep 1000
does not open up any connections at all. Thus, the use-case you describe
never occurs - yet we still blow up in memory. If I simply tell the OOB
Ralph,
The code pushed in g8e30579 is clearly not the right solution.
The problem starts in oob_tcp_listener.c line 742. A new
mca_oob_tcp_pending_connection_t object is allocated to store the incoming
connection. The accept few lines below fails with an error code of 0x23
which means "resource t
13 matches
Mail list logo