Is this by any chance associated with issue 579?

2015-05-14 20:49 GMT-06:00 Ralph Castain <r...@open-mpi.org>:

> I'll look at the lines you cite, but that clearly isn't the problem we are
> seeing here. I can verify that because the test case:
>
> mpirun -n 1 sleep 1000
>
> does not open up any connections at all. Thus, the use-case you describe
> never occurs - yet we still blow up in memory. If I simply tell the OOB not
> to set keep alive, the problem goes away.
>
> It only happens on Mac, and we never see Mac based clusters, so turning
> off keep alive on the Mac seems a pretty simple solution.
>
>
> On Thu, May 14, 2015 at 8:43 PM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
>
>> Ralph,
>>
>> The code pushed in g8e30579 is clearly not the right solution.
>>
>> The problem starts in oob_tcp_listener.c line 742. A new
>> mca_oob_tcp_pending_connection_t object is allocated to store the incoming
>> connection. The accept few lines below fails with an error code of 0x23
>> which means "resource temporary unavailable" on OS X (i.e. EAGAIN). Thus,
>> the if at line 750 is skipped, and we reach line 763 (a "continue") with 1)
>> a connection not accepted, and 2) an allocated object not release. Voila!
>>
>> Freeing the pending_connection object is not the right approach either,
>> as it will only remove the memory leak but the process will become a CPU
>> hog.
>>
>>   Thanks,
>>     George.
>>
>>
>>
>>
>> On Thu, May 14, 2015 at 8:10 PM, <git...@crest.iu.edu> wrote:
>>
>>> This is an automated email from the git hooks/post-receive script. It was
>>> generated because a ref change was pushed to the repository containing
>>> the project "open-mpi/ompi".
>>>
>>> The branch, master has been updated
>>>        via  8e30579e6efab580cf9cf1bec8f8df1376b7e9ef (commit)
>>>       from  1488e82efd1d09c30ba46dfa00b89e623623272f (commit)
>>>
>>> Those revisions listed above that are new to this repository have
>>> not appeared on any other notification email; so we list those
>>> revisions in full, below.
>>>
>>> - Log -----------------------------------------------------------------
>>>
>>> https://github.com/open-mpi/ompi/commit/8e30579e6efab580cf9cf1bec8f8df1376b7e9ef
>>>
>>> commit 8e30579e6efab580cf9cf1bec8f8df1376b7e9ef
>>> Author: Ralph Castain <r...@open-mpi.org>
>>> Date:   Thu May 14 18:09:13 2015 -0600
>>>
>>>     The Mac appears to have problems with the keepalive support - once
>>> keepalive starts, the memory footprint soars. So disable keepalive on the
>>> Mac
>>>
>>> diff --git a/config/opal_check_os_flavors.m4
>>> b/config/opal_check_os_flavors.m4
>>> index d1d124d..4939560 100644
>>> --- a/config/opal_check_os_flavors.m4
>>> +++ b/config/opal_check_os_flavors.m4
>>> @@ -57,6 +57,12 @@ AC_DEFUN([OPAL_CHECK_OS_FLAVORS],
>>>                         [$opal_have_solaris],
>>>                         [Whether or not we have solaris])
>>>
>>> +    AS_IF([test "$opal_found_apple" = "yes"],
>>> +          [opal_have_mac=1], [opal_have_mac=0])
>>> +    AC_DEFINE_UNQUOTED([OPAL_HAVE_MAC],
>>> +                       [$opal_have_mac],
>>> +                       [Whether or not we are on a Mac])
>>> +
>>>      # check for sockaddr_in (a good sign we have TCP)
>>>      AC_CHECK_HEADERS([netdb.h netinet/in.h netinet/tcp.h])
>>>      AC_CHECK_TYPES([struct sockaddr_in],
>>> diff --git a/orte/mca/oob/tcp/oob_tcp_common.c
>>> b/orte/mca/oob/tcp/oob_tcp_common.c
>>> index a768472..e3decf2 100644
>>> --- a/orte/mca/oob/tcp/oob_tcp_common.c
>>> +++ b/orte/mca/oob/tcp/oob_tcp_common.c
>>> @@ -72,7 +72,7 @@
>>>  /**
>>>   * Set socket buffering
>>>   */
>>> -
>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC
>>>  static void set_keepalive(int sd)
>>>  {
>>>      int option;
>>> @@ -146,6 +146,7 @@ static void set_keepalive(int sd)
>>>      }
>>>  #endif  // TCP_KEEPCNT
>>>  }
>>> +#endif //SO_KEEPALIVE
>>>
>>>  void orte_oob_tcp_set_socket_options(int sd)
>>>  {
>>> @@ -181,7 +182,7 @@ void orte_oob_tcp_set_socket_options(int sd)
>>>                              opal_socket_errno);
>>>      }
>>>  #endif
>>> -#if defined(SO_KEEPALIVE)
>>> +#if defined(SO_KEEPALIVE) && !OPAL_HAVE_MAC
>>>      if (0 < mca_oob_tcp_component.keepalive_time) {
>>>          set_keepalive(sd);
>>>      }
>>> diff --git a/orte/mca/oob/tcp/oob_tcp_component.c
>>> b/orte/mca/oob/tcp/oob_tcp_component.c
>>> index dd1af2a..372ed4c 100644
>>> --- a/orte/mca/oob/tcp/oob_tcp_component.c
>>> +++ b/orte/mca/oob/tcp/oob_tcp_component.c
>>> @@ -404,7 +404,7 @@ static int tcp_component_register(void)
>>>
>>>  &mca_oob_tcp_component.disable_ipv6_family);
>>>  #endif
>>>
>>> -
>>> +#if !OPAL_HAVE_MAC
>>>      mca_oob_tcp_component.keepalive_time = 10;
>>>      (void)mca_base_component_var_register(component, "keepalive_time",
>>>                                            "Idle time in seconds before
>>> starting to send keepalives (num <= 0 ----> disable keepalive)",
>>> @@ -427,7 +427,8 @@ static int tcp_component_register(void)
>>>                                            OPAL_INFO_LVL_9,
>>>                                            MCA_BASE_VAR_SCOPE_READONLY,
>>>
>>>  &mca_oob_tcp_component.keepalive_probes);
>>> -
>>> +#endif
>>> +
>>>      mca_oob_tcp_component.retry_delay = 0;
>>>      (void)mca_base_component_var_register(component, "retry_delay",
>>>                                            "Time (in sec) to wait before
>>> trying to connect to peer again",
>>>
>>>
>>> -----------------------------------------------------------------------
>>>
>>> Summary of changes:
>>>  config/opal_check_os_flavors.m4      | 6 ++++++
>>>  orte/mca/oob/tcp/oob_tcp_common.c    | 5 +++--
>>>  orte/mca/oob/tcp/oob_tcp_component.c | 5 +++--
>>>  3 files changed, 12 insertions(+), 4 deletions(-)
>>>
>>>
>>> hooks/post-receive
>>> --
>>> open-mpi/ompi
>>> _______________________________________________
>>> ompi-commits mailing list
>>> ompi-comm...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/ompi-commits
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/05/17401.php
>>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/05/17402.php
>

Reply via email to