[jira] [Comment Edited] (MESOS-4646) PortMappingIsolatorTests get kernel stuck.

2018-11-09 Thread Till Toenshoff (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682116#comment-16682116
 ] 

Till Toenshoff edited comment on MESOS-4646 at 11/10/18 1:58 AM:
-

Just observed this again on Ubuntu16.04 (4.4 kernel) within our internal CI.

The problems I see here manifest consistently as below:
{noformat}
01:45:21 [ RUN  ] PortMappingIsolatorTest.ROOT_NC_ContainerToContainerTCP
01:45:21 I1110 01:45:21.625305 32279 port_mapping_tests.cpp:238] Using ens3 as 
the public interface
01:45:21 I1110 01:45:21.62 32279 port_mapping_tests.cpp:246] Using lo as 
the loopback interface
01:45:21 I1110 01:45:21.641434 32279 port_mapping.cpp:1570] Using ens3 as the 
public interface
01:45:21 I1110 01:45:21.641729 32279 port_mapping.cpp:1596] Using lo as the 
loopback interface
01:45:21 I1110 01:45:21.642727 32279 port_mapping.cpp:1895] 
/proc/sys/net/ipv4/neigh/default/gc_thresh3 = '1024'
01:45:21 I1110 01:45:21.642773 32279 port_mapping.cpp:1895] 
/proc/sys/net/ipv4/neigh/default/gc_thresh1 = '128'
01:45:21 I1110 01:45:21.642808 32279 port_mapping.cpp:1895] 
/proc/sys/net/ipv4/tcp_wmem = '4096 16384   4194304'
01:45:21 I1110 01:45:21.642840 32279 port_mapping.cpp:1895] 
/proc/sys/net/ipv4/tcp_synack_retries = '5'
01:45:21 I1110 01:45:21.642879 32279 port_mapping.cpp:1895] 
/proc/sys/net/core/rmem_max = '212992'
01:45:21 I1110 01:45:21.642911 32279 port_mapping.cpp:1895] 
/proc/sys/net/core/somaxconn = '128'
01:45:21 I1110 01:45:21.642933 32279 port_mapping.cpp:1895] 
/proc/sys/net/core/wmem_max = '212992'
01:45:21 I1110 01:45:21.642961 32279 port_mapping.cpp:1895] 
/proc/sys/net/ipv4/tcp_rmem = '4096 87380   6291456'
01:45:21 I1110 01:45:21.642983 32279 port_mapping.cpp:1895] 
/proc/sys/net/ipv4/tcp_keepalive_time = '7200'
01:45:21 I1110 01:45:21.643013 32279 port_mapping.cpp:1895] 
/proc/sys/net/ipv4/neigh/default/gc_thresh2 = '512'
01:45:21 I1110 01:45:21.643033 32279 port_mapping.cpp:1895] 
/proc/sys/net/core/netdev_max_backlog = '1000'
01:45:21 I1110 01:45:21.643064 32279 port_mapping.cpp:1895] 
/proc/sys/net/ipv4/tcp_keepalive_intvl = '75'
01:45:21 I1110 01:45:21.643085 32279 port_mapping.cpp:1895] 
/proc/sys/net/ipv4/tcp_keepalive_probes = '9'
01:45:21 I1110 01:45:21.643112 32279 port_mapping.cpp:1895] 
/proc/sys/net/ipv4/tcp_max_syn_backlog = '512'
01:45:21 I1110 01:45:21.643139 32279 port_mapping.cpp:1895] 
/proc/sys/net/ipv4/tcp_retries2 = '15'
01:45:21 I1110 01:45:21.646045 32279 linux_launcher.cpp:144] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
01:45:21 I1110 01:45:21.646473 27877 port_mapping.cpp:2537] Using non-ephemeral 
ports {} and ephemeral ports [30016,30032) for container 
0cb621ab-78d9-419a-9189-ff2856d4e559 of executor ''
01:45:21 I1110 01:45:21.647001 27878 linux_launcher.cpp:492] Launching 
container 0cb621ab-78d9-419a-9189-ff2856d4e559 and cloning with namespaces 
CLONE_NEWNS | CLONE_NEWNET
01:45:21 I1110 01:45:21.672200 27881 port_mapping.cpp:2602] Bind mounted 
'/proc/10192/ns/net' to '/run/netns/10192' for container 
0cb621ab-78d9-419a-9189-ff2856d4e559
01:45:21 I1110 01:45:21.672251 27881 port_mapping.cpp:2618] Created network 
namespace handle symlink 
'/var/run/mesos/netns/0cb621ab-78d9-419a-9189-ff2856d4e559' -> 
'/run/netns/10192'
01:45:21 I1110 01:45:21.672857 27881 port_mapping.cpp:2678] Adding IP packet 
filters with ports [30016,30031] for container 
0cb621ab-78d9-419a-9189-ff2856d4e559
01:45:21 Executing pre-exec command '{"shell":true,"value":"#!/bin/sh\nset 
-xe\nmount --make-rslave /run/netns\ntest -f 
/proc/sys/net/ipv6/conf/all/disable_ipv6 && echo 1 > 
/proc/sys/net/ipv6/conf/all/disable_ipv6\nip link set lo address 
12:3b:cf:2b:ed:2a mtu 9001 up\nethtool -K ens3 rx off\nip link set ens3 address 
12:3b:cf:2b:ed:2a mtu 9001 up\nip addr add 172.16.10.54/24 dev ens3\nip route 
add default via 172.16.10.1\necho 30016 30031 > 
/proc/sys/net/ipv4/ip_local_port_range\necho 1 > 
/proc/sys/net/ipv4/conf/ens3/accept_local\necho 1 > 
/proc/sys/net/ipv4/conf/lo/accept_local\necho 1 > 
/proc/sys/net/ipv4/conf/lo/route_localnet\nif [ -f 
\"/proc/sys/net/ipv4/tcp_retries2\" ]; then\n echo '15' > 
/proc/sys/net/ipv4/tcp_retries2\nfi\nif [ -f 
\"/proc/sys/net/ipv4/tcp_max_syn_backlog\" ]; then\n echo '512' > 
/proc/sys/net/ipv4/tcp_max_syn_backlog\nfi\nif [ -f 
\"/proc/sys/net/ipv4/tcp_keepalive_probes\" ]; then\n echo '9' > 
/proc/sys/net/ipv4/tcp_keepalive_probes\nfi\nif [ -f 
\"/proc/sys/net/ipv4/tcp_synack_retries\" ]; then\n echo '5' > 
/proc/sys/net/ipv4/tcp_synack_retries\nfi\nif [ -f 
\"/proc/sys/net/ipv4/tcp_wmem\" ]; then\n echo '4096\t16384\t4194304' > 
/proc/sys/net/ipv4/tcp_wmem\nfi\nif [ -f 
\"/proc/sys/net/ipv4/neigh/default/gc_thresh1\" ]; then\n echo '128' > 
/proc/sys/net/ipv4/neigh/default/gc_thresh1\nfi\nif [ -f 
\"/proc/sys/net/ipv4/tcp_keepalive_intvl\" ]; then\n echo '75' > 

[jira] [Commented] (MESOS-4646) PortMappingIsolatorTests get kernel stuck.

2018-11-09 Thread Till Toenshoff (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682116#comment-16682116
 ] 

Till Toenshoff commented on MESOS-4646:
---

Just observed this again on Ubuntu16.04 (4.4 kernel) within our internal CI.

> PortMappingIsolatorTests get kernel stuck.
> --
>
> Key: MESOS-4646
> URL: https://issues.apache.org/jira/browse/MESOS-4646
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.8.0
> Environment: Linux Kernel 3.19.9-49-generic,
> libnl-3.2.27
>Reporter: Till Toenshoff
>Priority: Major
>
> {noformat}
> $ sudo ./bin/mesos-tests.sh --gtest_filter="*PortMappingIsolatorTest*"
> Source directory: /home/till/scratchpad/mesos
> Build directory: /home/till/scratchpad/mesos/build
> -
> We cannot run any cgroups tests that require mounting
> hierarchies because you have the following hierarchies mounted:
> /sys/fs/cgroup/blkio, /sys/fs/cgroup/cpu, /sys/fs/cgroup/cpuacct, 
> /sys/fs/cgroup/cpuset, /sys/fs/cgroup/devices, /sys/fs/cgroup/freezer, 
> /sys/fs/cgroup/hugetlb, /sys/fs/cgroup/memory, /sys/fs/cgroup/net_cls, 
> /sys/fs/cgroup/net_prio, /sys/fs/cgroup/perf_event, /sys/fs/cgroup/systemd
> We'll disable the CgroupsNoHierarchyTest test fixture for now.
> -
> WARNING: perf not found for kernel 3.19.0-49
>   You may need to install the following packages for this specific kernel:
> linux-tools-3.19.0-49-generic
> linux-cloud-tools-3.19.0-49-generic
>   You may also want to install one of the following packages to keep up to 
> date:
> linux-tools-generic-lts-
> linux-cloud-tools-generic-lts-
> -
> No 'perf' command found so no 'perf' tests will be run
> -
> WARNING: perf not found for kernel 3.19.0-49
>   You may need to install the following packages for this specific kernel:
> linux-tools-3.19.0-49-generic
> linux-cloud-tools-3.19.0-49-generic
>   You may also want to install one of the following packages to keep up to 
> date:
> linux-tools-generic-lts-
> linux-cloud-tools-generic-lts-
> -
> The 'perf' command wasn't found so tests using it
> to sample the 'cycles' hardware event will not be run.
> -
> /bin/nc
> /usr/local/bin/curl
> Note: Google Test filter = 
> 

[jira] [Assigned] (MESOS-4646) PortMappingIsolatorTests get kernel stuck.

2018-11-09 Thread Till Toenshoff (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-4646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff reassigned MESOS-4646:
-

Assignee: (was: Cong Wang)

> PortMappingIsolatorTests get kernel stuck.
> --
>
> Key: MESOS-4646
> URL: https://issues.apache.org/jira/browse/MESOS-4646
> Project: Mesos
>  Issue Type: Bug
> Environment: Linux Kernel 3.19.9-49-generic,
> libnl-3.2.27
>Reporter: Till Toenshoff
>Priority: Major
>
> {noformat}
> $ sudo ./bin/mesos-tests.sh --gtest_filter="*PortMappingIsolatorTest*"
> Source directory: /home/till/scratchpad/mesos
> Build directory: /home/till/scratchpad/mesos/build
> -
> We cannot run any cgroups tests that require mounting
> hierarchies because you have the following hierarchies mounted:
> /sys/fs/cgroup/blkio, /sys/fs/cgroup/cpu, /sys/fs/cgroup/cpuacct, 
> /sys/fs/cgroup/cpuset, /sys/fs/cgroup/devices, /sys/fs/cgroup/freezer, 
> /sys/fs/cgroup/hugetlb, /sys/fs/cgroup/memory, /sys/fs/cgroup/net_cls, 
> /sys/fs/cgroup/net_prio, /sys/fs/cgroup/perf_event, /sys/fs/cgroup/systemd
> We'll disable the CgroupsNoHierarchyTest test fixture for now.
> -
> WARNING: perf not found for kernel 3.19.0-49
>   You may need to install the following packages for this specific kernel:
> linux-tools-3.19.0-49-generic
> linux-cloud-tools-3.19.0-49-generic
>   You may also want to install one of the following packages to keep up to 
> date:
> linux-tools-generic-lts-
> linux-cloud-tools-generic-lts-
> -
> No 'perf' command found so no 'perf' tests will be run
> -
> WARNING: perf not found for kernel 3.19.0-49
>   You may need to install the following packages for this specific kernel:
> linux-tools-3.19.0-49-generic
> linux-cloud-tools-3.19.0-49-generic
>   You may also want to install one of the following packages to keep up to 
> date:
> linux-tools-generic-lts-
> linux-cloud-tools-generic-lts-
> -
> The 'perf' command wasn't found so tests using it
> to sample the 'cycles' hardware event will not be run.
> -
> /bin/nc
> /usr/local/bin/curl
> Note: Google Test filter = 
> 

[jira] [Commented] (MESOS-9258) Consider making Mesos subscribers send heartbeats

2018-11-09 Thread Joseph Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682078#comment-16682078
 ] 

Joseph Wu commented on MESOS-9258:
--

Alternative proposal for bounding the max number of subscribers:
https://reviews.apache.org/r/69307/

This one requires almost no client-side changes (as long as clients already 
retry when disconnected) and the code changes are also somewhat minimal from a 
backporting perspective.

> Consider making Mesos subscribers send heartbeats
> -
>
> Key: MESOS-9258
> URL: https://issues.apache.org/jira/browse/MESOS-9258
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Gastón Kleiman
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: mesosphere
>
> Some reverse proxies (e.g., ELB using an HTTP listener) won't close the 
> upstream connection to Mesos when they detect that their client is 
> disconnected.
> This can make Mesos leak subscribers, which generates unnecessary 
> authorization requests and affects performance.
> We should evaluate methods (e.g., heartbeats) to enable Mesos to detect that 
> a subscriber is gone, even if the TCP connection is still open.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9382) mesos-gtest-runner doesn't work on systems without ulimit binary

2018-11-09 Thread Benjamin Bannier (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-9382:
---

Assignee: Benjamin Bannier

> mesos-gtest-runner doesn't work on systems without ulimit binary
> 
>
> Key: MESOS-9382
> URL: https://issues.apache.org/jira/browse/MESOS-9382
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Ilya Pronin
>Assignee: Benjamin Bannier
>Priority: Major
>
> {{mesos-gtest-runner.py}} fails on systems without a separate ulimit binary 
> (i.e. CentOS 7).
> {noformat}
> /home/ipronin/mesos/build/../support/mesos-gtest-runner.py 
> --sequential=*ROOT_* ./mesos-tests
> Could not check compatibility of ulimit settings: [Errno 2] No such file or 
> directory: 'ulimit'
> {noformat}
> The problem arises in [this 
> call|https://github.com/apache/mesos/blob/630d8938462381e8e7b0f44fa6434e47460fb178/support/mesos-gtest-runner.py#L209].
>  Seems that it can be fixed by passing a {{shell=True}} argument to 
> {{subprocess.check_output()}}.
> Another problem is {{ROOT_*}} tests which should be ran as root. For root 
> {{ulimit -u}} will most likely return "unlimited", which will again crash the 
> runner.
> {noformat}
> Could not check compatibility of ulimit settings: invalid literal for int() 
> with base 10: b'unlimited\n'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9382) mesos-gtest-runner doesn't work on systems without ulimit binary

2018-11-09 Thread Benjamin Bannier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682021#comment-16682021
 ] 

Benjamin Bannier commented on MESOS-9382:
-

{noformat}
commit 5367851e0e399d718df8f529ad977b6afd74bf99
Author: Benjamin Bannier 
Date:   Fri Nov 9 21:50:42 2018 +0100

Fixed parallel test runner ulimit check for setups without limits.

This patch adds special handling for the case where no process rlimit
is in effect. For that we explicitly handle an output `unlimited`
which can be present in e.g., `C` locales.

Review: https://reviews.apache.org/r/69306

commit 9df27e3657aa44677668baaee70e46b7963b06e4
Author: Chun-Hung Hsiao 
Date:   Fri Nov 9 11:47:01 2018 -0800

Fixed the ulimit validation in the parallel test runner.

This patch fixes the following error:
```
$ support/mesos-gtest-runner.py build/bin/mesos-tests.sh
Could not check compatibility of ulimit settings: \
[Errno 2] No such file or directory: 'ulimit': 'ulimit'
```

Review: https://reviews.apache.org/r/69301/
{noformat}

> mesos-gtest-runner doesn't work on systems without ulimit binary
> 
>
> Key: MESOS-9382
> URL: https://issues.apache.org/jira/browse/MESOS-9382
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Reporter: Ilya Pronin
>Assignee: Benjamin Bannier
>Priority: Major
>
> {{mesos-gtest-runner.py}} fails on systems without a separate ulimit binary 
> (i.e. CentOS 7).
> {noformat}
> /home/ipronin/mesos/build/../support/mesos-gtest-runner.py 
> --sequential=*ROOT_* ./mesos-tests
> Could not check compatibility of ulimit settings: [Errno 2] No such file or 
> directory: 'ulimit'
> {noformat}
> The problem arises in [this 
> call|https://github.com/apache/mesos/blob/630d8938462381e8e7b0f44fa6434e47460fb178/support/mesos-gtest-runner.py#L209].
>  Seems that it can be fixed by passing a {{shell=True}} argument to 
> {{subprocess.check_output()}}.
> Another problem is {{ROOT_*}} tests which should be ran as root. For root 
> {{ulimit -u}} will most likely return "unlimited", which will again crash the 
> runner.
> {noformat}
> Could not check compatibility of ulimit settings: invalid literal for int() 
> with base 10: b'unlimited\n'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9382) mesos-gtest-runner doesn't work on systems without ulimit binary

2018-11-09 Thread Ilya Pronin (JIRA)
Ilya Pronin created MESOS-9382:
--

 Summary: mesos-gtest-runner doesn't work on systems without ulimit 
binary
 Key: MESOS-9382
 URL: https://issues.apache.org/jira/browse/MESOS-9382
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Ilya Pronin


{{mesos-gtest-runner.py}} fails on systems without a separate ulimit binary 
(i.e. CentOS 7).
{noformat}
/home/ipronin/mesos/build/../support/mesos-gtest-runner.py --sequential=*ROOT_* 
./mesos-tests
Could not check compatibility of ulimit settings: [Errno 2] No such file or 
directory: 'ulimit'
{noformat}
The problem arises in [this 
call|https://github.com/apache/mesos/blob/630d8938462381e8e7b0f44fa6434e47460fb178/support/mesos-gtest-runner.py#L209].
 Seems that it can be fixed by passing a {{shell=True}} argument to 
{{subprocess.check_output()}}.

Another problem is {{ROOT_*}} tests which should be ran as root. For root 
{{ulimit -u}} will most likely return "unlimited", which will again crash the 
runner.
{noformat}
Could not check compatibility of ulimit settings: invalid literal for int() 
with base 10: b'unlimited\n'
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)