[jira] [Commented] (MESOS-8994) Ensure that the cmake build knows about all source files in the autotools build

2018-07-02 Thread Benjamin Bannier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529490#comment-16529490
 ] 

Benjamin Bannier commented on MESOS-8994:
-

{noformat}
commit c0488797eaacbf6c07cb79235d1174718d933a2c
Author: Benjamin Bannier 
Date:   Tue Jun 26 15:10:02 2018 -0700

Added a support script to check for files missing in CMake.

This compares the sources listed in the Autotools and CMake build
files, and emits the difference. We use this to check if the builds
have diverged, and how to reconcile that divergence.

Review: https://reviews.apache.org/r/67707/
{noformat}

> Ensure that the cmake build knows about all source files in the autotools 
> build
> ---
>
> Key: MESOS-8994
> URL: https://issues.apache.org/jira/browse/MESOS-8994
> Project: Mesos
>  Issue Type: Improvement
>  Components: build, cmake
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Critical
>
> We currently maintain two build systems in parallel with autotools still 
> being used by the larger part of contributors and cmake catching up in terms 
> of coverage and features.
>  
> This has lead to situations where certain features were added only to the 
> autotools build while updating the cmake build was either implicitly (without 
> creating a ticket) deferred or forgotten. Identifying such missing coverage 
> makes it harder to gauge where the two build systems stand in terms of 
> feature parity and how much work is left before autotools can be retired.
> We should update the cmake build setup to explicitly check whether any 
> sources files (headers and sources) unknown to it exist in the tree. Until 
> full parity is reached we would likely need to maintain a whitelist of files 
> known to be missing in the cmake build (this whitelist would at the same time 
> serve as a {{TODO}} list). The LLVM project uses the following function to 
> perform closely related work, 
> https://github.com/llvm-mirror/llvm/blob/master/cmake/modules/LLVMProcessSources.cmake#L70-L111.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9044) DefaultExecutorTest.ROOT_ContainerStatusForTask can segfault

2018-07-02 Thread Jan Schlicht (JIRA)
Jan Schlicht created MESOS-9044:
---

 Summary: DefaultExecutorTest.ROOT_ContainerStatusForTask can 
segfault
 Key: MESOS-9044
 URL: https://issues.apache.org/jira/browse/MESOS-9044
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 1.5.1
 Environment: Ubuntu 16.04
Reporter: Jan Schlicht


The following segfault occured when testing the {{1.5.x}} branch (SHA 
{{64341865d}}) on Ubuntu 16.04:
{noformat}
[ RUN  ] 
MesosContainerizer/DefaultExecutorTest.ROOT_ContainerStatusForTask/0
I0702 08:32:25.241318 17172 cluster.cpp:172] Creating default 'local' authorizer
I0702 08:32:25.242328  6510 master.cpp:457] Master 
be25b90e-f63d-4935-aaf3-cacfc7faacbf (ip-172-16-10-86.ec2.internal) started on 
172.16.10.86:32891
I0702 08:32:25.242413  6510 master.cpp:459] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/I9TI6h/credentials" 
--filter_gpu_resources="true" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" 
--max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/I9TI6h/master" --zk_session_timeout="10secs"
I0702 08:32:25.242554  6510 master.cpp:508] Master only allowing authenticated 
frameworks to register
I0702 08:32:25.242564  6510 master.cpp:514] Master only allowing authenticated 
agents to register
I0702 08:32:25.242570  6510 master.cpp:520] Master only allowing authenticated 
HTTP frameworks to register
I0702 08:32:25.242575  6510 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/I9TI6h/credentials'
I0702 08:32:25.242677  6510 master.cpp:564] Using default 'crammd5' 
authenticator
I0702 08:32:25.242728  6510 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0702 08:32:25.242780  6510 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0702 08:32:25.242830  6510 http.cpp:1045] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0702 08:32:25.242864  6510 master.cpp:643] Authorization enabled
I0702 08:32:25.243048  6507 hierarchical.cpp:175] Initialized hierarchical 
allocator process
I0702 08:32:25.243223  6507 whitelist_watcher.cpp:77] No whitelist given
I0702 08:32:25.243743  6510 master.cpp:2210] Elected as the leading master!
I0702 08:32:25.243768  6510 master.cpp:1690] Recovering from registrar
I0702 08:32:25.243832  6511 registrar.cpp:347] Recovering registrar
I0702 08:32:25.244055  6511 registrar.cpp:391] Successfully fetched the 
registry (0B) in 124928ns
I0702 08:32:25.244096  6511 registrar.cpp:495] Applied 1 operations in 8690ns; 
attempting to update the registry
I0702 08:32:25.244261  6511 registrar.cpp:552] Successfully updated the 
registry in 146944ns
I0702 08:32:25.244302  6511 registrar.cpp:424] Successfully recovered registrar
I0702 08:32:25.244416  6511 master.cpp:1803] Recovered 0 agents from the 
registry (172B); allowing 10mins for agents to re-register
I0702 08:32:25.244556  6505 hierarchical.cpp:213] Skipping recovery of 
hierarchical allocator: nothing to recover
W0702 08:32:25.246150 17172 process.cpp:2759] Attempted to spawn already 
running process files@172.16.10.86:32891
I0702 08:32:25.246560 17172 containerizer.cpp:304] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
I0702 08:32:25.250222 17172 linux_launcher.cpp:146] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0702 08:32:25.250689 17172 provisioner.cpp:299] Using default backend 'overlay'
I0702 08:32:25.251200 17172 cluster.cpp:460] Creating default 'local' authorizer
I0702 08:32:25.251788  6509 slave.cpp:262] Mesos agent started on 
(996)@172.16.10.86:32891
I0702 08:32:25.251878  6509 slave.cpp:263] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://"; 
--appc_store_dir="/tmp/

[jira] [Created] (MESOS-9045) LogZooKeeperTest.WriteRead can segfault

2018-07-02 Thread Jan Schlicht (JIRA)
Jan Schlicht created MESOS-9045:
---

 Summary: LogZooKeeperTest.WriteRead can segfault
 Key: MESOS-9045
 URL: https://issues.apache.org/jira/browse/MESOS-9045
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.5.1
 Environment: macOS
Reporter: Jan Schlicht


The following segfault occured when testing the {{1.5.x}} branch (SHA 
{{64341865d}}) on macOS:
{noformat}
[ RUN  ] LogZooKeeperTest.WriteRead
I0702 00:49:46.259831 2560127808 jvm.cpp:590] Looking up method 
(Ljava/lang/String;)V
I0702 00:49:46.260002 2560127808 jvm.cpp:590] Looking up method deleteOnExit()V
I0702 00:49:46.260550 2560127808 jvm.cpp:590] Looking up method 
(Ljava/io/File;Ljava/io/File;)V
log4j:WARN No appenders could be found for logger 
(org.apache.zookeeper.server.persistence.FileTxnSnapLog).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
I0702 00:49:46.305560 2560127808 jvm.cpp:590] Looking up method ()V
I0702 00:49:46.306149 2560127808 jvm.cpp:590] Looking up method 
(Lorg/apache/zookeeper/server/persistence/FileTxnSnapLog;Lorg/apache/zookeeper/server/ZooKeeperServer$DataTreeBuilder;)V
I0702 00:49:46.07 2560127808 jvm.cpp:590] Looking up method ()V
I0702 00:49:46.343977 2560127808 jvm.cpp:590] Looking up method (I)V
I0702 00:49:46.344200 2560127808 jvm.cpp:590] Looking up method 
configure(Ljava/net/InetSocketAddress;I)V
I0702 00:49:46.357642 2560127808 jvm.cpp:590] Looking up method 
startup(Lorg/apache/zookeeper/server/ZooKeeperServer;)V
I0702 00:49:46.437831 2560127808 jvm.cpp:590] Looking up method getClientPort()I
I0702 00:49:46.437893 2560127808 zookeeper_test_server.cpp:156] Started 
ZooKeeperTestServer on port 54057
I0702 00:49:46.438153 2560127808 log_tests.cpp:2468] Using temporary directory 
'/var/folders/6w/rw03zh013y38ys6cyn8qppf8gn/T/LogZooKeeperTest_WriteRead_AKZArL'
I0702 00:49:46.440680 2560127808 leveldb.cpp:174] Opened db in 2.415822ms
I0702 00:49:46.441301 2560127808 leveldb.cpp:181] Compacted db in 584251ns
I0702 00:49:46.441349 2560127808 leveldb.cpp:196] Created db iterator in 20482ns
I0702 00:49:46.441380 2560127808 leveldb.cpp:202] Seeked to beginning of db in 
14577ns
I0702 00:49:46.441407 2560127808 leveldb.cpp:277] Iterated through 0 keys in 
the db in 16622ns
I0702 00:49:46.441447 2560127808 replica.cpp:795] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0702 00:49:46.441737 207974400 leveldb.cpp:310] Persisting metadata (8 bytes) 
to leveldb took 157037ns
I0702 00:49:46.441764 207974400 replica.cpp:322] Persisted replica status to 
VOTING
I0702 00:49:46.443361 2560127808 leveldb.cpp:174] Opened db in 1.305425ms
I0702 00:49:46.443821 2560127808 leveldb.cpp:181] Compacted db in 448477ns
I0702 00:49:46.443871 2560127808 leveldb.cpp:196] Created db iterator in 12681ns
I0702 00:49:46.443889 2560127808 leveldb.cpp:202] Seeked to beginning of db in 
13291ns
I0702 00:49:46.443914 2560127808 leveldb.cpp:277] Iterated through 0 keys in 
the db in 14460ns
I0702 00:49:46.443944 2560127808 replica.cpp:795] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
I0702 00:49:46.444277 206901248 leveldb.cpp:310] Persisting metadata (8 bytes) 
to leveldb took 234740ns
I0702 00:49:46.444317 206901248 replica.cpp:322] Persisted replica status to 
VOTING
I0702 00:49:46.445854 2560127808 leveldb.cpp:174] Opened db in 1.253613ms
I0702 00:49:46.446967 2560127808 leveldb.cpp:181] Compacted db in 1.096521ms
I0702 00:49:46.447022 2560127808 leveldb.cpp:196] Created db iterator in 14312ns
I0702 00:49:46.447048 2560127808 leveldb.cpp:202] Seeked to beginning of db in 
16620ns
I0702 00:49:46.447077 2560127808 leveldb.cpp:277] Iterated through 1 keys in 
the db in 21267ns
I0702 00:49:46.447113 2560127808 replica.cpp:795] Replica recovered with log 
positions 0 -> 0 with 1 holes and 0 unlearned
2018-07-02 00:49:46,447:85946(0x7c6da000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2018-07-02 00:49:46,447:85946(0x7c6da000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2018-07-02 00:49:46,447:85946(0x7c657000):ZOO_INFO@log_env@753: Client 
environment:zookeeper.version=zookeeper C client 3.4.8
2018-07-02 00:49:46,447:85946(0x7c657000):ZOO_INFO@log_env@757: Client 
environment:host.name=Jenkinss-Mac-mini.local
2018-07-02 00:49:46,447:85946(0x7c6da000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
2018-07-02 00:49:46,447:85946(0x7c6da000):ZOO_INFO@log_env@765: Client 
environment:os.arch=17.4.0
2018-07-02 00:49:46,447:85946(0x7c657000):ZOO_INFO@log_env@764: Client 
environment:os.name=Darwin
I0702 00:49:46.447453 206901248 log.cpp:108] Attempting to join replica to 
ZooKeeper group
2018-07-02 00:49:46,447:85946(0x7c6da000):ZOO_INFO@log_env@766: Client 
envi

[jira] [Commented] (MESOS-9031) Mesos CNI portmap plugins' iptables rules doesn't allow connections via host ip and port from the same bridge container network

2018-07-02 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529915#comment-16529915
 ] 

Qian Zhang commented on MESOS-9031:
---

[~Kirill P] Can you please elaborate a bit more on the reproduce steps?
{quote}2 services running on the same mesos-slave using unified containerizer 
in different tasks and communicating via host ip and host port
{quote}
Did you mean that you launched two Mesos tasks via unified containerizer and 
one task listened and served on host IP & port and the other task failed to 
communicate with that IP & port due to timeout? Did these two tasks join the 
bridge network {{mesos-cni0}} or the host network? Can you provide the json of 
these two tasks?

In another hand, the {{CNI-XXX}} chain is not created by 
{{mesos-cni-port-mapper}} plugin, it is actually created by the CNI bridge 
plugin, see 
[here|https://github.com/containernetworking/plugins/blob/v0.2.0/plugins/main/bridge/bridge.go#L223:L229]
 for details.

> Mesos CNI portmap plugins' iptables rules doesn't allow connections via host 
> ip and port from the same bridge container network
> ---
>
> Key: MESOS-9031
> URL: https://issues.apache.org/jira/browse/MESOS-9031
> Project: Mesos
>  Issue Type: Bug
>  Components: cni, containerization
>Affects Versions: 1.6.0
>Reporter: Kirill Plyashkevich
>Priority: Major
>
> using `mesos-cni-port-mapper` with folllowing config:
> {noformat}
> { 
>    "name" : "dcos", 
>    "type" : "mesos-cni-port-mapper", 
>    "excludeDevices" : [], 
>    "chain": "MESOS-CNI0-PORT-MAPPER", 
>    "delegate": { 
>    "type": "bridge", 
>    "bridge": "mesos-cni0", 
>    "isGateway": true, 
>    "ipMasq": true, 
>    "hairpinMode": true, 
>    "ipam": { 
>    "type": "host-local", 
>    "ranges": [ 
>    [{"subnet": "172.26.0.0/16"}] 
>    ], 
>    "routes": [ 
>    {"dst": "0.0.0.0/0"} 
>    ] 
>    } 
>    } 
> }
> {noformat}
>  - 2 services running on the same mesos-slave using unified containerizer in 
> different tasks and communicating via host ip and host port
>  - connection timeouts due to iptables rules per container CNI-XXX chain
>  - actually timeouts are caused by
> {noformat}
> Chain CNI-XXX (1 references)
> num  target prot opt source   destination 
> 1ACCEPT all  --  anywhere 172.26.0.0/16/* name: 
> "dcos" id: "" */
> 2MASQUERADE  all  --  anywhere!base-address.mcast.net/4  /* 
> name: "dcos" id: "" */
> {noformat}
> rule #1 is executed and no masquerading happens.
> there are multiple solutions:
>  - simpliest and fastest one is not to add that ACCEPT
>  - perhaps, there's a better change in iptables rules that can fix it
>  - proper one (imho) is to finally implement cni spec 0.3.x in order to be 
> able to use chaining of plugins and use cni's `bridge` and `portmap` plugins 
> in chain (and get rid of mesos-cni-port-mapper completely eventually).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9031) Mesos CNI portmap plugins' iptables rules doesn't allow connections via host ip and port from the same bridge container network

2018-07-02 Thread Kirill Plyashkevich (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530017#comment-16530017
 ] 

Kirill Plyashkevich commented on MESOS-9031:


[~qianzhang], yes, of course:
2 standalone containers/tasks are launched on the same slave, joining the same 
`mesos-cni0` bridge network.

tasks themselves are services of an akka cluster, so they communicate with 
other services.nodes of the cluster and between each other using host's ips and 
ports.

both tasks fail to cummunicate between each other due to timeout (with 
`excludeDevices` set to one including `mesos-cni0` connection will just get 
refused).

unfortunatelly striped json won't be a lot of help here, short interaction can 
be discribed as e.g.: node1@172.26.0.2:2552 tries to reach 
node2@192.168.1.123:31303 (host ip and other service port, which is effectively 
node2@172.26.0.3:2552)

 

I've been digging into `bridge` recently as well and ACCEPT rule is added 
[here|[https://github.com/containernetworking/plugins/blob/master/pkg/ip/ipmasq_linux.go#L63].]
 that said it's related to `cni/bridge` plugin.

> Mesos CNI portmap plugins' iptables rules doesn't allow connections via host 
> ip and port from the same bridge container network
> ---
>
> Key: MESOS-9031
> URL: https://issues.apache.org/jira/browse/MESOS-9031
> Project: Mesos
>  Issue Type: Bug
>  Components: cni, containerization
>Affects Versions: 1.6.0
>Reporter: Kirill Plyashkevich
>Priority: Major
>
> using `mesos-cni-port-mapper` with folllowing config:
> {noformat}
> { 
>    "name" : "dcos", 
>    "type" : "mesos-cni-port-mapper", 
>    "excludeDevices" : [], 
>    "chain": "MESOS-CNI0-PORT-MAPPER", 
>    "delegate": { 
>    "type": "bridge", 
>    "bridge": "mesos-cni0", 
>    "isGateway": true, 
>    "ipMasq": true, 
>    "hairpinMode": true, 
>    "ipam": { 
>    "type": "host-local", 
>    "ranges": [ 
>    [{"subnet": "172.26.0.0/16"}] 
>    ], 
>    "routes": [ 
>    {"dst": "0.0.0.0/0"} 
>    ] 
>    } 
>    } 
> }
> {noformat}
>  - 2 services running on the same mesos-slave using unified containerizer in 
> different tasks and communicating via host ip and host port
>  - connection timeouts due to iptables rules per container CNI-XXX chain
>  - actually timeouts are caused by
> {noformat}
> Chain CNI-XXX (1 references)
> num  target prot opt source   destination 
> 1ACCEPT all  --  anywhere 172.26.0.0/16/* name: 
> "dcos" id: "" */
> 2MASQUERADE  all  --  anywhere!base-address.mcast.net/4  /* 
> name: "dcos" id: "" */
> {noformat}
> rule #1 is executed and no masquerading happens.
> there are multiple solutions:
>  - simpliest and fastest one is not to add that ACCEPT
>  - perhaps, there's a better change in iptables rules that can fix it
>  - proper one (imho) is to finally implement cni spec 0.3.x in order to be 
> able to use chaining of plugins and use cni's `bridge` and `portmap` plugins 
> in chain (and get rid of mesos-cni-port-mapper completely eventually).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9031) Mesos CNI portmap plugins' iptables rules doesn't allow connections via host ip and port from the same bridge container network

2018-07-02 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530038#comment-16530038
 ] 

Qian Zhang commented on MESOS-9031:
---

[~Kirill P] For the two tasks, do they have port mapping enabled (i.e., specify 
port mapping info in their {{ContainerInfo.network_infos)}}? Or they just 
joined {{mesos-cni0}} bridge network and cannot communicate with other akka 
service nodes running on other Mesos agent host via Mesos agent host IP & port? 
And what about the other akka service nodes? Do they join {{mesos-cni0}} bridge 
network with port mapping enabled or just join host network?

> Mesos CNI portmap plugins' iptables rules doesn't allow connections via host 
> ip and port from the same bridge container network
> ---
>
> Key: MESOS-9031
> URL: https://issues.apache.org/jira/browse/MESOS-9031
> Project: Mesos
>  Issue Type: Bug
>  Components: cni, containerization
>Affects Versions: 1.6.0
>Reporter: Kirill Plyashkevich
>Priority: Major
>
> using `mesos-cni-port-mapper` with folllowing config:
> {noformat}
> { 
>    "name" : "dcos", 
>    "type" : "mesos-cni-port-mapper", 
>    "excludeDevices" : [], 
>    "chain": "MESOS-CNI0-PORT-MAPPER", 
>    "delegate": { 
>    "type": "bridge", 
>    "bridge": "mesos-cni0", 
>    "isGateway": true, 
>    "ipMasq": true, 
>    "hairpinMode": true, 
>    "ipam": { 
>    "type": "host-local", 
>    "ranges": [ 
>    [{"subnet": "172.26.0.0/16"}] 
>    ], 
>    "routes": [ 
>    {"dst": "0.0.0.0/0"} 
>    ] 
>    } 
>    } 
> }
> {noformat}
>  - 2 services running on the same mesos-slave using unified containerizer in 
> different tasks and communicating via host ip and host port
>  - connection timeouts due to iptables rules per container CNI-XXX chain
>  - actually timeouts are caused by
> {noformat}
> Chain CNI-XXX (1 references)
> num  target prot opt source   destination 
> 1ACCEPT all  --  anywhere 172.26.0.0/16/* name: 
> "dcos" id: "" */
> 2MASQUERADE  all  --  anywhere!base-address.mcast.net/4  /* 
> name: "dcos" id: "" */
> {noformat}
> rule #1 is executed and no masquerading happens.
> there are multiple solutions:
>  - simpliest and fastest one is not to add that ACCEPT
>  - perhaps, there's a better change in iptables rules that can fix it
>  - proper one (imho) is to finally implement cni spec 0.3.x in order to be 
> able to use chaining of plugins and use cni's `bridge` and `portmap` plugins 
> in chain (and get rid of mesos-cni-port-mapper completely eventually).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9031) Mesos CNI portmap plugins' iptables rules doesn't allow connections via host ip and port from the same bridge container network

2018-07-02 Thread Kirill Plyashkevich (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530058#comment-16530058
 ] 

Kirill Plyashkevich commented on MESOS-9031:


[~qianzhang], services are part of marathon's pod and get their ports mapped 
properly (port mapping is enabled).

communication with external nodes on other agents and launched in host network 
is ok.

the problem is only for services using port-maping on the same bridge network 
`mesos-cni0`.

more I'm thinking about it, more it looks like a problem in 
`mesos-cni-port-mapper`.

if you take a look at [cni 
portmap|https://github.com/containernetworking/plugins/tree/master/plugins/meta/portmap],
 it does MASQUARADE/SNAT on its own. so regardless of rules set by `bridge` 
plugin traffic still gets masqueraded, which is exactly what needed.

> Mesos CNI portmap plugins' iptables rules doesn't allow connections via host 
> ip and port from the same bridge container network
> ---
>
> Key: MESOS-9031
> URL: https://issues.apache.org/jira/browse/MESOS-9031
> Project: Mesos
>  Issue Type: Bug
>  Components: cni, containerization
>Affects Versions: 1.6.0
>Reporter: Kirill Plyashkevich
>Priority: Major
>
> using `mesos-cni-port-mapper` with folllowing config:
> {noformat}
> { 
>    "name" : "dcos", 
>    "type" : "mesos-cni-port-mapper", 
>    "excludeDevices" : [], 
>    "chain": "MESOS-CNI0-PORT-MAPPER", 
>    "delegate": { 
>    "type": "bridge", 
>    "bridge": "mesos-cni0", 
>    "isGateway": true, 
>    "ipMasq": true, 
>    "hairpinMode": true, 
>    "ipam": { 
>    "type": "host-local", 
>    "ranges": [ 
>    [{"subnet": "172.26.0.0/16"}] 
>    ], 
>    "routes": [ 
>    {"dst": "0.0.0.0/0"} 
>    ] 
>    } 
>    } 
> }
> {noformat}
>  - 2 services running on the same mesos-slave using unified containerizer in 
> different tasks and communicating via host ip and host port
>  - connection timeouts due to iptables rules per container CNI-XXX chain
>  - actually timeouts are caused by
> {noformat}
> Chain CNI-XXX (1 references)
> num  target prot opt source   destination 
> 1ACCEPT all  --  anywhere 172.26.0.0/16/* name: 
> "dcos" id: "" */
> 2MASQUERADE  all  --  anywhere!base-address.mcast.net/4  /* 
> name: "dcos" id: "" */
> {noformat}
> rule #1 is executed and no masquerading happens.
> there are multiple solutions:
>  - simpliest and fastest one is not to add that ACCEPT
>  - perhaps, there's a better change in iptables rules that can fix it
>  - proper one (imho) is to finally implement cni spec 0.3.x in order to be 
> able to use chaining of plugins and use cni's `bridge` and `portmap` plugins 
> in chain (and get rid of mesos-cni-port-mapper completely eventually).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9031) Mesos CNI portmap plugins' iptables rules doesn't allow connections via host ip and port from the same bridge container network

2018-07-02 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530120#comment-16530120
 ] 

Qian Zhang commented on MESOS-9031:
---

[~Kirill P] So there are two service nodes (i.e., two Mesos tasks) join the 
bridge network {{mesos-cni0}} on the same Mesos agent host, and both of the two 
services nodes have port mapping enabled, but they cannot communicate with the 
Mesos agent host IP & mapped port between each other, right?
{quote}if you take a look at [cni 
portmap|https://github.com/containernetworking/plugins/tree/master/plugins/meta/portmap],
 it does MASQUARADE/SNAT on its own. so regardless of rules set by `bridge` 
plugin traffic still gets masqueraded, which is exactly what needed.
{quote}
So you think the timeout issue is not caused by rule #1 in the chain CNI-XXX 
set by the {{bridge}} plugin? But one of your proposed solution is not to add 
that rule.

> Mesos CNI portmap plugins' iptables rules doesn't allow connections via host 
> ip and port from the same bridge container network
> ---
>
> Key: MESOS-9031
> URL: https://issues.apache.org/jira/browse/MESOS-9031
> Project: Mesos
>  Issue Type: Bug
>  Components: cni, containerization
>Affects Versions: 1.6.0
>Reporter: Kirill Plyashkevich
>Priority: Major
>
> using `mesos-cni-port-mapper` with folllowing config:
> {noformat}
> { 
>    "name" : "dcos", 
>    "type" : "mesos-cni-port-mapper", 
>    "excludeDevices" : [], 
>    "chain": "MESOS-CNI0-PORT-MAPPER", 
>    "delegate": { 
>    "type": "bridge", 
>    "bridge": "mesos-cni0", 
>    "isGateway": true, 
>    "ipMasq": true, 
>    "hairpinMode": true, 
>    "ipam": { 
>    "type": "host-local", 
>    "ranges": [ 
>    [{"subnet": "172.26.0.0/16"}] 
>    ], 
>    "routes": [ 
>    {"dst": "0.0.0.0/0"} 
>    ] 
>    } 
>    } 
> }
> {noformat}
>  - 2 services running on the same mesos-slave using unified containerizer in 
> different tasks and communicating via host ip and host port
>  - connection timeouts due to iptables rules per container CNI-XXX chain
>  - actually timeouts are caused by
> {noformat}
> Chain CNI-XXX (1 references)
> num  target prot opt source   destination 
> 1ACCEPT all  --  anywhere 172.26.0.0/16/* name: 
> "dcos" id: "" */
> 2MASQUERADE  all  --  anywhere!base-address.mcast.net/4  /* 
> name: "dcos" id: "" */
> {noformat}
> rule #1 is executed and no masquerading happens.
> there are multiple solutions:
>  - simpliest and fastest one is not to add that ACCEPT
>  - perhaps, there's a better change in iptables rules that can fix it
>  - proper one (imho) is to finally implement cni spec 0.3.x in order to be 
> able to use chaining of plugins and use cni's `bridge` and `portmap` plugins 
> in chain (and get rid of mesos-cni-port-mapper completely eventually).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9031) Mesos CNI portmap plugins' iptables rules doesn't allow connections via host ip and port from the same bridge container network

2018-07-02 Thread Kirill Plyashkevich (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530135#comment-16530135
 ] 

Kirill Plyashkevich commented on MESOS-9031:


[~qianzhang], 
{quote}So there are two service nodes (i.e., two Mesos tasks) join the bridge 
network mesos-cni0 on the same Mesos agent host, and both of the two services 
nodes have port mapping enabled, but they cannot communicate with the Mesos 
agent host IP & mapped port between each other, right?{quote}
yes, that's correct

{quote}
So you think the timeout issue is not caused by rule #1 in the chain CNI-XXX 
set by the bridge plugin? But one of your proposed solution is not to add that 
rule.
{quote}
that was my initial assumption, and deeper investigation shows that my proposal 
#1 is not actually a solution here.
the timeout is caused by missing snat/masquerade, which is not happening.
`cni/portmap` has proper implementation with snat/masquerade. so, if 
`mesos-cni-port-mapper` does smth alike and do the snat/masquerade, issue will 
be solved.
that said, IMHO, solutions #2 (with adding logic alike `cni/portmap` and #3 are 
the only left.

> Mesos CNI portmap plugins' iptables rules doesn't allow connections via host 
> ip and port from the same bridge container network
> ---
>
> Key: MESOS-9031
> URL: https://issues.apache.org/jira/browse/MESOS-9031
> Project: Mesos
>  Issue Type: Bug
>  Components: cni, containerization
>Affects Versions: 1.6.0
>Reporter: Kirill Plyashkevich
>Priority: Major
>
> using `mesos-cni-port-mapper` with folllowing config:
> {noformat}
> { 
>    "name" : "dcos", 
>    "type" : "mesos-cni-port-mapper", 
>    "excludeDevices" : [], 
>    "chain": "MESOS-CNI0-PORT-MAPPER", 
>    "delegate": { 
>    "type": "bridge", 
>    "bridge": "mesos-cni0", 
>    "isGateway": true, 
>    "ipMasq": true, 
>    "hairpinMode": true, 
>    "ipam": { 
>    "type": "host-local", 
>    "ranges": [ 
>    [{"subnet": "172.26.0.0/16"}] 
>    ], 
>    "routes": [ 
>    {"dst": "0.0.0.0/0"} 
>    ] 
>    } 
>    } 
> }
> {noformat}
>  - 2 services running on the same mesos-slave using unified containerizer in 
> different tasks and communicating via host ip and host port
>  - connection timeouts due to iptables rules per container CNI-XXX chain
>  - actually timeouts are caused by
> {noformat}
> Chain CNI-XXX (1 references)
> num  target prot opt source   destination 
> 1ACCEPT all  --  anywhere 172.26.0.0/16/* name: 
> "dcos" id: "" */
> 2MASQUERADE  all  --  anywhere!base-address.mcast.net/4  /* 
> name: "dcos" id: "" */
> {noformat}
> rule #1 is executed and no masquerading happens.
> there are multiple solutions:
>  - simpliest and fastest one is not to add that ACCEPT
>  - perhaps, there's a better change in iptables rules that can fix it
>  - proper one (imho) is to finally implement cni spec 0.3.x in order to be 
> able to use chaining of plugins and use cni's `bridge` and `portmap` plugins 
> in chain (and get rid of mesos-cni-port-mapper completely eventually).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8935) Quota limit "chopping" can lead to cpu-only and memory-only offers.

2018-07-02 Thread Greg Mann (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530369#comment-16530369
 ] 

Greg Mann commented on MESOS-8935:
--

Backports:

1.6.x:
{code}
commit 0587245b66ad3f2209c66a67211d987d2abdd371
Author: Meng Zhu 
Date:   Wed Jun 20 17:00:03 2018 -0700

Added a master flag to configure minimum allocatable resources.

This patch adds a new master flag `min_allocatable_resources`.
It specifies one or more resource quantities that define the
minimum allocatable resources for the allocator. The allocator
will only offer resources that contain at least one of the
specified resource quantities.

For example, the setting `disk:1000|cpus:1;mem:32` means that
the allocator will only allocate resources when they contain
1000MB of disk, or when they contain both 1 cpu and 32MB of
memory.

The default value for this new flag is such that it maintains
previous default behavior.

Also fixed all related tests and updated documentation.

Review: https://reviews.apache.org/r/67513/

commit ccb24bf3f7e098723179ee4b595aa99e3a0869e4
Author: Meng Zhu 
Date:   Thu Jun 28 08:33:00 2018 -0700

Added a resource utility `isScalarQuantity`.

`isScalarQuantity()` checks if a `Resources` object
is "pure" scalar quantity i.e. its `Resource`(s) only has
name, type (set to scalar) and scalar fields set.

Also added tests.

Review: https://reviews.apache.org/r/67516/

commit a615f36d9f10c92eaa4be95978987976dfc085e8
Author: Meng Zhu 
Date:   Thu Jun 28 08:32:47 2018 -0700

Fixed a bug in `createStrippedScalarQuantity()`.

This patch fixes `createStrippedScalarQuantity()` by
stripping the revocable field in resources.

Also added a test.

Review: https://reviews.apache.org/r/67510/
{code}

1.5.x:
{code}
commit 2e16cdb16ee9cc4162fad8b3957d69b9af7dbd8b
Author: Meng Zhu 
Date:   Wed Jun 20 17:00:03 2018 -0700

Added a master flag to configure minimum allocatable resources.

This patch adds a new master flag `min_allocatable_resources`.
It specifies one or more resource quantities that define the
minimum allocatable resources for the allocator. The allocator
will only offer resources that contain at least one of the
specified resource quantities.

For example, the setting `disk:1000|cpus:1;mem:32` means that
the allocator will only allocate resources when they contain
1000MB of disk, or when they contain both 1 cpu and 32MB of
memory.

The default value for this new flag is such that it maintains
previous default behavior.

Also fixed all related tests and updated documentation.

Review: https://reviews.apache.org/r/67513/

commit be077099b4dfcc1f82fe7f5ed222567eeb0c082c
Author: Meng Zhu 
Date:   Wed Jun 20 16:59:58 2018 -0700

Added a resource utility `isScalarQuantity`.

`isScalarQuantity()` checks if a `Resources` object
is a "pure" scalar quantity; i.e., its resources only have
name, type (set to scalar) and scalar fields set.

Also added tests.

Review: https://reviews.apache.org/r/67516/

commit 7a19d085c8aead7693c5d6212dbba7db771e60f6
Author: Meng Zhu 
Date:   Wed Jun 20 16:59:54 2018 -0700

Fixed a bug in `createStrippedScalarQuantity()`.

This patch fixes `createStrippedScalarQuantity()` by
stripping the revocable field in resources.

Also added a test.

Review: https://reviews.apache.org/r/67510/

commit c9efa4048be540f6cee47c012a7637ffed5e203e
Author: Benjamin Mahler 
Date:   Mon Feb 5 13:32:37 2018 -0800

Introduced a CHECK_NOTERROR macro.

Review: https://reviews.apache.org/r/65514
{code}

1.4.x:
{code}
commit 9cba3aa6dd571f8b92b46261d2e1256b0c47e338 (1.4.x-allocatable)
Author: Meng Zhu 
Date:   Wed Jun 20 17:00:03 2018 -0700

Added a master flag to configure minimum allocatable resources.

This patch adds a new master flag `min_allocatable_resources`.
It specifies one or more resource quantities that define the
minimum allocatable resources for the allocator. The allocator
will only offer resources that contain at least one of the
specified resource quantities.

For example, the setting `disk:1000|cpus:1;mem:32` means that
the allocator will only allocate resources when they contain
1000MB of disk, or when they contain both 1 cpu and 32MB of
memory.

The default value for this new flag is such that it maintains
previous default behavior.

Also fixed all related tests and updated documentation.

Review: https://reviews.apache.org/r/67513/

commit c45b4bdd50ee2fc5db7e3c2274ef2938a8999c22
Author: Meng Zhu 
Date:   Wed Jun 20 16:59:58 2018 -0700

Added a resource utility `isScalarQuantity`.

`isScalarQuantity()` checks if a `Resources` object
is a "pure" scalar quantity; i.e., its resources only have
name, type (set to scalar) and scalar

[jira] [Created] (MESOS-9046) Agent restart may fail on checkpointed resources.

2018-07-02 Thread Till Toenshoff (JIRA)
Till Toenshoff created MESOS-9046:
-

 Summary: Agent restart may fail on checkpointed resources.
 Key: MESOS-9046
 URL: https://issues.apache.org/jira/browse/MESOS-9046
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 1.6.0
Reporter: Till Toenshoff


When the user changes the agent resources, the resulting error message does not 
help in getting the problem resolved.

Consider a user having added or changed a mounted volume, then restart the 
agent while only having erased {{${MESOS_WORK_DIR}/meta/slaves/latest}} - the 
result may look as follows;

{noformat}
E0702 11:44:53.00  2278 slave.cpp:7305] EXIT with status 1: Failed to 
perform recovery:
Checkpointed resources
[...]
 [MOUNT:/dcos/volume1,5b0ca558-7e1f-463a-87ab-4c52899c4727:name-data]:5851
are incompatible with agent resources
[...]
{noformat}

This error message, while certainly being correct, may not be as helpful as it 
could be. We should consider offering advice on how to work around or fix this 
very common issue.


We may want to tell the user to:
1. {{rm -rf ${MESOS_WORK_DIR}/meta/slaves/latest}}
2. {{rm -rf ${MESOS_WORK_DIR}/meta/resources}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9047) ProvisionerDockerLocalStoreTest.MissingLayer is flaky.

2018-07-02 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-9047:
---

 Summary: ProvisionerDockerLocalStoreTest.MissingLayer is flaky.
 Key: MESOS-9047
 URL: https://issues.apache.org/jira/browse/MESOS-9047
 Project: Mesos
  Issue Type: Bug
  Components: containerization
 Environment: mesos-ec2-ubuntu-14.04-SSL
Reporter: Gilbert Song


{noformat}
../../src/tests/containerizer/provisioner_docker_tests.cpp:284
(imageInfo).failure(): Collect failed: Subprocess 'tar, tar, -x, -f, 
/tmp/nQt3Eu/store/staging/D1LuiF/123/layer.tar, -C, 
/tmp/nQt3Eu/store/staging/D1LuiF/123/rootfs' failed: tar: This does not look 
like a tar archive
tar: Exiting with failure status due to previous errors
{noformat}

{noformat}
agent log to be added ...
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9048) Build and persist quota headroom info across allocation cycle.

2018-07-02 Thread Meng Zhu (JIRA)
Meng Zhu created MESOS-9048:
---

 Summary: Build and persist quota headroom info across allocation 
cycle.
 Key: MESOS-9048
 URL: https://issues.apache.org/jira/browse/MESOS-9048
 Project: Mesos
  Issue Type: Improvement
Reporter: Meng Zhu
Assignee: Meng Zhu


Currently, in the allocator, quota headroom info is built up from scratch at 
the beginning of each allocation iteration. This affects performance and 
increases code complexity. We should be able to track and persist this info as 
we make new allocations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9031) Mesos CNI portmap plugins' iptables rules doesn't allow connections via host ip and port from the same bridge container network

2018-07-02 Thread Qian Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qian Zhang reassigned MESOS-9031:
-

Assignee: Qian Zhang
  Sprint: Mesosphere Sprint 2018-23

> Mesos CNI portmap plugins' iptables rules doesn't allow connections via host 
> ip and port from the same bridge container network
> ---
>
> Key: MESOS-9031
> URL: https://issues.apache.org/jira/browse/MESOS-9031
> Project: Mesos
>  Issue Type: Bug
>  Components: cni, containerization
>Affects Versions: 1.6.0
>Reporter: Kirill Plyashkevich
>Assignee: Qian Zhang
>Priority: Major
>
> using `mesos-cni-port-mapper` with folllowing config:
> {noformat}
> { 
>    "name" : "dcos", 
>    "type" : "mesos-cni-port-mapper", 
>    "excludeDevices" : [], 
>    "chain": "MESOS-CNI0-PORT-MAPPER", 
>    "delegate": { 
>    "type": "bridge", 
>    "bridge": "mesos-cni0", 
>    "isGateway": true, 
>    "ipMasq": true, 
>    "hairpinMode": true, 
>    "ipam": { 
>    "type": "host-local", 
>    "ranges": [ 
>    [{"subnet": "172.26.0.0/16"}] 
>    ], 
>    "routes": [ 
>    {"dst": "0.0.0.0/0"} 
>    ] 
>    } 
>    } 
> }
> {noformat}
>  - 2 services running on the same mesos-slave using unified containerizer in 
> different tasks and communicating via host ip and host port
>  - connection timeouts due to iptables rules per container CNI-XXX chain
>  - actually timeouts are caused by
> {noformat}
> Chain CNI-XXX (1 references)
> num  target prot opt source   destination 
> 1ACCEPT all  --  anywhere 172.26.0.0/16/* name: 
> "dcos" id: "" */
> 2MASQUERADE  all  --  anywhere!base-address.mcast.net/4  /* 
> name: "dcos" id: "" */
> {noformat}
> rule #1 is executed and no masquerading happens.
> there are multiple solutions:
>  - -simpliest and fastest one is not to add that ACCEPT- - NOT A SOLUTION. 
> it's happening in `bridge` plugin and `cni/portmap` shows that 
> snat/masquerade should be done during portmapping as well.
>  - perhaps, there's a better change in iptables rules that can fix it
>  - proper one (imho) is to finally implement cni spec 0.3.x in order to be 
> able to use chaining of plugins and use cni's `bridge` and `portmap` plugins 
> in chain (and get rid of mesos-cni-port-mapper completely eventually).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9039) CNI isolator recovery should wait until unknown orphan cleanup is done

2018-07-02 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16530649#comment-16530649
 ] 

Qian Zhang commented on MESOS-9039:
---

The main purpose of this fix is to ensure the test 
{{CniIsolatorTest.ROOT_SlaveRecovery}} that we updated in 
[https://reviews.apache.org/r/67737/] can catch the regression described in  
MESOS-9025, I think this ticket will not cause any actual issues in a real 
environment, so I think we do not need to backport the fix.

> CNI isolator recovery should wait until unknown orphan cleanup is done
> --
>
> Key: MESOS-9039
> URL: https://issues.apache.org/jira/browse/MESOS-9039
> Project: Mesos
>  Issue Type: Bug
>  Components: cni
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
> Fix For: 1.7.0
>
>
> Currently, CNI isolator will cleanup unknown orphaned containers in an 
> asynchronous way (see 
> [here|https://github.com/apache/mesos/blob/1.6.0/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L439]
>  for details) during recovery, that means agent recovery can finish while the 
> cleanup of unknown orphaned containers is still ongoing which is not ideal. 
> So we need to make CNI isolator recovery waits until unknown orphan cleanup 
> is done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)