[jira] [Commented] (MESOS-5061) process.cpp:1966] Failed to shutdown socket with fd x: Transport endpoint is not connected

2017-05-25 Thread Hao Yixin (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024590#comment-16024590
 ] 

Hao Yixin commented on MESOS-5061:
--

I got a similar with contiv netplugin with mesos containerizer.

I0525 18:52:36.583499  4041 exec.cpp:162] Version: 1.3.0
E0525 18:52:39.593489  4050 process.cpp:2450] Failed to shutdown socket with fd 
6, address 192.168.110.2:34176: Transport endpoint is not connected
I0525 18:52:39.593582  4048 exec.cpp:497] Agent exited ... shutting down

> process.cpp:1966] Failed to shutdown socket with fd x: Transport endpoint is 
> not connected
> --
>
> Key: MESOS-5061
> URL: https://issues.apache.org/jira/browse/MESOS-5061
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, modules
>Affects Versions: 0.27.0, 0.27.1, 0.27.2, 0.28.0
> Environment: Centos 7.1
>Reporter: Zogg
> Fix For: 1.0.0
>
>
> When launching a task through Marathon and asking the task to assign an IP 
> (using Calico networking):
> {noformat}
> {
> "id":"/calico-apps",
> "apps": [
> {
> "id": "hello-world-1",
> "cmd": "ip addr && sleep 3",
> "cpus": 0.1,
> "mem": 64.0,
> "ipAddress": {
> "groups": ["calico-k8s-network"]
> }
> }
> ]
> }
> {noformat}
> Mesos slave fails to launch a task, locking in STAGING state forewer, with 
> error:
> {noformat}
> [centos@rtmi-worker-001 mesos]$ tail mesos-slave.INFO
> I0325 20:35:43.420171 13495 slave.cpp:2642] Got registration for executor 
> 'calico-apps_hello-world-1.23ff72e9-f2c9-11e5-bb22-be052ff413d3' of framework 
> 23b404e4-700a-4348-a7c0-226239348981- from executor(1)@10.0.0.10:33443
> I0325 20:35:43.422652 13495 slave.cpp:1862] Sending queued task 
> 'calico-apps_hello-world-1.23ff72e9-f2c9-11e5-bb22-be052ff413d3' to executor 
> 'calico-apps_hello-world-1.23ff72e9-f2c9-11e5-bb22-be052ff413d3' of framework 
> 23b404e4-700a-4348-a7c0-226239348981- at executor(1)@10.0.0.10:33443
> E0325 20:35:43.423159 13502 process.cpp:1966] Failed to shutdown socket with 
> fd 22: Transport endpoint is not connected
> I0325 20:35:43.423316 13501 slave.cpp:3481] executor(1)@10.0.0.10:33443 exited
> {noformat}
> However, when deploying a task without ipAddress field, mesos slave launches 
> a task successfully. 
> Tested with various Mesos/Marathon/Calico versions. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7565) process.cpp:2450] Failed to shutdown socket with fd 6, address 192.168.110.2:34176: Transport endpoint is not connected

2017-05-25 Thread Hao Yixin (JIRA)
Hao Yixin created MESOS-7565:


 Summary: process.cpp:2450] Failed to shutdown socket with fd 6, 
address 192.168.110.2:34176: Transport endpoint is not connected
 Key: MESOS-7565
 URL: https://issues.apache.org/jira/browse/MESOS-7565
 Project: Mesos
  Issue Type: Bug
  Components: containerization, network
Affects Versions: 1.2.0, 1.2.1, 1.2.2, 1.3.0, 1.3.1
 Environment: centos 7.3
Reporter: Hao Yixin
Priority: Critical


When launching a task through Marathon and asking the task to assign an IP 
(using Contiv networking):
```
I0525 18:52:15.898908  1210 linux_launcher.cpp:429] Launching container 
c4b299e6-629a-4a99-bd88-cfbca0262b1a and cloning with namespaces CLONE_NEWNS | 
CLONE_NEWUTS | CLONE_NEWNET
I0525 18:52:15.900668  1210 systemd.cpp:96] Assigned child process '3985' to 
'mesos_executors.slice'
I0525 18:52:15.902612  1206 containerizer.cpp:1623] Checkpointing container's 
forked pid 3985 to 
'/var/lib/mesos/meta/slaves/00e6894c-d896-4a3d-8e79-679077f2af81-S4/frameworks/00e6894c-d896-4a3d-8e79-679077f2af81-/executors/container.1467.373c1d9b-4138-11e7-9117-024221dd5669/runs/c4b299e6-629a-4a99-bd88-cfbca0262b1a/pids/forked.pid'
I0525 18:52:15.903939  1206 cni.cpp:888] Bind mounted '/proc/3985/ns/net' to 
'/run/mesos/isolators/network/cni/c4b299e6-629a-4a99-bd88-cfbca0262b1a/ns' for 
container c4b299e6-629a-4a99-bd88-cfbca0262b1a
I0525 18:52:16.347486  1206 cni.cpp:1301] Got assigned IPv4 address 
'192.168.110.2/24' from CNI network 'netcontiv' for container 
c4b299e6-629a-4a99-bd88-cfbca0262b1a
I0525 18:52:16.347533  1206 cni.cpp:1307] Got assigned IPv6 address '' from CNI 
network 'netcontiv' for container c4b299e6-629a-4a99-bd88-cfbca0262b1a
I0525 18:52:16.347687  1206 cni.cpp:1010] Unable to find DNS nameservers for 
container c4b299e6-629a-4a99-bd88-cfbca0262b1a, using host '/etc/resolv.conf'
I0525 18:52:24.579439  1206 containerizer.cpp:2508] Container 
c4b299e6-629a-4a99-bd88-cfbca0262b1a has exited
I0525 18:52:24.579493  1206 containerizer.cpp:2102] Destroying container 
c4b299e6-629a-4a99-bd88-cfbca0262b1a in RUNNING state
I0525 18:52:24.579560  1206 linux_launcher.cpp:505] Asked to destroy container 
c4b299e6-629a-4a99-bd88-cfbca0262b1a
I0525 18:52:24.580025  1206 linux_launcher.cpp:548] Using freezer to destroy 
cgroup mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a
I0525 18:52:24.580930  1206 cgroups.cpp:2692] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a
I0525 18:52:24.582156  1206 cgroups.cpp:1405] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a after 
1.18784ms
I0525 18:52:24.583359  1206 cgroups.cpp:2710] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a
I0525 18:52:24.584491  1206 cgroups.cpp:1434] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a after 
1.093888ms
I0525 18:52:24.681495  1203 cni.cpp:1479] Unmounted the network namespace 
handle 
'/run/mesos/isolators/network/cni/c4b299e6-629a-4a99-bd88-cfbca0262b1a/ns' for 
container c4b299e6-629a-4a99-bd88-cfbca0262b1a
I0525 18:52:24.681591  1203 cni.cpp:1490] Removed the container directory 
'/run/mesos/isolators/network/cni/c4b299e6-629a-4a99-bd88-cfbca0262b1a'
I0525 18:52:24.691004  1203 slave.cpp:5168] Executor 
'container.1467.373c1d9b-4138-11e7-9117-024221dd5669' of framework 
00e6894c-d896-4a3d-8e79-679077f2af81- terminated with signal Killed
I0525 18:52:24.691063  1203 slave.cpp:4215] Handling status update TASK_FAILED 
(UUID: e90f3161-d136-4607-a67c-a621df9e82e4) for task 
container.1467.373c1d9b-4138-11e7-9117-024221dd5669 of framework 
00e6894c-d896-4a3d-8e79-679077f2af81- from @0.0.0.0:0
```

```
I0525 18:52:36.583499  4041 exec.cpp:162] Version: 1.3.0
E0525 18:52:39.593489  4050 process.cpp:2450] Failed to shutdown socket with fd 
6, address 192.168.110.2:34176: Transport endpoint is not connected
I0525 18:52:39.593582  4048 exec.cpp:497] Agent exited ... shutting down
```

However, when deploying a task without ipAddress field, mesos slave launches a 
task successfully.
Tested with various Mesos/Marathon/Contiv versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7565) process.cpp:2450] Failed to shutdown socket with fd 6, address 192.168.110.2:34176: Transport endpoint is not connected

2017-05-25 Thread Hao Yixin (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Yixin updated MESOS-7565:
-
Description: 
When launching a task through Marathon and asking the task to assign an IP 
(using Contiv networking):

Log from mesos-slave:

I0525 18:52:15.898908  1210 linux_launcher.cpp:429] Launching container 
c4b299e6-629a-4a99-bd88-cfbca0262b1a and cloning with namespaces CLONE_NEWNS | 
CLONE_NEWUTS | CLONE_NEWNET
I0525 18:52:15.900668  1210 systemd.cpp:96] Assigned child process '3985' to 
'mesos_executors.slice'
I0525 18:52:15.902612  1206 containerizer.cpp:1623] Checkpointing container's 
forked pid 3985 to 
'/var/lib/mesos/meta/slaves/00e6894c-d896-4a3d-8e79-679077f2af81-S4/frameworks/00e6894c-d896-4a3d-8e79-679077f2af81-/executors/container.1467.373c1d9b-4138-11e7-9117-024221dd5669/runs/c4b299e6-629a-4a99-bd88-cfbca0262b1a/pids/forked.pid'
I0525 18:52:15.903939  1206 cni.cpp:888] Bind mounted '/proc/3985/ns/net' to 
'/run/mesos/isolators/network/cni/c4b299e6-629a-4a99-bd88-cfbca0262b1a/ns' for 
container c4b299e6-629a-4a99-bd88-cfbca0262b1a
I0525 18:52:16.347486  1206 cni.cpp:1301] Got assigned IPv4 address 
'192.168.110.2/24' from CNI network 'netcontiv' for container 
c4b299e6-629a-4a99-bd88-cfbca0262b1a
I0525 18:52:16.347533  1206 cni.cpp:1307] Got assigned IPv6 address '' from CNI 
network 'netcontiv' for container c4b299e6-629a-4a99-bd88-cfbca0262b1a
I0525 18:52:16.347687  1206 cni.cpp:1010] Unable to find DNS nameservers for 
container c4b299e6-629a-4a99-bd88-cfbca0262b1a, using host '/etc/resolv.conf'
I0525 18:52:24.579439  1206 containerizer.cpp:2508] Container 
c4b299e6-629a-4a99-bd88-cfbca0262b1a has exited
I0525 18:52:24.579493  1206 containerizer.cpp:2102] Destroying container 
c4b299e6-629a-4a99-bd88-cfbca0262b1a in RUNNING state
I0525 18:52:24.579560  1206 linux_launcher.cpp:505] Asked to destroy container 
c4b299e6-629a-4a99-bd88-cfbca0262b1a
I0525 18:52:24.580025  1206 linux_launcher.cpp:548] Using freezer to destroy 
cgroup mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a
I0525 18:52:24.580930  1206 cgroups.cpp:2692] Freezing cgroup 
/sys/fs/cgroup/freezer/mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a
I0525 18:52:24.582156  1206 cgroups.cpp:1405] Successfully froze cgroup 
/sys/fs/cgroup/freezer/mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a after 
1.18784ms
I0525 18:52:24.583359  1206 cgroups.cpp:2710] Thawing cgroup 
/sys/fs/cgroup/freezer/mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a
I0525 18:52:24.584491  1206 cgroups.cpp:1434] Successfully thawed cgroup 
/sys/fs/cgroup/freezer/mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a after 
1.093888ms
I0525 18:52:24.681495  1203 cni.cpp:1479] Unmounted the network namespace 
handle 
'/run/mesos/isolators/network/cni/c4b299e6-629a-4a99-bd88-cfbca0262b1a/ns' for 
container c4b299e6-629a-4a99-bd88-cfbca0262b1a
I0525 18:52:24.681591  1203 cni.cpp:1490] Removed the container directory 
'/run/mesos/isolators/network/cni/c4b299e6-629a-4a99-bd88-cfbca0262b1a'
I0525 18:52:24.691004  1203 slave.cpp:5168] Executor 
'container.1467.373c1d9b-4138-11e7-9117-024221dd5669' of framework 
00e6894c-d896-4a3d-8e79-679077f2af81- terminated with signal Killed
I0525 18:52:24.691063  1203 slave.cpp:4215] Handling status update TASK_FAILED 
(UUID: e90f3161-d136-4607-a67c-a621df9e82e4) for task 
container.1467.373c1d9b-4138-11e7-9117-024221dd5669 of framework 
00e6894c-d896-4a3d-8e79-679077f2af81- from @0.0.0.0:0


Log from sandbox:

I0525 18:52:36.583499  4041 exec.cpp:162] Version: 1.3.0
E0525 18:52:39.593489  4050 process.cpp:2450] Failed to shutdown socket with fd 
6, address 192.168.110.2:34176: Transport endpoint is not connected
I0525 18:52:39.593582  4048 exec.cpp:497] Agent exited ... shutting down


However, when deploying a task without ipAddress field, mesos slave launches a 
task successfully.
Tested with various Mesos/Marathon/Contiv versions.

  was:
When launching a task through Marathon and asking the task to assign an IP 
(using Contiv networking):
```
I0525 18:52:15.898908  1210 linux_launcher.cpp:429] Launching container 
c4b299e6-629a-4a99-bd88-cfbca0262b1a and cloning with namespaces CLONE_NEWNS | 
CLONE_NEWUTS | CLONE_NEWNET
I0525 18:52:15.900668  1210 systemd.cpp:96] Assigned child process '3985' to 
'mesos_executors.slice'
I0525 18:52:15.902612  1206 containerizer.cpp:1623] Checkpointing container's 
forked pid 3985 to 
'/var/lib/mesos/meta/slaves/00e6894c-d896-4a3d-8e79-679077f2af81-S4/frameworks/00e6894c-d896-4a3d-8e79-679077f2af81-/executors/container.1467.373c1d9b-4138-11e7-9117-024221dd5669/runs/c4b299e6-629a-4a99-bd88-cfbca0262b1a/pids/forked.pid'
I0525 18:52:15.903939  1206 cni.cpp:888] Bind mounted '/proc/3985/ns/net' to 
'/run/mesos/isolators/network/cni/c4b299e6-629a-4a99-bd88-cfbca0262b1a/ns' for 
container c4b299e6-629a-4a99-bd88-cfbca0262b1a
I0525 18:52:16.347486  1206 cni.cpp:1301] Got assigned IPv4 address 
'192.168.110.2/24' from CNI n

[jira] [Updated] (MESOS-7542) Add executor reconnection retry logic to the agent

2017-05-25 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-7542:
-
Shepherd: Vinod Kone  (was: Benjamin Mahler)

> Add executor reconnection retry logic to the agent
> --
>
> Key: MESOS-7542
> URL: https://issues.apache.org/jira/browse/MESOS-7542
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, executor
>Reporter: Greg Mann
>Assignee: Benjamin Mahler
>  Labels: mesosphere
>
> Currently, the agent sends a single {{ReconnectExecutorMessage}} to PID-based 
> executors during recovery. It would be more robust to have the agent retry 
> these messages until {{executor_reregister_timeout}} has elapsed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-7542) Add executor reconnection retry logic to the agent

2017-05-25 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-7542:


Assignee: Benjamin Mahler  (was: Greg Mann)

> Add executor reconnection retry logic to the agent
> --
>
> Key: MESOS-7542
> URL: https://issues.apache.org/jira/browse/MESOS-7542
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, executor
>Reporter: Greg Mann
>Assignee: Benjamin Mahler
>  Labels: mesosphere
>
> Currently, the agent sends a single {{ReconnectExecutorMessage}} to PID-based 
> executors during recovery. It would be more robust to have the agent retry 
> these messages until {{executor_reregister_timeout}} has elapsed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7007) filesystem/shared and --default_container_info broken since 1.1

2017-05-25 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-7007:
--
Shepherd: Gilbert Song

> filesystem/shared and --default_container_info broken since 1.1
> ---
>
> Key: MESOS-7007
> URL: https://issues.apache.org/jira/browse/MESOS-7007
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Pierre Cheynier
>Assignee: Chun-Hung Hsiao
>
> I face this issue, that prevent me to upgrade to 1.1.0 (and the change was 
> consequently introduced in this version):
> I'm using default_container_info to mount a /tmp volume in the container's 
> mount namespace from its current sandbox, meaning that each container have a 
> dedicated /tmp, thanks to the {{filesystem/shared}} isolator.
> I noticed through our automation pipeline that integration tests were failing 
> and found that this is because /tmp (the one from the host!) contents is 
> trashed each time a container is created.
> Here is my setup: 
> * 
> {{--isolation='cgroups/cpu,cgroups/mem,namespaces/pid,*disk/du,filesystem/shared,filesystem/linux*,docker/runtime'}}
> * 
> {{--default_container_info='\{"type":"MESOS","volumes":\[\{"host_path":"tmp","container_path":"/tmp","mode":"RW"\}\]\}'}}
> I discovered this issue in the early days of 1.1 (end of Nov, spoke with 
> someone on Slack), but had unfortunately no time to dig into the symptoms a 
> bit more.
> I found nothing interesting even using GLOGv=3.
> Maybe it's a bad usage of isolators that trigger this issue ? If it's the 
> case, then at least a documentation update should be done.
> Let me know if more information is needed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7544) Incremental builds on Windows are broken

2017-05-25 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025182#comment-16025182
 ] 

Andrew Schwartzmeyer commented on MESOS-7544:
-

This was a problem with the machine's Visual Studio installation. Uninstalling 
it completely, and reinstalling it, resolved the problem.

> Incremental builds on Windows are broken
> 
>
> Key: MESOS-7544
> URL: https://issues.apache.org/jira/browse/MESOS-7544
> Project: Mesos
>  Issue Type: Bug
> Environment: Windows 10
>Reporter: Andrew Schwartzmeyer
>Assignee: Andrew Schwartzmeyer
>
> Cannot incrementally rebuild, say, mesos-tests after changing a test source 
> file. Everything rebuilds. At least as recent as 0edb2ee96.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-5994) Add Windows support for modules

2017-05-25 Thread Andrew Schwartzmeyer (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Schwartzmeyer reassigned MESOS-5994:
---

Assignee: Andrew Schwartzmeyer  (was: Jeff Coffler)

> Add Windows support for modules 
> 
>
> Key: MESOS-5994
> URL: https://issues.apache.org/jira/browse/MESOS-5994
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
> Environment: Windows
>Reporter: Joseph Wu
>Assignee: Andrew Schwartzmeyer
>Priority: Minor
>  Labels: agent, master, mesos, mesosphere, windows
>
> Modules are currently not supported on Windows due to a couple limitations:
> * GCC and Clang export all symbols to shared libraries by default.  MSVC has 
> the opposite behavior and does not export any symbols by default.  To 
> properly create a shared library on Windows, one must 
> {{__declspec(dllexport)}} every single exposed function/class.
> * CMake 3.4+ has utilities for auto-generating exports, but upgrading the 
> CMake requirement has other version incompatibilities.
> * We can't load a statically linked module due to a runtime check in the 
> protobuf library.
> For now, module-related code is not compiled on Windows.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7566) Master crash due to failed check in DRFSorter::remove

2017-05-25 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-7566:


 Summary: Master crash due to failed check in DRFSorter::remove
 Key: MESOS-7566
 URL: https://issues.apache.org/jira/browse/MESOS-7566
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.1.1, 1.1.2
Reporter: Zhitao Li
Priority: Critical


A check in 
[https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355
 DRFSorter] is triggered occasionally in our cluster and crashes the master 
leader.

I manually modified that check to print out the related variables, and the 
following is a master log.

https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt

>From the log, it seems like the check was using an stale value of 
>{{cpus(*){REV}:26}} while the new value was updated to {{cpus(*){REV}:25}}, 
>thus it crashed.

So far two verified occurrence of this bug are both observed near an 
{{UNRESERVE}} operation (see lines above in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7566) Master crash due to failed check in DRFSorter::remove

2017-05-25 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-7566:
-
Description: 
A check in [sorter.cpp#L355 in 1.1.2 | 
https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355]
 is triggered occasionally in our cluster and crashes the master leader.

I manually modified that check to print out the related variables, and the 
following is a master log.

https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt

>From the log, it seems like the check was using an stale value revocable CPU  
>{{26}} while the new value was updated to 25, thus the check crashed.

So far two verified occurrence of this bug are both observed near an 
{{UNRESERVE}} operation (see lines above in the log).

  was:
A check in 
[https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355
 DRFSorter] is triggered occasionally in our cluster and crashes the master 
leader.

I manually modified that check to print out the related variables, and the 
following is a master log.

https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt

>From the log, it seems like the check was using an stale value of 
>{{cpus(*){REV}:26}} while the new value was updated to {{cpus(*){REV}:25}}, 
>thus it crashed.

So far two verified occurrence of this bug are both observed near an 
{{UNRESERVE}} operation (see lines above in the log).


> Master crash due to failed check in DRFSorter::remove
> -
>
> Key: MESOS-7566
> URL: https://issues.apache.org/jira/browse/MESOS-7566
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.1.2
>Reporter: Zhitao Li
>Priority: Critical
>
> A check in [sorter.cpp#L355 in 1.1.2 | 
> https://github.com/apache/mesos/blob/1.1.2/src/master/allocator/sorter/drf/sorter.cpp#L355]
>  is triggered occasionally in our cluster and crashes the master leader.
> I manually modified that check to print out the related variables, and the 
> following is a master log.
> https://gist.github.com/zhitaoli/0662d9fe1f6d57de344951c05b536bad#file-gistfile1-txt
> From the log, it seems like the check was using an stale value revocable CPU  
> {{26}} while the new value was updated to 25, thus the check crashed.
> So far two verified occurrence of this bug are both observed near an 
> {{UNRESERVE}} operation (see lines above in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7565) Container with "Contiv" networking fails upon startup

2017-05-25 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-7565:
-
Affects Version/s: (was: 1.2.1)
   (was: 1.3.0)
   (was: 1.2.0)
 Priority: Major  (was: Critical)
  Component/s: (was: containerization)
  Summary: Container with "Contiv" networking fails upon startup  
(was: process.cpp:2450] Failed to shutdown socket with fd 6, address 
192.168.110.2:34176: Transport endpoint is not connected)

> Container with "Contiv" networking fails upon startup
> -
>
> Key: MESOS-7565
> URL: https://issues.apache.org/jira/browse/MESOS-7565
> Project: Mesos
>  Issue Type: Bug
>  Components: network
>Affects Versions: 1.2.2, 1.3.1
> Environment: centos 7.3
>Reporter: Hao Yixin
>
> When launching a task through Marathon and asking the task to assign an IP 
> (using Contiv networking):
> Log from mesos-slave:
> I0525 18:52:15.898908  1210 linux_launcher.cpp:429] Launching container 
> c4b299e6-629a-4a99-bd88-cfbca0262b1a and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWUTS | CLONE_NEWNET
> I0525 18:52:15.900668  1210 systemd.cpp:96] Assigned child process '3985' to 
> 'mesos_executors.slice'
> I0525 18:52:15.902612  1206 containerizer.cpp:1623] Checkpointing container's 
> forked pid 3985 to 
> '/var/lib/mesos/meta/slaves/00e6894c-d896-4a3d-8e79-679077f2af81-S4/frameworks/00e6894c-d896-4a3d-8e79-679077f2af81-/executors/container.1467.373c1d9b-4138-11e7-9117-024221dd5669/runs/c4b299e6-629a-4a99-bd88-cfbca0262b1a/pids/forked.pid'
> I0525 18:52:15.903939  1206 cni.cpp:888] Bind mounted '/proc/3985/ns/net' to 
> '/run/mesos/isolators/network/cni/c4b299e6-629a-4a99-bd88-cfbca0262b1a/ns' 
> for container c4b299e6-629a-4a99-bd88-cfbca0262b1a
> I0525 18:52:16.347486  1206 cni.cpp:1301] Got assigned IPv4 address 
> '192.168.110.2/24' from CNI network 'netcontiv' for container 
> c4b299e6-629a-4a99-bd88-cfbca0262b1a
> I0525 18:52:16.347533  1206 cni.cpp:1307] Got assigned IPv6 address '' from 
> CNI network 'netcontiv' for container c4b299e6-629a-4a99-bd88-cfbca0262b1a
> I0525 18:52:16.347687  1206 cni.cpp:1010] Unable to find DNS nameservers for 
> container c4b299e6-629a-4a99-bd88-cfbca0262b1a, using host '/etc/resolv.conf'
> I0525 18:52:24.579439  1206 containerizer.cpp:2508] Container 
> c4b299e6-629a-4a99-bd88-cfbca0262b1a has exited
> I0525 18:52:24.579493  1206 containerizer.cpp:2102] Destroying container 
> c4b299e6-629a-4a99-bd88-cfbca0262b1a in RUNNING state
> I0525 18:52:24.579560  1206 linux_launcher.cpp:505] Asked to destroy 
> container c4b299e6-629a-4a99-bd88-cfbca0262b1a
> I0525 18:52:24.580025  1206 linux_launcher.cpp:548] Using freezer to destroy 
> cgroup mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a
> I0525 18:52:24.580930  1206 cgroups.cpp:2692] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a
> I0525 18:52:24.582156  1206 cgroups.cpp:1405] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a after 
> 1.18784ms
> I0525 18:52:24.583359  1206 cgroups.cpp:2710] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a
> I0525 18:52:24.584491  1206 cgroups.cpp:1434] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a after 
> 1.093888ms
> I0525 18:52:24.681495  1203 cni.cpp:1479] Unmounted the network namespace 
> handle 
> '/run/mesos/isolators/network/cni/c4b299e6-629a-4a99-bd88-cfbca0262b1a/ns' 
> for container c4b299e6-629a-4a99-bd88-cfbca0262b1a
> I0525 18:52:24.681591  1203 cni.cpp:1490] Removed the container directory 
> '/run/mesos/isolators/network/cni/c4b299e6-629a-4a99-bd88-cfbca0262b1a'
> I0525 18:52:24.691004  1203 slave.cpp:5168] Executor 
> 'container.1467.373c1d9b-4138-11e7-9117-024221dd5669' of framework 
> 00e6894c-d896-4a3d-8e79-679077f2af81- terminated with signal Killed
> I0525 18:52:24.691063  1203 slave.cpp:4215] Handling status update 
> TASK_FAILED (UUID: e90f3161-d136-4607-a67c-a621df9e82e4) for task 
> container.1467.373c1d9b-4138-11e7-9117-024221dd5669 of framework 
> 00e6894c-d896-4a3d-8e79-679077f2af81- from @0.0.0.0:0
> Log from sandbox:
> I0525 18:52:36.583499  4041 exec.cpp:162] Version: 1.3.0
> E0525 18:52:39.593489  4050 process.cpp:2450] Failed to shutdown socket with 
> fd 6, address 192.168.110.2:34176: Transport endpoint is not connected
> I0525 18:52:39.593582  4048 exec.cpp:497] Agent exited ... shutting down
> However, when deploying a task without ipAddress field, mesos slave launches 
> a task successfully.
> Tested with various Mesos/Marathon/Contiv versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7565) Container with "Contiv" networking fails upon startup

2017-05-25 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025439#comment-16025439
 ] 

Joseph Wu commented on MESOS-7565:
--

It appears that, with your networking setup, the executor cannot open a 
connection to the agent (presumably located at {{192.168.110.2}}).  That's why 
the executor logs {{Agent exited ... shutting down}}.

> Container with "Contiv" networking fails upon startup
> -
>
> Key: MESOS-7565
> URL: https://issues.apache.org/jira/browse/MESOS-7565
> Project: Mesos
>  Issue Type: Bug
>  Components: network
>Affects Versions: 1.2.2, 1.3.1
> Environment: centos 7.3
>Reporter: Hao Yixin
>
> When launching a task through Marathon and asking the task to assign an IP 
> (using Contiv networking):
> Log from mesos-slave:
> I0525 18:52:15.898908  1210 linux_launcher.cpp:429] Launching container 
> c4b299e6-629a-4a99-bd88-cfbca0262b1a and cloning with namespaces CLONE_NEWNS 
> | CLONE_NEWUTS | CLONE_NEWNET
> I0525 18:52:15.900668  1210 systemd.cpp:96] Assigned child process '3985' to 
> 'mesos_executors.slice'
> I0525 18:52:15.902612  1206 containerizer.cpp:1623] Checkpointing container's 
> forked pid 3985 to 
> '/var/lib/mesos/meta/slaves/00e6894c-d896-4a3d-8e79-679077f2af81-S4/frameworks/00e6894c-d896-4a3d-8e79-679077f2af81-/executors/container.1467.373c1d9b-4138-11e7-9117-024221dd5669/runs/c4b299e6-629a-4a99-bd88-cfbca0262b1a/pids/forked.pid'
> I0525 18:52:15.903939  1206 cni.cpp:888] Bind mounted '/proc/3985/ns/net' to 
> '/run/mesos/isolators/network/cni/c4b299e6-629a-4a99-bd88-cfbca0262b1a/ns' 
> for container c4b299e6-629a-4a99-bd88-cfbca0262b1a
> I0525 18:52:16.347486  1206 cni.cpp:1301] Got assigned IPv4 address 
> '192.168.110.2/24' from CNI network 'netcontiv' for container 
> c4b299e6-629a-4a99-bd88-cfbca0262b1a
> I0525 18:52:16.347533  1206 cni.cpp:1307] Got assigned IPv6 address '' from 
> CNI network 'netcontiv' for container c4b299e6-629a-4a99-bd88-cfbca0262b1a
> I0525 18:52:16.347687  1206 cni.cpp:1010] Unable to find DNS nameservers for 
> container c4b299e6-629a-4a99-bd88-cfbca0262b1a, using host '/etc/resolv.conf'
> I0525 18:52:24.579439  1206 containerizer.cpp:2508] Container 
> c4b299e6-629a-4a99-bd88-cfbca0262b1a has exited
> I0525 18:52:24.579493  1206 containerizer.cpp:2102] Destroying container 
> c4b299e6-629a-4a99-bd88-cfbca0262b1a in RUNNING state
> I0525 18:52:24.579560  1206 linux_launcher.cpp:505] Asked to destroy 
> container c4b299e6-629a-4a99-bd88-cfbca0262b1a
> I0525 18:52:24.580025  1206 linux_launcher.cpp:548] Using freezer to destroy 
> cgroup mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a
> I0525 18:52:24.580930  1206 cgroups.cpp:2692] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a
> I0525 18:52:24.582156  1206 cgroups.cpp:1405] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a after 
> 1.18784ms
> I0525 18:52:24.583359  1206 cgroups.cpp:2710] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a
> I0525 18:52:24.584491  1206 cgroups.cpp:1434] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/c4b299e6-629a-4a99-bd88-cfbca0262b1a after 
> 1.093888ms
> I0525 18:52:24.681495  1203 cni.cpp:1479] Unmounted the network namespace 
> handle 
> '/run/mesos/isolators/network/cni/c4b299e6-629a-4a99-bd88-cfbca0262b1a/ns' 
> for container c4b299e6-629a-4a99-bd88-cfbca0262b1a
> I0525 18:52:24.681591  1203 cni.cpp:1490] Removed the container directory 
> '/run/mesos/isolators/network/cni/c4b299e6-629a-4a99-bd88-cfbca0262b1a'
> I0525 18:52:24.691004  1203 slave.cpp:5168] Executor 
> 'container.1467.373c1d9b-4138-11e7-9117-024221dd5669' of framework 
> 00e6894c-d896-4a3d-8e79-679077f2af81- terminated with signal Killed
> I0525 18:52:24.691063  1203 slave.cpp:4215] Handling status update 
> TASK_FAILED (UUID: e90f3161-d136-4607-a67c-a621df9e82e4) for task 
> container.1467.373c1d9b-4138-11e7-9117-024221dd5669 of framework 
> 00e6894c-d896-4a3d-8e79-679077f2af81- from @0.0.0.0:0
> Log from sandbox:
> I0525 18:52:36.583499  4041 exec.cpp:162] Version: 1.3.0
> E0525 18:52:39.593489  4050 process.cpp:2450] Failed to shutdown socket with 
> fd 6, address 192.168.110.2:34176: Transport endpoint is not connected
> I0525 18:52:39.593582  4048 exec.cpp:497] Agent exited ... shutting down
> However, when deploying a task without ipAddress field, mesos slave launches 
> a task successfully.
> Tested with various Mesos/Marathon/Contiv versions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7492) Introduce a daemon manager in the agent.

2017-05-25 Thread James DeFelice (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025484#comment-16025484
 ] 

James DeFelice commented on MESOS-7492:
---

aren't {{poll_interval}} and {{initial_delay}} baked into {{CheckInfo}} already?

> Introduce a daemon manager in the agent.
> 
>
> Key: MESOS-7492
> URL: https://issues.apache.org/jira/browse/MESOS-7492
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>
> Once we have standalone container support from the containerizer, we should 
> consider adding a daemon manager inside the agent. It'll be like 'monit', 
> 'upstart' or 'systemd', but with very limited functionalities. For instance, 
> as a start, the manager will simply always restart the daemons if the daemon 
> fails. It'll also try to cleanup unknown daemons.
> This feature will be used to manage CSI plugin containers on the agent.
> The daemon manager should have an interface allowing operators to "register" 
> a daemon with a name and a config of the daemon. The daemon manager is 
> responsible for restarting the daemon if it crashes until some one explicitly 
> "unregister" it. Some simple backoff and health check functionality should be 
> provided.
> We probably need a small design doc for this.
> {code}
> message DaemonConfig {
>   optional ContainerInfo container;
>   optional CommandInfo command;
>   optional uint32 poll_interval;
>   optional uint32 initial_delay;
>   optional CheckInfo check; // For health check.
> }
> class DaemonManager
> {
> public:
>   Future register(
> const ContainerID& containerId,
> const DaemonConfig& config;
>   Future unregister(const ContainerID& containerId);
>   Future> ps();
>   Future status(const ContainerID& containerId);
> };
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7542) Add executor reconnection retry logic to the agent

2017-05-25 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025530#comment-16025530
 ] 

Greg Mann commented on MESOS-7542:
--

Implementation and testing of the executor changes, which cause the executor to 
drop messages when it is not connected to the agent:
https://reviews.apache.org/r/59582/
https://reviews.apache.org/r/59583/

> Add executor reconnection retry logic to the agent
> --
>
> Key: MESOS-7542
> URL: https://issues.apache.org/jira/browse/MESOS-7542
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, executor
>Reporter: Greg Mann
>Assignee: Benjamin Mahler
>  Labels: mesosphere
>
> Currently, the agent sends a single {{ReconnectExecutorMessage}} to PID-based 
> executors during recovery. It would be more robust to have the agent retry 
> these messages until {{executor_reregister_timeout}} has elapsed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7567) Show summary statistics for benchmarks.

2017-05-25 Thread James Peach (JIRA)
James Peach created MESOS-7567:
--

 Summary: Show summary statistics for benchmarks.
 Key: MESOS-7567
 URL: https://issues.apache.org/jira/browse/MESOS-7567
 Project: Mesos
  Issue Type: Improvement
  Components: test
Reporter: James Peach


Some of the benchmark tests repeat an operation and the user is supposed to 
look at the log output and decide if the operation performance is OK. We should 
improve this by using `process::TimeSeries` and `process::Statistics` to output 
a summary of the benchmark.

For example, https://reviews.apache.org/r/49571/ would benefit from this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7564) Introduce a heartbeat mechanism for executor <-> agent communication.

2017-05-25 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7564:
---
Issue Type: Bug  (was: Task)

> Introduce a heartbeat mechanism for executor <-> agent communication.
> -
>
> Key: MESOS-7564
> URL: https://issues.apache.org/jira/browse/MESOS-7564
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>
> Currently, we do not have heartbeats for executor <-> agent communication. 
> This is especially problematic in scenarios when IPFilters are enabled since 
> the default conntrack keep alive timeout is 5 days. When that timeout 
> elapses, the executor doesn't get notified via a socket disconnection when 
> the agent process restarts. The executor would then get killed if it doesn't 
> re-register when the agent recovery process is completed.
> Enabling application level heartbeats or TCP KeepAlive's can be a possible 
> way for fixing this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7568) Introduce a heartbeat mechanism for v0 executor <-> agent links.

2017-05-25 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7568:
--

 Summary: Introduce a heartbeat mechanism for v0 executor <-> agent 
links.
 Key: MESOS-7568
 URL: https://issues.apache.org/jira/browse/MESOS-7568
 Project: Mesos
  Issue Type: Bug
Reporter: Anand Mazumdar


Currently, we do not have heartbeats for executor <-> agent communication. This 
is especially problematic in scenarios when IPFilters are enabled since the 
default conntrack keep alive timeout is 5 days. When that timeout elapses, the 
executor doesn't get notified via a socket disconnection when the agent process 
restarts. The executor would then get killed if it doesn't re-register when the 
agent recovery process is completed.

Enabling application level heartbeats or TCP KeepAlive's can be a possible way 
for fixing this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7564) Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.

2017-05-25 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7564:
---
Summary: Introduce a heartbeat mechanism for v1 HTTP executor <-> agent 
communication.  (was: Introduce a heartbeat mechanism for executor <-> agent 
communication.)

> Introduce a heartbeat mechanism for v1 HTTP executor <-> agent communication.
> -
>
> Key: MESOS-7564
> URL: https://issues.apache.org/jira/browse/MESOS-7564
> Project: Mesos
>  Issue Type: Bug
>Reporter: Anand Mazumdar
>
> Currently, we do not have heartbeats for executor <-> agent communication. 
> This is especially problematic in scenarios when IPFilters are enabled since 
> the default conntrack keep alive timeout is 5 days. When that timeout 
> elapses, the executor doesn't get notified via a socket disconnection when 
> the agent process restarts. The executor would then get killed if it doesn't 
> re-register when the agent recovery process is completed.
> Enabling application level heartbeats or TCP KeepAlive's can be a possible 
> way for fixing this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7542) Add executor reconnection retry logic to the agent

2017-05-25 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025602#comment-16025602
 ] 

Greg Mann commented on MESOS-7542:
--

Implementation/tests of the agent-side behavior:
https://reviews.apache.org/r/59584/
https://reviews.apache.org/r/59584/
https://reviews.apache.org/r/59584/
https://reviews.apache.org/r/59584/

> Add executor reconnection retry logic to the agent
> --
>
> Key: MESOS-7542
> URL: https://issues.apache.org/jira/browse/MESOS-7542
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, executor
>Reporter: Greg Mann
>Assignee: Benjamin Mahler
>  Labels: mesosphere
>
> Currently, the agent sends a single {{ReconnectExecutorMessage}} to PID-based 
> executors during recovery. It would be more robust to have the agent retry 
> these messages until {{executor_reregister_timeout}} has elapsed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-5361) Consider introducing TCP KeepAlive for Libprocess sockets.

2017-05-25 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025600#comment-16025600
 ] 

Benjamin Mahler commented on MESOS-5361:


Linking in the executor related tickets that came up due to conntrack 
considering connections stale after 5 days.

> Consider introducing TCP KeepAlive for Libprocess sockets.
> --
>
> Key: MESOS-5361
> URL: https://issues.apache.org/jira/browse/MESOS-5361
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Anand Mazumdar
>  Labels: mesosphere
>
> We currently don't use TCP KeepAlive's when creating sockets in libprocess. 
> This might benefit master - scheduler, master - agent connections i.e. we can 
> detect if any of them failed faster.
> Currently, if the master process goes down. If for some reason the {{RST}} 
> sequence did not reach the scheduler, the scheduler can only come to know 
> about the disconnection when it tries to do a {{send}} itself. 
> The default TCP keep alive values on Linux are of little use in a real world 
> application:
> {code}
> . This means that the keepalive routines wait for two hours (7200 secs) 
> before sending the first keepalive probe, and then resend it every 75 
> seconds. If no ACK response is received for nine consecutive times, the 
> connection is marked as broken.
> {code}
> However, for long running instances of scheduler/agent this still can be 
> beneficial. Also, operators might start tuning the values for their clusters 
> explicitly once we start supporting it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-7542) Add executor reconnection retry logic to the agent

2017-05-25 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025602#comment-16025602
 ] 

Greg Mann edited comment on MESOS-7542 at 5/26/17 12:48 AM:


Implementation/tests of the agent-side behavior:
https://reviews.apache.org/r/59584/
https://reviews.apache.org/r/59585/
https://reviews.apache.org/r/59586/
https://reviews.apache.org/r/59587/


was (Author: greggomann):
Implementation/tests of the agent-side behavior:
https://reviews.apache.org/r/59584/
https://reviews.apache.org/r/59584/
https://reviews.apache.org/r/59584/
https://reviews.apache.org/r/59584/

> Add executor reconnection retry logic to the agent
> --
>
> Key: MESOS-7542
> URL: https://issues.apache.org/jira/browse/MESOS-7542
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, executor
>Reporter: Greg Mann
>Assignee: Benjamin Mahler
>  Labels: mesosphere
>
> Currently, the agent sends a single {{ReconnectExecutorMessage}} to PID-based 
> executors during recovery. It would be more robust to have the agent retry 
> these messages until {{executor_reregister_timeout}} has elapsed.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.

2017-05-25 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-7569:
--

 Summary: Allow "old" executors with half-open connections to be 
preserved during agent upgrade / restart.
 Key: MESOS-7569
 URL: https://issues.apache.org/jira/browse/MESOS-7569
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: Benjamin Mahler


Users who have executors in their cluster without the fix to MESOS-7057 will 
experience these executors potentially being destroyed whenever the agent 
restarts (or is upgraded).

This occurs when these old executors have connections idle for > 5 days 
(default conntrack tcp timeout). At this point, the connection is timedout and 
no longer tracked by conntrack. From what we've seen, if the agent stays up, 
the packets still flow between the executor and agent. However, once the agent 
restarts, in some cases (presence of a DROP rule, or some flavors of NATing), 
the executor does not receive the RST/FIN from the kernel and will hold a 
half-open TCP connection. At this point, when the executor responds to the 
reconnect message from the restarted agent, it's half-open TCP connection 
closes, and the executor will be destroyed by the agent.

In order to allow users to preserve the tasks running in these "old" executors 
(i.e. without the MESOS-7057 fix), we can add *optional* retrying of the 
reconnect message in the agent. This allows the old executor to correctly 
establish a link to agent, when the second reconnect message is handled.

Longer term, heartbeating or TCP keepalives will prevent the connections from 
reaching the conntrack timeout (see MESOS-7568).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.

2017-05-25 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7569:
---
Target Version/s: 1.2.2, 1.3.1, 1.4.0, 1.1.3  (was: 1.2.2, 1.3.1, 1.4.0)

> Allow "old" executors with half-open connections to be preserved during agent 
> upgrade / restart.
> 
>
> Key: MESOS-7569
> URL: https://issues.apache.org/jira/browse/MESOS-7569
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>
> Users who have executors in their cluster without the fix to MESOS-7057 will 
> experience these executors potentially being destroyed whenever the agent 
> restarts (or is upgraded).
> This occurs when these old executors have connections idle for > 5 days 
> (default conntrack tcp timeout). At this point, the connection is timedout 
> and no longer tracked by conntrack. From what we've seen, if the agent stays 
> up, the packets still flow between the executor and agent. However, once the 
> agent restarts, in some cases (presence of a DROP rule, or some flavors of 
> NATing), the executor does not receive the RST/FIN from the kernel and will 
> hold a half-open TCP connection. At this point, when the executor responds to 
> the reconnect message from the restarted agent, it's half-open TCP connection 
> closes, and the executor will be destroyed by the agent.
> In order to allow users to preserve the tasks running in these "old" 
> executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying 
> of the reconnect message in the agent. This allows the old executor to 
> correctly establish a link to agent, when the second reconnect message is 
> handled.
> Longer term, heartbeating or TCP keepalives will prevent the connections from 
> reaching the conntrack timeout (see MESOS-7568).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-7569) Allow "old" executors with half-open connections to be preserved during agent upgrade / restart.

2017-05-25 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-7569:
--

Assignee: Benjamin Mahler

> Allow "old" executors with half-open connections to be preserved during agent 
> upgrade / restart.
> 
>
> Key: MESOS-7569
> URL: https://issues.apache.org/jira/browse/MESOS-7569
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>
> Users who have executors in their cluster without the fix to MESOS-7057 will 
> experience these executors potentially being destroyed whenever the agent 
> restarts (or is upgraded).
> This occurs when these old executors have connections idle for > 5 days 
> (default conntrack tcp timeout). At this point, the connection is timedout 
> and no longer tracked by conntrack. From what we've seen, if the agent stays 
> up, the packets still flow between the executor and agent. However, once the 
> agent restarts, in some cases (presence of a DROP rule, or some flavors of 
> NATing), the executor does not receive the RST/FIN from the kernel and will 
> hold a half-open TCP connection. At this point, when the executor responds to 
> the reconnect message from the restarted agent, it's half-open TCP connection 
> closes, and the executor will be destroyed by the agent.
> In order to allow users to preserve the tasks running in these "old" 
> executors (i.e. without the MESOS-7057 fix), we can add *optional* retrying 
> of the reconnect message in the agent. This allows the old executor to 
> correctly establish a link to agent, when the second reconnect message is 
> handled.
> Longer term, heartbeating or TCP keepalives will prevent the connections from 
> reaching the conntrack timeout (see MESOS-7568).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-5332) TASK_LOST on slave restart potentially due to executor race condition

2017-05-25 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025625#comment-16025625
 ] 

Benjamin Mahler commented on MESOS-5332:


In order to enable users who hit this situation to safely upgrade (without all 
>5 day idle connection executors being destroyed), we will introduce an 
optional retry of the reconnect message via MESOS-7569:

https://reviews.apache.org/r/59584/

This will allow the preservation of executors without the relink (MESOS-7057) 
fix when upgrading an agent. Longer term, TCP keepalives or heartbeating will 
be put in place to avoid the connections timing out in conntrack.

> TASK_LOST on slave restart potentially due to executor race condition
> -
>
> Key: MESOS-5332
> URL: https://issues.apache.org/jira/browse/MESOS-5332
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, libprocess
>Affects Versions: 0.26.0
> Environment: Mesos 0.26
> Aurora 0.13
>Reporter: Stephan Erb
>Assignee: Anand Mazumdar
> Attachments: executor-logs.tar.gz, executor-stderr.log, 
> executor-stderrV2.log, mesos-slave.log
>
>
> When restarting the Mesos agent binary, tasks can end up as LOST. We lose 
> from 20% to 50% of all tasks. They are killed by the Mesos agent via:
> {code}
> I0505 08:42:06.781318 21738 slave.cpp:2702] Cleaning up un-reregistered 
> executors
> I0505 08:42:06.781366 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-28854-0-6a88d62e-656
> 4-4e33-b0bb-1d8039d97afc' of framework 
> 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:40541
> I0505 08:42:06.781446 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-23839-0-1d2cd0e6-699
> 4-4cba-a9df-3dfc1552667f' of framework 
> 20151001-085346-58917130-5050-37976- at executor(1)@10.X.X.X:35757
> I0505 08:42:06.781466 21738 slave.cpp:2720] Killing un-reregistered executor 
> 'thermos-nobody-devel-service-29970-0-478a7291-d070-4aa8
> -af21-6fda889f750c' of framework 20151001-085346-58917130-5050-37976- at 
> executor(1)@10.X.X.X:51463
> ...
> I0505 08:42:06.781558 21738 slave.cpp:4230] Finished recovery
> {code}
> We have verified that the tasks and their executors are killed by the agent 
> during startup. When stopping the agent using supervisorctl stop, the 
> executors are still running (verified via {{ps aux}}). They are only killed 
> once the agent tries to reregister.
> The issue is hard to reproduce:
> * When restarting the agent binary multiple times, tasks are only lost for 
> the first restart.
> * It is much more likely to occur if the agent binary has been running for a 
> longer period of time (> 7 days)
> Mesos is correctly sticking to the 2 seconds wait time before killing 
> un-reregistered executors. The failed executors receive the reregistration 
> request, but it seems like they fail to send a reply.
> A successful reregistration (not leading to LOST):
> {code}
> I0505 08:41:59.581231 21664 exec.cpp:456] Slave exited, but framework has 
> checkpointing enabled. Waiting 15mins to reconnect with slave 
> 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.780591 21665 exec.cpp:256] Received reconnect request from 
> slave 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.785297 21676 exec.cpp:233] Executor re-registered on slave 
> 20160118-141153-92471562-5050-6270-S17
> I0505 08:42:04.788579 21676 exec.cpp:245] Executor::reregistered took 
> 1.492339ms
> {code}
> A failed one:
> {code}
> I0505 08:42:04.779677  2389 exec.cpp:256] Received reconnect request from 
> slave 20160118-141153-92471562-5050-6270-S17
> E0505 08:42:05.481374  2408 process.cpp:1911] Failed to shutdown socket with 
> fd 11: Transport endpoint is not connected
> I0505 08:42:05.481374  2395 exec.cpp:456] Slave exited, but framework has 
> checkpointing enabled. Waiting 15mins to reconnect with slave 
> 20160118-141153-92471562-5050-6270-S17
> {code}
> All task ending up in LOST have an output similar to the one posted above, 
> i.e. log messages are in a wrong order.
> Anyone an idea what might be going on here? 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7570) Add a storage local resource provider.

2017-05-25 Thread Jie Yu (JIRA)
Jie Yu created MESOS-7570:
-

 Summary: Add a storage local resource provider.
 Key: MESOS-7570
 URL: https://issues.apache.org/jira/browse/MESOS-7570
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu
Assignee: Jie Yu






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7570) Add a storage local resource provider.

2017-05-25 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-7570:
--
Description: 
This will be a subclass of LocalResourceProvider. It will interact with CSI 
plugins to get capacity information as well as talking to the plugin for 
operations like CREATE/DESTROY.


> Add a storage local resource provider.
> --
>
> Key: MESOS-7570
> URL: https://issues.apache.org/jira/browse/MESOS-7570
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Jie Yu
>
> This will be a subclass of LocalResourceProvider. It will interact with CSI 
> plugins to get capacity information as well as talking to the plugin for 
> operations like CREATE/DESTROY.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7571) Add `--resource_providers` flag to the agent.

2017-05-25 Thread Jie Yu (JIRA)
Jie Yu created MESOS-7571:
-

 Summary: Add `--resource_providers` flag to the agent.
 Key: MESOS-7571
 URL: https://issues.apache.org/jira/browse/MESOS-7571
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu
Assignee: Jie Yu


Add an agent flag `--resource_providers` to allow operators to specify the list 
of local resource providers to register with the master.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7342) Port Docker tests

2017-05-25 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025642#comment-16025642
 ] 

Joseph Wu commented on MESOS-7342:
--

{code}
commit c4df8f7b0e48e190111ef427080bf3ca5019a25a
Author: John Kordich 
Date:   Thu May 25 17:24:10 2017 -0700

Windows: Enabled DOCKER and ROOT test filters.

This flips two test filters on Windows, so that all Docker and Root
tests are now enabled by default on Windows (rather than disabled).

On Windows, everything must be run as Administrator (like root, but
also somewhat different), so all ROOT tests are enabled by default.
The DOCKER filter now has the same behavior on Windows and Linux.

However, since many of the tests affected by these filters are
still not working on Windows, some individual tests have been
disabled.

The main test enabled by this commit is:
`DockerTest.ROOT_DOCKER_Version`.

Review: https://reviews.apache.org/r/59353/
{code}

> Port Docker tests
> -
>
> Key: MESOS-7342
> URL: https://issues.apache.org/jira/browse/MESOS-7342
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
> Environment: Windows 10
>Reporter: Andrew Schwartzmeyer
>Assignee: John Kordich
>  Labels: microsoft, windows
>
> While one of Daniel Pravat's last acts was introducing the the Docker 
> containerizer for Windows, we don't have tests. We need to port 
> `docker_tests.cpp` and `docker_containerizer_tests.cpp` to Windows.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)