[jira] [Commented] (MESOS-6125) Task still 'active' after TASK_FINISHED status

2016-09-07 Thread mark1982 (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15470061#comment-15470061
 ] 

mark1982 commented on MESOS-6125:
-

Scheduler#resourceOffers(SchedulerDriver schedulerDriver, List 
offers) may be blocked.

If  this method is blocked, status can be send to register center, but mesos' 
Scheduler#statusUpdate(SchedulerDriver driver, Protos.TaskStatus taskStatus) 
can no be call back.

so the make the TASK_FINISHED status task still into 'active' task list. 

> Task still 'active' after TASK_FINISHED status
> --
>
> Key: MESOS-6125
> URL: https://issues.apache.org/jira/browse/MESOS-6125
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.24.2
> Environment: java docker 
>Reporter: xingxingwang
>Priority: Critical
>
> I  build my own  applications on top of Mesos with the guide of 
> "https://github.com/AgilData/mesos-docker-tutorial".However I encounter a 
> problem that the task is already finished but it still remain active from the 
> ui of mesos master,so the interface method named "statusUpdate" does not 
> receive any messages,it puzzles me a lot . I follow the reference 
> "https://mail-archives.apache.org/mod_mbox/mesos-user/201511.mbox/%3CCAGzvUEygz--6XxKBxYO_pH=Sj6U=qss_3y3kthk14svmc86...@mail.gmail.com%3E";,
>  but I still do not know why ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6125) Task still 'active' after TASK_FINISHED status

2016-09-07 Thread xingxingwang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15470078#comment-15470078
 ] 

xingxingwang commented on MESOS-6125:
-

you are right,mark

> Task still 'active' after TASK_FINISHED status
> --
>
> Key: MESOS-6125
> URL: https://issues.apache.org/jira/browse/MESOS-6125
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.24.2
> Environment: java docker 
>Reporter: xingxingwang
>Priority: Critical
>
> I  build my own  applications on top of Mesos with the guide of 
> "https://github.com/AgilData/mesos-docker-tutorial".However I encounter a 
> problem that the task is already finished but it still remain active from the 
> ui of mesos master,so the interface method named "statusUpdate" does not 
> receive any messages,it puzzles me a lot . I follow the reference 
> "https://mail-archives.apache.org/mod_mbox/mesos-user/201511.mbox/%3CCAGzvUEygz--6XxKBxYO_pH=Sj6U=qss_3y3kthk14svmc86...@mail.gmail.com%3E";,
>  but I still do not know why ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-6125) Task still 'active' after TASK_FINISHED status

2016-09-07 Thread xingxingwang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xingxingwang updated MESOS-6125:

Comment: was deleted

(was: you are right,mark)

> Task still 'active' after TASK_FINISHED status
> --
>
> Key: MESOS-6125
> URL: https://issues.apache.org/jira/browse/MESOS-6125
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.24.2
> Environment: java docker 
>Reporter: xingxingwang
>Priority: Critical
>
> I  build my own  applications on top of Mesos with the guide of 
> "https://github.com/AgilData/mesos-docker-tutorial".However I encounter a 
> problem that the task is already finished but it still remain active from the 
> ui of mesos master,so the interface method named "statusUpdate" does not 
> receive any messages,it puzzles me a lot . I follow the reference 
> "https://mail-archives.apache.org/mod_mbox/mesos-user/201511.mbox/%3CCAGzvUEygz--6XxKBxYO_pH=Sj6U=qss_3y3kthk14svmc86...@mail.gmail.com%3E";,
>  but I still do not know why ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-6125) Task still 'active' after TASK_FINISHED status

2016-09-07 Thread xingxingwang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xingxingwang updated MESOS-6125:

Comment: was deleted

(was: as mark said)

> Task still 'active' after TASK_FINISHED status
> --
>
> Key: MESOS-6125
> URL: https://issues.apache.org/jira/browse/MESOS-6125
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.24.2
> Environment: java docker 
>Reporter: xingxingwang
>Priority: Critical
>
> I  build my own  applications on top of Mesos with the guide of 
> "https://github.com/AgilData/mesos-docker-tutorial".However I encounter a 
> problem that the task is already finished but it still remain active from the 
> ui of mesos master,so the interface method named "statusUpdate" does not 
> receive any messages,it puzzles me a lot . I follow the reference 
> "https://mail-archives.apache.org/mod_mbox/mesos-user/201511.mbox/%3CCAGzvUEygz--6XxKBxYO_pH=Sj6U=qss_3y3kthk14svmc86...@mail.gmail.com%3E";,
>  but I still do not know why ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6125) Task still 'active' after TASK_FINISHED status

2016-09-07 Thread xingxingwang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15470084#comment-15470084
 ] 

xingxingwang commented on MESOS-6125:
-

mark1982 is right

> Task still 'active' after TASK_FINISHED status
> --
>
> Key: MESOS-6125
> URL: https://issues.apache.org/jira/browse/MESOS-6125
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.24.2
> Environment: java docker 
>Reporter: xingxingwang
>Priority: Critical
>
> I  build my own  applications on top of Mesos with the guide of 
> "https://github.com/AgilData/mesos-docker-tutorial".However I encounter a 
> problem that the task is already finished but it still remain active from the 
> ui of mesos master,so the interface method named "statusUpdate" does not 
> receive any messages,it puzzles me a lot . I follow the reference 
> "https://mail-archives.apache.org/mod_mbox/mesos-user/201511.mbox/%3CCAGzvUEygz--6XxKBxYO_pH=Sj6U=qss_3y3kthk14svmc86...@mail.gmail.com%3E";,
>  but I still do not know why ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-6125) Task still 'active' after TASK_FINISHED status

2016-09-07 Thread xingxingwang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xingxingwang updated MESOS-6125:

Comment: was deleted

(was: mark1982 is right)

> Task still 'active' after TASK_FINISHED status
> --
>
> Key: MESOS-6125
> URL: https://issues.apache.org/jira/browse/MESOS-6125
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.24.2
> Environment: java docker 
>Reporter: xingxingwang
>Priority: Critical
>
> I  build my own  applications on top of Mesos with the guide of 
> "https://github.com/AgilData/mesos-docker-tutorial".However I encounter a 
> problem that the task is already finished but it still remain active from the 
> ui of mesos master,so the interface method named "statusUpdate" does not 
> receive any messages,it puzzles me a lot . I follow the reference 
> "https://mail-archives.apache.org/mod_mbox/mesos-user/201511.mbox/%3CCAGzvUEygz--6XxKBxYO_pH=Sj6U=qss_3y3kthk14svmc86...@mail.gmail.com%3E";,
>  but I still do not know why ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6125) Task still 'active' after TASK_FINISHED status

2016-09-07 Thread xingxingwang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15470086#comment-15470086
 ] 

xingxingwang commented on MESOS-6125:
-

mark1982 is right

> Task still 'active' after TASK_FINISHED status
> --
>
> Key: MESOS-6125
> URL: https://issues.apache.org/jira/browse/MESOS-6125
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.24.2
> Environment: java docker 
>Reporter: xingxingwang
>Priority: Critical
>
> I  build my own  applications on top of Mesos with the guide of 
> "https://github.com/AgilData/mesos-docker-tutorial".However I encounter a 
> problem that the task is already finished but it still remain active from the 
> ui of mesos master,so the interface method named "statusUpdate" does not 
> receive any messages,it puzzles me a lot . I follow the reference 
> "https://mail-archives.apache.org/mod_mbox/mesos-user/201511.mbox/%3CCAGzvUEygz--6XxKBxYO_pH=Sj6U=qss_3y3kthk14svmc86...@mail.gmail.com%3E";,
>  but I still do not know why ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-6125) Task still 'active' after TASK_FINISHED status

2016-09-07 Thread xingxingwang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xingxingwang updated MESOS-6125:

Comment: was deleted

(was: please give me a help)

> Task still 'active' after TASK_FINISHED status
> --
>
> Key: MESOS-6125
> URL: https://issues.apache.org/jira/browse/MESOS-6125
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.24.2
> Environment: java docker 
>Reporter: xingxingwang
>Priority: Critical
>
> I  build my own  applications on top of Mesos with the guide of 
> "https://github.com/AgilData/mesos-docker-tutorial".However I encounter a 
> problem that the task is already finished but it still remain active from the 
> ui of mesos master,so the interface method named "statusUpdate" does not 
> receive any messages,it puzzles me a lot . I follow the reference 
> "https://mail-archives.apache.org/mod_mbox/mesos-user/201511.mbox/%3CCAGzvUEygz--6XxKBxYO_pH=Sj6U=qss_3y3kthk14svmc86...@mail.gmail.com%3E";,
>  but I still do not know why ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4606) Add IPv6 support to net::IP and net::IPNetwork

2016-09-07 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15470145#comment-15470145
 ] 

Benno Evers commented on MESOS-4606:


Yes, an implementation is available at 
https://github.com/lava/mesos/commit/8b83489a5cd5e3fe81c98cae3dfe58a7e945376f

There were no shepherds willing to take on this task, maybe this will change 
after a design document for the bigger issue (IPv6 support in mesos) is 
finished, which should be ready in the next few days to weeks.

> Add IPv6 support to net::IP and net::IPNetwork
> --
>
> Key: MESOS-4606
> URL: https://issues.apache.org/jira/browse/MESOS-4606
> Project: Mesos
>  Issue Type: Improvement
>  Components: stout
>Reporter: Benno Evers
>Assignee: Benno Evers
>Priority: Minor
>  Labels: network, stout
>
> The classes net::IP and net::IPNetwork should to be able to store IPv6 
> addresses.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5803) Command health checks do not survive after framework restart

2016-09-07 Thread Alexandr Kuzmitsky (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15470218#comment-15470218
 ] 

Alexandr Kuzmitsky commented on MESOS-5803:
---

This bugfix works in couple with 
https://github.com/mesosphere/marathon/pull/4094.
It is our pull requests. We use these patches in Mesos/Marathon.

> Command health checks do not survive after framework restart
> 
>
> Key: MESOS-5803
> URL: https://issues.apache.org/jira/browse/MESOS-5803
> Project: Mesos
>  Issue Type: Bug
>Reporter: haosdent
>  Labels: health-check
>
> Reported in https://github.com/mesosphere/marathon/issues/916
> and https://github.com/apache/mesos/pull/118
> So far health check only sends success healthy status if the previous status 
> is failed or not exists. So frameworks could not know the health status of 
> tasks after master restart.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6132) Mesos 1.0.1 marathon 1.1.1 deployment fails no such image, container or task

2016-09-07 Thread Arslan Qadeer (JIRA)
Arslan Qadeer created MESOS-6132:


 Summary: Mesos 1.0.1 marathon 1.1.1 deployment fails no such 
image, container or task
 Key: MESOS-6132
 URL: https://issues.apache.org/jira/browse/MESOS-6132
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.0.1
Reporter: Arslan Qadeer
Priority: Blocker


Trying to launch docker containers with marathon on bridged network but they 
fail. I have tried to pull containers manually but it did not work too.
Here are some logs from Mesos Salve/Master and Marathon.

Mesos Slave Logs:

"sudo ./bin/mesos-slave.sh --ip= --master=zk://:2181/mesos --advertise_ip= 
--docker_socket=/var/run/docker.sock --work_dir=$/tmp/mesos-slave 
--containerizers=docker --executor_registration_timeout=5mins 
--isolation=docker/runtime"

I0907 06:02:58.206800  5820 slave.cpp:1495] Got assigned task 
helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545 for framework 
d034dd41-d32c-4e81-a168-070c2185eefe- [23/9246]
I0907 06:02:58.208921  5820 slave.cpp:1614] Launching task 
helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545 for framework 
d034dd41-d32c-4e81-a168-070c2185eefe-[22/9246]
I0907 06:02:58.210595  5820 paths.cpp:528] Trying to chown 
'$/tmp/mesos-slave/slaves/9b5505a5-71d9-47cd-a810-898bb24be347-S0/frameworks/d034dd41-d32c-4e81-a168-070c2185ee[21/9246]
xecutors/helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545/runs/0a52cc6d-499d-487d-983a-d3d8e62213e2'
 to user 'aio' [20/9246]
I0907 06:02:58.214004  5820 slave.cpp:5674] Launching executor 
helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545 of framework 
d034dd41-d32c-4e81-a168-070c2185eefe- with[19/9246]
s cpus(*):0.1; mem(*):32 in work directory 
'$/tmp/mesos-slave/slaves/9b5505a5-71d9-47cd-a810-898bb24be347-S0/frameworks/d034dd41-d32c-4e81-a168-070c2185eefe-/executor[18/9246]
rld.43c00fb5-74e2-11e6-80f0-a2e45809a545/runs/0a52cc6d-499d-487d-983a-d3d8e62213e2'

   [17/9246]
I0907 06:02:58.216660  5820 slave.cpp:1840] Queuing task 
'helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545' for executor 
'helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545' [16/9246]
ork d034dd41-d32c-4e81-a168-070c2185eefe-   

  [15/9246]
I0907 06:02:58.217999  5817 docker.cpp:1042] Starting container 
'0a52cc6d-499d-487d-983a-d3d8e62213e2' for task 
'helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545' (and exe[14/9246]
I0907 06:02:58.217999  5817 docker.cpp:1042] Starting container 
'0a52cc6d-499d-487d-983a-d3d8e62213e2' for task 
'helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545' (and exe[14/9246$
lloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545') of framework 
'd034dd41-d32c-4e81-a168-070c2185eefe-'
I0907 06:03:02.005821  5820 docker.cpp:658] Checkpointing pid 5853 to 
'$/tmp/mesos-slave/meta/slaves/9b5505a5-71d9-47cd-a810-898bb24be347-S0/frameworks/d034dd41-d32c-4e81-a168-07$
c2185eefe-/executors/helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545/runs/0a52cc6d-499d-487d-983a-d3d8e62213e2/pids/forked.pid'
I0907 06:03:02.057451  5814 slave.cpp:2828] Got registration for executor 
'helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545' of framework 
d034dd41-d32c-4e81-a168-070c2185eefe- 
from executor(1)@192.168.50.14:59541
I0907 06:03:02.059079  5814 docker.cpp:1443] Ignoring updating container 
'0a52cc6d-499d-487d-983a-d3d8e62213e2' with resources passed to update is 
identical to existing resources
I0907 06:03:02.059430  5814 slave.cpp:2005] Sending queued task 
'helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545' to executor 
'helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545' of 
framework d034dd41-d32c-4e81-a168-070c2185eefe- at 
executor(1)@192.168.50.14:59541
I0907 06:03:07.500459  5815 slave.cpp:3211] Handling status update TASK_FAILED 
(UUID: ed6f0a0c-0687-4039-93e7-c14be4a9ec8c) for task 
helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a5$
5 of framework d034dd41-d32c-4e81-a168-070c2185eefe- from 
executor(1)@192.168.50.14:59541
E0907 06:03:07.539857  5820 slave.cpp:3456] Failed to update resources for 
container 0a52cc6d-499d-487d-983a-d3d8e62213e2 of executor 
'helloworld.43c00fb5-74e2-11e6-80f0-a2e45809$
545' running task helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545 on status 
update for terminal task, destroying container: Failed to run 'docker -H 
unix:///var/run/docker.sock i$
spect 
mesos-9b5505a5-71d9-47cd-a810-898bb24be347-S0.0a52cc6d-499d-487d-983a-d3d8e62213e2':
 exited with status 1; stderr='Error: No such image, container or task: 
mesos-9b5505a5-7$d9-47cd-a810-898bb24be347-S0.0a52cc6d-499d-487d-983a-d3d8e62213e2
'
I0907 06:03:07.540410  5820 docker.cpp:1852] Destroying container 
'0a52cc6d-499d-487d-983a-d3d8e62213e2'
I0907 06:03:07.540549  582

[jira] [Created] (MESOS-6133) List of isolators in agent help display is incomplete

2016-09-07 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-6133:
---

 Summary: List of isolators in agent help display is incomplete
 Key: MESOS-6133
 URL: https://issues.apache.org/jira/browse/MESOS-6133
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Benjamin Bannier


The agent help currently contains a hard-coded list of isolators,

{code}

{code}

This list appears to be incomplete already now, and e.g., also would never 
include isolators from loaded modules. It might make more sense to e.g., 
calculate it dynamically at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6134) Port CFS quota support to Docker Containerizer using command executor

2016-09-07 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-6134:


 Summary: Port CFS quota support to Docker Containerizer using 
command executor
 Key: MESOS-6134
 URL: https://issues.apache.org/jira/browse/MESOS-6134
 Project: Mesos
  Issue Type: Bug
Reporter: Zhitao Li
Assignee: Zhitao Li


MESOS-2154 only partially fixed the CFS quota support in Docker Containerizer: 
that fix only works for custom executor.

This tracks the fix for command executor so we can declare this is complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6134) Port CFS quota support to Docker Containerizer using command executor

2016-09-07 Thread Zhitao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhitao Li updated MESOS-6134:
-
Fix Version/s: 1.1.0

> Port CFS quota support to Docker Containerizer using command executor
> -
>
> Key: MESOS-6134
> URL: https://issues.apache.org/jira/browse/MESOS-6134
> Project: Mesos
>  Issue Type: Bug
>Reporter: Zhitao Li
>Assignee: Zhitao Li
> Fix For: 1.1.0
>
>
> MESOS-2154 only partially fixed the CFS quota support in Docker 
> Containerizer: that fix only works for custom executor.
> This tracks the fix for command executor so we can declare this is complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6134) Port CFS quota support to Docker Containerizer using command executor

2016-09-07 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-6134:

Component/s: docker
 containerization

> Port CFS quota support to Docker Containerizer using command executor
> -
>
> Key: MESOS-6134
> URL: https://issues.apache.org/jira/browse/MESOS-6134
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Reporter: Zhitao Li
>Assignee: Zhitao Li
> Fix For: 1.1.0
>
>
> MESOS-2154 only partially fixed the CFS quota support in Docker 
> Containerizer: that fix only works for custom executor.
> This tracks the fix for command executor so we can declare this is complete.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6132) Mesos 1.0.1 marathon 1.1.1 deployment fails no such image, container or task

2016-09-07 Thread Arslan Qadeer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471222#comment-15471222
 ] 

Arslan Qadeer commented on MESOS-6132:
--

Using --work_dir=/tmp/mesos_slave instead of --work_dir=$/tmp/mesos-slave 
solved the issue.

> Mesos 1.0.1 marathon 1.1.1 deployment fails no such image, container or task
> 
>
> Key: MESOS-6132
> URL: https://issues.apache.org/jira/browse/MESOS-6132
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Arslan Qadeer
>Priority: Blocker
> Fix For: 1.0.1
>
>
> Trying to launch docker containers with marathon on bridged network but they 
> fail. I have tried to pull containers manually but it did not work too.
> Here are some logs from Mesos Salve/Master and Marathon.
> Mesos Slave Logs:
> "sudo ./bin/mesos-slave.sh --ip= --master=zk:// ZOOKEEPER>:2181/mesos --advertise_ip= 
> --docker_socket=/var/run/docker.sock --work_dir=$/tmp/mesos-slave 
> --containerizers=docker --executor_registration_timeout=5mins 
> --isolation=docker/runtime"
> I0907 06:02:58.206800  5820 slave.cpp:1495] Got assigned task 
> helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545 for framework 
> d034dd41-d32c-4e81-a168-070c2185eefe- [23/9246]
> I0907 06:02:58.208921  5820 slave.cpp:1614] Launching task 
> helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545 for framework 
> d034dd41-d32c-4e81-a168-070c2185eefe-[22/9246]
> I0907 06:02:58.210595  5820 paths.cpp:528] Trying to chown 
> '$/tmp/mesos-slave/slaves/9b5505a5-71d9-47cd-a810-898bb24be347-S0/frameworks/d034dd41-d32c-4e81-a168-070c2185ee[21/9246]
> xecutors/helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545/runs/0a52cc6d-499d-487d-983a-d3d8e62213e2'
>  to user 'aio' 
> [20/9246]
> I0907 06:02:58.214004  5820 slave.cpp:5674] Launching executor 
> helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545 of framework 
> d034dd41-d32c-4e81-a168-070c2185eefe- with[19/9246]
> s cpus(*):0.1; mem(*):32 in work directory 
> '$/tmp/mesos-slave/slaves/9b5505a5-71d9-47cd-a810-898bb24be347-S0/frameworks/d034dd41-d32c-4e81-a168-070c2185eefe-/executor[18/9246]
> rld.43c00fb5-74e2-11e6-80f0-a2e45809a545/runs/0a52cc6d-499d-487d-983a-d3d8e62213e2'
>   
>  [17/9246]
> I0907 06:02:58.216660  5820 slave.cpp:1840] Queuing task 
> 'helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545' for executor 
> 'helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545' [16/9246]
> ork d034dd41-d32c-4e81-a168-070c2185eefe- 
>   
>   [15/9246]
> I0907 06:02:58.217999  5817 docker.cpp:1042] Starting container 
> '0a52cc6d-499d-487d-983a-d3d8e62213e2' for task 
> 'helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545' (and exe[14/9246]
> I0907 06:02:58.217999  5817 docker.cpp:1042] Starting container 
> '0a52cc6d-499d-487d-983a-d3d8e62213e2' for task 
> 'helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545' (and exe[14/9246$
> lloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545') of framework 
> 'd034dd41-d32c-4e81-a168-070c2185eefe-'
> I0907 06:03:02.005821  5820 docker.cpp:658] Checkpointing pid 5853 to 
> '$/tmp/mesos-slave/meta/slaves/9b5505a5-71d9-47cd-a810-898bb24be347-S0/frameworks/d034dd41-d32c-4e81-a168-07$
> c2185eefe-/executors/helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545/runs/0a52cc6d-499d-487d-983a-d3d8e62213e2/pids/forked.pid'
> I0907 06:03:02.057451  5814 slave.cpp:2828] Got registration for executor 
> 'helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545' of framework 
> d034dd41-d32c-4e81-a168-070c2185eefe- 
> from executor(1)@192.168.50.14:59541
> I0907 06:03:02.059079  5814 docker.cpp:1443] Ignoring updating container 
> '0a52cc6d-499d-487d-983a-d3d8e62213e2' with resources passed to update is 
> identical to existing resources
> I0907 06:03:02.059430  5814 slave.cpp:2005] Sending queued task 
> 'helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545' to executor 
> 'helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a545' of 
> framework d034dd41-d32c-4e81-a168-070c2185eefe- at 
> executor(1)@192.168.50.14:59541
> I0907 06:03:07.500459  5815 slave.cpp:3211] Handling status update 
> TASK_FAILED (UUID: ed6f0a0c-0687-4039-93e7-c14be4a9ec8c) for task 
> helloworld.43c00fb5-74e2-11e6-80f0-a2e45809a5$
> 5 of framework d034dd41-d32c-4e81-a168-070c2185eefe- from 
> executor(1)@192.168.50.14:59541
> E0907 06:03:07.539857  5820 slave.cpp:3456] Failed to update resources for 
> container 0a52cc6d-499d-487d-983a-d3d8e62213e2 of executor 
> 'helloworld.43c00fb5-74e2-11e6-80f0-a2e45809$
> 545' running task helloworld.43c00fb5-74e2-11e6-80f

[jira] [Created] (MESOS-6135) ContainerLoggerTest.LOGROTATE_RotateInSandbox is flaky

2016-09-07 Thread Greg Mann (JIRA)
Greg Mann created MESOS-6135:


 Summary: ContainerLoggerTest.LOGROTATE_RotateInSandbox is flaky
 Key: MESOS-6135
 URL: https://issues.apache.org/jira/browse/MESOS-6135
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.0.1
 Environment: Ubuntu 14, libev, non-SSL
Reporter: Greg Mann


Observed in our internal CI:
{code}
[19:53:51] : [Step 10/10] [ RUN  ] 
ContainerLoggerTest.LOGROTATE_RotateInSandbox
[19:53:51]W: [Step 10/10] I0906 19:53:51.460055 23729 cluster.cpp:157] 
Creating default 'local' authorizer
[19:53:51]W: [Step 10/10] I0906 19:53:51.468907 23729 leveldb.cpp:174] 
Opened db in 8.730166ms
[19:53:51]W: [Step 10/10] I0906 19:53:51.472470 23729 leveldb.cpp:181] 
Compacted db in 3.544028ms
[19:53:51]W: [Step 10/10] I0906 19:53:51.472491 23729 leveldb.cpp:196] 
Created db iterator in 3678ns
[19:53:51]W: [Step 10/10] I0906 19:53:51.472496 23729 leveldb.cpp:202] 
Seeked to beginning of db in 673ns
[19:53:51]W: [Step 10/10] I0906 19:53:51.472499 23729 leveldb.cpp:271] 
Iterated through 0 keys in the db in 256ns
[19:53:51]W: [Step 10/10] I0906 19:53:51.472510 23729 replica.cpp:776] 
Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
[19:53:51]W: [Step 10/10] I0906 19:53:51.472709 23744 recover.cpp:451] 
Starting replica recovery
[19:53:51]W: [Step 10/10] I0906 19:53:51.472820 23748 recover.cpp:477] 
Replica is in EMPTY status
[19:53:51]W: [Step 10/10] I0906 19:53:51.473059 23748 replica.cpp:673] 
Replica in EMPTY status received a broadcasted recover request from 
__req_res__(177)@172.30.2.89:44578
[19:53:51]W: [Step 10/10] I0906 19:53:51.473146 23746 recover.cpp:197] 
Received a recover response from a replica in EMPTY status
[19:53:51]W: [Step 10/10] I0906 19:53:51.473234 23745 recover.cpp:568] 
Updating replica status to STARTING
[19:53:51]W: [Step 10/10] I0906 19:53:51.473629 23747 master.cpp:379] 
Master 6d1b2727-f42d-446b-b2f8-a9f7e7667340 (ip-172-30-2-89.mesosphere.io) 
started on 172.30.2.89:44578
[19:53:51]W: [Step 10/10] I0906 19:53:51.473644 23747 master.cpp:381] Flags 
at startup: --acls="" --agent_ping_timeout="15secs" 
--agent_reregister_timeout="10mins" --allocation_interval="1secs" 
--allocator="HierarchicalDRF" --authenticate_agents="true" 
--authenticate_frameworks="true" --authenticate_http_frameworks="true" 
--authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/ceLmd7/credentials" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
--max_completed_tasks_per_framework="1000" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="replicated_log" 
--registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
--registry_strict="true" --root_submissions="true" --user_sorter="drf" 
--version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/ceLmd7/master" --zk_session_timeout="10secs"
[19:53:51]W: [Step 10/10] I0906 19:53:51.473832 23747 master.cpp:431] 
Master only allowing authenticated frameworks to register
[19:53:51]W: [Step 10/10] I0906 19:53:51.473844 23747 master.cpp:445] 
Master only allowing authenticated agents to register
[19:53:51]W: [Step 10/10] I0906 19:53:51.473850 23747 master.cpp:458] 
Master only allowing authenticated HTTP frameworks to register
[19:53:51]W: [Step 10/10] I0906 19:53:51.473856 23747 credentials.hpp:37] 
Loading credentials for authentication from '/tmp/ceLmd7/credentials'
[19:53:51]W: [Step 10/10] I0906 19:53:51.473975 23747 master.cpp:503] Using 
default 'crammd5' authenticator
[19:53:51]W: [Step 10/10] I0906 19:53:51.474028 23747 http.cpp:883] Using 
default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
[19:53:51]W: [Step 10/10] I0906 19:53:51.474097 23747 http.cpp:883] Using 
default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
[19:53:51]W: [Step 10/10] I0906 19:53:51.474161 23747 http.cpp:883] Using 
default 'basic' HTTP authenticator for realm 'mesos-master-scheduler'
[19:53:51]W: [Step 10/10] I0906 19:53:51.474242 23747 master.cpp:583] 
Authorization enabled
[19:53:51]W: [Step 10/10] I0906 19:53:51.474308 23744 hierarchical.cpp:149] 
Initialized hierarchical allocator process
[19:53:51]W: [Step 10/10] I0906 19:53:51.474308 23750 
whitelist_watcher.cpp:77] No whitelist given
[19:53:51]W: [Step 10/10] I0906 19:53:51.474840 23745 master.cpp:1850] 
Elected as the leading master!
[19:53:51]W: [Step 10/10] I0906 19:53:51.474850 23745 master.cpp:1551] 
Recovering from registrar
[19:53:51]W:   

[jira] [Commented] (MESOS-6135) ContainerLoggerTest.LOGROTATE_RotateInSandbox is flaky

2016-09-07 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471271#comment-15471271
 ] 

Greg Mann commented on MESOS-6135:
--

[~kaysoky] FYI

> ContainerLoggerTest.LOGROTATE_RotateInSandbox is flaky
> --
>
> Key: MESOS-6135
> URL: https://issues.apache.org/jira/browse/MESOS-6135
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
> Environment: Ubuntu 14, libev, non-SSL
>Reporter: Greg Mann
>  Labels: logging, mesosphere
>
> Observed in our internal CI:
> {code}
> [19:53:51] :   [Step 10/10] [ RUN  ] 
> ContainerLoggerTest.LOGROTATE_RotateInSandbox
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.460055 23729 cluster.cpp:157] 
> Creating default 'local' authorizer
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.468907 23729 leveldb.cpp:174] 
> Opened db in 8.730166ms
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472470 23729 leveldb.cpp:181] 
> Compacted db in 3.544028ms
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472491 23729 leveldb.cpp:196] 
> Created db iterator in 3678ns
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472496 23729 leveldb.cpp:202] 
> Seeked to beginning of db in 673ns
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472499 23729 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 256ns
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472510 23729 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472709 23744 recover.cpp:451] 
> Starting replica recovery
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472820 23748 recover.cpp:477] 
> Replica is in EMPTY status
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473059 23748 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(177)@172.30.2.89:44578
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473146 23746 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473234 23745 recover.cpp:568] 
> Updating replica status to STARTING
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473629 23747 master.cpp:379] 
> Master 6d1b2727-f42d-446b-b2f8-a9f7e7667340 (ip-172-30-2-89.mesosphere.io) 
> started on 172.30.2.89:44578
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473644 23747 master.cpp:381] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/ceLmd7/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
> --registry_strict="true" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/ceLmd7/master" --zk_session_timeout="10secs"
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473832 23747 master.cpp:431] 
> Master only allowing authenticated frameworks to register
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473844 23747 master.cpp:445] 
> Master only allowing authenticated agents to register
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473850 23747 master.cpp:458] 
> Master only allowing authenticated HTTP frameworks to register
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473856 23747 credentials.hpp:37] 
> Loading credentials for authentication from '/tmp/ceLmd7/credentials'
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473975 23747 master.cpp:503] Using 
> default 'crammd5' authenticator
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474028 23747 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474097 23747 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474161 23747 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-scheduler'
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474242 23747 master.cpp:583] 
> Authorization enabled
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474308 23744 hierarchical.cpp:14

[jira] [Commented] (MESOS-5987) Update health check protobuf for HTTP and TCP health check

2016-09-07 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471410#comment-15471410
 ] 

Alexander Rukletsov commented on MESOS-5987:


For posterity: the removed message and the changed field were experimental 
features and were not supposed to be part of the stable API. However, we 
decided to restore them to support those who rely on them.

> Update health check protobuf for HTTP and TCP health check
> --
>
> Key: MESOS-5987
> URL: https://issues.apache.org/jira/browse/MESOS-5987
> Project: Mesos
>  Issue Type: Task
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere
> Fix For: 1.1.0
>
>
> To support HTTP and TCP health check, we need to update the existing 
> {{HealthCheck}} protobuf message according to [~alexr] and [~gaston] 
> commented in https://reviews.apache.org/r/36816/ and 
> https://reviews.apache.org/r/49360/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5987) Update health check protobuf for HTTP and TCP health check

2016-09-07 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5987:
---
Sprint: Mesosphere Sprint 40, Mesosphere Sprint 42  (was: Mesosphere Sprint 
40)

> Update health check protobuf for HTTP and TCP health check
> --
>
> Key: MESOS-5987
> URL: https://issues.apache.org/jira/browse/MESOS-5987
> Project: Mesos
>  Issue Type: Task
>Reporter: haosdent
>Assignee: haosdent
>  Labels: health-check, mesosphere
> Fix For: 1.1.0
>
>
> To support HTTP and TCP health check, we need to update the existing 
> {{HealthCheck}} protobuf message according to [~alexr] and [~gaston] 
> commented in https://reviews.apache.org/r/36816/ and 
> https://reviews.apache.org/r/49360/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6067) Support provisioner to be nested aware for Mesos Pods.

2016-09-07 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15436491#comment-15436491
 ] 

Gilbert Song edited comment on MESOS-6067 at 9/7/16 7:04 PM:
-

https://reviews.apache.org/r/51323/
https://reviews.apache.org/r/51392/
https://reviews.apache.org/r/51393/
https://reviews.apache.org/r/51402/
https://reviews.apache.org/r/51503/
https://reviews.apache.org/r/51420/
https://reviews.apache.org/r/51421/


was (Author: gilbert):
https://reviews.apache.org/r/51323/
https://reviews.apache.org/r/51343/
https://reviews.apache.org/r/51358/
https://reviews.apache.org/r/51359/
https://reviews.apache.org/r/51392/
https://reviews.apache.org/r/51393/
https://reviews.apache.org/r/51402/
https://reviews.apache.org/r/51420/
https://reviews.apache.org/r/51421/

> Support provisioner to be nested aware for Mesos Pods.
> --
>
> Key: MESOS-6067
> URL: https://issues.apache.org/jira/browse/MESOS-6067
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: containerizer, provisioner
>
> The provisioner has to be nested aware for sub-container provisioning, as 
> well as recovery and nested container destroy. Better to support multi-level 
> hierarchy. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6135) ContainerLoggerTest.LOGROTATE_RotateInSandbox is flaky

2016-09-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471542#comment-15471542
 ] 

Joseph Wu commented on MESOS-6135:
--

This test is flaky because it is reading the contents of the {{stdout}} file 
and expecting a certain size (with some amount of leeway).  It looks like it 
can fail when the executor logs slightly more (~1 KB) more text than expected.

We can fix this by either: 
1) silencing the executor's logging with an environment variable 
({{MESOS_LOGGING_LEVEL=ERROR}}); or
2) increasing the amount of leeway the test can tolerate 
(https://github.com/apache/mesos/blob/b2101157fd61bbe42c9536935ee9fda44a929ee9/src/tests/container_logger_tests.cpp#L505).

> ContainerLoggerTest.LOGROTATE_RotateInSandbox is flaky
> --
>
> Key: MESOS-6135
> URL: https://issues.apache.org/jira/browse/MESOS-6135
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
> Environment: Ubuntu 14, libev, non-SSL
>Reporter: Greg Mann
>  Labels: logging, mesosphere
>
> Observed in our internal CI:
> {code}
> [19:53:51] :   [Step 10/10] [ RUN  ] 
> ContainerLoggerTest.LOGROTATE_RotateInSandbox
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.460055 23729 cluster.cpp:157] 
> Creating default 'local' authorizer
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.468907 23729 leveldb.cpp:174] 
> Opened db in 8.730166ms
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472470 23729 leveldb.cpp:181] 
> Compacted db in 3.544028ms
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472491 23729 leveldb.cpp:196] 
> Created db iterator in 3678ns
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472496 23729 leveldb.cpp:202] 
> Seeked to beginning of db in 673ns
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472499 23729 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 256ns
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472510 23729 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472709 23744 recover.cpp:451] 
> Starting replica recovery
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472820 23748 recover.cpp:477] 
> Replica is in EMPTY status
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473059 23748 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(177)@172.30.2.89:44578
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473146 23746 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473234 23745 recover.cpp:568] 
> Updating replica status to STARTING
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473629 23747 master.cpp:379] 
> Master 6d1b2727-f42d-446b-b2f8-a9f7e7667340 (ip-172-30-2-89.mesosphere.io) 
> started on 172.30.2.89:44578
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473644 23747 master.cpp:381] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/ceLmd7/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
> --registry_strict="true" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/ceLmd7/master" --zk_session_timeout="10secs"
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473832 23747 master.cpp:431] 
> Master only allowing authenticated frameworks to register
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473844 23747 master.cpp:445] 
> Master only allowing authenticated agents to register
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473850 23747 master.cpp:458] 
> Master only allowing authenticated HTTP frameworks to register
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473856 23747 credentials.hpp:37] 
> Loading credentials for authentication from '/tmp/ceLmd7/credentials'
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473975 23747 master.cpp:503] Using 
> default 'crammd5' authenticator
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474028 23747 http.cpp:883] Using 
> default 'basic'

[jira] [Updated] (MESOS-6135) ContainerLoggerTest.LOGROTATE_RotateInSandbox is flaky

2016-09-07 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-6135:
-
Shepherd: Joseph Wu
Story Points: 1

> ContainerLoggerTest.LOGROTATE_RotateInSandbox is flaky
> --
>
> Key: MESOS-6135
> URL: https://issues.apache.org/jira/browse/MESOS-6135
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
> Environment: Ubuntu 14, libev, non-SSL
>Reporter: Greg Mann
>  Labels: logging, mesosphere
>
> Observed in our internal CI:
> {code}
> [19:53:51] :   [Step 10/10] [ RUN  ] 
> ContainerLoggerTest.LOGROTATE_RotateInSandbox
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.460055 23729 cluster.cpp:157] 
> Creating default 'local' authorizer
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.468907 23729 leveldb.cpp:174] 
> Opened db in 8.730166ms
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472470 23729 leveldb.cpp:181] 
> Compacted db in 3.544028ms
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472491 23729 leveldb.cpp:196] 
> Created db iterator in 3678ns
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472496 23729 leveldb.cpp:202] 
> Seeked to beginning of db in 673ns
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472499 23729 leveldb.cpp:271] 
> Iterated through 0 keys in the db in 256ns
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472510 23729 replica.cpp:776] 
> Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472709 23744 recover.cpp:451] 
> Starting replica recovery
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.472820 23748 recover.cpp:477] 
> Replica is in EMPTY status
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473059 23748 replica.cpp:673] 
> Replica in EMPTY status received a broadcasted recover request from 
> __req_res__(177)@172.30.2.89:44578
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473146 23746 recover.cpp:197] 
> Received a recover response from a replica in EMPTY status
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473234 23745 recover.cpp:568] 
> Updating replica status to STARTING
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473629 23747 master.cpp:379] 
> Master 6d1b2727-f42d-446b-b2f8-a9f7e7667340 (ip-172-30-2-89.mesosphere.io) 
> started on 172.30.2.89:44578
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473644 23747 master.cpp:381] Flags 
> at startup: --acls="" --agent_ping_timeout="15secs" 
> --agent_reregister_timeout="10mins" --allocation_interval="1secs" 
> --allocator="HierarchicalDRF" --authenticate_agents="true" 
> --authenticate_frameworks="true" --authenticate_http_frameworks="true" 
> --authenticate_http_readonly="true" --authenticate_http_readwrite="true" 
> --authenticators="crammd5" --authorizers="local" 
> --credentials="/tmp/ceLmd7/credentials" --framework_sorter="drf" 
> --help="false" --hostname_lookup="true" --http_authenticators="basic" 
> --http_framework_authenticators="basic" --initialize_driver_logging="true" 
> --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
> --max_agent_ping_timeouts="5" --max_completed_frameworks="50" 
> --max_completed_tasks_per_framework="1000" --quiet="false" 
> --recovery_agent_removal_limit="100%" --registry="replicated_log" 
> --registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
> --registry_strict="true" --root_submissions="true" --user_sorter="drf" 
> --version="false" --webui_dir="/usr/local/share/mesos/webui" 
> --work_dir="/tmp/ceLmd7/master" --zk_session_timeout="10secs"
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473832 23747 master.cpp:431] 
> Master only allowing authenticated frameworks to register
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473844 23747 master.cpp:445] 
> Master only allowing authenticated agents to register
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473850 23747 master.cpp:458] 
> Master only allowing authenticated HTTP frameworks to register
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473856 23747 credentials.hpp:37] 
> Loading credentials for authentication from '/tmp/ceLmd7/credentials'
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.473975 23747 master.cpp:503] Using 
> default 'crammd5' authenticator
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474028 23747 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readonly'
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474097 23747 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-readwrite'
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474161 23747 http.cpp:883] Using 
> default 'basic' HTTP authenticator for realm 'mesos-master-scheduler'
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474242 23747 master.cpp:583] 
> Authorization enabled
> [19:53:51]W:   [Step 10/10] I0906 19:53:51.474308 23744 hierarchical.cpp:149] 
> Initialized hierarc

[jira] [Commented] (MESOS-5965) Implement garbage collection for unreachable agent lists in registry

2016-09-07 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471883#comment-15471883
 ] 

Neil Conway commented on MESOS-5965:


https://reviews.apache.org/r/51706/
https://reviews.apache.org/r/51707/

> Implement garbage collection for unreachable agent lists in registry
> 
>
> Key: MESOS-5965
> URL: https://issues.apache.org/jira/browse/MESOS-5965
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere, registry
>
> The list of unreachable agents (and eventually, the list of gone agents) can 
> grow without bound. We should implement a GC scheme to avoid these lists 
> getting too large and hampering the performance of the master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6136) Duplicate framework id handling

2016-09-07 Thread Christopher Hunt (JIRA)
Christopher Hunt created MESOS-6136:
---

 Summary: Duplicate framework id handling
 Key: MESOS-6136
 URL: https://issues.apache.org/jira/browse/MESOS-6136
 Project: Mesos
  Issue Type: Improvement
  Components: general
Affects Versions: 0.28.1
 Environment: DCOS 1.7 Cloud Formation scripts
Reporter: Christopher Hunt
Priority: Critical


We have observed a situation where Mesos will kill tasks belonging to a 
framework where that framework times out with the Mesos master for some reason, 
perhaps even because of a network partition.

While we can provide a long timeout so that Mesos will not kill a framework's 
tasks for practical purposes, I'm wondering if there's an improvement where a 
framework shouldn't be permitted to re-register for a given id (as now), but 
Mesos doesn't also kill tasks? What I'm thinking is that Mesos could be "told" 
by an operator that this condition should be cleared.

IMHO frameworks should be the only entity requesting that tasks be killed 
unless manually overridden by an operator.

I'm flagging this as a critical improvement because a) the focus should be on 
keeping tasks running in a system, and it isn't; and b) Mesos is working as 
designed. 

In summary I feel that Mesos is taking on a responsibility in killing tasks 
where it shouldn't be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6136) Duplicate framework id handling

2016-09-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15472146#comment-15472146
 ] 

Joseph Wu commented on MESOS-6136:
--

Sounds like you're asking for:
a) A way to orphan tasks on purpose; or
b) The {{failover_timeout}} that the framework is supposed to set: 
https://github.com/apache/mesos/blob/3e52a107c4073778de9c14bf5fcdeb6e342821aa/include/mesos/mesos.proto#L229-L237

> Duplicate framework id handling
> ---
>
> Key: MESOS-6136
> URL: https://issues.apache.org/jira/browse/MESOS-6136
> Project: Mesos
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 0.28.1
> Environment: DCOS 1.7 Cloud Formation scripts
>Reporter: Christopher Hunt
>Priority: Critical
>  Labels: framework, lifecyclemanagement, task
>
> We have observed a situation where Mesos will kill tasks belonging to a 
> framework where that framework times out with the Mesos master for some 
> reason, perhaps even because of a network partition.
> While we can provide a long timeout so that Mesos will not kill a framework's 
> tasks for practical purposes, I'm wondering if there's an improvement where a 
> framework shouldn't be permitted to re-register for a given id (as now), but 
> Mesos doesn't also kill tasks? What I'm thinking is that Mesos could be 
> "told" by an operator that this condition should be cleared.
> IMHO frameworks should be the only entity requesting that tasks be killed 
> unless manually overridden by an operator.
> I'm flagging this as a critical improvement because a) the focus should be on 
> keeping tasks running in a system, and it isn't; and b) Mesos is working as 
> designed. 
> In summary I feel that Mesos is taking on a responsibility in killing tasks 
> where it shouldn't be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6136) Duplicate framework id handling

2016-09-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15472146#comment-15472146
 ] 

Joseph Wu edited comment on MESOS-6136 at 9/7/16 11:40 PM:
---

Sounds like you're asking for:
a) A way to orphan tasks on purpose (e.g. [MESOS-4659]); or
b) The {{failover_timeout}} that the framework is supposed to set: 
https://github.com/apache/mesos/blob/3e52a107c4073778de9c14bf5fcdeb6e342821aa/include/mesos/mesos.proto#L229-L237


was (Author: kaysoky):
Sounds like you're asking for:
a) A way to orphan tasks on purpose; or
b) The {{failover_timeout}} that the framework is supposed to set: 
https://github.com/apache/mesos/blob/3e52a107c4073778de9c14bf5fcdeb6e342821aa/include/mesos/mesos.proto#L229-L237

> Duplicate framework id handling
> ---
>
> Key: MESOS-6136
> URL: https://issues.apache.org/jira/browse/MESOS-6136
> Project: Mesos
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 0.28.1
> Environment: DCOS 1.7 Cloud Formation scripts
>Reporter: Christopher Hunt
>Priority: Critical
>  Labels: framework, lifecyclemanagement, task
>
> We have observed a situation where Mesos will kill tasks belonging to a 
> framework where that framework times out with the Mesos master for some 
> reason, perhaps even because of a network partition.
> While we can provide a long timeout so that Mesos will not kill a framework's 
> tasks for practical purposes, I'm wondering if there's an improvement where a 
> framework shouldn't be permitted to re-register for a given id (as now), but 
> Mesos doesn't also kill tasks? What I'm thinking is that Mesos could be 
> "told" by an operator that this condition should be cleared.
> IMHO frameworks should be the only entity requesting that tasks be killed 
> unless manually overridden by an operator.
> I'm flagging this as a critical improvement because a) the focus should be on 
> keeping tasks running in a system, and it isn't; and b) Mesos is working as 
> designed. 
> In summary I feel that Mesos is taking on a responsibility in killing tasks 
> where it shouldn't be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6136) Duplicate framework id handling

2016-09-07 Thread Christopher Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15472313#comment-15472313
 ] 

Christopher Hunt commented on MESOS-6136:
-

Thanks for the reply. However I'm not asking for either of those things.

In the case of MESOS-4659, the suggestion there is that Mesos kills tasks given 
the instruction of a framework. What I'm stating is that Mesos shouldn't be 
responsible for killing tasks unless an operator (human) intervenes. Killing 
tasks is a serious business and can possibly impact a business dramatically.

In the case of the fail over timeout, we're already setting our timeout to 1 
week as suggested in order to mitigate this situation. What I'm stating though 
is that we shouldn't have to do this.

My point again: only frameworks should instruct Mesos to kill tasks unless 
overridden by an operator. The frameworks and operators know best, not Mesos.

> Duplicate framework id handling
> ---
>
> Key: MESOS-6136
> URL: https://issues.apache.org/jira/browse/MESOS-6136
> Project: Mesos
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 0.28.1
> Environment: DCOS 1.7 Cloud Formation scripts
>Reporter: Christopher Hunt
>Priority: Critical
>  Labels: framework, lifecyclemanagement, task
>
> We have observed a situation where Mesos will kill tasks belonging to a 
> framework where that framework times out with the Mesos master for some 
> reason, perhaps even because of a network partition.
> While we can provide a long timeout so that Mesos will not kill a framework's 
> tasks for practical purposes, I'm wondering if there's an improvement where a 
> framework shouldn't be permitted to re-register for a given id (as now), but 
> Mesos doesn't also kill tasks? What I'm thinking is that Mesos could be 
> "told" by an operator that this condition should be cleared.
> IMHO frameworks should be the only entity requesting that tasks be killed 
> unless manually overridden by an operator.
> I'm flagging this as a critical improvement because a) the focus should be on 
> keeping tasks running in a system, and it isn't; and b) Mesos is working as 
> designed. 
> In summary I feel that Mesos is taking on a responsibility in killing tasks 
> where it shouldn't be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6136) Duplicate framework id handling

2016-09-07 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15472393#comment-15472393
 ] 

Joseph Wu commented on MESOS-6136:
--

I'd argue that Mesos is responsible for resource management.  So in a situation 
where a framework no longer exists but its task does exist, Mesos *should* kill 
the task to reclaim resources.

Frameworks are responsible for keeping their tasks alive for as long as needed 
(i.e. short-lived or long-lived).  That responsibility also includes 
re-registering within the timeout.  If one week isn't enough, set it to a 
month, or a century :)

> Duplicate framework id handling
> ---
>
> Key: MESOS-6136
> URL: https://issues.apache.org/jira/browse/MESOS-6136
> Project: Mesos
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 0.28.1
> Environment: DCOS 1.7 Cloud Formation scripts
>Reporter: Christopher Hunt
>Priority: Critical
>  Labels: framework, lifecyclemanagement, task
>
> We have observed a situation where Mesos will kill tasks belonging to a 
> framework where that framework times out with the Mesos master for some 
> reason, perhaps even because of a network partition.
> While we can provide a long timeout so that Mesos will not kill a framework's 
> tasks for practical purposes, I'm wondering if there's an improvement where a 
> framework shouldn't be permitted to re-register for a given id (as now), but 
> Mesos doesn't also kill tasks? What I'm thinking is that Mesos could be 
> "told" by an operator that this condition should be cleared.
> IMHO frameworks should be the only entity requesting that tasks be killed 
> unless manually overridden by an operator.
> I'm flagging this as a critical improvement because a) the focus should be on 
> keeping tasks running in a system, and it isn't; and b) Mesos is working as 
> designed. 
> In summary I feel that Mesos is taking on a responsibility in killing tasks 
> where it shouldn't be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6136) Duplicate framework id handling

2016-09-07 Thread Christopher Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15472407#comment-15472407
 ] 

Christopher Hunt commented on MESOS-6136:
-

> So in a situation where a framework no longer exists...

Mesos can never be sure on whether a framework exists or not. For example, 
Mesos cannot determine if the framework has stopped for some reason, or whether 
it is just a network partition.

> Duplicate framework id handling
> ---
>
> Key: MESOS-6136
> URL: https://issues.apache.org/jira/browse/MESOS-6136
> Project: Mesos
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 0.28.1
> Environment: DCOS 1.7 Cloud Formation scripts
>Reporter: Christopher Hunt
>Priority: Critical
>  Labels: framework, lifecyclemanagement, task
>
> We have observed a situation where Mesos will kill tasks belonging to a 
> framework where that framework times out with the Mesos master for some 
> reason, perhaps even because of a network partition.
> While we can provide a long timeout so that Mesos will not kill a framework's 
> tasks for practical purposes, I'm wondering if there's an improvement where a 
> framework shouldn't be permitted to re-register for a given id (as now), but 
> Mesos doesn't also kill tasks? What I'm thinking is that Mesos could be 
> "told" by an operator that this condition should be cleared.
> IMHO frameworks should be the only entity requesting that tasks be killed 
> unless manually overridden by an operator.
> I'm flagging this as a critical improvement because a) the focus should be on 
> keeping tasks running in a system, and it isn't; and b) Mesos is working as 
> designed. 
> In summary I feel that Mesos is taking on a responsibility in killing tasks 
> where it shouldn't be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-6136) Duplicate framework id handling

2016-09-07 Thread Christopher Hunt (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15472407#comment-15472407
 ] 

Christopher Hunt edited comment on MESOS-6136 at 9/8/16 1:56 AM:
-

> So in a situation where a framework no longer exists...

Mesos can never be sure on whether a framework exists or not. For example, 
Mesos cannot determine if the framework has stopped for some reason, or whether 
it is just a network partition.

By comparison, Akka does not automatically "down" a cluster member in the 
situation where it becomes lost. Instead, it quarantines it requiring an 
operator to intervene (there is also a product we provided for handling split 
brain scenarios that will automatically down parts of the cluster, but I 
digress...).

I'm suggesting that Mesos also quarantines frameworks but doesn't kill tasks. 
Perhaps this could be considered "opt-in" by a framework.


was (Author: huntc):
> So in a situation where a framework no longer exists...

Mesos can never be sure on whether a framework exists or not. For example, 
Mesos cannot determine if the framework has stopped for some reason, or whether 
it is just a network partition.

> Duplicate framework id handling
> ---
>
> Key: MESOS-6136
> URL: https://issues.apache.org/jira/browse/MESOS-6136
> Project: Mesos
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 0.28.1
> Environment: DCOS 1.7 Cloud Formation scripts
>Reporter: Christopher Hunt
>Priority: Critical
>  Labels: framework, lifecyclemanagement, task
>
> We have observed a situation where Mesos will kill tasks belonging to a 
> framework where that framework times out with the Mesos master for some 
> reason, perhaps even because of a network partition.
> While we can provide a long timeout so that Mesos will not kill a framework's 
> tasks for practical purposes, I'm wondering if there's an improvement where a 
> framework shouldn't be permitted to re-register for a given id (as now), but 
> Mesos doesn't also kill tasks? What I'm thinking is that Mesos could be 
> "told" by an operator that this condition should be cleared.
> IMHO frameworks should be the only entity requesting that tasks be killed 
> unless manually overridden by an operator.
> I'm flagging this as a critical improvement because a) the focus should be on 
> keeping tasks running in a system, and it isn't; and b) Mesos is working as 
> designed. 
> In summary I feel that Mesos is taking on a responsibility in killing tasks 
> where it shouldn't be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6137) Segfault during DiskResource/PersistentVolumeTest.IncompatibleCheckpointedResources/0

2016-09-07 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-6137:
-
Description: 
Observed on Jenkins CI:
{code}
I0906 20:01:45.235483 29082 master.cpp:379] Master 
9fd91e5d-4257-427d-a7da-3f18d99c8ffa (0a1dc2da838b) started on 172.17.0.3:60366
I0906 20:01:45.235513 29082 master.cpp:381] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/ze1TG1/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--quiet="false" --recovery_agent_removal_limit="100%" 
--registry="replicated_log" --registry_fetch_timeout="1mins" 
--registry_store_timeout="100secs" --registry_strict="true" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/mesos/mesos-1.1.0/_inst/share/mesos/webui" 
--work_dir="/tmp/ze1TG1/master" --zk_session_timeout="10secs"
I0906 20:01:45.236022 29082 master.cpp:431] Master only allowing authenticated 
frameworks to register
I0906 20:01:45.236037 29082 master.cpp:445] Master only allowing authenticated 
agents to register
I0906 20:01:45.236045 29082 master.cpp:458] Master only allowing authenticated 
HTTP frameworks to register
I0906 20:01:45.236054 29082 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/ze1TG1/credentials'
I0906 20:01:45.236392 29082 master.cpp:503] Using default 'crammd5' 
authenticator
I0906 20:01:45.236654 29079 replica.cpp:673] Replica in STARTING status 
received a broadcasted recover request from __req_res__(6359)@172.17.0.3:60366
I0906 20:01:45.236687 29082 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0906 20:01:45.236927 29082 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0906 20:01:45.237095 29079 recover.cpp:197] Received a recover response from a 
replica in STARTING status
I0906 20:01:45.237117 29082 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0906 20:01:45.237340 29082 master.cpp:583] Authorization enabled
I0906 20:01:45.237663 29080 whitelist_watcher.cpp:77] No whitelist given
I0906 20:01:45.237685 29075 hierarchical.cpp:149] Initialized hierarchical 
allocator process
I0906 20:01:45.237835 29085 recover.cpp:568] Updating replica status to VOTING
I0906 20:01:45.238531 29081 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 378674ns
I0906 20:01:45.238560 29081 replica.cpp:320] Persisted replica status to VOTING
I0906 20:01:45.238685 29073 recover.cpp:582] Successfully joined the Paxos group
I0906 20:01:45.238975 29073 recover.cpp:466] Recover process terminated
I0906 20:01:45.240437 29078 master.cpp:1850] Elected as the leading master!
I0906 20:01:45.240468 29078 master.cpp:1551] Recovering from registrar
I0906 20:01:45.240592 29080 registrar.cpp:332] Recovering registrar
I0906 20:01:45.241178 29075 log.cpp:553] Attempting to start the writer
I0906 20:01:45.242928 29072 replica.cpp:493] Replica received implicit promise 
request from __req_res__(6360)@172.17.0.3:60366 with proposal 1
I0906 20:01:45.243324 29072 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 335676ns
I0906 20:01:45.243350 29072 replica.cpp:342] Persisted promised to 1
I0906 20:01:45.244056 29081 coordinator.cpp:238] Coordinator attempting to fill 
missing positions
I0906 20:01:45.245538 29078 replica.cpp:388] Replica received explicit promise 
request from __req_res__(6361)@172.17.0.3:60366 for position 0 with proposal 2
I0906 20:01:45.245995 29078 leveldb.cpp:341] Persisting action (8 bytes) to 
leveldb took 412163ns
I0906 20:01:45.246021 29078 replica.cpp:708] Persisted action NOP at position 0
I0906 20:01:45.247329 29082 replica.cpp:537] Replica received write request for 
position 0 from __req_res__(6362)@172.17.0.3:60366
I0906 20:01:45.247406 29082 leveldb.cpp:436] Reading position from leveldb took 
35845ns
I0906 20:01:45.247989 29082 leveldb.cpp:341] Persisting action (14 bytes) to 
leveldb took 541972ns
I0906 20:01:45.248015 29082 replica.cpp:708] Persisted action NOP at position 0
I0906 20:01:45.248556 29084 replica.cpp:691] Replica received learned notice 
for position 0 from @0.0.0.0:0
I0906 20:01:45.249241 29084 leveldb.cpp:341] Persisting action (16 bytes) to 
leveldb took 647885ns
I0906 20:01:45.249271 29084 r

[jira] [Created] (MESOS-6137) Segfault during DiskResource/PersistentVolumeTest.IncompatibleCheckpointedResources/0

2016-09-07 Thread Greg Mann (JIRA)
Greg Mann created MESOS-6137:


 Summary: Segfault during 
DiskResource/PersistentVolumeTest.IncompatibleCheckpointedResources/0
 Key: MESOS-6137
 URL: https://issues.apache.org/jira/browse/MESOS-6137
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.0.1
 Environment: Ubuntu 14.04, non-SSL, libev
Reporter: Greg Mann
Assignee: Greg Mann


Observed in our internal CI:
{code}
I0906 20:01:45.235483 29082 master.cpp:379] Master 
9fd91e5d-4257-427d-a7da-3f18d99c8ffa (0a1dc2da838b) started on 172.17.0.3:60366
I0906 20:01:45.235513 29082 master.cpp:381] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/ze1TG1/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--quiet="false" --recovery_agent_removal_limit="100%" 
--registry="replicated_log" --registry_fetch_timeout="1mins" 
--registry_store_timeout="100secs" --registry_strict="true" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/mesos/mesos-1.1.0/_inst/share/mesos/webui" 
--work_dir="/tmp/ze1TG1/master" --zk_session_timeout="10secs"
I0906 20:01:45.236022 29082 master.cpp:431] Master only allowing authenticated 
frameworks to register
I0906 20:01:45.236037 29082 master.cpp:445] Master only allowing authenticated 
agents to register
I0906 20:01:45.236045 29082 master.cpp:458] Master only allowing authenticated 
HTTP frameworks to register
I0906 20:01:45.236054 29082 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/ze1TG1/credentials'
I0906 20:01:45.236392 29082 master.cpp:503] Using default 'crammd5' 
authenticator
I0906 20:01:45.236654 29079 replica.cpp:673] Replica in STARTING status 
received a broadcasted recover request from __req_res__(6359)@172.17.0.3:60366
I0906 20:01:45.236687 29082 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0906 20:01:45.236927 29082 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0906 20:01:45.237095 29079 recover.cpp:197] Received a recover response from a 
replica in STARTING status
I0906 20:01:45.237117 29082 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0906 20:01:45.237340 29082 master.cpp:583] Authorization enabled
I0906 20:01:45.237663 29080 whitelist_watcher.cpp:77] No whitelist given
I0906 20:01:45.237685 29075 hierarchical.cpp:149] Initialized hierarchical 
allocator process
I0906 20:01:45.237835 29085 recover.cpp:568] Updating replica status to VOTING
I0906 20:01:45.238531 29081 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 378674ns
I0906 20:01:45.238560 29081 replica.cpp:320] Persisted replica status to VOTING
I0906 20:01:45.238685 29073 recover.cpp:582] Successfully joined the Paxos group
I0906 20:01:45.238975 29073 recover.cpp:466] Recover process terminated
I0906 20:01:45.240437 29078 master.cpp:1850] Elected as the leading master!
I0906 20:01:45.240468 29078 master.cpp:1551] Recovering from registrar
I0906 20:01:45.240592 29080 registrar.cpp:332] Recovering registrar
I0906 20:01:45.241178 29075 log.cpp:553] Attempting to start the writer
I0906 20:01:45.242928 29072 replica.cpp:493] Replica received implicit promise 
request from __req_res__(6360)@172.17.0.3:60366 with proposal 1
I0906 20:01:45.243324 29072 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 335676ns
I0906 20:01:45.243350 29072 replica.cpp:342] Persisted promised to 1
I0906 20:01:45.244056 29081 coordinator.cpp:238] Coordinator attempting to fill 
missing positions
I0906 20:01:45.245538 29078 replica.cpp:388] Replica received explicit promise 
request from __req_res__(6361)@172.17.0.3:60366 for position 0 with proposal 2
I0906 20:01:45.245995 29078 leveldb.cpp:341] Persisting action (8 bytes) to 
leveldb took 412163ns
I0906 20:01:45.246021 29078 replica.cpp:708] Persisted action NOP at position 0
I0906 20:01:45.247329 29082 replica.cpp:537] Replica received write request for 
position 0 from __req_res__(6362)@172.17.0.3:60366
I0906 20:01:45.247406 29082 leveldb.cpp:436] Reading position from leveldb took 
35845ns
I0906 20:01:45.247989 29082 leveldb.cpp:341] Persisting action (14 bytes) to 
leveldb took 541972ns
I0906 20:01:45.248015 29082 replica.cpp:708

[jira] [Updated] (MESOS-6137) Segfault during DiskResource/PersistentVolumeTest.IncompatibleCheckpointedResources/0

2016-09-07 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-6137:
-
Attachment: mesos-segfault.txt.zip

> Segfault during 
> DiskResource/PersistentVolumeTest.IncompatibleCheckpointedResources/0
> -
>
> Key: MESOS-6137
> URL: https://issues.apache.org/jira/browse/MESOS-6137
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
> Environment: Ubuntu 14.04, non-SSL, libev
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere
> Attachments: mesos-segfault.txt.zip
>
>
> Observed on Jenkins CI:
> {code}
> I0906 20:01:45.235483 29082 master.cpp:379] Master 
> 9fd91e5d-4257-427d-a7da-3f18d99c8ffa (0a1dc2da838b) started on 
> 172.17.0.3:60366
> I0906 20:01:45.235513 29082 master.cpp:381] Flags at startup: --acls="" 
> --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
> --allocation_interval="1secs" --allocator="HierarchicalDRF" 
> --authenticate_agents="true" --authenticate_frameworks="true" 
> --authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
> --authenticate_http_readwrite="true" --authenticators="crammd5" 
> --authorizers="local" --credentials="/tmp/ze1TG1/credentials" 
> --framework_sorter="drf" --help="false" --hostname_lookup="true" 
> --http_authenticators="basic" --http_framework_authenticators="basic" 
> --initialize_driver_logging="true" --log_auto_initialize="true" 
> --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
> --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
> --quiet="false" --recovery_agent_removal_limit="100%" 
> --registry="replicated_log" --registry_fetch_timeout="1mins" 
> --registry_store_timeout="100secs" --registry_strict="true" 
> --root_submissions="true" --user_sorter="drf" --version="false" 
> --webui_dir="/mesos/mesos-1.1.0/_inst/share/mesos/webui" 
> --work_dir="/tmp/ze1TG1/master" --zk_session_timeout="10secs"
> I0906 20:01:45.236022 29082 master.cpp:431] Master only allowing 
> authenticated frameworks to register
> I0906 20:01:45.236037 29082 master.cpp:445] Master only allowing 
> authenticated agents to register
> I0906 20:01:45.236045 29082 master.cpp:458] Master only allowing 
> authenticated HTTP frameworks to register
> I0906 20:01:45.236054 29082 credentials.hpp:37] Loading credentials for 
> authentication from '/tmp/ze1TG1/credentials'
> I0906 20:01:45.236392 29082 master.cpp:503] Using default 'crammd5' 
> authenticator
> I0906 20:01:45.236654 29079 replica.cpp:673] Replica in STARTING status 
> received a broadcasted recover request from __req_res__(6359)@172.17.0.3:60366
> I0906 20:01:45.236687 29082 http.cpp:883] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readonly'
> I0906 20:01:45.236927 29082 http.cpp:883] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-readwrite'
> I0906 20:01:45.237095 29079 recover.cpp:197] Received a recover response from 
> a replica in STARTING status
> I0906 20:01:45.237117 29082 http.cpp:883] Using default 'basic' HTTP 
> authenticator for realm 'mesos-master-scheduler'
> I0906 20:01:45.237340 29082 master.cpp:583] Authorization enabled
> I0906 20:01:45.237663 29080 whitelist_watcher.cpp:77] No whitelist given
> I0906 20:01:45.237685 29075 hierarchical.cpp:149] Initialized hierarchical 
> allocator process
> I0906 20:01:45.237835 29085 recover.cpp:568] Updating replica status to VOTING
> I0906 20:01:45.238531 29081 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 378674ns
> I0906 20:01:45.238560 29081 replica.cpp:320] Persisted replica status to 
> VOTING
> I0906 20:01:45.238685 29073 recover.cpp:582] Successfully joined the Paxos 
> group
> I0906 20:01:45.238975 29073 recover.cpp:466] Recover process terminated
> I0906 20:01:45.240437 29078 master.cpp:1850] Elected as the leading master!
> I0906 20:01:45.240468 29078 master.cpp:1551] Recovering from registrar
> I0906 20:01:45.240592 29080 registrar.cpp:332] Recovering registrar
> I0906 20:01:45.241178 29075 log.cpp:553] Attempting to start the writer
> I0906 20:01:45.242928 29072 replica.cpp:493] Replica received implicit 
> promise request from __req_res__(6360)@172.17.0.3:60366 with proposal 1
> I0906 20:01:45.243324 29072 leveldb.cpp:304] Persisting metadata (8 bytes) to 
> leveldb took 335676ns
> I0906 20:01:45.243350 29072 replica.cpp:342] Persisted promised to 1
> I0906 20:01:45.244056 29081 coordinator.cpp:238] Coordinator attempting to 
> fill missing positions
> I0906 20:01:45.245538 29078 replica.cpp:388] Replica received explicit 
> promise request from __req_res__(6361)@172.17.0.3:60366 for position 0 with 
> proposal 2
> I0906 20:01:45.245995 29078 leveldb.cpp:341] Persisting action (8 bytes) to 
>

[jira] [Updated] (MESOS-6137) Segfault during DiskResource/PersistentVolumeTest.IncompatibleCheckpointedResources/0

2016-09-07 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-6137:
-
Description: 
Observed on Jenkins CI:
{code}
I0906 20:01:45.235483 29082 master.cpp:379] Master 
9fd91e5d-4257-427d-a7da-3f18d99c8ffa (0a1dc2da838b) started on 172.17.0.3:60366
I0906 20:01:45.235513 29082 master.cpp:381] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/ze1TG1/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--quiet="false" --recovery_agent_removal_limit="100%" 
--registry="replicated_log" --registry_fetch_timeout="1mins" 
--registry_store_timeout="100secs" --registry_strict="true" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/mesos/mesos-1.1.0/_inst/share/mesos/webui" 
--work_dir="/tmp/ze1TG1/master" --zk_session_timeout="10secs"
I0906 20:01:45.236022 29082 master.cpp:431] Master only allowing authenticated 
frameworks to register
I0906 20:01:45.236037 29082 master.cpp:445] Master only allowing authenticated 
agents to register
I0906 20:01:45.236045 29082 master.cpp:458] Master only allowing authenticated 
HTTP frameworks to register
I0906 20:01:45.236054 29082 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/ze1TG1/credentials'
I0906 20:01:45.236392 29082 master.cpp:503] Using default 'crammd5' 
authenticator
I0906 20:01:45.236654 29079 replica.cpp:673] Replica in STARTING status 
received a broadcasted recover request from __req_res__(6359)@172.17.0.3:60366
I0906 20:01:45.236687 29082 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0906 20:01:45.236927 29082 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0906 20:01:45.237095 29079 recover.cpp:197] Received a recover response from a 
replica in STARTING status
I0906 20:01:45.237117 29082 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0906 20:01:45.237340 29082 master.cpp:583] Authorization enabled
I0906 20:01:45.237663 29080 whitelist_watcher.cpp:77] No whitelist given
I0906 20:01:45.237685 29075 hierarchical.cpp:149] Initialized hierarchical 
allocator process
I0906 20:01:45.237835 29085 recover.cpp:568] Updating replica status to VOTING
I0906 20:01:45.238531 29081 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 378674ns
I0906 20:01:45.238560 29081 replica.cpp:320] Persisted replica status to VOTING
I0906 20:01:45.238685 29073 recover.cpp:582] Successfully joined the Paxos group
I0906 20:01:45.238975 29073 recover.cpp:466] Recover process terminated
I0906 20:01:45.240437 29078 master.cpp:1850] Elected as the leading master!
I0906 20:01:45.240468 29078 master.cpp:1551] Recovering from registrar
I0906 20:01:45.240592 29080 registrar.cpp:332] Recovering registrar
I0906 20:01:45.241178 29075 log.cpp:553] Attempting to start the writer
I0906 20:01:45.242928 29072 replica.cpp:493] Replica received implicit promise 
request from __req_res__(6360)@172.17.0.3:60366 with proposal 1
I0906 20:01:45.243324 29072 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 335676ns
I0906 20:01:45.243350 29072 replica.cpp:342] Persisted promised to 1
I0906 20:01:45.244056 29081 coordinator.cpp:238] Coordinator attempting to fill 
missing positions
I0906 20:01:45.245538 29078 replica.cpp:388] Replica received explicit promise 
request from __req_res__(6361)@172.17.0.3:60366 for position 0 with proposal 2
I0906 20:01:45.245995 29078 leveldb.cpp:341] Persisting action (8 bytes) to 
leveldb took 412163ns
I0906 20:01:45.246021 29078 replica.cpp:708] Persisted action NOP at position 0
I0906 20:01:45.247329 29082 replica.cpp:537] Replica received write request for 
position 0 from __req_res__(6362)@172.17.0.3:60366
I0906 20:01:45.247406 29082 leveldb.cpp:436] Reading position from leveldb took 
35845ns
I0906 20:01:45.247989 29082 leveldb.cpp:341] Persisting action (14 bytes) to 
leveldb took 541972ns
I0906 20:01:45.248015 29082 replica.cpp:708] Persisted action NOP at position 0
I0906 20:01:45.248556 29084 replica.cpp:691] Replica received learned notice 
for position 0 from @0.0.0.0:0
I0906 20:01:45.249241 29084 leveldb.cpp:341] Persisting action (16 bytes) to 
leveldb took 647885ns
I0906 20:01:45.249271 29084 r

[jira] [Updated] (MESOS-6137) Segfault during DiskResource/PersistentVolumeTest.IncompatibleCheckpointedResources/0

2016-09-07 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-6137:
-
Description: 
Observed on Jenkins CI:
{code}
I0906 20:01:45.235483 29082 master.cpp:379] Master 
9fd91e5d-4257-427d-a7da-3f18d99c8ffa (0a1dc2da838b) started on 172.17.0.3:60366
I0906 20:01:45.235513 29082 master.cpp:381] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1secs" --allocator="HierarchicalDRF" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/ze1TG1/credentials" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--quiet="false" --recovery_agent_removal_limit="100%" 
--registry="replicated_log" --registry_fetch_timeout="1mins" 
--registry_store_timeout="100secs" --registry_strict="true" 
--root_submissions="true" --user_sorter="drf" --version="false" 
--webui_dir="/mesos/mesos-1.1.0/_inst/share/mesos/webui" 
--work_dir="/tmp/ze1TG1/master" --zk_session_timeout="10secs"
I0906 20:01:45.236022 29082 master.cpp:431] Master only allowing authenticated 
frameworks to register
I0906 20:01:45.236037 29082 master.cpp:445] Master only allowing authenticated 
agents to register
I0906 20:01:45.236045 29082 master.cpp:458] Master only allowing authenticated 
HTTP frameworks to register
I0906 20:01:45.236054 29082 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/ze1TG1/credentials'
I0906 20:01:45.236392 29082 master.cpp:503] Using default 'crammd5' 
authenticator
I0906 20:01:45.236654 29079 replica.cpp:673] Replica in STARTING status 
received a broadcasted recover request from __req_res__(6359)@172.17.0.3:60366
I0906 20:01:45.236687 29082 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0906 20:01:45.236927 29082 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0906 20:01:45.237095 29079 recover.cpp:197] Received a recover response from a 
replica in STARTING status
I0906 20:01:45.237117 29082 http.cpp:883] Using default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0906 20:01:45.237340 29082 master.cpp:583] Authorization enabled
I0906 20:01:45.237663 29080 whitelist_watcher.cpp:77] No whitelist given
I0906 20:01:45.237685 29075 hierarchical.cpp:149] Initialized hierarchical 
allocator process
I0906 20:01:45.237835 29085 recover.cpp:568] Updating replica status to VOTING
I0906 20:01:45.238531 29081 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 378674ns
I0906 20:01:45.238560 29081 replica.cpp:320] Persisted replica status to VOTING
I0906 20:01:45.238685 29073 recover.cpp:582] Successfully joined the Paxos group
I0906 20:01:45.238975 29073 recover.cpp:466] Recover process terminated
I0906 20:01:45.240437 29078 master.cpp:1850] Elected as the leading master!
I0906 20:01:45.240468 29078 master.cpp:1551] Recovering from registrar
I0906 20:01:45.240592 29080 registrar.cpp:332] Recovering registrar
I0906 20:01:45.241178 29075 log.cpp:553] Attempting to start the writer
I0906 20:01:45.242928 29072 replica.cpp:493] Replica received implicit promise 
request from __req_res__(6360)@172.17.0.3:60366 with proposal 1
I0906 20:01:45.243324 29072 leveldb.cpp:304] Persisting metadata (8 bytes) to 
leveldb took 335676ns
I0906 20:01:45.243350 29072 replica.cpp:342] Persisted promised to 1
I0906 20:01:45.244056 29081 coordinator.cpp:238] Coordinator attempting to fill 
missing positions
I0906 20:01:45.245538 29078 replica.cpp:388] Replica received explicit promise 
request from __req_res__(6361)@172.17.0.3:60366 for position 0 with proposal 2
I0906 20:01:45.245995 29078 leveldb.cpp:341] Persisting action (8 bytes) to 
leveldb took 412163ns
I0906 20:01:45.246021 29078 replica.cpp:708] Persisted action NOP at position 0
I0906 20:01:45.247329 29082 replica.cpp:537] Replica received write request for 
position 0 from __req_res__(6362)@172.17.0.3:60366
I0906 20:01:45.247406 29082 leveldb.cpp:436] Reading position from leveldb took 
35845ns
I0906 20:01:45.247989 29082 leveldb.cpp:341] Persisting action (14 bytes) to 
leveldb took 541972ns
I0906 20:01:45.248015 29082 replica.cpp:708] Persisted action NOP at position 0
I0906 20:01:45.248556 29084 replica.cpp:691] Replica received learned notice 
for position 0 from @0.0.0.0:0
I0906 20:01:45.249241 29084 leveldb.cpp:341] Persisting action (16 bytes) to 
leveldb took 647885ns
I0906 20:01:45.249271 29084 r

[jira] [Created] (MESOS-6138) Add 'syntax=proto2' to all .proto files in Mesos

2016-09-07 Thread Zhitao Li (JIRA)
Zhitao Li created MESOS-6138:


 Summary: Add 'syntax=proto2' to all .proto files in Mesos
 Key: MESOS-6138
 URL: https://issues.apache.org/jira/browse/MESOS-6138
 Project: Mesos
  Issue Type: Bug
  Components: HTTP API
Reporter: Zhitao Li
Assignee: Zhitao Li


This will make people's life easier if they need to both generate protobuf 
messages from Mesos's public protobuf interfaces, and also use protoc 3 for 
their own files.

Based on chat with [~greggomann] this should not be break any part inside Mesos.

Also related: MESOS-5186




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6139) How can I config docker executor port range?

2016-09-07 Thread Vu Nguyen Duy (JIRA)
Vu Nguyen Duy created MESOS-6139:


 Summary: How can I config docker executor port range?
 Key: MESOS-6139
 URL: https://issues.apache.org/jira/browse/MESOS-6139
 Project: Mesos
  Issue Type: Bug
  Components: docker
Reporter: Vu Nguyen Duy


Hi,
I got a connection problem when install jenkins-mesos master and mesos cluster 
in two seperate enviroments, there are ACLs rule (block all) between two envs. 
I had to open rule allow mesos-master connect jenkins-scheduler (use 
LIBPROCESS_PORT to define static port), rule allow mesos-slave connect 
jenkins-master:8080 to downloading slave.jar. But I don't know how can I config 
mesos-docker-executor port range?, they're random, I can't open ACL rule for 
that. So it raise error: 

```
I0908 10:58:05.941046  5854 slave.cpp:2828] Got registration for executor 
'mesos-jenkins-5ae4b388d0f34f42887e8ade1cbf6803-mesos-abc' of framework 
2e7b1138-cc15-4ccf-9874-b452706cb8c9-0109 from executor(1)@x.x.x.x:33393
I0908 10:58:05.942631  5854 docker.cpp:1443] Ignoring updating container 
'62248670-81ab-423c-952d-5002917f9fa4' with resources passed to update is 
identical to existing resources
I0908 10:58:05.943017  5854 slave.cpp:2005] Sending queued task 
'mesos-jenkins-5ae4b388d0f34f42887e8ade1cbf6803-mesos-abc' to executor 
'mesos-jenkins-5ae4b388d0f34f42887e8ade1cbf6803-mesos-abc' of framework 
2e7b1138-cc15-4ccf-9874-b452706cb8c9-0109 at executor(1)@x.x.x.x:33393
I0908 10:58:11.383888  5855 slave.cpp:3211] Handling status update TASK_FAILED 
(UUID: b5a40cfa-c6dc-4bf1-8d72-5ccf4e71f707) for task 
mesos-jenkins-5ae4b388d0f34f42887e8ade1cbf6803-mesos-abc of framework 
2e7b1138-cc15-4ccf-9874-b452706cb8c9-0109 from executor(1)@x.x.x.x:33393

```

```
root@Mesos61:~# netstat -ntpl | grep mesos
tcp0  0 0.0.0.0:33393   0.0.0.0:*   LISTEN  
6289/mesos-docker-executor
```



Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6139) How can I config docker executor port range?

2016-09-07 Thread Vu Nguyen Duy (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vu Nguyen Duy updated MESOS-6139:
-
Description: 
Hi,
I got a connection problem when install jenkins-mesos master and mesos cluster 
in two seperate enviroments, there are ACLs rule (block all) between two envs. 
I had to open rule allow mesos-master connect jenkins-scheduler (use 
LIBPROCESS_PORT to define static port), rule allow mesos-slave connect 
jenkins-master:8080 to downloading slave.jar, and rule allow jenkins connect 
mesos-docker-executor, but I don't know how can I config mesos-docker-executor 
port range?, they're random, I can't open ACL rule for that. So it raise error: 

```
I0908 10:58:05.941046  5854 slave.cpp:2828] Got registration for executor 
'mesos-jenkins-5ae4b388d0f34f42887e8ade1cbf6803-mesos-abc' of framework 
2e7b1138-cc15-4ccf-9874-b452706cb8c9-0109 from executor(1)@x.x.x.x:33393
I0908 10:58:05.942631  5854 docker.cpp:1443] Ignoring updating container 
'62248670-81ab-423c-952d-5002917f9fa4' with resources passed to update is 
identical to existing resources
I0908 10:58:05.943017  5854 slave.cpp:2005] Sending queued task 
'mesos-jenkins-5ae4b388d0f34f42887e8ade1cbf6803-mesos-abc' to executor 
'mesos-jenkins-5ae4b388d0f34f42887e8ade1cbf6803-mesos-abc' of framework 
2e7b1138-cc15-4ccf-9874-b452706cb8c9-0109 at executor(1)@x.x.x.x:33393
I0908 10:58:11.383888  5855 slave.cpp:3211] Handling status update TASK_FAILED 
(UUID: b5a40cfa-c6dc-4bf1-8d72-5ccf4e71f707) for task 
mesos-jenkins-5ae4b388d0f34f42887e8ade1cbf6803-mesos-abc of framework 
2e7b1138-cc15-4ccf-9874-b452706cb8c9-0109 from executor(1)@x.x.x.x:33393

```

```
root@Mesos61:~# netstat -ntpl | grep mesos
tcp0  0 0.0.0.0:33393   0.0.0.0:*   LISTEN  
6289/mesos-docker-executor
```



Thanks

  was:
Hi,
I got a connection problem when install jenkins-mesos master and mesos cluster 
in two seperate enviroments, there are ACLs rule (block all) between two envs. 
I had to open rule allow mesos-master connect jenkins-scheduler (use 
LIBPROCESS_PORT to define static port), rule allow mesos-slave connect 
jenkins-master:8080 to downloading slave.jar. But I don't know how can I config 
mesos-docker-executor port range?, they're random, I can't open ACL rule for 
that. So it raise error: 

```
I0908 10:58:05.941046  5854 slave.cpp:2828] Got registration for executor 
'mesos-jenkins-5ae4b388d0f34f42887e8ade1cbf6803-mesos-abc' of framework 
2e7b1138-cc15-4ccf-9874-b452706cb8c9-0109 from executor(1)@x.x.x.x:33393
I0908 10:58:05.942631  5854 docker.cpp:1443] Ignoring updating container 
'62248670-81ab-423c-952d-5002917f9fa4' with resources passed to update is 
identical to existing resources
I0908 10:58:05.943017  5854 slave.cpp:2005] Sending queued task 
'mesos-jenkins-5ae4b388d0f34f42887e8ade1cbf6803-mesos-abc' to executor 
'mesos-jenkins-5ae4b388d0f34f42887e8ade1cbf6803-mesos-abc' of framework 
2e7b1138-cc15-4ccf-9874-b452706cb8c9-0109 at executor(1)@x.x.x.x:33393
I0908 10:58:11.383888  5855 slave.cpp:3211] Handling status update TASK_FAILED 
(UUID: b5a40cfa-c6dc-4bf1-8d72-5ccf4e71f707) for task 
mesos-jenkins-5ae4b388d0f34f42887e8ade1cbf6803-mesos-abc of framework 
2e7b1138-cc15-4ccf-9874-b452706cb8c9-0109 from executor(1)@x.x.x.x:33393

```

```
root@Mesos61:~# netstat -ntpl | grep mesos
tcp0  0 0.0.0.0:33393   0.0.0.0:*   LISTEN  
6289/mesos-docker-executor
```



Thanks


> How can I config docker executor port range?
> 
>
> Key: MESOS-6139
> URL: https://issues.apache.org/jira/browse/MESOS-6139
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Reporter: Vu Nguyen Duy
>
> Hi,
> I got a connection problem when install jenkins-mesos master and mesos 
> cluster in two seperate enviroments, there are ACLs rule (block all) between 
> two envs. I had to open rule allow mesos-master connect jenkins-scheduler 
> (use LIBPROCESS_PORT to define static port), rule allow mesos-slave connect 
> jenkins-master:8080 to downloading slave.jar, and rule allow jenkins connect 
> mesos-docker-executor, but I don't know how can I config 
> mesos-docker-executor port range?, they're random, I can't open ACL rule for 
> that. So it raise error: 
> ```
> I0908 10:58:05.941046  5854 slave.cpp:2828] Got registration for executor 
> 'mesos-jenkins-5ae4b388d0f34f42887e8ade1cbf6803-mesos-abc' of framework 
> 2e7b1138-cc15-4ccf-9874-b452706cb8c9-0109 from executor(1)@x.x.x.x:33393
> I0908 10:58:05.942631  5854 docker.cpp:1443] Ignoring updating container 
> '62248670-81ab-423c-952d-5002917f9fa4' with resources passed to update is 
> identical to existing resources
> I0908 10:58:05.943017  5854 slave.cpp:2005] Sending queued task 
> 'mesos-jenkins-5ae4b388d0f34f42887e8ade1cbf6803-mesos-abc' to executor 
> '