[jira] [Commented] (MESOS-6577) Failed to run docker inspect

2016-11-11 Thread Marc Villacorta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15657292#comment-15657292
 ] 

Marc Villacorta commented on MESOS-6577:


I might be reaching the {{DOCKER_INSPECT_TIMEOUT}}:
https://github.com/apache/mesos/blob/bf7e9ce836d0fe9924adc2e94054469c4a1906a0/src/docker/executor.cpp#L70-L71

> Failed to run docker inspect
> 
>
> Key: MESOS-6577
> URL: https://issues.apache.org/jira/browse/MESOS-6577
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.1
> Environment: {code:none}
> core@kato-2 ~ $ cat /etc/systemd/system/mesos-agent.service
> [Unit]
> Description=Mesos agent
> After=go-dnsmasq.service
> [Service]
> Slice=machine.slice
> Restart=always
> RestartSec=10
> TimeoutStartSec=0
> KillMode=mixed
> EnvironmentFile=/etc/kato.env
> ExecStartPre=/usr/bin/sh -c "[ -d /var/lib/mesos/agent ] || mkdir -p 
> /var/lib/mesos/agent"
> ExecStartPre=/usr/bin/sh -c "[ -d /etc/certs ] || mkdir -p /etc/certs"
> ExecStartPre=/usr/bin/sh -c "[ -d /etc/cni ] || mkdir -p /etc/cni"
> ExecStartPre=/opt/bin/zk-alive ${KATO_QUORUM_COUNT}
> ExecStartPre=/usr/bin/rkt fetch quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2
> ExecStartPre=/usr/bin/docker pull 
> quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2
> ExecStart=/usr/bin/rkt run \
>  --net=host \
>  --dns=host \
>  --hosts-entry=host \
>  --volume cni,kind=host,source=/etc/cni \
>  --mount volume=cni,target=/etc/cni \
>  --volume certs,kind=host,source=/etc/certs \
>  --mount volume=certs,target=/etc/certs \
>  --volume docker,kind=host,source=/var/run/docker.sock \
>  --mount volume=docker,target=/var/run/docker.sock \
>  --volume data,kind=host,source=/var/lib/mesos \
>  --mount volume=data,target=/var/lib/mesos \
>  --stage1-name=coreos.com/rkt/stage1-fly \
>  quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2 --exec /usr/sbin/mesos-agent 
> -- \
>  --no-systemd_enable_support \
>  --docker_mesos_image=quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2 \
>  --hostname=worker-${KATO_HOST_ID}.${KATO_DOMAIN} \
>  --ip=${KATO_HOST_IP} \
>  --containerizers=docker \
>  --executor_registration_timeout=2mins \
>  --master=zk://${KATO_ZK}/mesos \
>  --work_dir=/var/lib/mesos/agent \
>  --log_dir=/var/log/mesos/agent \
>  --network_cni_config_dir=/etc/cni \
>  --network_cni_plugins_dir=/var/lib/mesos/cni-plugins
> [Install]
> WantedBy=kato.target
> {code}
> {code:none}
> core@kato-2 ~ $ docker version
> Client:
>  Version:  1.12.3
>  API version:  1.24
>  Go version:   go1.6.3
>  Git commit:   34a2ead
>  Built:
>  OS/Arch:  linux/amd64
> Server:
>  Version:  1.12.3
>  API version:  1.24
>  Go version:   go1.6.3
>  Git commit:   34a2ead
>  Built:
>  OS/Arch:  linux/amd64
> {code}
>Reporter: Marc Villacorta
>
> I am running a _rocketized_ mesos agent.
> I am using the docker containerizer.
> My executors are _dockerized_.
> The very first time I deploy a sample platform I get some errors like the one 
> below:
> {code:none}
> Failed to launch container: Failed to run 'docker -H 
> unix:///var/run/docker.sock inspect 
> mesos-84a9df2b-be0e-459e-afc9-b95d4e8ced57-S0.0116a0a2-ccaf-4f1a-846c-361ec4e4a179':
>  exited with status 1; stderr='Error: No such image, container or task: 
> mesos-84a9df2b-be0e-459e-afc9-b95d4e8ced57-S0.0116a0a2-ccaf-4f1a-846c-361ec4e4a179
>  '
> {code}
> But when I check with {{docker ps}} I can see the supposedly missing 
> container and I can even successfully run {{docker inspect}} on it. Then 
> marathon reschedules and I get a duplicate. Nor mesos neither marathon list 
> any duplicate (only docker does).
> Restarting the mesos-agent wipes out the reported missing container leaving 
> the other ones alive.
> When all my nodes have the docker image layers cached I can deploy the sample 
> platform smoothly and I don't get the previous errors.
> If a container needs a remote volume attached (EBS via REX-Ray) the error 
> happens all the time. No matter if cached or not.
> Reading the code I suspect it is related to the _retryInterval_ of 
> _Docker::inspect_ 
> https://github.com/apache/mesos/blob/2e013890e47c30053b7b83cd205b432376589216/src/docker/docker.cpp#L950-L952
>  but there is no option to modify this setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6577) Failed to run docker inspect

2016-11-11 Thread Marc Villacorta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Villacorta updated MESOS-6577:
---
Environment: 
{code:none}
core@kato-2 ~ $ cat /etc/kato.env 
   KATO_CLUSTER_ID=cell-1-dub
   KATO_QUORUM_COUNT=3
   KATO_ROLES='quorum master worker '
   KATO_HOST_NAME=kato
   KATO_HOST_ID=2
   KATO_ZK=quorum-1:2181,quorum-2:2181,quorum-3:2181
   
KATO_ALERT_MANAGERS=http://master-1:9093,http://master-2:9093,http://master-3:9093
   KATO_DOMAIN=cell-1.dub.xnood.com
   KATO_MESOS_DOMAIN=cell-1.dub.mesos
   KATO_HOST_IP=10.136.64.12 
   KATO_QUORUM=2
   DOCKER_VERSION=1.12.3
{code}

{code:none}
core@kato-2 ~ $ cat /etc/systemd/system/mesos-agent.service
[Unit]
Description=Mesos agent
After=go-dnsmasq.service

[Service]
Slice=machine.slice
Restart=always
RestartSec=10
TimeoutStartSec=0
KillMode=mixed
EnvironmentFile=/etc/kato.env
ExecStartPre=/usr/bin/sh -c "[ -d /var/lib/mesos/agent ] || mkdir -p 
/var/lib/mesos/agent"
ExecStartPre=/usr/bin/sh -c "[ -d /etc/certs ] || mkdir -p /etc/certs"
ExecStartPre=/usr/bin/sh -c "[ -d /etc/cni ] || mkdir -p /etc/cni"
ExecStartPre=/opt/bin/zk-alive ${KATO_QUORUM_COUNT}
ExecStartPre=/usr/bin/rkt fetch quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2
ExecStartPre=/usr/bin/docker pull quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2
ExecStart=/usr/bin/rkt run \
 --net=host \
 --dns=host \
 --hosts-entry=host \
 --volume cni,kind=host,source=/etc/cni \
 --mount volume=cni,target=/etc/cni \
 --volume certs,kind=host,source=/etc/certs \
 --mount volume=certs,target=/etc/certs \
 --volume docker,kind=host,source=/var/run/docker.sock \
 --mount volume=docker,target=/var/run/docker.sock \
 --volume data,kind=host,source=/var/lib/mesos \
 --mount volume=data,target=/var/lib/mesos \
 --stage1-name=coreos.com/rkt/stage1-fly \
 quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2 --exec /usr/sbin/mesos-agent -- \
 --no-systemd_enable_support \
 --docker_mesos_image=quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2 \
 --hostname=worker-${KATO_HOST_ID}.${KATO_DOMAIN} \
 --ip=${KATO_HOST_IP} \
 --containerizers=docker \
 --executor_registration_timeout=2mins \
 --master=zk://${KATO_ZK}/mesos \
 --work_dir=/var/lib/mesos/agent \
 --log_dir=/var/log/mesos/agent \
 --network_cni_config_dir=/etc/cni \
 --network_cni_plugins_dir=/var/lib/mesos/cni-plugins

[Install]
WantedBy=kato.target
{code}

{code:none}
core@kato-2 ~ $ docker version
Client:
 Version:  1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   34a2ead
 Built:
 OS/Arch:  linux/amd64

Server:
 Version:  1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   34a2ead
 Built:
 OS/Arch:  linux/amd64
{code}

  was:
{code:none}
core@kato-2 ~ $ cat /etc/systemd/system/mesos-agent.service
[Unit]
Description=Mesos agent
After=go-dnsmasq.service

[Service]
Slice=machine.slice
Restart=always
RestartSec=10
TimeoutStartSec=0
KillMode=mixed
EnvironmentFile=/etc/kato.env
ExecStartPre=/usr/bin/sh -c "[ -d /var/lib/mesos/agent ] || mkdir -p 
/var/lib/mesos/agent"
ExecStartPre=/usr/bin/sh -c "[ -d /etc/certs ] || mkdir -p /etc/certs"
ExecStartPre=/usr/bin/sh -c "[ -d /etc/cni ] || mkdir -p /etc/cni"
ExecStartPre=/opt/bin/zk-alive ${KATO_QUORUM_COUNT}
ExecStartPre=/usr/bin/rkt fetch quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2
ExecStartPre=/usr/bin/docker pull quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2
ExecStart=/usr/bin/rkt run \
 --net=host \
 --dns=host \
 --hosts-entry=host \
 --volume cni,kind=host,source=/etc/cni \
 --mount volume=cni,target=/etc/cni \
 --volume certs,kind=host,source=/etc/certs \
 --mount volume=certs,target=/etc/certs \
 --volume docker,kind=host,source=/var/run/docker.sock \
 --mount volume=docker,target=/var/run/docker.sock \
 --volume data,kind=host,source=/var/lib/mesos \
 --mount volume=data,target=/var/lib/mesos \
 --stage1-name=coreos.com/rkt/stage1-fly \
 quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2 --exec /usr/sbin/mesos-agent -- \
 --no-systemd_enable_support \
 --docker_mesos_image=quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2 \
 --hostname=worker-${KATO_HOST_ID}.${KATO_DOMAIN} \
 --ip=${KATO_HOST_IP} \
 --containerizers=docker \
 --executor_registration_timeout=2mins \
 --master=zk://${KATO_ZK}/mesos \
 --work_dir=/var/lib/mesos/agent \
 --log_dir=/var/log/mesos/agent \
 --network_cni_config_dir=/etc/cni \
 --network_cni_plugins_dir=/var/lib/mesos/cni-plugins

[Install]
WantedBy=kato.target
{code}

{code:none}
core@kato-2 ~ $ docker version
Client:
 Version:  1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   34a2ead
 Built:
 OS/Arch:  linux/amd64

Server:
 Version:  1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   34a2ead
 Built:
 OS/Arch:  linux/amd64
{code}


> Failed to run docker inspect
> 
>
> Key: MESOS-6577
> URL: 

[jira] [Updated] (MESOS-6577) Failed to run docker inspect

2016-11-11 Thread Marc Villacorta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Villacorta updated MESOS-6577:
---
Description: 
I am running a _rocketized_ mesos agent.
I am using the docker containerizer.
My executors are _dockerized_.
The very first time I deploy a sample platform I get some errors like the one 
below:

{code:none}
Failed to launch container: Failed to run 'docker -H 
unix:///var/run/docker.sock inspect 
mesos-84a9df2b-be0e-459e-afc9-b95d4e8ced57-S0.0116a0a2-ccaf-4f1a-846c-361ec4e4a179':
 exited with status 1; stderr='Error: No such image, container or task: 
mesos-84a9df2b-be0e-459e-afc9-b95d4e8ced57-S0.0116a0a2-ccaf-4f1a-846c-361ec4e4a179
 '
{code}

But when I check with {{docker ps}} I can see the supposedly missing container 
and I can even successfully run {{docker inspect}} on it. Then marathon 
reschedules and I get a duplicate. Nor mesos neither marathon list any 
duplicate (only docker does).

Restarting the mesos-agent wipes out the reported missing container leaving the 
other ones alive.

When all my nodes have the docker image layers cached I can deploy the sample 
platform smoothly and I don't get the previous errors.

If a container needs a remote volume attached (EBS via REX-Ray) the error 
happens all the time. No matter if cached or not.

Reading the code I suspect it is related to the _retryInterval_ of 
_Docker::inspect_ 
https://github.com/apache/mesos/blob/2e013890e47c30053b7b83cd205b432376589216/src/docker/docker.cpp#L950-L952
 but there is no option to modify this setting.

  was:
I am running a _rocketized_ mesos agent.
I am using the docker containerizer.
My executors are _dockerized_.
The very first time I deploy a sample platform I get some errors like the one 
below:

{code:none}
Failed to launch container: Failed to run 'docker -H 
unix:///var/run/docker.sock inspect mesos-84a9df2b-be0e-459e-afc9-b95d4e8c
ed57-S0.0116a0a2-ccaf-4f1a-846c-361ec4e4a179': exited with status 1; 
stderr='Error: No such image, container or task: mesos-84a
9df2b-be0e-459e-afc9-b95d4e8ced57-S0.0116a0a2-ccaf-4f1a-846c-361ec4e4a179 '
{code}

But when I check with {{docker ps}} I can see the supposedly missing container 
and I can even successfully run {{docker inspect}} on it. Then marathon 
reschedules and I get a duplicate. Nor mesos neither marathon list any 
duplicate (only docker does).

Restarting the mesos-agent wipes out the reported missing container leaving the 
other ones alive.

When all my nodes have the docker image layers cached I can deploy the sample 
platform smoothly and I don't get the previous errors.

If a container needs a remote volume attached (EBS via REX-Ray) the error 
happens all the time. No matter if cached or not.

Reading the code I suspect it is related to the _retryInterval_ of 
_Docker::inspect_ 
https://github.com/apache/mesos/blob/2e013890e47c30053b7b83cd205b432376589216/src/docker/docker.cpp#L950-L952
 but there is no option to modify this setting.


> Failed to run docker inspect
> 
>
> Key: MESOS-6577
> URL: https://issues.apache.org/jira/browse/MESOS-6577
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.1
> Environment: {code:none}
> core@kato-2 ~ $ cat /etc/systemd/system/mesos-agent.service
> [Unit]
> Description=Mesos agent
> After=go-dnsmasq.service
> [Service]
> Slice=machine.slice
> Restart=always
> RestartSec=10
> TimeoutStartSec=0
> KillMode=mixed
> EnvironmentFile=/etc/kato.env
> ExecStartPre=/usr/bin/sh -c "[ -d /var/lib/mesos/agent ] || mkdir -p 
> /var/lib/mesos/agent"
> ExecStartPre=/usr/bin/sh -c "[ -d /etc/certs ] || mkdir -p /etc/certs"
> ExecStartPre=/usr/bin/sh -c "[ -d /etc/cni ] || mkdir -p /etc/cni"
> ExecStartPre=/opt/bin/zk-alive ${KATO_QUORUM_COUNT}
> ExecStartPre=/usr/bin/rkt fetch quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2
> ExecStartPre=/usr/bin/docker pull 
> quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2
> ExecStart=/usr/bin/rkt run \
>  --net=host \
>  --dns=host \
>  --hosts-entry=host \
>  --volume cni,kind=host,source=/etc/cni \
>  --mount volume=cni,target=/etc/cni \
>  --volume certs,kind=host,source=/etc/certs \
>  --mount volume=certs,target=/etc/certs \
>  --volume docker,kind=host,source=/var/run/docker.sock \
>  --mount volume=docker,target=/var/run/docker.sock \
>  --volume data,kind=host,source=/var/lib/mesos \
>  --mount volume=data,target=/var/lib/mesos \
>  --stage1-name=coreos.com/rkt/stage1-fly \
>  quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2 --exec /usr/sbin/mesos-agent 
> -- \
>  --no-systemd_enable_support \
>  --docker_mesos_image=quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2 \
>  --hostname=worker-${KATO_HOST_ID}.${KATO_DOMAIN} \
>  --ip=${KATO_HOST_IP} \
>  --containerizers=docker \
>  --executor_registration_timeout=2mins \
>  

[jira] [Created] (MESOS-6577) Failed to run docker inspect

2016-11-11 Thread Marc Villacorta (JIRA)
Marc Villacorta created MESOS-6577:
--

 Summary: Failed to run docker inspect
 Key: MESOS-6577
 URL: https://issues.apache.org/jira/browse/MESOS-6577
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Affects Versions: 1.0.1
 Environment: {code:none}
core@kato-2 ~ $ cat /etc/systemd/system/mesos-agent.service
[Unit]
Description=Mesos agent
After=go-dnsmasq.service

[Service]
Slice=machine.slice
Restart=always
RestartSec=10
TimeoutStartSec=0
KillMode=mixed
EnvironmentFile=/etc/kato.env
ExecStartPre=/usr/bin/sh -c "[ -d /var/lib/mesos/agent ] || mkdir -p 
/var/lib/mesos/agent"
ExecStartPre=/usr/bin/sh -c "[ -d /etc/certs ] || mkdir -p /etc/certs"
ExecStartPre=/usr/bin/sh -c "[ -d /etc/cni ] || mkdir -p /etc/cni"
ExecStartPre=/opt/bin/zk-alive ${KATO_QUORUM_COUNT}
ExecStartPre=/usr/bin/rkt fetch quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2
ExecStartPre=/usr/bin/docker pull quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2
ExecStart=/usr/bin/rkt run \
 --net=host \
 --dns=host \
 --hosts-entry=host \
 --volume cni,kind=host,source=/etc/cni \
 --mount volume=cni,target=/etc/cni \
 --volume certs,kind=host,source=/etc/certs \
 --mount volume=certs,target=/etc/certs \
 --volume docker,kind=host,source=/var/run/docker.sock \
 --mount volume=docker,target=/var/run/docker.sock \
 --volume data,kind=host,source=/var/lib/mesos \
 --mount volume=data,target=/var/lib/mesos \
 --stage1-name=coreos.com/rkt/stage1-fly \
 quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2 --exec /usr/sbin/mesos-agent -- \
 --no-systemd_enable_support \
 --docker_mesos_image=quay.io/kato/mesos:v1.0.1-${DOCKER_VERSION}-2 \
 --hostname=worker-${KATO_HOST_ID}.${KATO_DOMAIN} \
 --ip=${KATO_HOST_IP} \
 --containerizers=docker \
 --executor_registration_timeout=2mins \
 --master=zk://${KATO_ZK}/mesos \
 --work_dir=/var/lib/mesos/agent \
 --log_dir=/var/log/mesos/agent \
 --network_cni_config_dir=/etc/cni \
 --network_cni_plugins_dir=/var/lib/mesos/cni-plugins

[Install]
WantedBy=kato.target
{code}

{code:none}
core@kato-2 ~ $ docker version
Client:
 Version:  1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   34a2ead
 Built:
 OS/Arch:  linux/amd64

Server:
 Version:  1.12.3
 API version:  1.24
 Go version:   go1.6.3
 Git commit:   34a2ead
 Built:
 OS/Arch:  linux/amd64
{code}
Reporter: Marc Villacorta


I am running a _rocketized_ mesos agent.
I am using the docker containerizer.
My executors are _dockerized_.
The very first time I deploy a sample platform I get some errors like the one 
below:

{code:none}
Failed to launch container: Failed to run 'docker -H 
unix:///var/run/docker.sock inspect mesos-84a9df2b-be0e-459e-afc9-b95d4e8c
ed57-S0.0116a0a2-ccaf-4f1a-846c-361ec4e4a179': exited with status 1; 
stderr='Error: No such image, container or task: mesos-84a
9df2b-be0e-459e-afc9-b95d4e8ced57-S0.0116a0a2-ccaf-4f1a-846c-361ec4e4a179 '
{code}

But when I check with {{docker ps}} I can see the supposedly missing container 
and I can even successfully run {{docker inspect}} on it. Then marathon 
reschedules and I get a duplicate. Nor mesos neither marathon list any 
duplicate (only docker does).

Restarting the mesos-agent wipes out the reported missing container leaving the 
other ones alive.

When all my nodes have the docker image layers cached I can deploy the sample 
platform smoothly and I don't get the previous errors.

If a container needs a remote volume attached (EBS via REX-Ray) the error 
happens all the time. No matter if cached or not.

Reading the code I suspect it is related to the _retryInterval_ of 
_Docker::inspect_ 
https://github.com/apache/mesos/blob/2e013890e47c30053b7b83cd205b432376589216/src/docker/docker.cpp#L950-L952
 but there is no option to modify this setting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-2115) Improve recovering Docker containers when slave is contained

2016-11-03 Thread Marc Villacorta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15633471#comment-15633471
 ] 

Marc Villacorta edited comment on MESOS-2115 at 11/3/16 5:07 PM:
-

[~SailC] The docker image you specify in {{--docker_mesos_image}} must have a 
docker client embedded (not bind-mounted) this image will be used to run the 
mesos executor. I personally use the same image for the mesos-agent and for the 
executor. In this 
[commit|https://github.com/katosys/kato/commit/50b7a82d8c63373b53072be33943cb6ff56a20b5]
 I switch from docker to rocket and it might be of interest to you because it 
shows how this can be achieved with both container runtimes.



was (Author: h0tbird):
[~SailC] The docker image you specify in {{--docker_mesos_image}} must have a 
docker client embedded (not bind-mounted) this image will be used to run the 
mesos executor. I personally use the same image for the mesos-agent and for the 
executor. In this 
[commit|https://github.com/katosys/kato/commit/50b7a82d8c63373b53072be33943cb6ff56a20b5]
 I switch from docker to rocker and it might be of interest to you because it 
shows how this can be achieved with both container runtimes.


> Improve recovering Docker containers when slave is contained
> 
>
> Key: MESOS-2115
> URL: https://issues.apache.org/jira/browse/MESOS-2115
> Project: Mesos
>  Issue Type: Epic
>  Components: docker
>Reporter: Timothy Chen
>Assignee: Timothy Chen
>  Labels: docker
> Fix For: 0.23.0
>
>
> Currently when docker containerizer is recovering it checks the checkpointed 
> executor pids to recover which containers are still running, and remove the 
> rest of the containers from docker ps that isn't recognized.
> This is problematic when the slave itself was in a docker container, as when 
> the slave container dies all the forked processes are removed as well, so the 
> checkpointed executor pids are no longer valid.
> We have to assume the docker containers might be still running even though 
> the checkpointed executor pids are not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-6486) Mesos on Alpine Linux: JVM Segmentation fault

2016-10-26 Thread Marc Villacorta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Villacorta updated MESOS-6486:
---
Comment: was deleted

(was: What do you think? Is this a problem with _libjvm.so_ or perhaps a JNI 
problem in _libmesos-1.0.1.so_?)

> Mesos on Alpine Linux: JVM Segmentation fault
> -
>
> Key: MESOS-6486
> URL: https://issues.apache.org/jira/browse/MESOS-6486
> Project: Mesos
>  Issue Type: Wish
>Affects Versions: 1.0.1
> Environment: *Docker*
> {code:none}
> ➜  ~ docker version
> Client:
>  Version:  1.12.1
>  API version:  1.24
>  Go version:   go1.7.1
>  Git commit:   6f9534c
>  Built:Thu Sep  8 10:31:18 2016
>  OS/Arch:  darwin/amd64
> Server:
>  Version:  1.12.1
>  API version:  1.24
>  Go version:   go1.6.3
>  Git commit:   23cf638
>  Built:Thu Aug 18 17:52:38 2016
>  OS/Arch:  linux/amd64
> {code}
> *Alpine*
> {code:none}
> ---  S Y S T E M  ---
> OS:NAME="Alpine Linux"
> ID=alpine
> VERSION_ID=3.4.4
> PRETTY_NAME="Alpine Linux v3.4"
> HOME_URL="http://alpinelinux.org;
> BUG_REPORT_URL="http://bugs.alpinelinux.org;
> uname:Linux 4.4.20-moby #1 SMP Thu Sep 15 12:10:20 UTC 2016 x86_64
> libc:glibc 2.9 NPTL
> rlimit: STACK 8192k, CORE infinity, NPROC infinity, NOFILE 1048576, AS 
> infinity
> load average:0.01 0.39 0.89
> {code}
> *Java*
> {code:none}
> # JRE version: OpenJDK Runtime Environment (8.0_101-b13) (build 1.8.0_101-b13)
> # Java VM: OpenJDK 64-Bit Server VM (25.101-b13 mixed mode linux-amd64 
> compressed oops)
> # Derivative: IcedTea 3.1.0
> # Distribution: Custom build (Tue Aug 30 20:38:19 GMT 2016)
> {code}
>Reporter: Marc Villacorta
>Priority: Minor
> Attachments: hs_err_pid1677.log
>
>
> I have compiled Mesos 1.0.1 inside a Docker container using Alpine Linux 
> (Dockerfile below):
> {code:none}
> # Set the base image for subsequent instructions:
> FROM alpine:3.4
> MAINTAINER Marc Villacorta Morera 
> # Environment variables:
> ENV TAG="1.0.1" \
> PREFIX="/usr/local" \
> JAVA_HOME="/usr/lib/jvm/default-jvm" \
> 
> JAVA_JVM_LIBRARY="/usr/lib/jvm/default-jvm/jre/lib/amd64/server/libjvm.so" \
> LD_LIBRARY_PATH="/usr/lib/jvm/default-jvm/jre/lib/amd64/server" \
> EDGE_REPO="http://nl.alpinelinux.org/alpine/edge;
> # Install mesos:
> RUN apk add -U --no-cache -t dev git autoconf automake libtool g++ \
> zlib-dev fts-dev apr-dev curl-dev file cyrus-sasl-dev cyrus-sasl-crammd5 \
> subversion-dev make patch linux-headers binutils && apk add -U --no-cache 
> \
> -t dev openjdk8 maven --repository ${EDGE_REPO}/community && apk add -U \
> --no-cache libstdc++ libgcc subversion-libs libcurl fts zlib coreutils \
> && git clone https://git-wip-us.apache.org/repos/asf/mesos.git && cd 
> mesos \
> && { [ "${TAG}" != "master" ] && git checkout tags/${TAG} -b ${TAG}; }; \
> ./bootstrap && mkdir build && cd build && ../configure --prefix=${PREFIX} 
> \
> --disable-dependency-tracking --disable-maintainer-mode --disable-python \
> --enable-optimize --enable-silent-rules \
> && CORES=$(cat /proc/cpuinfo | grep processor | wc -l) \
> && make -j${CORES} && make install && cd && rm -rf /mesos 
> ${PREFIX}/include \
> && find ${PREFIX} -type f -perm /u=x,g=x,o=x | xargs strip -s 
> 2>/dev/null; \
> apk del --purge dev && rm -rf /var/cache/apk/*
> # Command:
> CMD ["/bin/sh"]
> {code}
> Some tests are failing and my biggest concern is with this one:
> {code:none}
> make check GTEST_FILTER="ExamplesTest.JavaFramework"
> {code}
> {code:none}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from ExamplesTest
> [ RUN  ] ExamplesTest.JavaFramework
> ../../src/tests/script.cpp:80: Failure
> Failed
> java_framework_test.sh terminated with signal Segmentation fault
> [  FAILED  ] ExamplesTest.JavaFramework (5655 ms)
> [--] 1 test from ExamplesTest (5656 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (5689 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] ExamplesTest.JavaFramework
> {code}
> An ugly SIGSEGV is dispatched by the kernel. It looks like _libjvm.so_ is the 
> offending library but I am not sure at all:
> {code:none}
> I1026 15:19:54.843340  1706 replica.cpp:712] Persisted action at 7
> I1026 15:19:54.843683  1706 replica.cpp:691] Replica received learned notice 
> for position 7 from @0.0.0.0:0
> I1026 15:19:54.864063  1706 leveldb.cpp:341] Persisting action (690 bytes) to 
> leveldb took 20.333769ms
> I1026 15:19:54.864123  1706 replica.cpp:712] Persisted action at 7
> I1026 15:19:54.864131  1706 replica.cpp:697] Replica learned APPEND action at 

[jira] [Commented] (MESOS-6486) Mesos on Alpine Linux: JVM Segmentation fault

2016-10-26 Thread Marc Villacorta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15608833#comment-15608833
 ] 

Marc Villacorta commented on MESOS-6486:


What do you think? Is this a problem with _libjvm.so_ or perhaps a JNI problem 
in _libmesos-1.0.1.so_?

> Mesos on Alpine Linux: JVM Segmentation fault
> -
>
> Key: MESOS-6486
> URL: https://issues.apache.org/jira/browse/MESOS-6486
> Project: Mesos
>  Issue Type: Wish
>Affects Versions: 1.0.1
> Environment: *Docker*
> {code:none}
> ➜  ~ docker version
> Client:
>  Version:  1.12.1
>  API version:  1.24
>  Go version:   go1.7.1
>  Git commit:   6f9534c
>  Built:Thu Sep  8 10:31:18 2016
>  OS/Arch:  darwin/amd64
> Server:
>  Version:  1.12.1
>  API version:  1.24
>  Go version:   go1.6.3
>  Git commit:   23cf638
>  Built:Thu Aug 18 17:52:38 2016
>  OS/Arch:  linux/amd64
> {code}
> *Alpine*
> {code:none}
> ---  S Y S T E M  ---
> OS:NAME="Alpine Linux"
> ID=alpine
> VERSION_ID=3.4.4
> PRETTY_NAME="Alpine Linux v3.4"
> HOME_URL="http://alpinelinux.org;
> BUG_REPORT_URL="http://bugs.alpinelinux.org;
> uname:Linux 4.4.20-moby #1 SMP Thu Sep 15 12:10:20 UTC 2016 x86_64
> libc:glibc 2.9 NPTL
> rlimit: STACK 8192k, CORE infinity, NPROC infinity, NOFILE 1048576, AS 
> infinity
> load average:0.01 0.39 0.89
> {code}
> *Java*
> {code:none}
> # JRE version: OpenJDK Runtime Environment (8.0_101-b13) (build 1.8.0_101-b13)
> # Java VM: OpenJDK 64-Bit Server VM (25.101-b13 mixed mode linux-amd64 
> compressed oops)
> # Derivative: IcedTea 3.1.0
> # Distribution: Custom build (Tue Aug 30 20:38:19 GMT 2016)
> {code}
>Reporter: Marc Villacorta
>Priority: Minor
> Attachments: hs_err_pid1677.log
>
>
> I have compiled Mesos 1.0.1 inside a Docker container using Alpine Linux 
> (Dockerfile below):
> {code:none}
> # Set the base image for subsequent instructions:
> FROM alpine:3.4
> MAINTAINER Marc Villacorta Morera 
> # Environment variables:
> ENV TAG="1.0.1" \
> PREFIX="/usr/local" \
> JAVA_HOME="/usr/lib/jvm/default-jvm" \
> 
> JAVA_JVM_LIBRARY="/usr/lib/jvm/default-jvm/jre/lib/amd64/server/libjvm.so" \
> LD_LIBRARY_PATH="/usr/lib/jvm/default-jvm/jre/lib/amd64/server" \
> EDGE_REPO="http://nl.alpinelinux.org/alpine/edge;
> # Install mesos:
> RUN apk add -U --no-cache -t dev git autoconf automake libtool g++ \
> zlib-dev fts-dev apr-dev curl-dev file cyrus-sasl-dev cyrus-sasl-crammd5 \
> subversion-dev make patch linux-headers binutils && apk add -U --no-cache 
> \
> -t dev openjdk8 maven --repository ${EDGE_REPO}/community && apk add -U \
> --no-cache libstdc++ libgcc subversion-libs libcurl fts zlib coreutils \
> && git clone https://git-wip-us.apache.org/repos/asf/mesos.git && cd 
> mesos \
> && { [ "${TAG}" != "master" ] && git checkout tags/${TAG} -b ${TAG}; }; \
> ./bootstrap && mkdir build && cd build && ../configure --prefix=${PREFIX} 
> \
> --disable-dependency-tracking --disable-maintainer-mode --disable-python \
> --enable-optimize --enable-silent-rules \
> && CORES=$(cat /proc/cpuinfo | grep processor | wc -l) \
> && make -j${CORES} && make install && cd && rm -rf /mesos 
> ${PREFIX}/include \
> && find ${PREFIX} -type f -perm /u=x,g=x,o=x | xargs strip -s 
> 2>/dev/null; \
> apk del --purge dev && rm -rf /var/cache/apk/*
> # Command:
> CMD ["/bin/sh"]
> {code}
> Some tests are failing and my biggest concern is with this one:
> {code:none}
> make check GTEST_FILTER="ExamplesTest.JavaFramework"
> {code}
> {code:none}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from ExamplesTest
> [ RUN  ] ExamplesTest.JavaFramework
> ../../src/tests/script.cpp:80: Failure
> Failed
> java_framework_test.sh terminated with signal Segmentation fault
> [  FAILED  ] ExamplesTest.JavaFramework (5655 ms)
> [--] 1 test from ExamplesTest (5656 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (5689 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] ExamplesTest.JavaFramework
> {code}
> An ugly SIGSEGV is dispatched by the kernel. It looks like _libjvm.so_ is the 
> offending library but I am not sure at all:
> {code:none}
> I1026 15:19:54.843340  1706 replica.cpp:712] Persisted action at 7
> I1026 15:19:54.843683  1706 replica.cpp:691] Replica received learned notice 
> for position 7 from @0.0.0.0:0
> I1026 15:19:54.864063  1706 leveldb.cpp:341] Persisting action (690 bytes) to 
> leveldb took 20.333769ms
> I1026 15:19:54.864123  1706 replica.cpp:712] Persisted action at 7
> I1026 15:19:54.864131  1706 replica.cpp:697] Replica learned APPEND 

[jira] [Updated] (MESOS-6486) Mesos on Alpine Linux: JVM Segmentation fault

2016-10-26 Thread Marc Villacorta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Villacorta updated MESOS-6486:
---
Summary: Mesos on Alpine Linux: JVM Segmentation fault  (was: Mesos on 
Alpine Linux)

> Mesos on Alpine Linux: JVM Segmentation fault
> -
>
> Key: MESOS-6486
> URL: https://issues.apache.org/jira/browse/MESOS-6486
> Project: Mesos
>  Issue Type: Wish
>Affects Versions: 1.0.1
> Environment: *Docker*
> {code:none}
> ➜  ~ docker version
> Client:
>  Version:  1.12.1
>  API version:  1.24
>  Go version:   go1.7.1
>  Git commit:   6f9534c
>  Built:Thu Sep  8 10:31:18 2016
>  OS/Arch:  darwin/amd64
> Server:
>  Version:  1.12.1
>  API version:  1.24
>  Go version:   go1.6.3
>  Git commit:   23cf638
>  Built:Thu Aug 18 17:52:38 2016
>  OS/Arch:  linux/amd64
> {code}
> *Alpine*
> {code:none}
> ---  S Y S T E M  ---
> OS:NAME="Alpine Linux"
> ID=alpine
> VERSION_ID=3.4.4
> PRETTY_NAME="Alpine Linux v3.4"
> HOME_URL="http://alpinelinux.org;
> BUG_REPORT_URL="http://bugs.alpinelinux.org;
> uname:Linux 4.4.20-moby #1 SMP Thu Sep 15 12:10:20 UTC 2016 x86_64
> libc:glibc 2.9 NPTL
> rlimit: STACK 8192k, CORE infinity, NPROC infinity, NOFILE 1048576, AS 
> infinity
> load average:0.01 0.39 0.89
> {code}
> *Java*
> {code:none}
> # JRE version: OpenJDK Runtime Environment (8.0_101-b13) (build 1.8.0_101-b13)
> # Java VM: OpenJDK 64-Bit Server VM (25.101-b13 mixed mode linux-amd64 
> compressed oops)
> # Derivative: IcedTea 3.1.0
> # Distribution: Custom build (Tue Aug 30 20:38:19 GMT 2016)
> {code}
>Reporter: Marc Villacorta
>Priority: Minor
> Attachments: hs_err_pid1677.log
>
>
> I have compiled Mesos 1.0.1 inside a Docker container using Alpine Linux 
> (Dockerfile below):
> {code:none}
> # Set the base image for subsequent instructions:
> FROM alpine:3.4
> MAINTAINER Marc Villacorta Morera 
> # Environment variables:
> ENV TAG="1.0.1" \
> PREFIX="/usr/local" \
> JAVA_HOME="/usr/lib/jvm/default-jvm" \
> 
> JAVA_JVM_LIBRARY="/usr/lib/jvm/default-jvm/jre/lib/amd64/server/libjvm.so" \
> LD_LIBRARY_PATH="/usr/lib/jvm/default-jvm/jre/lib/amd64/server" \
> EDGE_REPO="http://nl.alpinelinux.org/alpine/edge;
> # Install mesos:
> RUN apk add -U --no-cache -t dev git autoconf automake libtool g++ \
> zlib-dev fts-dev apr-dev curl-dev file cyrus-sasl-dev cyrus-sasl-crammd5 \
> subversion-dev make patch linux-headers binutils && apk add -U --no-cache 
> \
> -t dev openjdk8 maven --repository ${EDGE_REPO}/community && apk add -U \
> --no-cache libstdc++ libgcc subversion-libs libcurl fts zlib coreutils \
> && git clone https://git-wip-us.apache.org/repos/asf/mesos.git && cd 
> mesos \
> && { [ "${TAG}" != "master" ] && git checkout tags/${TAG} -b ${TAG}; }; \
> ./bootstrap && mkdir build && cd build && ../configure --prefix=${PREFIX} 
> \
> --disable-dependency-tracking --disable-maintainer-mode --disable-python \
> --enable-optimize --enable-silent-rules \
> && CORES=$(cat /proc/cpuinfo | grep processor | wc -l) \
> && make -j${CORES} && make install && cd && rm -rf /mesos 
> ${PREFIX}/include \
> && find ${PREFIX} -type f -perm /u=x,g=x,o=x | xargs strip -s 
> 2>/dev/null; \
> apk del --purge dev && rm -rf /var/cache/apk/*
> # Command:
> CMD ["/bin/sh"]
> {code}
> Some tests are failing and my biggest concern is with this one:
> {code:none}
> make check GTEST_FILTER="ExamplesTest.JavaFramework"
> {code}
> {code:none}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from ExamplesTest
> [ RUN  ] ExamplesTest.JavaFramework
> ../../src/tests/script.cpp:80: Failure
> Failed
> java_framework_test.sh terminated with signal Segmentation fault
> [  FAILED  ] ExamplesTest.JavaFramework (5655 ms)
> [--] 1 test from ExamplesTest (5656 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (5689 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] ExamplesTest.JavaFramework
> {code}
> An ugly SIGSEGV is dispatched by the kernel. It looks like _libjvm.so_ is the 
> offending library but I am not sure at all:
> {code:none}
> I1026 15:19:54.843340  1706 replica.cpp:712] Persisted action at 7
> I1026 15:19:54.843683  1706 replica.cpp:691] Replica received learned notice 
> for position 7 from @0.0.0.0:0
> I1026 15:19:54.864063  1706 leveldb.cpp:341] Persisting action (690 bytes) to 
> leveldb took 20.333769ms
> I1026 15:19:54.864123  1706 replica.cpp:712] Persisted action at 7
> I1026 15:19:54.864131  1706 replica.cpp:697] Replica learned APPEND action at 
> position 7
> I1026 15:19:54.864936  1705 

[jira] [Updated] (MESOS-6486) Mesos on Alpine Linux

2016-10-26 Thread Marc Villacorta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Villacorta updated MESOS-6486:
---
Attachment: hs_err_pid1677.log

> Mesos on Alpine Linux
> -
>
> Key: MESOS-6486
> URL: https://issues.apache.org/jira/browse/MESOS-6486
> Project: Mesos
>  Issue Type: Wish
>Affects Versions: 1.0.1
> Environment: *Docker*
> {code:none}
> ➜  ~ docker version
> Client:
>  Version:  1.12.1
>  API version:  1.24
>  Go version:   go1.7.1
>  Git commit:   6f9534c
>  Built:Thu Sep  8 10:31:18 2016
>  OS/Arch:  darwin/amd64
> Server:
>  Version:  1.12.1
>  API version:  1.24
>  Go version:   go1.6.3
>  Git commit:   23cf638
>  Built:Thu Aug 18 17:52:38 2016
>  OS/Arch:  linux/amd64
> {code}
> *Alpine*
> {code:none}
> ---  S Y S T E M  ---
> OS:NAME="Alpine Linux"
> ID=alpine
> VERSION_ID=3.4.4
> PRETTY_NAME="Alpine Linux v3.4"
> HOME_URL="http://alpinelinux.org;
> BUG_REPORT_URL="http://bugs.alpinelinux.org;
> uname:Linux 4.4.20-moby #1 SMP Thu Sep 15 12:10:20 UTC 2016 x86_64
> libc:glibc 2.9 NPTL
> rlimit: STACK 8192k, CORE infinity, NPROC infinity, NOFILE 1048576, AS 
> infinity
> load average:0.01 0.39 0.89
> {code}
> *Java*
> {code:none}
> # JRE version: OpenJDK Runtime Environment (8.0_101-b13) (build 1.8.0_101-b13)
> # Java VM: OpenJDK 64-Bit Server VM (25.101-b13 mixed mode linux-amd64 
> compressed oops)
> # Derivative: IcedTea 3.1.0
> # Distribution: Custom build (Tue Aug 30 20:38:19 GMT 2016)
> {code}
>Reporter: Marc Villacorta
>Priority: Minor
> Attachments: hs_err_pid1677.log
>
>
> I have compiled Mesos 1.0.1 inside a Docker container using Alpine Linux 
> (Dockerfile below):
> {code:none}
> # Set the base image for subsequent instructions:
> FROM alpine:3.4
> MAINTAINER Marc Villacorta Morera 
> # Environment variables:
> ENV TAG="1.0.1" \
> PREFIX="/usr/local" \
> JAVA_HOME="/usr/lib/jvm/default-jvm" \
> 
> JAVA_JVM_LIBRARY="/usr/lib/jvm/default-jvm/jre/lib/amd64/server/libjvm.so" \
> LD_LIBRARY_PATH="/usr/lib/jvm/default-jvm/jre/lib/amd64/server" \
> EDGE_REPO="http://nl.alpinelinux.org/alpine/edge;
> # Install mesos:
> RUN apk add -U --no-cache -t dev git autoconf automake libtool g++ \
> zlib-dev fts-dev apr-dev curl-dev file cyrus-sasl-dev cyrus-sasl-crammd5 \
> subversion-dev make patch linux-headers binutils && apk add -U --no-cache 
> \
> -t dev openjdk8 maven --repository ${EDGE_REPO}/community && apk add -U \
> --no-cache libstdc++ libgcc subversion-libs libcurl fts zlib coreutils \
> && git clone https://git-wip-us.apache.org/repos/asf/mesos.git && cd 
> mesos \
> && { [ "${TAG}" != "master" ] && git checkout tags/${TAG} -b ${TAG}; }; \
> ./bootstrap && mkdir build && cd build && ../configure --prefix=${PREFIX} 
> \
> --disable-dependency-tracking --disable-maintainer-mode --disable-python \
> --enable-optimize --enable-silent-rules \
> && CORES=$(cat /proc/cpuinfo | grep processor | wc -l) \
> && make -j${CORES} && make install && cd && rm -rf /mesos 
> ${PREFIX}/include \
> && find ${PREFIX} -type f -perm /u=x,g=x,o=x | xargs strip -s 
> 2>/dev/null; \
> apk del --purge dev && rm -rf /var/cache/apk/*
> # Command:
> CMD ["/bin/sh"]
> {code}
> Some tests are failing and my biggest concern is with this one:
> {code:none}
> make check GTEST_FILTER="ExamplesTest.JavaFramework"
> {code}
> {code:none}
> [==] Running 1 test from 1 test case.
> [--] Global test environment set-up.
> [--] 1 test from ExamplesTest
> [ RUN  ] ExamplesTest.JavaFramework
> ../../src/tests/script.cpp:80: Failure
> Failed
> java_framework_test.sh terminated with signal Segmentation fault
> [  FAILED  ] ExamplesTest.JavaFramework (5655 ms)
> [--] 1 test from ExamplesTest (5656 ms total)
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (5689 ms total)
> [  PASSED  ] 0 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] ExamplesTest.JavaFramework
> {code}
> An ugly SIGSEGV is dispatched by the kernel. It looks like _libjvm.so_ is the 
> offending library but I am not sure at all:
> {code:none}
> I1026 15:19:54.843340  1706 replica.cpp:712] Persisted action at 7
> I1026 15:19:54.843683  1706 replica.cpp:691] Replica received learned notice 
> for position 7 from @0.0.0.0:0
> I1026 15:19:54.864063  1706 leveldb.cpp:341] Persisting action (690 bytes) to 
> leveldb took 20.333769ms
> I1026 15:19:54.864123  1706 replica.cpp:712] Persisted action at 7
> I1026 15:19:54.864131  1706 replica.cpp:697] Replica learned APPEND action at 
> position 7
> I1026 15:19:54.864936  1705 registrar.cpp:509] Successfully updated the 
> 'registry' in 31.458048ms
> I1026 15:19:54.864989  1700 

[jira] [Commented] (MESOS-6310) Remove or define non-POSIX function

2016-10-14 Thread Marc Villacorta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576711#comment-15576711
 ] 

Marc Villacorta commented on MESOS-6310:


It builds successfully after I applied that last patch.

> Remove or define non-POSIX function
> ---
>
> Key: MESOS-6310
> URL: https://issues.apache.org/jira/browse/MESOS-6310
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 1.0.2
>Reporter: Marc Villacorta
>Assignee: Kevin Klues
>Priority: Minor
> Fix For: 1.1.0
>
>
> I was trying to compile Mesos using _musl_ inside Alpine Linux 3.4.
> But this [commit| 
> https://github.com/apache/mesos/commit/498d14e934233e4501597b43da3924bfe8b2de20]
>  introduced the {{W_EXITCODE()}} macro which is not defined in _musl_ and 
> seems to be non-POSIX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6310) Remove or define non-POSIX function

2016-10-14 Thread Marc Villacorta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15576290#comment-15576290
 ] 

Marc Villacorta commented on MESOS-6310:


I think {{stout/os/wait.hpp}} should be included in 
{{src/slave/containerizer/mesos/launch.cpp}} too.

{code:none}
  CXX  slave/containerizer/mesos/libmesos_no_3rdparty_la-launch.lo
../../src/slave/containerizer/mesos/launch.cpp: In function 'void 
mesos::internal::slave::exitWithSignal(int)':
../../src/slave/containerizer/mesos/launch.cpp:224:44: error: 'W_EXITCODE' was 
not declared in this scope
 signalSafeWriteStatus(W_EXITCODE(0, sig));
^
../../src/slave/containerizer/mesos/launch.cpp: In function 'void 
mesos::internal::slave::exitWithStatus(int)':
../../src/slave/containerizer/mesos/launch.cpp:236:47: error: 'W_EXITCODE' was 
not declared in this scope
 signalSafeWriteStatus(W_EXITCODE(status, 0));
   ^
{code}

> Remove or define non-POSIX function
> ---
>
> Key: MESOS-6310
> URL: https://issues.apache.org/jira/browse/MESOS-6310
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 1.0.2
>Reporter: Marc Villacorta
>Assignee: Kevin Klues
>Priority: Minor
> Fix For: 1.1.0
>
>
> I was trying to compile Mesos using _musl_ inside Alpine Linux 3.4.
> But this [commit| 
> https://github.com/apache/mesos/commit/498d14e934233e4501597b43da3924bfe8b2de20]
>  introduced the {{W_EXITCODE()}} macro which is not defined in _musl_ and 
> seems to be non-POSIX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6314) It looks like getgrouplist returns duplicated results

2016-10-05 Thread Marc Villacorta (JIRA)
Marc Villacorta created MESOS-6314:
--

 Summary: It looks like getgrouplist returns duplicated results
 Key: MESOS-6314
 URL: https://issues.apache.org/jira/browse/MESOS-6314
 Project: Mesos
  Issue Type: Bug
  Components: tests
Affects Versions: 1.0.2
 Environment: Inside Docker container {{alpine:3.4}}
Reporter: Marc Villacorta


In my Alpine 3.4 system OsTest.User fails:
{code:none}
/mesos/build # id -G
0 1 2 3 4 6 10 11 20 26 27
{code}

{code:none}
 RUN  ] OsTest.User
../../../3rdparty/stout/tests/os_tests.cpp:696: Failure
Value of: expected_gids
  Actual: { "0", "0", "1", "10", "11", "2", "20", "26", "27", "3", "4", "6" }
Expected: tokens.get()
Which is: { "0", "1", "10", "11", "2", "20", "26", "27", "3", "4", "6" }
[  FAILED  ] OsTest.User (6 ms)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6314) OsTest.User: It looks like getgrouplist returns duplicated results

2016-10-05 Thread Marc Villacorta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Villacorta updated MESOS-6314:
---
Summary: OsTest.User: It looks like getgrouplist returns duplicated results 
 (was: It looks like getgrouplist returns duplicated results)

> OsTest.User: It looks like getgrouplist returns duplicated results
> --
>
> Key: MESOS-6314
> URL: https://issues.apache.org/jira/browse/MESOS-6314
> Project: Mesos
>  Issue Type: Bug
>  Components: tests
>Affects Versions: 1.0.2
> Environment: Inside Docker container {{alpine:3.4}}
>Reporter: Marc Villacorta
>
> In my Alpine 3.4 system OsTest.User fails:
> {code:none}
> /mesos/build # id -G
> 0 1 2 3 4 6 10 11 20 26 27
> {code}
> {code:none}
>  RUN  ] OsTest.User
> ../../../3rdparty/stout/tests/os_tests.cpp:696: Failure
> Value of: expected_gids
>   Actual: { "0", "0", "1", "10", "11", "2", "20", "26", "27", "3", "4", "6" }
> Expected: tokens.get()
> Which is: { "0", "1", "10", "11", "2", "20", "26", "27", "3", "4", "6" }
> [  FAILED  ] OsTest.User (6 ms)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5909) Stout "OsTest.User" test can fail on some systems

2016-10-05 Thread Marc Villacorta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15548353#comment-15548353
 ] 

Marc Villacorta commented on MESOS-5909:


In my Alpine 3.4 system this test still fails:
{code:none}
/mesos/build # id -G
0 1 2 3 4 6 10 11 20 26 27
{code}

{code:none}
 RUN  ] OsTest.User
../../../3rdparty/stout/tests/os_tests.cpp:696: Failure
Value of: expected_gids
  Actual: { "0", "0", "1", "10", "11", "2", "20", "26", "27", "3", "4", "6" }
Expected: tokens.get()
Which is: { "0", "1", "10", "11", "2", "20", "26", "27", "3", "4", "6" }
[  FAILED  ] OsTest.User (6 ms)
{code}

Should I open a new Jira?

> Stout "OsTest.User" test can fail on some systems
> -
>
> Key: MESOS-5909
> URL: https://issues.apache.org/jira/browse/MESOS-5909
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: Kapil Arya
>Assignee: Mao Geng
>  Labels: mesosphere
> Fix For: 1.1.0
>
> Attachments: MESOS-5909-fix.diff
>
>
> Libc call {{getgrouplist}} doesn't return the {{gid}} list in a sorted manner 
> (in my case, it's returning "471 100") ... whereas {{id -G}} return a sorted 
> list ("100 471" in my case) causing the validation inside the loop to fail.
> We should sort both lists before comparing the values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6310) Remove or define non-POSIX function

2016-10-04 Thread Marc Villacorta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Villacorta updated MESOS-6310:
---
Summary: Remove or define non-POSIX function  (was: Remove or define 
non-posix function)

> Remove or define non-POSIX function
> ---
>
> Key: MESOS-6310
> URL: https://issues.apache.org/jira/browse/MESOS-6310
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Affects Versions: 1.0.2
>Reporter: Marc Villacorta
>Priority: Minor
>
> I was trying to compile Mesos using _musl_ inside Alpine Linux 3.4.
> But this [commit| 
> https://github.com/apache/mesos/commit/498d14e934233e4501597b43da3924bfe8b2de20]
>  introduced the {{W_EXITCODE()}} macro which is not defined in _musl_ and 
> seems to be non-POSIX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6310) Remove or define non-posix function

2016-10-04 Thread Marc Villacorta (JIRA)
Marc Villacorta created MESOS-6310:
--

 Summary: Remove or define non-posix function
 Key: MESOS-6310
 URL: https://issues.apache.org/jira/browse/MESOS-6310
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Affects Versions: 1.0.2
Reporter: Marc Villacorta
Priority: Minor


I was trying to compile Mesos using _musl_ inside Alpine Linux 3.4.
But this [commit| 
https://github.com/apache/mesos/commit/498d14e934233e4501597b43da3924bfe8b2de20]
 introduced the {{W_EXITCODE()}} macro which is not defined in _musl_ and seems 
to be non-POSIX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6202) Docker containerizer kills containers whose name starts with 'mesos-'

2016-09-20 Thread Marc Villacorta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15505959#comment-15505959
 ] 

Marc Villacorta commented on MESOS-6202:


Sure, here you have it: MESOS-6212

> Docker containerizer kills containers whose name starts with 'mesos-'
> -
>
> Key: MESOS-6202
> URL: https://issues.apache.org/jira/browse/MESOS-6202
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.1
> Environment: Dockerized 
> {{mesosphere/mesos-slave:1.0.1-2.0.93.ubuntu1404}}
>Reporter: Marc Villacorta
>
> I run 3 docker containers in my CoreOS system whose names start with 
> _'mesos-'_ those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_.
> I can start the first two without any problem but when I start the third one 
> _('mesos-agent')_ all three containers are killed by the docker daemon.
> If I rename the containers to _'m3s0s-master'_, _'m3s0s-dns'_ and 
> _'m3s0s-agent'_ everything works.
> I tracked down the problem to 
> [this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120]
>  code which is marked to be removed after deprecation cycle.
> I was previously running Mesos 0.28.2 without this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6212) Validate the name format of mesos-managed docker containers

2016-09-20 Thread Marc Villacorta (JIRA)
Marc Villacorta created MESOS-6212:
--

 Summary: Validate the name format of mesos-managed docker 
containers
 Key: MESOS-6212
 URL: https://issues.apache.org/jira/browse/MESOS-6212
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Affects Versions: 1.0.1
Reporter: Marc Villacorta
Priority: Minor


Validate the name format of mesos-managed docker containers in order to avoid 
false positives when looking for orphaned mesos tasks.

Currently names such as _'mesos-master'_, _'mesos-agent'_ and _'mesos-dns'_ are 
wrongly terminated when {{--docker_kill_orphans}} is set to true (default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-6202) Docker containerizer kills containers whose name starts with 'mesos-'

2016-09-19 Thread Marc Villacorta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15502699#comment-15502699
 ] 

Marc Villacorta commented on MESOS-6202:


Would you considere adding a validation to make sure {{id}} is a valid Docker 
UUID?

> Docker containerizer kills containers whose name starts with 'mesos-'
> -
>
> Key: MESOS-6202
> URL: https://issues.apache.org/jira/browse/MESOS-6202
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.1
> Environment: Dockerized 
> {{mesosphere/mesos-slave:1.0.1-2.0.93.ubuntu1404}}
>Reporter: Marc Villacorta
>
> I run 3 docker containers in my CoreOS system whose names start with 
> _'mesos-'_ those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_.
> I can start the first two without any problem but when I start the third one 
> _('mesos-agent')_ all three containers are killed by the docker daemon.
> If I rename the containers to _'m3s0s-master'_, _'m3s0s-dns'_ and 
> _'m3s0s-agent'_ everything works.
> I tracked down the problem to 
> [this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120]
>  code which is marked to be removed after deprecation cycle.
> I was previously running Mesos 0.28.2 without this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6202) Docker containerizer kills containers whose name starts with 'mesos-'

2016-09-18 Thread Marc Villacorta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Villacorta updated MESOS-6202:
---
Description: 
I run 3 docker containers in my CoreOS system whose names start with _'mesos-'_ 
those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_.

I can start the first two without any problem but when I start the third one 
_('mesos-agent')_ all three containers are killed by the docker daemon.

If I rename the containers to _'m3s0s-master'_, _'m3s0s-dns'_ and 
_'m3s0s-agent'_ everything works.

I tracked down the problem to 
[this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120]
 code which is marked to be removed after deprecation cycle.

I was previously running Mesos 0.28.2 without this problem.

  was:
I run 3 docker containers in my CoreOS system whose names start with _'mesos-'_ 
those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_.

I can start the first two without any problem but when I start the third one 
_('mesos-agent')_ all three containers are killed by the docker daemon.

If I rename the containers to _'m3s0s-master'_, _'m3s0s-dns'_ and 
_'m3s0s-agent'_ everithing works.

I tracked down the problem to 
[this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120]
 code which is marked to be removed after deprecation cycle.

I was previously running Mesos 0.28.2 without this problem.


> Docker containerizer kills containers whose name starts with 'mesos-'
> -
>
> Key: MESOS-6202
> URL: https://issues.apache.org/jira/browse/MESOS-6202
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.1
> Environment: Dockerized 
> {{mesosphere/mesos-slave:1.0.1-2.0.93.ubuntu1404}}
>Reporter: Marc Villacorta
>
> I run 3 docker containers in my CoreOS system whose names start with 
> _'mesos-'_ those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_.
> I can start the first two without any problem but when I start the third one 
> _('mesos-agent')_ all three containers are killed by the docker daemon.
> If I rename the containers to _'m3s0s-master'_, _'m3s0s-dns'_ and 
> _'m3s0s-agent'_ everything works.
> I tracked down the problem to 
> [this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120]
>  code which is marked to be removed after deprecation cycle.
> I was previously running Mesos 0.28.2 without this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-6202) Docker containerizer kills containers whose name starts with 'mesos-'

2016-09-18 Thread Marc Villacorta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marc Villacorta updated MESOS-6202:
---
Description: 
I run 3 docker containers in my CoreOS system whose names start with _'mesos-'_ 
those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_.

I can start the first two without any problem but when I start the third one 
_('mesos-agent')_ all three containers are killed by the docker daemon.

If I rename the containers to _'m3s0s-master'_, _'m3s0s-dns'_ and 
_'m3s0s-agent'_ everithing works.

I tracked down the problem to 
[this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120]
 code which is marked to be removed after deprecation cycle.

I was previously running Mesos 0.28.2 without this problem.

  was:
I run 3 docker containers in my CoreOS system whose names start with _'mesos-'_ 
those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_.

I can start the first two without any problem but when I start the third one 
_('mesos-agent')_ all three containers are killed by the docker daemon.

If I rename the containers to 'm3s0s-master'_, _'m3s0s-dns'_ and 
_'m3s0s-agent'_ everithing works.

I tracked down the problem to 
[this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120]
 code which is marked to be removed after deprecation cycle.

I was previously running Mesos 0.28.2 without this problem.


> Docker containerizer kills containers whose name starts with 'mesos-'
> -
>
> Key: MESOS-6202
> URL: https://issues.apache.org/jira/browse/MESOS-6202
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.1
> Environment: Dockerized 
> {{mesosphere/mesos-slave:1.0.1-2.0.93.ubuntu1404}}
>Reporter: Marc Villacorta
>
> I run 3 docker containers in my CoreOS system whose names start with 
> _'mesos-'_ those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_.
> I can start the first two without any problem but when I start the third one 
> _('mesos-agent')_ all three containers are killed by the docker daemon.
> If I rename the containers to _'m3s0s-master'_, _'m3s0s-dns'_ and 
> _'m3s0s-agent'_ everithing works.
> I tracked down the problem to 
> [this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120]
>  code which is marked to be removed after deprecation cycle.
> I was previously running Mesos 0.28.2 without this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-6202) Docker containerizer kills containers whose name starts with 'mesos-'

2016-09-18 Thread Marc Villacorta (JIRA)
Marc Villacorta created MESOS-6202:
--

 Summary: Docker containerizer kills containers whose name starts 
with 'mesos-'
 Key: MESOS-6202
 URL: https://issues.apache.org/jira/browse/MESOS-6202
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Affects Versions: 1.0.1
 Environment: Dockerized 
{{mesosphere/mesos-slave:1.0.1-2.0.93.ubuntu1404}}
Reporter: Marc Villacorta


I run 3 docker containers in my CoreOS system whose names start with _'mesos-'_ 
those are: _'mesos-master'_, _'mesos-dns'_ and _'mesos-agent'_.

I can start the first two without any problem but when I start the third one 
_('mesos-agent')_ all three containers are killed by the docker daemon.

If I rename the containers to 'm3s0s-master'_, _'m3s0s-dns'_ and 
_'m3s0s-agent'_ everithing works.

I tracked down the problem to 
[this|https://github.com/apache/mesos/blob/16a563aca1f226b021b8f8815c4d115a3212f02b/src/slave/containerizer/docker.cpp#L116-L120]
 code which is marked to be removed after deprecation cycle.

I was previously running Mesos 0.28.2 without this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5472) Hadoop-free S3 fetcher

2016-05-27 Thread Marc Villacorta (JIRA)
Marc Villacorta created MESOS-5472:
--

 Summary: Hadoop-free S3 fetcher
 Key: MESOS-5472
 URL: https://issues.apache.org/jira/browse/MESOS-5472
 Project: Mesos
  Issue Type: Wish
  Components: fetcher
Reporter: Marc Villacorta
Priority: Minor


My mesos agents are running on systems without Hadoop.
I would like to fetch _S3_ uris into my sandboxes.
How about using the _'awscli'_?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2115) Improve recovering Docker containers when slave is contained

2015-12-28 Thread Marc Villacorta (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15072769#comment-15072769
 ] 

Marc Villacorta commented on MESOS-2115:


Is there a way to define other volumes (or Docker parameters in general) to 
bind-mount to the container where the executor is running (the one set by 
_'docker_mesos_image'_)? I am trying to use:

{code:none}
--docker_mesos_image=mesosphere/mesos-slave:0.26.0-0.2.145.ubuntu1404
{code}

... but in CoreOS I must set some extra bind-mounts such as:
{code:none}
  --volume /usr/bin/docker:/usr/bin/docker:ro
  --volume /lib64/libdevmapper.so.1.02:/lib/libdevmapper.so.1.02:ro
  --volume /lib64/libsystemd.so.0:/lib/libsystemd.so.0:ro
  --volume /lib64/libgcrypt.so.20:/lib/libgcrypt.so.20:ro
{code}

Which Docker image do you set in _'--docker_mesos_image'_?

> Improve recovering Docker containers when slave is contained
> 
>
> Key: MESOS-2115
> URL: https://issues.apache.org/jira/browse/MESOS-2115
> Project: Mesos
>  Issue Type: Epic
>  Components: docker
>Reporter: Timothy Chen
>Assignee: Timothy Chen
>  Labels: docker
> Fix For: 0.23.0
>
>
> Currently when docker containerizer is recovering it checks the checkpointed 
> executor pids to recover which containers are still running, and remove the 
> rest of the containers from docker ps that isn't recognized.
> This is problematic when the slave itself was in a docker container, as when 
> the slave container dies all the forked processes are removed as well, so the 
> checkpointed executor pids are no longer valid.
> We have to assume the docker containers might be still running even though 
> the checkpointed executor pids are not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)