Re: Sigkill while running mesos agent (1.0.1) in docker

2017-01-13 Thread Giulio Eulisse

Actually, no. The docker containers seem to be running just fine. Looks like 
mesos is not able to notice that. Did anything change in the way mesos looks up 
for them? Notice I've both renamed my container to "agent" and added 
MESOS_DOCKER_KILL_ORPHANS=false.



On 13 Jan 2017, 02:14 +0100, haosdent <haosd...@gmail.com>, wrote:
> Is it caused by your container riemann-elasticsearch could not start 
> successfully?
>
> > On Fri, Jan 13, 2017 at 9:10 AM, Giulio Eulisse <giulio.euli...@gmail.com> 
> > wrote:
> > > MMm... it improved things, but now I get a bunch of:
> > >
> > > ```
> > > W0113 01:06:24.757287 17811 slave.cpp:5220] Failed to get resource 
> > > statistics for executor 'riemann-elasticsearch.7fc1bc0b-d92c-11e6-9
> > > 367-02426821a225' of framework 20150626-112246-2475462272-5050-5-: 
> > > Failed to run 'docker -H unix:///var/run/docker.sock inspect me
> > > sos-498ff8de-782e-482a-9478-69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c':
> > >  exited with status 1; stderr='Error: No such image,
> > >  container or task: 
> > > mesos-498ff8de-782e-482a-9478-69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c
> > > ```
> > >
> > > and then leaves out a bunch of running containers.
> > >
> > > On 13 Jan 2017, 01:51 +0100, Joseph Wu <jos...@mesosphere.io>, wrote:
> > > > If Apache JIRA were up, I'd point you to a JIRA noting the problem with 
> > > > naming docker containers `mesos-*`, as Mesos reserves that prefix (and 
> > > > kills everything it considers "unknown").
> > > >
> > > > As a quick workaround, try setting this flag to false:
> > > > https://github.com/apache/mesos/blob/1.1.x/src/slave/flags.cpp#L590-L596
> > > >
> > > > > On Thu, Jan 12, 2017 at 4:41 PM, Giulio Eulisse 
> > > > > <giulio.euli...@gmail.com> wrote:
> > > > > > MMm... it seems to die after a long sequence of forks, and mesos 
> > > > > > itself seems to be issuing the sigkill. I wonder if it's trying to 
> > > > > > do some cleanup and it does not realise one of the containers is 
> > > > > > the agent itself??? Notice I do have 
> > > > > > `MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set.
> > > > > >
> > > > > > On 13 Jan 2017, 01:23 +0100, Giulio Eulisse 
> > > > > > <giulio.euli...@gmail.com>, wrote:
> > > > > > > Ciao,
> > > > > > >
> > > > > > > the only thing I could find is by running a parallel `docker 
> > > > > > > events`
> > > > > > >
> > > > > > > ```
> > > > > > > 2017-01-13T01:18:20.766593692+01:00 network connect 
> > > > > > > 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 
> > > > > > > (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71,
> > > > > > >  name=host, type=host)
> > > > > > > 2017-01-13T01:18:20.846137793+01:00 container start 
> > > > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> > > > > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, 
> > > > > > > license=GPLv2, name=mesos-slave, vendor=CentOS)
> > > > > > > 2017-01-13T01:18:20.847965921+01:00 container resize 
> > > > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> > > > > > > (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1, 
> > > > > > > license=GPLv2, name=mesos-slave, vendor=CentOS, width=134)
> > > > > > > 2017-01-13T01:18:21.610141857+01:00 container kill 
> > > > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> > > > > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, 
> > > > > > > license=GPLv2, name=mesos-slave, signal=15, vendor=CentOS)
> > > > > > > 2017-01-13T01:18:21.610491564+01:00 container kill 
> > > > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> > > > > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, 
> > > > > > > license=GPLv2, name=mesos-slave, signal=9, vendor=CentOS)
> > > > > > > 2017-01-13T01:18:21.646229213+01:00 container die 
> > > > > > > 1fddd8e8f956

Re: Sigkill while running mesos agent (1.0.1) in docker

2017-01-12 Thread Giulio Eulisse
MMm... it improved things, but now I get a bunch of:

```
W0113 01:06:24.757287 17811 slave.cpp:5220] Failed to get resource statistics 
for executor 'riemann-elasticsearch.7fc1bc0b-d92c-11e6-9
367-02426821a225' of framework 20150626-112246-2475462272-5050-5-: Failed 
to run 'docker -H unix:///var/run/docker.sock inspect me
sos-498ff8de-782e-482a-9478-69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c':
 exited with status 1; stderr='Error: No such image,
 container or task: 
mesos-498ff8de-782e-482a-9478-69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c
```

and then leaves out a bunch of running containers.

On 13 Jan 2017, 01:51 +0100, Joseph Wu <jos...@mesosphere.io>, wrote:
> If Apache JIRA were up, I'd point you to a JIRA noting the problem with 
> naming docker containers `mesos-*`, as Mesos reserves that prefix (and kills 
> everything it considers "unknown").
>
> As a quick workaround, try setting this flag to false:
> https://github.com/apache/mesos/blob/1.1.x/src/slave/flags.cpp#L590-L596
>
> > On Thu, Jan 12, 2017 at 4:41 PM, Giulio Eulisse <giulio.euli...@gmail.com> 
> > wrote:
> > > MMm... it seems to die after a long sequence of forks, and mesos itself 
> > > seems to be issuing the sigkill. I wonder if it's trying to do some 
> > > cleanup and it does not realise one of the containers is the agent 
> > > itself??? Notice I do have 
> > > `MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set.
> > >
> > > On 13 Jan 2017, 01:23 +0100, Giulio Eulisse <giulio.euli...@gmail.com>, 
> > > wrote:
> > > > Ciao,
> > > >
> > > > the only thing I could find is by running a parallel `docker events`
> > > >
> > > > ```
> > > > 2017-01-13T01:18:20.766593692+01:00 network connect 
> > > > 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 
> > > > (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71,
> > > >  name=host, type=host)
> > > > 2017-01-13T01:18:20.846137793+01:00 container start 
> > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, 
> > > > name=mesos-slave, vendor=CentOS)
> > > > 2017-01-13T01:18:20.847965921+01:00 container resize 
> > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> > > > (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1, 
> > > > license=GPLv2, name=mesos-slave, vendor=CentOS, width=134)
> > > > 2017-01-13T01:18:21.610141857+01:00 container kill 
> > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, 
> > > > name=mesos-slave, signal=15, vendor=CentOS)
> > > > 2017-01-13T01:18:21.610491564+01:00 container kill 
> > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, 
> > > > name=mesos-slave, signal=9, vendor=CentOS)
> > > > 2017-01-13T01:18:21.646229213+01:00 container die 
> > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> > > > (build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1, 
> > > > license=GPLv2, name=mesos-slave, vendor=CentOS)
> > > > 2017-01-13T01:18:21.652894124+01:00 network disconnect 
> > > > 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 
> > > > (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71,
> > > >  name=host, type=host)
> > > > 2017-01-13T01:18:21.705874041+01:00 container stop 
> > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, 
> > > > name=mesos-slave, vendor=CentOS)
> > > > ```
> > > >
> > > > Ciao,
> > > > Giulio
> > > >
> > > > On 13 Jan 2017, 01:06 +0100, haosdent <haosd...@gmail.com>, wrote:
> > > > > Hi, @Giuliio According to your log, it looks normal. Do you have any 
> > > > > logs related to "SIGKILL"?
> > > > >
> > > > > > On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse 
> > > > > > <giulio.euli...@gmail.com> wrote:
> > > > > > > Hi,
> > > > > > > I’ve a setup where I run mesos in docker which works perfectly 
> &g

Re: Sigkill while running mesos agent (1.0.1) in docker

2017-01-12 Thread Giulio Eulisse
docker rm mesos-slave
/usr/bin/docker run --pids-limit -1 --net host -m 0b --privileged      \
  --oom-kill-disable \
    -e LIBPROCESS_SSL_KEY_FILE=/etc/grid-security/hostkey.pem \
    -e LIBPROCESS_SSL_CERT_FILE=/etc/grid-security/hostcert.pem      \
    -e LIBPROCESS_SSL_VERIFY_CERT=false      \
    -e LIBPROCESS_SSL_SUPPORT_DOWNGRADE=true      \
    -e LIBPROCESS_SSL_ENABLED=true      \
    -e MESOS_MASTER_ZK=zk://XXX:2181,XXX:2181,XXX:2181/mesos      \
    -e MESOS_ATTRIBUTES="os:Linux;is_virtual:true;cpu:GenuineIntel"      \
    -e MESOS_MASTER_WORKDIR=/build/mesos      \
    -e MESOS_SYSTEMD_ENABLE_SUPPORT=false      \
    -e MESOS_LAUNCHER=posix      \
    -e MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1      \
    -e MESOS_IMAGE_PROVIDERS=docker      \
    -e MESOS_ISOLATION=docker/runtime      \
    -e MESOS_EXTRA_CPUS=1      \
    -e MESOS_MODULES=file://etc/mesos-slave/modules      \
    -e MESOS_RESOURCE_ESTIMATOR=org_apache_mesos_FixedResourceEstimator      \
    -e MESOS_QOS_CONTROLLER=org_apache_mesos_LoadQoSController      \
    -e MESOS_LOGGING_LEVEL=WARNING \
    -e JENKINS_UID=203      \
    -e JENKINS_GID=992      \
    -v /var/run/docker.sock:/var/run/docker.sock      \
    -v /sys/fs/cgroup:/sys/fs/cgroup      \
    -v /build/docker:/var/lib/docker      \
    -v /build:/build      \
    -v /build/log:/var/log \
    -v /etc/grid-security:/etc/grid-security:ro  -it --pid=host --name 
mesos-agent -it alisw/mesos-slave:1.0.1 /bin/bash


I also tried with `mesos-agent` as a name.

On 13 Jan 2017, 01:46 +0100, haosdent <haosd...@gmail.com>, wrote:
> Hi, what the docker command you use to start agents, I remember mesos would 
> try to recover containers which names start with mesos-slave and kill them if 
> could not recover successfully.
>
> > On Jan 13, 2017 8:43 AM, "Giulio Eulisse" <giulio.euli...@gmail.com> wrote:
> > > MMm... it seems to die after a long sequence of forks, and mesos itself 
> > > seems to be issuing the sigkill. I wonder if it's trying to do some 
> > > cleanup and it does not realise one of the containers is the agent 
> > > itself??? Notice I do have 
> > > `MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set.
> > >
> > > On 13 Jan 2017, 01:23 +0100, Giulio Eulisse <giulio.euli...@gmail.com>, 
> > > wrote:
> > > > Ciao,
> > > >
> > > > the only thing I could find is by running a parallel `docker events`
> > > >
> > > > ```
> > > > 2017-01-13T01:18:20.766593692+01:00 network connect 
> > > > 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 
> > > > (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71,
> > > >  name=host, type=host)
> > > > 2017-01-13T01:18:20.846137793+01:00 container start 
> > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, 
> > > > name=mesos-slave, vendor=CentOS)
> > > > 2017-01-13T01:18:20.847965921+01:00 container resize 
> > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> > > > (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1, 
> > > > license=GPLv2, name=mesos-slave, vendor=CentOS, width=134)
> > > > 2017-01-13T01:18:21.610141857+01:00 container kill 
> > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, 
> > > > name=mesos-slave, signal=15, vendor=CentOS)
> > > > 2017-01-13T01:18:21.610491564+01:00 container kill 
> > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, 
> > > > name=mesos-slave, signal=9, vendor=CentOS)
> > > > 2017-01-13T01:18:21.646229213+01:00 container die 
> > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> > > > (build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1, 
> > > > license=GPLv2, name=mesos-slave, vendor=CentOS)
> > > > 2017-01-13T01:18:21.652894124+01:00 network disconnect 
> > > > 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 
> > > > (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71,
> > > >  name=host, type=host)
> > > > 2017-01-13T01:18:21.705874041+01:00 container stop 
> > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, licens

Re: Sigkill while running mesos agent (1.0.1) in docker

2017-01-12 Thread Giulio Eulisse
MMm... it seems to die after a long sequence of forks, and mesos itself seems 
to be issuing the sigkill. I wonder if it's trying to do some cleanup and it 
does not realise one of the containers is the agent itself??? Notice I do have 
`MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set.

On 13 Jan 2017, 01:23 +0100, Giulio Eulisse <giulio.euli...@gmail.com>, wrote:
> Ciao,
>
> the only thing I could find is by running a parallel `docker events`
>
> ```
> 2017-01-13T01:18:20.766593692+01:00 network connect 
> 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 
> (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, 
> name=host, type=host)
> 2017-01-13T01:18:20.846137793+01:00 container start 
> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, 
> name=mesos-slave, vendor=CentOS)
> 2017-01-13T01:18:20.847965921+01:00 container resize 
> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1, 
> license=GPLv2, name=mesos-slave, vendor=CentOS, width=134)
> 2017-01-13T01:18:21.610141857+01:00 container kill 
> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, 
> name=mesos-slave, signal=15, vendor=CentOS)
> 2017-01-13T01:18:21.610491564+01:00 container kill 
> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, 
> name=mesos-slave, signal=9, vendor=CentOS)
> 2017-01-13T01:18:21.646229213+01:00 container die 
> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> (build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1, 
> license=GPLv2, name=mesos-slave, vendor=CentOS)
> 2017-01-13T01:18:21.652894124+01:00 network disconnect 
> 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 
> (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, 
> name=host, type=host)
> 2017-01-13T01:18:21.705874041+01:00 container stop 
> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, 
> name=mesos-slave, vendor=CentOS)
> ```
>
> Ciao,
> Giulio
>
> On 13 Jan 2017, 01:06 +0100, haosdent <haosd...@gmail.com>, wrote:
> > Hi, @Giuliio According to your log, it looks normal. Do you have any logs 
> > related to "SIGKILL"?
> >
> > > On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse 
> > > <giulio.euli...@gmail.com> wrote:
> > > > Hi,
> > > > I’ve a setup where I run mesos in docker which works perfectly when I 
> > > > use 0.28.2. I now migrated to 1.0.1 (but it’s the same with 1.1.0 and 
> > > > 1.0.0) and it seems to receive a sigkill right after saying:
> > > >
> > > > WARNING: Logging before InitGoogleLogging() is written to STDERR
> > > > I0112 23:22:09.889120  4934 main.cpp:243] Build: 2016-08-26 23:06:27 by 
> > > > centos
> > > > I0112 23:22:09.889181  4934 main.cpp:244] Version: 1.0.1
> > > > I0112 23:22:09.889184  4934 main.cpp:247] Git tag: 1.0.1
> > > > I0112 23:22:09.889188  4934 main.cpp:251] Git SHA: 
> > > > 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3
> > > > W0112 23:22:09.890808  4934 openssl.cpp:398] Failed SSL connections 
> > > > will be downgraded to a non-SSL socket
> > > > W0112 23:22:09.891237  4934 process.cpp:881] Failed SSL connections 
> > > > will be downgraded to a non-SSL socket
> > > > E0112 23:22:10.129096  4934 shell.hpp:106] Command 'hadoop version 
> > > > 2>&1' failed; this is the output:
> > > > sh: hadoop: command not found
> > > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@726: 
> > > > Client environment:zookeeper.version=zookeeper C client 3.4.8
> > > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@730: 
> > > > Client environment:host.name=.XXX.ch
> > > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@737: 
> > > > Client environment:os.name=Linux
> > > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@738: 
> > > > Client environment:os.arch=3.10.0-229.14.1.el7.x86_64
> > > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@739: 
> > > > Client environment:os.version=#1 SMP Tue Sep 15 15:05:51 UTC 2015
> > > > 2017-01-12 23:22:10,131:4934(0x7f950503b700)

Re: Sigkill while running mesos agent (1.0.1) in docker

2017-01-12 Thread Giulio Eulisse
Ciao,

the only thing I could find is by running a parallel `docker events`

```
2017-01-13T01:18:20.766593692+01:00 network connect 
32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 
(container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, 
name=host, type=host)
2017-01-13T01:18:20.846137793+01:00 container start 
1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
(build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, 
name=mesos-slave, vendor=CentOS)
2017-01-13T01:18:20.847965921+01:00 container resize 
1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
(build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1, license=GPLv2, 
name=mesos-slave, vendor=CentOS, width=134)
2017-01-13T01:18:21.610141857+01:00 container kill 
1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
(build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, 
name=mesos-slave, signal=15, vendor=CentOS)
2017-01-13T01:18:21.610491564+01:00 container kill 
1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
(build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, 
name=mesos-slave, signal=9, vendor=CentOS)
2017-01-13T01:18:21.646229213+01:00 container die 
1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
(build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1, 
license=GPLv2, name=mesos-slave, vendor=CentOS)
2017-01-13T01:18:21.652894124+01:00 network disconnect 
32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 
(container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, 
name=host, type=host)
2017-01-13T01:18:21.705874041+01:00 container stop 
1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 
(build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, 
name=mesos-slave, vendor=CentOS)
```

Ciao,
Giulio

On 13 Jan 2017, 01:06 +0100, haosdent <haosd...@gmail.com>, wrote:
> Hi, @Giuliio According to your log, it looks normal. Do you have any logs 
> related to "SIGKILL"?
>
> > On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse <giulio.euli...@gmail.com> 
> > wrote:
> > > Hi,
> > > I’ve a setup where I run mesos in docker which works perfectly when I use 
> > > 0.28.2. I now migrated to 1.0.1 (but it’s the same with 1.1.0 and 1.0.0) 
> > > and it seems to receive a sigkill right after saying:
> > >
> > > WARNING: Logging before InitGoogleLogging() is written to STDERR
> > > I0112 23:22:09.889120  4934 main.cpp:243] Build: 2016-08-26 23:06:27 by 
> > > centos
> > > I0112 23:22:09.889181  4934 main.cpp:244] Version: 1.0.1
> > > I0112 23:22:09.889184  4934 main.cpp:247] Git tag: 1.0.1
> > > I0112 23:22:09.889188  4934 main.cpp:251] Git SHA: 
> > > 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3
> > > W0112 23:22:09.890808  4934 openssl.cpp:398] Failed SSL connections will 
> > > be downgraded to a non-SSL socket
> > > W0112 23:22:09.891237  4934 process.cpp:881] Failed SSL connections will 
> > > be downgraded to a non-SSL socket
> > > E0112 23:22:10.129096  4934 shell.hpp:106] Command 'hadoop version 2>&1' 
> > > failed; this is the output:
> > > sh: hadoop: command not found
> > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@726: Client 
> > > environment:zookeeper.version=zookeeper C client 3.4.8
> > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@730: Client 
> > > environment:host.name=.XXX.ch
> > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@737: Client 
> > > environment:os.name=Linux
> > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@738: Client 
> > > environment:os.arch=3.10.0-229.14.1.el7.x86_64
> > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@739: Client 
> > > environment:os.version=#1 SMP Tue Sep 15 15:05:51 UTC 2015
> > > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@747: Client 
> > > environment:user.name=(null)
> > > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@755: Client 
> > > environment:user.home=/root
> > > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@767: Client 
> > > environment:user.dir=/
> > > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@zookeeper_init@800: 
> > > Initiating client connection, 
> > > host=XXX1.YYY.ch:2181,XXX2.YYY.ch:2181,XXX3.YYY.ch:2181 
> > > sessionTimeout=1 watcher=0x7f950ee20300 sessionId=0 
> > > sessionPasswd= context=0x
> > > 7f94fc60 flags=0
> > > 2017-01-12 23:22:10,134:4934(0x7f9501fd7700):ZOO_INFO@check_

Sigkill while running mesos agent (1.0.1) in docker

2017-01-12 Thread Giulio Eulisse
Hi,

I’ve a setup where I run mesos in docker which works perfectly when I use
0.28.2. I now migrated to 1.0.1 (but it’s the same with 1.1.0 and 1.0.0)
and it seems to receive a sigkill right after saying:

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0112 23:22:09.889120  4934 main.cpp:243] Build: 2016-08-26 23:06:27 by centos
I0112 23:22:09.889181  4934 main.cpp:244] Version: 1.0.1
I0112 23:22:09.889184  4934 main.cpp:247] Git tag: 1.0.1
I0112 23:22:09.889188  4934 main.cpp:251] Git SHA:
3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3
W0112 23:22:09.890808  4934 openssl.cpp:398] Failed SSL connections
will be downgraded to a non-SSL socket
W0112 23:22:09.891237  4934 process.cpp:881] Failed SSL connections
will be downgraded to a non-SSL socket
E0112 23:22:10.129096  4934 shell.hpp:106] Command 'hadoop version
2>&1' failed; this is the output:
sh: hadoop: command not found
2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@726:
Client environment:zookeeper.version=zookeeper C client 3.4.8
2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@730:
Client environment:host.name=.XXX.ch
2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@737:
Client environment:os.name=Linux
2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@738:
Client environment:os.arch=3.10.0-229.14.1.el7.x86_64
2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@739:
Client environment:os.version=#1 SMP Tue Sep 15 15:05:51 UTC 2015
2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@747:
Client environment:user.name=(null)
2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@755:
Client environment:user.home=/root
2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@767:
Client environment:user.dir=/
2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@zookeeper_init@800:
Initiating client connection,
host=XXX1.YYY.ch:2181,XXX2.YYY.ch:2181,XXX3.YYY.ch:2181
sessionTimeout=1 watcher=0x7f950ee20300 sessionId=0
sessionPasswd= context=0x
7f94fc60 flags=0
2017-01-12 23:22:10,134:4934(0x7f9501fd7700):ZOO_INFO@check_events@1728:
initiated connection to server [XX.YY.ZZ.WW:2181]
2017-01-12 23:22:10,146:4934(0x7f9501fd7700):ZOO_INFO@check_events@1775:
session establishment complete on server [XX.YY.ZZ.WW:2181
], sessionId=0x35828ae70fb2065, negotiated timeout=1

Any idea of what might be going on? Looks like an OOM, but I do not see it
in /var/log/messages and it also happens with --oom-kill-disable.

--
Ciao,
Giulio


Re: Documentation for ACCEPT HTTP API

2016-07-04 Thread Giulio Eulisse
Dear Jay,

thank you for your reply.

Yes, I am aware of those pages and that's what I used so far. As I said,
they are actually quite clear and allowed me to get a simple "reject all
offers" framework up and running in one evening.

However, if you look at the scheduler-http-api page, you will see that the
documentation for the "ACCEPT" message is completely lacking the
description of what should go into the "operations" field of the JSON
payload, unless I am missing something completely trivial. My question is
therefore where I can find the schema of what should go there (and possibly
a simple example).

-- 
Ciao,
Giulio

On Mon, Jul 4, 2016 at 1:24 PM Jay JN Guo <guojian...@cn.ibm.com> wrote:

> Hi Giulio,
>
> For scheduler/executor HTTP API, please refer to:
> http://mesos.apache.org/documentation/latest/scheduler-http-api/
> http://mesos.apache.org/documentation/latest/executor-http-api/
>
> If you find anything missing there, let us know.
>
> Also, we are working on Operator HTTP API documentation and hopefully
> could get it out soon.
>
> cheers,
> /Jay
>
>
> - Original message -
> From: Giulio Eulisse <giulio.euli...@gmail.com>
> To: "user@mesos.apache.org" <user@mesos.apache.org>
> Cc:
> Subject: Documentation for ACCEPT HTTP API
>
> Date: Mon, Jul 4, 2016 6:01 PM
>
> Dear all,
>
> I've started writing a simple framework using node.js and the HTTP
> Scheduler API. I've managed to subscribe to the event stream, parse
> messages and decline offers quite easily, however I'm having a bit of
> trouble accepting the offers and launching tasks, since I cannot find any
> complete example for the JSON format the various operations should have. I
> assume I can reverse engineer mesos.proto and do a bit of trial and error,
> but I was wondering if I was simply missing some proper documentation. Any
> suggestions?
>
> --
> Ciao,
> Giulio
>
>


Documentation for ACCEPT HTTP API

2016-07-04 Thread Giulio Eulisse
Dear all,

I've started writing a simple framework using node.js and the HTTP
Scheduler API. I've managed to subscribe to the event stream, parse
messages and decline offers quite easily, however I'm having a bit of
trouble accepting the offers and launching tasks, since I cannot find any
complete example for the JSON format the various operations should have. I
assume I can reverse engineer mesos.proto and do a bit of trial and error,
but I was wondering if I was simply missing some proper documentation. Any
suggestions?

-- 
Ciao,
Giulio


Unable to start mesos slave 0.25+ in docker

2016-01-29 Thread Giulio Eulisse
Ciao,

I filed:

https://issues.apache.org/jira/browse/MESOS-4543

about not being able to start a mesos slave in docker since 0.25 (including
0.26, did not test master). Has anyone seen the same problem?

-- 
Ciao,
Giulio


Re: Running mesos-execute inside docker.

2015-05-21 Thread Giulio Eulisse
Mmm, no this does not seem to work. The message is still there. Any 
other suggestions?


--
Ciao,
Giulio

On 21 May 2015, at 17:43, Tyson Norris wrote:

You might try adding --pid=host - I found that running a docker based 
executor when running slave as a docker container also, I had to do 
this so the the pids are visible between containers.


Tyson

On May 21, 2015, at 6:04 AM, Giulio Eulisse 
giulio.euli...@cern.chmailto:giulio.euli...@cern.ch wrote:



Hi,

I've a problem which can be reduced to running:

mesos-execute --name=foo --command=uname -a  hostname 
--master=leader.mesos:5050



inside a docker container. If I run without --net=host, it blocks 
completely (I guess the master / slave cannot communicate back to the 
framework), if I run with --net=host everything is fine but I get:


May 21 14:59:13 cmsbuild30 mesos-slave[1514]: I0521 14:59:13.115659  
1546 slave.cpp:1533] Asked to shut down framework 
20150418-223037-3834547840-5050-6-2757 by 
master@128.142.142.228mailto:master@128.142.142.228:5050
May 21 14:59:13 cmsbuild30 mesos-slave[1514]: W0521 14:59:13.117231  
1546 slave.cpp:1548] Cannot shut down unknown framework 
20150418-223037-3834547840-5050-6-2757



in my host machine logs, which is not ideal. Any idea on how to do 
this correctly?


The actual problem I'm trying to solve is using the mesos plugin for a 
jenkins instance which runs inside docker.


--
Ciao
Giulio


Re: Mesos slaves connecting but not active.

2015-03-24 Thread Giulio Eulisse

Ciao,

I updated to 0.21.1 and seems to have fixed the issue (at least the 
slave reconnects). docker is still slow deleting stuff.


--
Ciao,
Giulio

On 23 Mar 2015, at 18:20, Tim Chen wrote:


How many containers are you running, and what is your system like?

Also are you able to capture through perf or strace what docker rm is
blocked on?

Tim


On Mon, Mar 23, 2015 at 10:12 AM, Giulio Eulisse 
giulio.euli...@cern.ch

wrote:

I suspect my problem is that docker rm takes forever in my case. 
I'm not

running docker in docker though.


On 23 Mar 2015, at 18:01, haosdent wrote:

Are your issue relevant to this?

https://issues.apache.org/jira/browse/MESOS-2115

On Tue, Mar 24, 2015 at 12:52 AM, Giulio Eulisse 
giulio.euli...@cern.ch

wrote:

Hi,


I'm running using 0.20.1 and I seem to have troubles due to the 
fact a
mesos slave is not able to recover the docker containers after a 
restart,

resulting in a very long wait.

Is this some known issue?

--
Ciao,
Giulio





--
Best Regards,
Haosdent Huang





Re: Mesos slaves connecting but not active.

2015-03-23 Thread Giulio Eulisse
I suspect my problem is that docker rm takes forever in my case. I'm 
not running docker in docker though.


On 23 Mar 2015, at 18:01, haosdent wrote:


Are your issue relevant to this?
https://issues.apache.org/jira/browse/MESOS-2115

On Tue, Mar 24, 2015 at 12:52 AM, Giulio Eulisse 
giulio.euli...@cern.ch

wrote:


Hi,

I'm running using 0.20.1 and I seem to have troubles due to the fact 
a
mesos slave is not able to recover the docker containers after a 
restart,

resulting in a very long wait.

Is this some known issue?

--
Ciao,
Giulio





--
Best Regards,
Haosdent Huang


Re: Mesos slaves connecting but not active.

2015-03-23 Thread Giulio Eulisse

Ciao,


How many containers are you running, and what is your system like?


I've something like a dozen of slaves a 2 / 3 containers per slave. I'm 
running on a Centos6 derived distribution (Scientific Linux CERN). On 
the specific slave I do not have any running container:


```
[root@cmsbuild11 ~]# docker ps -q  | wc
  0   0   0
```

but I do have a bunch of dead one:

```
[root@cmsbuild11 ~]# docker ps -qa  | wc
999 999   12987
```

due to some runaway process.

By attaching via gdb to the docker daemon I get:

```
#0  0x005b0ad4 in syscall.Syscall ()
#1  0x0084f91b in 
github.com/docker/docker/pkg/devicemapper.ioctlBlkDiscard ()

#2  0x0010 in ?? ()
#3  0x000b in ?? ()
#4  0x1277 in ?? ()
#5  0x7f06d004e128 in ?? ()
#6  0x00c209341e68 in ?? ()
#7  0x7f06d004e140 in ?? ()
#8  0x0018 in ?? ()
#9  0x00c209341e40 in ?? ()
#10 0x in ?? ()
```

for a few of the running threads (the other ones are blocked in some 
futex). Notice I'm running on a CEPH volume.


--
Ciao,
Giulio



Also are you able to capture through perf or strace what docker rm is
blocked on?

Tim


On Mon, Mar 23, 2015 at 10:12 AM, Giulio Eulisse 
giulio.euli...@cern.ch

wrote:

I suspect my problem is that docker rm takes forever in my case. 
I'm not

running docker in docker though.


On 23 Mar 2015, at 18:01, haosdent wrote:

Are your issue relevant to this?

https://issues.apache.org/jira/browse/MESOS-2115

On Tue, Mar 24, 2015 at 12:52 AM, Giulio Eulisse 
giulio.euli...@cern.ch

wrote:

Hi,


I'm running using 0.20.1 and I seem to have troubles due to the 
fact a
mesos slave is not able to recover the docker containers after a 
restart,

resulting in a very long wait.

Is this some known issue?

--
Ciao,
Giulio





--
Best Regards,
Haosdent Huang





Migration from mesos 0.19 to mesos 0.20

2014-08-27 Thread Giulio Eulisse

Hi,

is there any best practices / recommendation when updating from mesos 
0.19 to mesos 0.20?


--
Ciao,
Giulio