Re: Sigkill while running mesos agent (1.0.1) in docker
As the log show, it failed when perform below command to find the container status. ``` docker -H unix:///var/run/docker.sock inspect mesos-498ff8de-782e-482a-9478-69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c ``` have you mount the sock file from host to your agent container? On Fri, Jan 13, 2017 at 8:20 PM, Giulio Eulisse wrote: > > Actually, no. The docker containers seem to be running just fine. Looks > like mesos is not able to notice that. Did anything change in the way mesos > looks up for them? Notice I've both renamed my container to "agent" and > added MESOS_DOCKER_KILL_ORPHANS=false. > > > > On 13 Jan 2017, 02:14 +0100, haosdent , wrote: > > Is it caused by your container riemann-elasticsearch could not start > successfully? > > On Fri, Jan 13, 2017 at 9:10 AM, Giulio Eulisse > wrote: > >> MMm... it improved things, but now I get a bunch of: >> >> ``` >> W0113 01:06:24.757287 17811 slave.cpp:5220] Failed to get resource >> statistics for executor 'riemann-elasticsearch.7fc1bc0b-d92c-11e6-9 >> 367-02426821a225' of framework 20150626-112246-2475462272-5050-5-: >> Failed to run 'docker -H unix:///var/run/docker.sock inspect me >> sos-498ff8de-782e-482a-9478-69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c': >> exited with status 1; stderr='Error: No such image, >> container or task: mesos-498ff8de-782e-482a-9478- >> 69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c >> ``` >> >> and then leaves out a bunch of running containers. >> >> On 13 Jan 2017, 01:51 +0100, Joseph Wu , wrote: >> >> If Apache JIRA were up, I'd point you to a JIRA noting the problem with >> naming docker containers `mesos-*`, as Mesos reserves that prefix (and >> kills everything it considers "unknown"). >> >> As a quick workaround, try setting this flag to false: >> https://github.com/apache/mesos/blob/1.1.x/src/slave/flags.cpp#L590-L596 >> >> On Thu, Jan 12, 2017 at 4:41 PM, Giulio Eulisse > > wrote: >> >>> MMm... it seems to die after a long sequence of forks, and mesos itself >>> seems to be issuing the sigkill. I wonder if it's trying to do some cleanup >>> and it does not realise one of the containers is the agent itself??? Notice >>> I do have `MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set. >>> >>> On 13 Jan 2017, 01:23 +0100, Giulio Eulisse , >>> wrote: >>> >>> Ciao, >>> >>> the only thing I could find is by running a parallel `docker events` >>> >>> ``` >>> 2017-01-13T01:18:20.766593692+01:00 network connect >>> 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 >>> (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, >>> name=host, type=host) >>> 2017-01-13T01:18:20.846137793+01:00 container start >>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 >>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, >>> name=mesos-slave, vendor=CentOS) >>> 2017-01-13T01:18:20.847965921+01:00 container resize >>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 >>> (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1, >>> license=GPLv2, name=mesos-slave, vendor=CentOS, width=134) >>> 2017-01-13T01:18:21.610141857+01:00 container kill >>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 >>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, >>> name=mesos-slave, signal=15, vendor=CentOS) >>> 2017-01-13T01:18:21.610491564+01:00 container kill >>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 >>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, >>> name=mesos-slave, signal=9, vendor=CentOS) >>> 2017-01-13T01:18:21.646229213+01:00 container die >>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 >>> (build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1, >>> license=GPLv2, name=mesos-slave, vendor=CentOS) >>> 2017-01-13T01:18:21.652894124+01:00 network disconnect >>> 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 >>> (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, >>> name=host, type=host) >>> 2017-01-13T01:18:21.705874041+01:00 container stop >>> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 >>> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, >>> name=mesos-slave, vendor=CentOS) >>> ``` >>> >>> Ciao, >>> Giulio >>> >>> On 13 Jan 2017, 01:06 +0100, haosdent , wrote: >>> >>> Hi, @Giuliio According to your log, it looks normal. Do you have any >>> logs related to "SIGKILL"? >>> >>> On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse < >>> giulio.euli...@gmail.com> wrote: >>> Hi, I’ve a setup where I run mesos in docker which works perfectly when I use 0.28.2. I now migrated to 1.0.1 (but it’s the same with 1.1.0 and 1.0.0) and it seems to receive a sigkill right after saying: WARNING: Logging before InitGoogleLogging() is written to STDERR I0112 23:22:09.889120 4934 main.cpp:243] Build: 2016-
Re: Sigkill while running mesos agent (1.0.1) in docker
Actually, no. The docker containers seem to be running just fine. Looks like mesos is not able to notice that. Did anything change in the way mesos looks up for them? Notice I've both renamed my container to "agent" and added MESOS_DOCKER_KILL_ORPHANS=false. On 13 Jan 2017, 02:14 +0100, haosdent , wrote: > Is it caused by your container riemann-elasticsearch could not start > successfully? > > > On Fri, Jan 13, 2017 at 9:10 AM, Giulio Eulisse > > wrote: > > > MMm... it improved things, but now I get a bunch of: > > > > > > ``` > > > W0113 01:06:24.757287 17811 slave.cpp:5220] Failed to get resource > > > statistics for executor 'riemann-elasticsearch.7fc1bc0b-d92c-11e6-9 > > > 367-02426821a225' of framework 20150626-112246-2475462272-5050-5-: > > > Failed to run 'docker -H unix:///var/run/docker.sock inspect me > > > sos-498ff8de-782e-482a-9478-69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c': > > > exited with status 1; stderr='Error: No such image, > > > container or task: > > > mesos-498ff8de-782e-482a-9478-69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c > > > ``` > > > > > > and then leaves out a bunch of running containers. > > > > > > On 13 Jan 2017, 01:51 +0100, Joseph Wu , wrote: > > > > If Apache JIRA were up, I'd point you to a JIRA noting the problem with > > > > naming docker containers `mesos-*`, as Mesos reserves that prefix (and > > > > kills everything it considers "unknown"). > > > > > > > > As a quick workaround, try setting this flag to false: > > > > https://github.com/apache/mesos/blob/1.1.x/src/slave/flags.cpp#L590-L596 > > > > > > > > > On Thu, Jan 12, 2017 at 4:41 PM, Giulio Eulisse > > > > > wrote: > > > > > > MMm... it seems to die after a long sequence of forks, and mesos > > > > > > itself seems to be issuing the sigkill. I wonder if it's trying to > > > > > > do some cleanup and it does not realise one of the containers is > > > > > > the agent itself??? Notice I do have > > > > > > `MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set. > > > > > > > > > > > > On 13 Jan 2017, 01:23 +0100, Giulio Eulisse > > > > > > , wrote: > > > > > > > Ciao, > > > > > > > > > > > > > > the only thing I could find is by running a parallel `docker > > > > > > > events` > > > > > > > > > > > > > > ``` > > > > > > > 2017-01-13T01:18:20.766593692+01:00 network connect > > > > > > > 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 > > > > > > > (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, > > > > > > > name=host, type=host) > > > > > > > 2017-01-13T01:18:20.846137793+01:00 container start > > > > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > > > > > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, > > > > > > > license=GPLv2, name=mesos-slave, vendor=CentOS) > > > > > > > 2017-01-13T01:18:20.847965921+01:00 container resize > > > > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > > > > > > > (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1, > > > > > > > license=GPLv2, name=mesos-slave, vendor=CentOS, width=134) > > > > > > > 2017-01-13T01:18:21.610141857+01:00 container kill > > > > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > > > > > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, > > > > > > > license=GPLv2, name=mesos-slave, signal=15, vendor=CentOS) > > > > > > > 2017-01-13T01:18:21.610491564+01:00 container kill > > > > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > > > > > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, > > > > > > > license=GPLv2, name=mesos-slave, signal=9, vendor=CentOS) > > > > > > > 2017-01-13T01:18:21.646229213+01:00 container die > > > > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > > > > > > > (build-date=20161214, exitCode=143, > > > > > > > image=alisw/mesos-slave:1.0.1, license=GPLv2, name=mesos-slave, > > > > > > > vendor=CentOS) > > > > > > > 2017-01-13T01:18:21.652894124+01:00 network disconnect > > > > > > > 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 > > > > > > > (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, > > > > > > > name=host, type=host) > > > > > > > 2017-01-13T01:18:21.705874041+01:00 container stop > > > > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > > > > > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, > > > > > > > license=GPLv2, name=mesos-slave, vendor=CentOS) > > > > > > > ``` > > > > > > > > > > > > > > Ciao, > > > > > > > Giulio > > > > > > > > > > > > > > On 13 Jan 2017, 01:06 +0100, haosdent , wrote: > > > > > > > > Hi, @Giuliio According to your log, it looks normal. Do you > > > > > > > > have any logs related to "SIGKILL"? > > > > > > > > > > > > > > > > > On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse > > > > > > > > > wrote: > > > > > > > > > > Hi, > > > > > > >
Re: Sigkill while running mesos agent (1.0.1) in docker
Is it caused by your container riemann-elasticsearch could not start successfully? On Fri, Jan 13, 2017 at 9:10 AM, Giulio Eulisse wrote: > MMm... it improved things, but now I get a bunch of: > > ``` > W0113 01:06:24.757287 17811 slave.cpp:5220] Failed to get resource > statistics for executor 'riemann-elasticsearch.7fc1bc0b-d92c-11e6-9 > 367-02426821a225' of framework 20150626-112246-2475462272-5050-5-: > Failed to run 'docker -H unix:///var/run/docker.sock inspect me > sos-498ff8de-782e-482a-9478-69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c': > exited with status 1; stderr='Error: No such image, > container or task: mesos-498ff8de-782e-482a-9478- > 69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c > ``` > > and then leaves out a bunch of running containers. > > On 13 Jan 2017, 01:51 +0100, Joseph Wu , wrote: > > If Apache JIRA were up, I'd point you to a JIRA noting the problem with > naming docker containers `mesos-*`, as Mesos reserves that prefix (and > kills everything it considers "unknown"). > > As a quick workaround, try setting this flag to false: > https://github.com/apache/mesos/blob/1.1.x/src/slave/flags.cpp#L590-L596 > > On Thu, Jan 12, 2017 at 4:41 PM, Giulio Eulisse > wrote: > >> MMm... it seems to die after a long sequence of forks, and mesos itself >> seems to be issuing the sigkill. I wonder if it's trying to do some cleanup >> and it does not realise one of the containers is the agent itself??? Notice >> I do have `MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set. >> >> On 13 Jan 2017, 01:23 +0100, Giulio Eulisse , >> wrote: >> >> Ciao, >> >> the only thing I could find is by running a parallel `docker events` >> >> ``` >> 2017-01-13T01:18:20.766593692+01:00 network connect >> 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 >> (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, >> name=host, type=host) >> 2017-01-13T01:18:20.846137793+01:00 container start >> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 >> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, >> name=mesos-slave, vendor=CentOS) >> 2017-01-13T01:18:20.847965921+01:00 container resize >> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 >> (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1, >> license=GPLv2, name=mesos-slave, vendor=CentOS, width=134) >> 2017-01-13T01:18:21.610141857+01:00 container kill >> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 >> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, >> name=mesos-slave, signal=15, vendor=CentOS) >> 2017-01-13T01:18:21.610491564+01:00 container kill >> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 >> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, >> name=mesos-slave, signal=9, vendor=CentOS) >> 2017-01-13T01:18:21.646229213+01:00 container die >> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 >> (build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1, >> license=GPLv2, name=mesos-slave, vendor=CentOS) >> 2017-01-13T01:18:21.652894124+01:00 network disconnect >> 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 >> (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, >> name=host, type=host) >> 2017-01-13T01:18:21.705874041+01:00 container stop >> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 >> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, >> name=mesos-slave, vendor=CentOS) >> ``` >> >> Ciao, >> Giulio >> >> On 13 Jan 2017, 01:06 +0100, haosdent , wrote: >> >> Hi, @Giuliio According to your log, it looks normal. Do you have any logs >> related to "SIGKILL"? >> >> On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse > > wrote: >> >>> Hi, >>> >>> I’ve a setup where I run mesos in docker which works perfectly when I >>> use 0.28.2. I now migrated to 1.0.1 (but it’s the same with 1.1.0 and >>> 1.0.0) and it seems to receive a sigkill right after saying: >>> >>> WARNING: Logging before InitGoogleLogging() is written to STDERR >>> I0112 23:22:09.889120 4934 main.cpp:243] Build: 2016-08-26 23:06:27 by >>> centos >>> I0112 23:22:09.889181 4934 main.cpp:244] Version: 1.0.1 >>> I0112 23:22:09.889184 4934 main.cpp:247] Git tag: 1.0.1 >>> I0112 23:22:09.889188 4934 main.cpp:251] Git SHA: >>> 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3 >>> W0112 23:22:09.890808 4934 openssl.cpp:398] Failed SSL connections will be >>> downgraded to a non-SSL socket >>> W0112 23:22:09.891237 4934 process.cpp:881] Failed SSL connections will be >>> downgraded to a non-SSL socket >>> E0112 23:22:10.129096 4934 shell.hpp:106] Command 'hadoop version 2>&1' >>> failed; this is the output: >>> sh: hadoop: command not found >>> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@726: Client >>> environment:zookeeper.version=zookeeper C client 3.4.8 >>> 2017-01-12 23:22:10,130:4934(0x
Re: Sigkill while running mesos agent (1.0.1) in docker
MMm... it improved things, but now I get a bunch of: ``` W0113 01:06:24.757287 17811 slave.cpp:5220] Failed to get resource statistics for executor 'riemann-elasticsearch.7fc1bc0b-d92c-11e6-9 367-02426821a225' of framework 20150626-112246-2475462272-5050-5-: Failed to run 'docker -H unix:///var/run/docker.sock inspect me sos-498ff8de-782e-482a-9478-69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c': exited with status 1; stderr='Error: No such image, container or task: mesos-498ff8de-782e-482a-9478-69d3faf5a853-S5.a242fc24-0d32-46e6-af63-299cb82fc01c ``` and then leaves out a bunch of running containers. On 13 Jan 2017, 01:51 +0100, Joseph Wu , wrote: > If Apache JIRA were up, I'd point you to a JIRA noting the problem with > naming docker containers `mesos-*`, as Mesos reserves that prefix (and kills > everything it considers "unknown"). > > As a quick workaround, try setting this flag to false: > https://github.com/apache/mesos/blob/1.1.x/src/slave/flags.cpp#L590-L596 > > > On Thu, Jan 12, 2017 at 4:41 PM, Giulio Eulisse > > wrote: > > > MMm... it seems to die after a long sequence of forks, and mesos itself > > > seems to be issuing the sigkill. I wonder if it's trying to do some > > > cleanup and it does not realise one of the containers is the agent > > > itself??? Notice I do have > > > `MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set. > > > > > > On 13 Jan 2017, 01:23 +0100, Giulio Eulisse , > > > wrote: > > > > Ciao, > > > > > > > > the only thing I could find is by running a parallel `docker events` > > > > > > > > ``` > > > > 2017-01-13T01:18:20.766593692+01:00 network connect > > > > 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 > > > > (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, > > > > name=host, type=host) > > > > 2017-01-13T01:18:20.846137793+01:00 container start > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, > > > > name=mesos-slave, vendor=CentOS) > > > > 2017-01-13T01:18:20.847965921+01:00 container resize > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > > > > (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1, > > > > license=GPLv2, name=mesos-slave, vendor=CentOS, width=134) > > > > 2017-01-13T01:18:21.610141857+01:00 container kill > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, > > > > name=mesos-slave, signal=15, vendor=CentOS) > > > > 2017-01-13T01:18:21.610491564+01:00 container kill > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, > > > > name=mesos-slave, signal=9, vendor=CentOS) > > > > 2017-01-13T01:18:21.646229213+01:00 container die > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > > > > (build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1, > > > > license=GPLv2, name=mesos-slave, vendor=CentOS) > > > > 2017-01-13T01:18:21.652894124+01:00 network disconnect > > > > 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 > > > > (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, > > > > name=host, type=host) > > > > 2017-01-13T01:18:21.705874041+01:00 container stop > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, > > > > name=mesos-slave, vendor=CentOS) > > > > ``` > > > > > > > > Ciao, > > > > Giulio > > > > > > > > On 13 Jan 2017, 01:06 +0100, haosdent , wrote: > > > > > Hi, @Giuliio According to your log, it looks normal. Do you have any > > > > > logs related to "SIGKILL"? > > > > > > > > > > > On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse > > > > > > wrote: > > > > > > > Hi, > > > > > > > I’ve a setup where I run mesos in docker which works perfectly > > > > > > > when I use 0.28.2. I now migrated to 1.0.1 (but it’s the same > > > > > > > with 1.1.0 and 1.0.0) and it seems to receive a sigkill right > > > > > > > after saying: > > > > > > > > > > > > > > WARNING: Logging before InitGoogleLogging() is written to STDERR > > > > > > > I0112 23:22:09.889120 4934 main.cpp:243] Build: 2016-08-26 > > > > > > > 23:06:27 by centos > > > > > > > I0112 23:22:09.889181 4934 main.cpp:244] Version: 1.0.1 > > > > > > > I0112 23:22:09.889184 4934 main.cpp:247] Git tag: 1.0.1 > > > > > > > I0112 23:22:09.889188 4934 main.cpp:251] Git SHA: > > > > > > > 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3 > > > > > > > W0112 23:22:09.890808 4934 openssl.cpp:398] Failed SSL > > > > > > > connections will be downgraded to a non-SSL socket > > > > > > > W0112 23:22:09.891237 4934 process.cpp:881] Failed SSL > > > > > > > connections will be downgraded to a non-SSL socket > > >
Re: Sigkill while running mesos agent (1.0.1) in docker
yep, it fixed in 1.1.0 https://www.mail-archive.com/issues@mesos.apache.org/msg33959.html On Fri, Jan 13, 2017 at 8:51 AM, Joseph Wu wrote: > If Apache JIRA were up, I'd point you to a JIRA noting the problem with > naming docker containers `mesos-*`, as Mesos reserves that prefix (and > kills everything it considers "unknown"). > > As a quick workaround, try setting this flag to false: > https://github.com/apache/mesos/blob/1.1.x/src/slave/flags.cpp#L590-L596 > > On Thu, Jan 12, 2017 at 4:41 PM, Giulio Eulisse > wrote: > >> MMm... it seems to die after a long sequence of forks, and mesos itself >> seems to be issuing the sigkill. I wonder if it's trying to do some cleanup >> and it does not realise one of the containers is the agent itself??? Notice >> I do have `MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set. >> >> On 13 Jan 2017, 01:23 +0100, Giulio Eulisse , >> wrote: >> >> Ciao, >> >> the only thing I could find is by running a parallel `docker events` >> >> ``` >> 2017-01-13T01:18:20.766593692+01:00 network connect >> 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 >> (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, >> name=host, type=host) >> 2017-01-13T01:18:20.846137793+01:00 container start >> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 >> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, >> name=mesos-slave, vendor=CentOS) >> 2017-01-13T01:18:20.847965921+01:00 container resize >> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 >> (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1, >> license=GPLv2, name=mesos-slave, vendor=CentOS, width=134) >> 2017-01-13T01:18:21.610141857+01:00 container kill >> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 >> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, >> name=mesos-slave, signal=15, vendor=CentOS) >> 2017-01-13T01:18:21.610491564+01:00 container kill >> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 >> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, >> name=mesos-slave, signal=9, vendor=CentOS) >> 2017-01-13T01:18:21.646229213+01:00 container die >> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 >> (build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1, >> license=GPLv2, name=mesos-slave, vendor=CentOS) >> 2017-01-13T01:18:21.652894124+01:00 network disconnect >> 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 >> (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, >> name=host, type=host) >> 2017-01-13T01:18:21.705874041+01:00 container stop >> 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 >> (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, >> name=mesos-slave, vendor=CentOS) >> ``` >> >> Ciao, >> Giulio >> >> On 13 Jan 2017, 01:06 +0100, haosdent , wrote: >> >> Hi, @Giuliio According to your log, it looks normal. Do you have any logs >> related to "SIGKILL"? >> >> On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse > > wrote: >> >>> Hi, >>> >>> I’ve a setup where I run mesos in docker which works perfectly when I >>> use 0.28.2. I now migrated to 1.0.1 (but it’s the same with 1.1.0 and >>> 1.0.0) and it seems to receive a sigkill right after saying: >>> >>> WARNING: Logging before InitGoogleLogging() is written to STDERR >>> I0112 23:22:09.889120 4934 main.cpp:243] Build: 2016-08-26 23:06:27 by >>> centos >>> I0112 23:22:09.889181 4934 main.cpp:244] Version: 1.0.1 >>> I0112 23:22:09.889184 4934 main.cpp:247] Git tag: 1.0.1 >>> I0112 23:22:09.889188 4934 main.cpp:251] Git SHA: >>> 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3 >>> W0112 23:22:09.890808 4934 openssl.cpp:398] Failed SSL connections will be >>> downgraded to a non-SSL socket >>> W0112 23:22:09.891237 4934 process.cpp:881] Failed SSL connections will be >>> downgraded to a non-SSL socket >>> E0112 23:22:10.129096 4934 shell.hpp:106] Command 'hadoop version 2>&1' >>> failed; this is the output: >>> sh: hadoop: command not found >>> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@726: Client >>> environment:zookeeper.version=zookeeper C client 3.4.8 >>> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@730: Client >>> environment:host.name=.XXX.ch >>> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@737: Client >>> environment:os.name=Linux >>> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@738: Client >>> environment:os.arch=3.10.0-229.14.1.el7.x86_64 >>> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@739: Client >>> environment:os.version=#1 SMP Tue Sep 15 15:05:51 UTC 2015 >>> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@747: Client >>> environment:user.name=(null) >>> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@755: Client >>> environment:user.home=/root >>> 2017-01-12
Re: Sigkill while running mesos agent (1.0.1) in docker
If Apache JIRA were up, I'd point you to a JIRA noting the problem with naming docker containers `mesos-*`, as Mesos reserves that prefix (and kills everything it considers "unknown"). As a quick workaround, try setting this flag to false: https://github.com/apache/mesos/blob/1.1.x/src/slave/flags.cpp#L590-L596 On Thu, Jan 12, 2017 at 4:41 PM, Giulio Eulisse wrote: > MMm... it seems to die after a long sequence of forks, and mesos itself > seems to be issuing the sigkill. I wonder if it's trying to do some cleanup > and it does not realise one of the containers is the agent itself??? Notice > I do have `MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set. > > On 13 Jan 2017, 01:23 +0100, Giulio Eulisse , > wrote: > > Ciao, > > the only thing I could find is by running a parallel `docker events` > > ``` > 2017-01-13T01:18:20.766593692+01:00 network connect > 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 > (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, > name=host, type=host) > 2017-01-13T01:18:20.846137793+01:00 container start > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, > name=mesos-slave, vendor=CentOS) > 2017-01-13T01:18:20.847965921+01:00 container resize > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1, > license=GPLv2, name=mesos-slave, vendor=CentOS, width=134) > 2017-01-13T01:18:21.610141857+01:00 container kill > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, > name=mesos-slave, signal=15, vendor=CentOS) > 2017-01-13T01:18:21.610491564+01:00 container kill > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, > name=mesos-slave, signal=9, vendor=CentOS) > 2017-01-13T01:18:21.646229213+01:00 container die > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > (build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1, > license=GPLv2, name=mesos-slave, vendor=CentOS) > 2017-01-13T01:18:21.652894124+01:00 network disconnect > 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 > (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, > name=host, type=host) > 2017-01-13T01:18:21.705874041+01:00 container stop > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, > name=mesos-slave, vendor=CentOS) > ``` > > Ciao, > Giulio > > On 13 Jan 2017, 01:06 +0100, haosdent , wrote: > > Hi, @Giuliio According to your log, it looks normal. Do you have any logs > related to "SIGKILL"? > > On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse > wrote: > >> Hi, >> >> I’ve a setup where I run mesos in docker which works perfectly when I use >> 0.28.2. I now migrated to 1.0.1 (but it’s the same with 1.1.0 and 1.0.0) >> and it seems to receive a sigkill right after saying: >> >> WARNING: Logging before InitGoogleLogging() is written to STDERR >> I0112 23:22:09.889120 4934 main.cpp:243] Build: 2016-08-26 23:06:27 by >> centos >> I0112 23:22:09.889181 4934 main.cpp:244] Version: 1.0.1 >> I0112 23:22:09.889184 4934 main.cpp:247] Git tag: 1.0.1 >> I0112 23:22:09.889188 4934 main.cpp:251] Git SHA: >> 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3 >> W0112 23:22:09.890808 4934 openssl.cpp:398] Failed SSL connections will be >> downgraded to a non-SSL socket >> W0112 23:22:09.891237 4934 process.cpp:881] Failed SSL connections will be >> downgraded to a non-SSL socket >> E0112 23:22:10.129096 4934 shell.hpp:106] Command 'hadoop version 2>&1' >> failed; this is the output: >> sh: hadoop: command not found >> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@726: Client >> environment:zookeeper.version=zookeeper C client 3.4.8 >> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@730: Client >> environment:host.name=.XXX.ch >> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@737: Client >> environment:os.name=Linux >> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@738: Client >> environment:os.arch=3.10.0-229.14.1.el7.x86_64 >> 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@739: Client >> environment:os.version=#1 SMP Tue Sep 15 15:05:51 UTC 2015 >> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@747: Client >> environment:user.name=(null) >> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@755: Client >> environment:user.home=/root >> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@767: Client >> environment:user.dir=/ >> 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@zookeeper_init@800: >> Initiating client connection, >> host=XXX1.YYY.ch:2181,XXX2.YYY.ch:2181,XXX3.YYY.
Re: Sigkill while running mesos agent (1.0.1) in docker
docker rm mesos-slave /usr/bin/docker run --pids-limit -1 --net host -m 0b --privileged \ --oom-kill-disable \ -e LIBPROCESS_SSL_KEY_FILE=/etc/grid-security/hostkey.pem \ -e LIBPROCESS_SSL_CERT_FILE=/etc/grid-security/hostcert.pem \ -e LIBPROCESS_SSL_VERIFY_CERT=false \ -e LIBPROCESS_SSL_SUPPORT_DOWNGRADE=true \ -e LIBPROCESS_SSL_ENABLED=true \ -e MESOS_MASTER_ZK=zk://XXX:2181,XXX:2181,XXX:2181/mesos \ -e MESOS_ATTRIBUTES="os:Linux;is_virtual:true;cpu:GenuineIntel" \ -e MESOS_MASTER_WORKDIR=/build/mesos \ -e MESOS_SYSTEMD_ENABLE_SUPPORT=false \ -e MESOS_LAUNCHER=posix \ -e MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1 \ -e MESOS_IMAGE_PROVIDERS=docker \ -e MESOS_ISOLATION=docker/runtime \ -e MESOS_EXTRA_CPUS=1 \ -e MESOS_MODULES=file://etc/mesos-slave/modules \ -e MESOS_RESOURCE_ESTIMATOR=org_apache_mesos_FixedResourceEstimator \ -e MESOS_QOS_CONTROLLER=org_apache_mesos_LoadQoSController \ -e MESOS_LOGGING_LEVEL=WARNING \ -e JENKINS_UID=203 \ -e JENKINS_GID=992 \ -v /var/run/docker.sock:/var/run/docker.sock \ -v /sys/fs/cgroup:/sys/fs/cgroup \ -v /build/docker:/var/lib/docker \ -v /build:/build \ -v /build/log:/var/log \ -v /etc/grid-security:/etc/grid-security:ro -it --pid=host --name mesos-agent -it alisw/mesos-slave:1.0.1 /bin/bash I also tried with `mesos-agent` as a name. On 13 Jan 2017, 01:46 +0100, haosdent , wrote: > Hi, what the docker command you use to start agents, I remember mesos would > try to recover containers which names start with mesos-slave and kill them if > could not recover successfully. > > > On Jan 13, 2017 8:43 AM, "Giulio Eulisse" wrote: > > > MMm... it seems to die after a long sequence of forks, and mesos itself > > > seems to be issuing the sigkill. I wonder if it's trying to do some > > > cleanup and it does not realise one of the containers is the agent > > > itself??? Notice I do have > > > `MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set. > > > > > > On 13 Jan 2017, 01:23 +0100, Giulio Eulisse , > > > wrote: > > > > Ciao, > > > > > > > > the only thing I could find is by running a parallel `docker events` > > > > > > > > ``` > > > > 2017-01-13T01:18:20.766593692+01:00 network connect > > > > 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 > > > > (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, > > > > name=host, type=host) > > > > 2017-01-13T01:18:20.846137793+01:00 container start > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, > > > > name=mesos-slave, vendor=CentOS) > > > > 2017-01-13T01:18:20.847965921+01:00 container resize > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > > > > (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1, > > > > license=GPLv2, name=mesos-slave, vendor=CentOS, width=134) > > > > 2017-01-13T01:18:21.610141857+01:00 container kill > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, > > > > name=mesos-slave, signal=15, vendor=CentOS) > > > > 2017-01-13T01:18:21.610491564+01:00 container kill > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, > > > > name=mesos-slave, signal=9, vendor=CentOS) > > > > 2017-01-13T01:18:21.646229213+01:00 container die > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > > > > (build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1, > > > > license=GPLv2, name=mesos-slave, vendor=CentOS) > > > > 2017-01-13T01:18:21.652894124+01:00 network disconnect > > > > 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 > > > > (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, > > > > name=host, type=host) > > > > 2017-01-13T01:18:21.705874041+01:00 container stop > > > > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > > > > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, > > > > name=mesos-slave, vendor=CentOS) > > > > ``` > > > > > > > > Ciao, > > > > Giulio > > > > > > > > On 13 Jan 2017, 01:06 +0100, haosdent , wrote: > > > > > Hi, @Giuliio According to your log, it looks normal. Do you have any > > > > > logs related to "SIGKILL"? > > > > > > > > > > > On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse > > > > > > wrote: > > > > > > > Hi, > > > > > > > I’ve a setup where I run mesos in docker which works perfectly > > > > > > > when I use 0.28.2. I now migrated to 1.0.1 (but it’s the same > > > > > > > with 1.1.0 and 1.0.0) and it seems to receive a sigkill right > > > > > > >
Re: Sigkill while running mesos agent (1.0.1) in docker
Hi, what the docker command you use to start agents, I remember mesos would try to recover containers which names start with mesos-slave and kill them if could not recover successfully. On Jan 13, 2017 8:43 AM, "Giulio Eulisse" wrote: MMm... it seems to die after a long sequence of forks, and mesos itself seems to be issuing the sigkill. I wonder if it's trying to do some cleanup and it does not realise one of the containers is the agent itself??? Notice I do have `MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set. On 13 Jan 2017, 01:23 +0100, Giulio Eulisse , wrote: Ciao, the only thing I could find is by running a parallel `docker events` ``` 2017-01-13T01:18:20.766593692+01:00 network connect 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 (container= 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, name=host, type=host) 2017-01-13T01:18:20.846137793+01:00 container start 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, name=mesos-slave, vendor=CentOS) 2017-01-13T01:18:20.847965921+01:00 container resize 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1, license=GPLv2, name=mesos-slave, vendor=CentOS, width=134) 2017-01-13T01:18:21.610141857+01:00 container kill 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, name=mesos-slave, signal=15, vendor=CentOS) 2017-01-13T01:18:21.610491564+01:00 container kill 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, name=mesos-slave, signal=9, vendor=CentOS) 2017-01-13T01:18:21.646229213+01:00 container die 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 (build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1, license=GPLv2, name=mesos-slave, vendor=CentOS) 2017-01-13T01:18:21.652894124+01:00 network disconnect 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 (container= 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, name=host, type=host) 2017-01-13T01:18:21.705874041+01:00 container stop 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, name=mesos-slave, vendor=CentOS) ``` Ciao, Giulio On 13 Jan 2017, 01:06 +0100, haosdent , wrote: Hi, @Giuliio According to your log, it looks normal. Do you have any logs related to "SIGKILL"? On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse wrote: > Hi, > > I’ve a setup where I run mesos in docker which works perfectly when I use > 0.28.2. I now migrated to 1.0.1 (but it’s the same with 1.1.0 and 1.0.0) > and it seems to receive a sigkill right after saying: > > WARNING: Logging before InitGoogleLogging() is written to STDERR > I0112 23:22:09.889120 4934 main.cpp:243] Build: 2016-08-26 23:06:27 by centos > I0112 23:22:09.889181 4934 main.cpp:244] Version: 1.0.1 > I0112 23:22:09.889184 4934 main.cpp:247] Git tag: 1.0.1 > I0112 23:22:09.889188 4934 main.cpp:251] Git SHA: > 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3 > W0112 23:22:09.890808 4934 openssl.cpp:398] Failed SSL connections will be > downgraded to a non-SSL socket > W0112 23:22:09.891237 4934 process.cpp:881] Failed SSL connections will be > downgraded to a non-SSL socket > E0112 23:22:10.129096 4934 shell.hpp:106] Command 'hadoop version 2>&1' > failed; this is the output: > sh: hadoop: command not found > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@726: Client > environment:zookeeper.version=zookeeper C client 3.4.8 > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@730: Client > environment:host.name=.XXX.ch > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@737: Client > environment:os.name=Linux > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@738: Client > environment:os.arch=3.10.0-229.14.1.el7.x86_64 > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@739: Client > environment:os.version=#1 SMP Tue Sep 15 15:05:51 UTC 2015 > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@747: Client > environment:user.name=(null) > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@755: Client > environment:user.home=/root > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@767: Client > environment:user.dir=/ > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@zookeeper_init@800: > Initiating client connection, > host=XXX1.YYY.ch:2181,XXX2.YYY.ch:2181,XXX3.YYY.ch:2181 sessionTimeout=1 > watcher=0x7f950ee20300 sessionId=0 sessionPasswd= context=0x > 7f94fc60 flags=0 > 2017-01-12 23:22:10,134:4934(0x7f9501fd7700):ZOO_INFO@check_events@1728: > initiated connection to server [XX.YY.ZZ.WW:2181] > 2017-01-12 23:22:10,146:4934(0x7f95
Re: Sigkill while running mesos agent (1.0.1) in docker
MMm... it seems to die after a long sequence of forks, and mesos itself seems to be issuing the sigkill. I wonder if it's trying to do some cleanup and it does not realise one of the containers is the agent itself??? Notice I do have `MESOS_DOCKER_MESOS_IMAGE=alisw/mesos-slave:1.0.1` set. On 13 Jan 2017, 01:23 +0100, Giulio Eulisse , wrote: > Ciao, > > the only thing I could find is by running a parallel `docker events` > > ``` > 2017-01-13T01:18:20.766593692+01:00 network connect > 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 > (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, > name=host, type=host) > 2017-01-13T01:18:20.846137793+01:00 container start > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, > name=mesos-slave, vendor=CentOS) > 2017-01-13T01:18:20.847965921+01:00 container resize > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1, > license=GPLv2, name=mesos-slave, vendor=CentOS, width=134) > 2017-01-13T01:18:21.610141857+01:00 container kill > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, > name=mesos-slave, signal=15, vendor=CentOS) > 2017-01-13T01:18:21.610491564+01:00 container kill > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, > name=mesos-slave, signal=9, vendor=CentOS) > 2017-01-13T01:18:21.646229213+01:00 container die > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > (build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1, > license=GPLv2, name=mesos-slave, vendor=CentOS) > 2017-01-13T01:18:21.652894124+01:00 network disconnect > 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 > (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, > name=host, type=host) > 2017-01-13T01:18:21.705874041+01:00 container stop > 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 > (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, > name=mesos-slave, vendor=CentOS) > ``` > > Ciao, > Giulio > > On 13 Jan 2017, 01:06 +0100, haosdent , wrote: > > Hi, @Giuliio According to your log, it looks normal. Do you have any logs > > related to "SIGKILL"? > > > > > On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse > > > wrote: > > > > Hi, > > > > I’ve a setup where I run mesos in docker which works perfectly when I > > > > use 0.28.2. I now migrated to 1.0.1 (but it’s the same with 1.1.0 and > > > > 1.0.0) and it seems to receive a sigkill right after saying: > > > > > > > > WARNING: Logging before InitGoogleLogging() is written to STDERR > > > > I0112 23:22:09.889120 4934 main.cpp:243] Build: 2016-08-26 23:06:27 by > > > > centos > > > > I0112 23:22:09.889181 4934 main.cpp:244] Version: 1.0.1 > > > > I0112 23:22:09.889184 4934 main.cpp:247] Git tag: 1.0.1 > > > > I0112 23:22:09.889188 4934 main.cpp:251] Git SHA: > > > > 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3 > > > > W0112 23:22:09.890808 4934 openssl.cpp:398] Failed SSL connections > > > > will be downgraded to a non-SSL socket > > > > W0112 23:22:09.891237 4934 process.cpp:881] Failed SSL connections > > > > will be downgraded to a non-SSL socket > > > > E0112 23:22:10.129096 4934 shell.hpp:106] Command 'hadoop version > > > > 2>&1' failed; this is the output: > > > > sh: hadoop: command not found > > > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@726: > > > > Client environment:zookeeper.version=zookeeper C client 3.4.8 > > > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@730: > > > > Client environment:host.name=.XXX.ch > > > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@737: > > > > Client environment:os.name=Linux > > > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@738: > > > > Client environment:os.arch=3.10.0-229.14.1.el7.x86_64 > > > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@739: > > > > Client environment:os.version=#1 SMP Tue Sep 15 15:05:51 UTC 2015 > > > > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@747: > > > > Client environment:user.name=(null) > > > > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@755: > > > > Client environment:user.home=/root > > > > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@767: > > > > Client environment:user.dir=/ > > > > 2017-01-12 > > > > 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@zookeeper_init@800: > > > > Initiating client connection, > > > > host=XXX1.YYY.ch:2181,XXX2.YYY.ch:2181,XXX3.YYY.ch:2181 > > > > sessionTimeout=1 watcher=0x7f950ee20300 sessionId=0 > > > > sessionPasswd= context=0x > > > > 7f94fc60 flags=0 > > > > 2017-
Re: Sigkill while running mesos agent (1.0.1) in docker
Ciao, the only thing I could find is by running a parallel `docker events` ``` 2017-01-13T01:18:20.766593692+01:00 network connect 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, name=host, type=host) 2017-01-13T01:18:20.846137793+01:00 container start 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, name=mesos-slave, vendor=CentOS) 2017-01-13T01:18:20.847965921+01:00 container resize 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 (build-date=20161214, height=16, image=alisw/mesos-slave:1.0.1, license=GPLv2, name=mesos-slave, vendor=CentOS, width=134) 2017-01-13T01:18:21.610141857+01:00 container kill 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, name=mesos-slave, signal=15, vendor=CentOS) 2017-01-13T01:18:21.610491564+01:00 container kill 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, name=mesos-slave, signal=9, vendor=CentOS) 2017-01-13T01:18:21.646229213+01:00 container die 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 (build-date=20161214, exitCode=143, image=alisw/mesos-slave:1.0.1, license=GPLv2, name=mesos-slave, vendor=CentOS) 2017-01-13T01:18:21.652894124+01:00 network disconnect 32441cb5f42b009580e104a8360e544beec7120bb6fff800f16dbee421454267 (container=1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71, name=host, type=host) 2017-01-13T01:18:21.705874041+01:00 container stop 1fddd8e8f956f4545c8b36b088eeca74d157eb1923867d28bf2d919d27babb71 (build-date=20161214, image=alisw/mesos-slave:1.0.1, license=GPLv2, name=mesos-slave, vendor=CentOS) ``` Ciao, Giulio On 13 Jan 2017, 01:06 +0100, haosdent , wrote: > Hi, @Giuliio According to your log, it looks normal. Do you have any logs > related to "SIGKILL"? > > > On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse > > wrote: > > > Hi, > > > I’ve a setup where I run mesos in docker which works perfectly when I use > > > 0.28.2. I now migrated to 1.0.1 (but it’s the same with 1.1.0 and 1.0.0) > > > and it seems to receive a sigkill right after saying: > > > > > > WARNING: Logging before InitGoogleLogging() is written to STDERR > > > I0112 23:22:09.889120 4934 main.cpp:243] Build: 2016-08-26 23:06:27 by > > > centos > > > I0112 23:22:09.889181 4934 main.cpp:244] Version: 1.0.1 > > > I0112 23:22:09.889184 4934 main.cpp:247] Git tag: 1.0.1 > > > I0112 23:22:09.889188 4934 main.cpp:251] Git SHA: > > > 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3 > > > W0112 23:22:09.890808 4934 openssl.cpp:398] Failed SSL connections will > > > be downgraded to a non-SSL socket > > > W0112 23:22:09.891237 4934 process.cpp:881] Failed SSL connections will > > > be downgraded to a non-SSL socket > > > E0112 23:22:10.129096 4934 shell.hpp:106] Command 'hadoop version 2>&1' > > > failed; this is the output: > > > sh: hadoop: command not found > > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@726: Client > > > environment:zookeeper.version=zookeeper C client 3.4.8 > > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@730: Client > > > environment:host.name=.XXX.ch > > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@737: Client > > > environment:os.name=Linux > > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@738: Client > > > environment:os.arch=3.10.0-229.14.1.el7.x86_64 > > > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@739: Client > > > environment:os.version=#1 SMP Tue Sep 15 15:05:51 UTC 2015 > > > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@747: Client > > > environment:user.name=(null) > > > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@755: Client > > > environment:user.home=/root > > > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@767: Client > > > environment:user.dir=/ > > > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@zookeeper_init@800: > > > Initiating client connection, > > > host=XXX1.YYY.ch:2181,XXX2.YYY.ch:2181,XXX3.YYY.ch:2181 > > > sessionTimeout=1 watcher=0x7f950ee20300 sessionId=0 > > > sessionPasswd= context=0x > > > 7f94fc60 flags=0 > > > 2017-01-12 23:22:10,134:4934(0x7f9501fd7700):ZOO_INFO@check_events@1728: > > > initiated connection to server [XX.YY.ZZ.WW:2181] > > > 2017-01-12 23:22:10,146:4934(0x7f9501fd7700):ZOO_INFO@check_events@1775: > > > session establishment complete on server [XX.YY.ZZ.WW:2181 > > > ], sessionId=0x35828ae70fb2065, negotiated timeout=1 > > > > > > Any idea of what might be going on? Looks like an OOM, but I do not see > > > it in /var/log/messages and it also happens with --oom-kill-disable. > > > -- > > > Ciao, > > > Giulio > > >
Sigkill while running mesos agent (1.0.1) in docker
Hi, I’ve a setup where I run mesos in docker which works perfectly when I use 0.28.2. I now migrated to 1.0.1 (but it’s the same with 1.1.0 and 1.0.0) and it seems to receive a sigkill right after saying: WARNING: Logging before InitGoogleLogging() is written to STDERR I0112 23:22:09.889120 4934 main.cpp:243] Build: 2016-08-26 23:06:27 by centos I0112 23:22:09.889181 4934 main.cpp:244] Version: 1.0.1 I0112 23:22:09.889184 4934 main.cpp:247] Git tag: 1.0.1 I0112 23:22:09.889188 4934 main.cpp:251] Git SHA: 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3 W0112 23:22:09.890808 4934 openssl.cpp:398] Failed SSL connections will be downgraded to a non-SSL socket W0112 23:22:09.891237 4934 process.cpp:881] Failed SSL connections will be downgraded to a non-SSL socket E0112 23:22:10.129096 4934 shell.hpp:106] Command 'hadoop version 2>&1' failed; this is the output: sh: hadoop: command not found 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper C client 3.4.8 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@730: Client environment:host.name=.XXX.ch 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@737: Client environment:os.name=Linux 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@738: Client environment:os.arch=3.10.0-229.14.1.el7.x86_64 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@739: Client environment:os.version=#1 SMP Tue Sep 15 15:05:51 UTC 2015 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@747: Client environment:user.name=(null) 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@755: Client environment:user.home=/root 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@767: Client environment:user.dir=/ 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@zookeeper_init@800: Initiating client connection, host=XXX1.YYY.ch:2181,XXX2.YYY.ch:2181,XXX3.YYY.ch:2181 sessionTimeout=1 watcher=0x7f950ee20300 sessionId=0 sessionPasswd= context=0x 7f94fc60 flags=0 2017-01-12 23:22:10,134:4934(0x7f9501fd7700):ZOO_INFO@check_events@1728: initiated connection to server [XX.YY.ZZ.WW:2181] 2017-01-12 23:22:10,146:4934(0x7f9501fd7700):ZOO_INFO@check_events@1775: session establishment complete on server [XX.YY.ZZ.WW:2181 ], sessionId=0x35828ae70fb2065, negotiated timeout=1 Any idea of what might be going on? Looks like an OOM, but I do not see it in /var/log/messages and it also happens with --oom-kill-disable. -- Ciao, Giulio
Re: Sigkill while running mesos agent (1.0.1) in docker
Hi, @Giuliio According to your log, it looks normal. Do you have any logs related to "SIGKILL"? On Fri, Jan 13, 2017 at 8:00 AM, Giulio Eulisse wrote: > Hi, > > I’ve a setup where I run mesos in docker which works perfectly when I use > 0.28.2. I now migrated to 1.0.1 (but it’s the same with 1.1.0 and 1.0.0) > and it seems to receive a sigkill right after saying: > > WARNING: Logging before InitGoogleLogging() is written to STDERR > I0112 23:22:09.889120 4934 main.cpp:243] Build: 2016-08-26 23:06:27 by centos > I0112 23:22:09.889181 4934 main.cpp:244] Version: 1.0.1 > I0112 23:22:09.889184 4934 main.cpp:247] Git tag: 1.0.1 > I0112 23:22:09.889188 4934 main.cpp:251] Git SHA: > 3611eb0b7eea8d144e9b2e840e0ba16f2f659ee3 > W0112 23:22:09.890808 4934 openssl.cpp:398] Failed SSL connections will be > downgraded to a non-SSL socket > W0112 23:22:09.891237 4934 process.cpp:881] Failed SSL connections will be > downgraded to a non-SSL socket > E0112 23:22:10.129096 4934 shell.hpp:106] Command 'hadoop version 2>&1' > failed; this is the output: > sh: hadoop: command not found > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@726: Client > environment:zookeeper.version=zookeeper C client 3.4.8 > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@730: Client > environment:host.name=.XXX.ch > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@737: Client > environment:os.name=Linux > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@738: Client > environment:os.arch=3.10.0-229.14.1.el7.x86_64 > 2017-01-12 23:22:10,130:4934(0x7f950503b700):ZOO_INFO@log_env@739: Client > environment:os.version=#1 SMP Tue Sep 15 15:05:51 UTC 2015 > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@747: Client > environment:user.name=(null) > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@755: Client > environment:user.home=/root > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@log_env@767: Client > environment:user.dir=/ > 2017-01-12 23:22:10,131:4934(0x7f950503b700):ZOO_INFO@zookeeper_init@800: > Initiating client connection, > host=XXX1.YYY.ch:2181,XXX2.YYY.ch:2181,XXX3.YYY.ch:2181 sessionTimeout=1 > watcher=0x7f950ee20300 sessionId=0 sessionPasswd= context=0x > 7f94fc60 flags=0 > 2017-01-12 23:22:10,134:4934(0x7f9501fd7700):ZOO_INFO@check_events@1728: > initiated connection to server [XX.YY.ZZ.WW:2181] > 2017-01-12 23:22:10,146:4934(0x7f9501fd7700):ZOO_INFO@check_events@1775: > session establishment complete on server [XX.YY.ZZ.WW:2181 > ], sessionId=0x35828ae70fb2065, negotiated timeout=1 > > Any idea of what might be going on? Looks like an OOM, but I do not see it > in /var/log/messages and it also happens with --oom-kill-disable. > > -- > Ciao, > Giulio > -- Best Regards, Haosdent Huang