[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243232#comment-15243232 ] Tyson Norris commented on MESOS-4279: - Thanks for the updates. One note I wanted to add was that we see exactly what [~bydga] describes above, in the "there are actually 2 bugs" comment: - task stdout is truncated (compared to docker container json.log) - task status is killed (instead of finished) For example, regarding "You are calling the run->discard method (which causes to close the stderr/stdout streams) too early - during the "stoping period" container can (and usually will) write something about the termination". If I check the docker container log file on disk, it has a series of lines that are emitted during shutdown, so I can see that "docker stop" is called and the container does actually perform a graceful shutdown. HOWEVER, the task stdout does not receive any of these lines, after the docker stop is called. > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0, 0.26.0, 0.27.2 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > Labels: docker, mesosphere > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4279) Graceful restart of docker task
[ https://issues.apache.org/jira/browse/MESOS-4279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15241564#comment-15241564 ] Tyson Norris commented on MESOS-4279: - We are seeing this as well. * on mesos-slave we use: --docker_stop_timeout=50secs * outside of mesos, using "docker stop produces some logged output from container based on container process handling SIGTERM signal * inside of mesos, when task is stopped by marathon, no output is generated Is there any issue reproducing this? > Graceful restart of docker task > --- > > Key: MESOS-4279 > URL: https://issues.apache.org/jira/browse/MESOS-4279 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.25.0 >Reporter: Martin Bydzovsky >Assignee: Qian Zhang > > I'm implementing a graceful restarts of our mesos-marathon-docker setup and I > came to a following issue: > (it was already discussed on > https://github.com/mesosphere/marathon/issues/2876 and guys form mesosphere > got to a point that its probably a docker containerizer problem...) > To sum it up: > When i deploy simple python script to all mesos-slaves: > {code} > #!/usr/bin/python > from time import sleep > import signal > import sys > import datetime > def sigterm_handler(_signo, _stack_frame): > print "got %i" % _signo > print datetime.datetime.now().time() > sys.stdout.flush() > sleep(2) > print datetime.datetime.now().time() > print "ending" > sys.stdout.flush() > sys.exit(0) > signal.signal(signal.SIGTERM, sigterm_handler) > signal.signal(signal.SIGINT, sigterm_handler) > try: > print "Hello" > i = 0 > while True: > i += 1 > print datetime.datetime.now().time() > print "Iteration #%i" % i > sys.stdout.flush() > sleep(1) > finally: > print "Goodbye" > {code} > and I run it through Marathon like > {code:javascript} > data = { > args: ["/tmp/script.py"], > instances: 1, > cpus: 0.1, > mem: 256, > id: "marathon-test-api" > } > {code} > During the app restart I get expected result - the task receives sigterm and > dies peacefully (during my script-specified 2 seconds period) > But when i wrap this python script in a docker: > {code} > FROM node:4.2 > RUN mkdir /app > ADD . /app > WORKDIR /app > ENTRYPOINT [] > {code} > and run appropriate application by Marathon: > {code:javascript} > data = { > args: ["./script.py"], > container: { > type: "DOCKER", > docker: { > image: "bydga/marathon-test-api" > }, > forcePullImage: yes > }, > cpus: 0.1, > mem: 256, > instances: 1, > id: "marathon-test-api" > } > {code} > The task during restart (issued from marathon) dies immediately without > having a chance to do any cleanup. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2587) libprocess should allow configuration of ip/port separate from the ones it binds to
[ https://issues.apache.org/jira/browse/MESOS-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595313#comment-14595313 ] Tyson Norris commented on MESOS-2587: - We see a simliar problem with slaves, where slave MESOS_HOSTNAME is either reachable by master (using private IP), or else the browser UI works (using public IP) - but we cannot make both work properly at the same time. Generally anywhere that HOSTNAME or IP is configurable, this should ideally include a public and a private value, for networks that expose different values depending on the actual client. > libprocess should allow configuration of ip/port separate from the ones it > binds to > --- > > Key: MESOS-2587 > URL: https://issues.apache.org/jira/browse/MESOS-2587 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Cosmin Lehene > > Currently libprocess will advertise {{LIBPROCESS_IP}}{{LIBPROCESS_PORT}}, but > if a framework runs in a container without an an interface that has a > publicly accessible IP (e.g. a container in bridge mode) it will advertise an > IP that will not be reachable by master. > With this, we could advertise the external IP (reachable from master) of the > bridge from within a container. > This should allow frameworks running in containers to work in the safer > bridged mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)