[ https://issues.apache.org/jira/browse/YARN-5818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15689892#comment-15689892 ]
Shane Kumpf commented on YARN-5818: ----------------------------------- {{docker wait}} will have to be removed to support Docker live restore. Retrying the {{docker wait}} is brittle, as it requires parsing stderr and looking for a specific string which could change without notice. I propose we replace the {{docker wait}} approach with the following to support live restore: # {{docker run}} to start the container. # {{docker inspect}} to get the pid. # Null signal ({{kill -0 pid}}) liveliness loop waiting for the container to complete. # {{docker inspect}} the finished container for the exit code. # Write the exitcode file to be picked up by the NM The null signal loop has pitfalls, but this is the pattern we rely upon else where where wait/waitpid aren't possible (container re-acquisition on NM restart for example). I'll put up a patch that does the above as a starting point. Please provide your thoughts on the approach. > Support the Docker Live Restore feature > --------------------------------------- > > Key: YARN-5818 > URL: https://issues.apache.org/jira/browse/YARN-5818 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn > Reporter: Shane Kumpf > > Docker 1.12.x introduced the docker [Live > Restore|https://docs.docker.com/engine/admin/live-restore/] feature which > allows docker containers to survive docker daemon restarts/upgrades. Support > for this feature should be added to YARN to allow docker changes and upgrades > to be less impactful to existing containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org