[ https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469061#comment-16469061 ]
Eric Yang commented on YARN-7654: --------------------------------- [~jlowe] [~Jim_Brennan] I misread the last message in the discussion forum. Logs feature can redirect stdout and stderr streams correctly. However, I am not thrilled to call extra docker logs command to fetch logs, and maintaining the liveness of docker logs command. In my view, this is more fragile because docker logs command can receive external signal to prevent the whole log to be sent to yarn, and subsequence tailing will report duplicated information. If it is attached to the real stdout and stderr of the running program, we reduces the headache of additional process management and no duplicate information. I don't believe blocking call is the correct answer to help determine liveness of docker container. The blocking call to wait for docker detach has several problems: 1. Docker run could get stuck in pull docker images when mass number of containers are all starting at the same time and image is not cached locally. This happen a lot on repositories that are hosted on docker hub. 2. Docker run cli can also get stuck when docker daemon hangs, and no exit code is returned. 3. Some docker image that are not built to run in detached mode. Some developer might have built their system to require foreground mode. These images will terminate in detach mode. When "docker run -d", and "docker logs" combination are employed, there is some progress are not logged. i.e. the downloading progress, docker daemon error message. The current patch would log any errors coming from docker run cli to provide more information for user who is troubleshooting the problems. Regarding the racy problem, this is a problem that can be optimized by system administrator. On a cluster that download all images from internet via a slow internet link. It is perfectly reasonable to set the retry and timeout value to 30 minutes to wait for download to complete. In highly automated system, such as a cloud vendor trying to spin up images in fraction of a second for mass number of user, the timeout value might be set to as short as 5 seconds. If the image came up in 6 seconds, and it missed the SLA, another container takes its place in the next 5 second to provide smooth user experience. The 6 seconds container is recycled and rebuilt. At mass scale, race condition problem is easier to deal with than blocking call that prevent the entire automated system from working. I can update the code to make retry configurable setting in the short term. I am not discounting the possibilities to support docker run -d and docker logs, but this requires more development experiments to ensure all mechanic are covered well. The current approach has been in use in my environment for the past 6 months, and it works well. For 3.1.1 release, it would be safer to use the current approach to get us better coverage of the type of containers that can be supported. Thoughts? > Support ENTRY_POINT for docker container > ---------------------------------------- > > Key: YARN-7654 > URL: https://issues.apache.org/jira/browse/YARN-7654 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn > Affects Versions: 3.1.0 > Reporter: Eric Yang > Assignee: Eric Yang > Priority: Blocker > Labels: Docker > Attachments: YARN-7654.001.patch, YARN-7654.002.patch, > YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, > YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, > YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, > YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, > YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, > YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, > YARN-7654.021.patch > > > Docker image may have ENTRY_POINT predefined, but this is not supported in > the current implementation. It would be nice if we can detect existence of > {{launch_command}} and base on this variable launch docker container in > different ways: > h3. Launch command exists > {code} > docker run [image]:[version] > docker exec [container_id] [launch_command] > {code} > h3. Use ENTRY_POINT > {code} > docker run [image]:[version] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org