[jira] [Resolved] (YARN-7218) ApiServer REST API naming convention /ws/v1 is already used in Hadoop v2
[ https://issues.apache.org/jira/browse/YARN-7218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang resolved YARN-7218. - Resolution: Won't Fix It looks like v1 of YARN rest api is still evolving. The name space used by services is independent of other paths, hence the incompatibility concern is a non-issue at this time. We can close this as wrong fix. > ApiServer REST API naming convention /ws/v1 is already used in Hadoop v2 > > > Key: YARN-7218 > URL: https://issues.apache.org/jira/browse/YARN-7218 > Project: Hadoop YARN > Issue Type: Sub-task > Components: api, applications >Reporter: Eric Yang >Assignee: Eric Yang > > In YARN-6626, there is a desire to have ability to run ApiServer REST API in > Resource Manager, this can eliminate the requirement to deploy another daemon > service for submitting docker applications. In YARN-5698, a new UI has been > implemented as a separate web application. There are some problems in the > arrangement that can cause conflicts of how Java session are being managed. > The root context of Resource Manager web application is /ws. This is hard > coded in startWebapp method in ResourceManager.java. This means all the > session management is applied to Web URL of /ws prefix. /ui2 is independent > of /ws context, therefore session management code doesn't apply to /ui2. > This could be a session management problem, if servlet based code is going to > be introduced into /ui2 web application. > ApiServer code base is designed as a separate web application. There is no > easy way to inject a separate web application into the same /ws context > because ResourceManager is already setup to bind to RMWebServices. Unless > ApiServer code is moved into RMWebServices, otherwise, they will not share > the same session management. > The alternate solution is to keep ApiServer prefix URL independent of /ws > context. However, this will be a departure from YARN web services naming > convention. This can be loaded as a separate web application in Resource > Manager jetty server. One possible proposal is /app/v1/services. This can > keep ApiServer code modular and independent from Resource Manager. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6187) Auto-generate REST API resources and server side stubs from swagger definition
[ https://issues.apache.org/jira/browse/YARN-6187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227785#comment-16227785 ] Eric Yang commented on YARN-6187: - [~gsaha] Swagger is good for generating initial classes to get development going. Changes made to Swagger definition will result in generating new code with empty classes. I don't see a way to constantly update swagger yaml file, and keeping generated code inline with human added logic. Do we still need this? > Auto-generate REST API resources and server side stubs from swagger definition > -- > > Key: YARN-6187 > URL: https://issues.apache.org/jira/browse/YARN-6187 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Gour Saha > Fix For: yarn-native-services > > > Currently the REST API resource package is generated offline using swagger > codegen library and then formatted with basic eclipse formatter and then > checked in. It is not entirely in line with YARN documentation and coding > guidelines. We need to do these things to streamline this effort - > # Auto-generate the resource package and the server side API interface/stubs > using swagger codegen libraries > # Use a template framework like jmustache or similar (or better) to align/add > documentation and code-formatting in-line with Yarn project standards -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-6387) Provide a flag in Rest API GET response to notify if the app launch delay is due to docker image download.
[ https://issues.apache.org/jira/browse/YARN-6387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227789#comment-16227789 ] Eric Yang commented on YARN-6387: - Do we still need this? There is no description. The current REST API only response the final result. If the goal is to provide progress during REST API call to create containers, then we probably need to add extension to the REST API. Each of the operation (create, start, stop, flex) can be reference by an operation ID. Front end can invoke REST API with operation ID to inspect the current progress of the operation. Without operation centric API, it is not possible to determine if container is in downloading state or container is started and running. > Provide a flag in Rest API GET response to notify if the app launch delay is > due to docker image download. > -- > > Key: YARN-6387 > URL: https://issues.apache.org/jira/browse/YARN-6387 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: sriharsha devineni > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7197) Add support for a volume blacklist for docker containers
[ https://issues.apache.org/jira/browse/YARN-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225452#comment-16225452 ] Eric Yang edited comment on YARN-7197 at 10/30/17 6:18 PM: --- [~jlowe] {quote}Either /run isn't in the whitelist in the first place rendering the blacklist entry moot or /run is in the whitelist and the user can simply mount {{/run}} and access the blacklist path.{quote} Let's expand on the real world example. A hacker tries to take control of {{/run/docker.socket}} to acquire root privileges and spawn root containers to access vital system area to become root on the host system. The system admin placed {{/var}} in read-write white list for ability to write to database and log directories, without black list capability. Hacker explicitly specify {{/var/run/docker.socket}} to be included, put the socket in {{/tmp/docker.socket}}. Hacker generates a docker image with {{/etc/group}} modified to include his own name or setuid bit binary in container. Hack can successfully gain control to host level docker without much effort. {{/run}} contains a growing list of software that put their pid file or socket in this location. System admin can't say no to not allow other software (i.e. hdfs short circuit read) to place their socket in {{/run}} location and share between containers due to company requirement. However, he still doesn't want to let hacker gain root access. h3. Solution 1: System admin placed {{/var/*}}, {{/run/\*}} (except /run/docker.socket), and {{/mnt/hdfs/user/*}} (except yarn), carefully in read-write white list. None of the symlink is exposed. Hacker can not get in. h3. Solution 2 (All symlinks, and hardcoded locations are banned): (Current proposed patch) System admin specifies: white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), {{/mnt/hdfs/user/\*}} (exception yarn) black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or explicit hard coded location also result in ban. h3. Solution 3: (Replace black list location with empty directories): (Jason proposed implementation) System admin specifies: white-list-read-write: {{/var}},{{/run}},{{/mnt/hdfs/user}} black-list: {{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or mount /run/docker.socket manually, but result in empty file. All solutions requires system administrator to enforce ability to upload secure image to private registry to prevent torjan horse in docker image. I can see the appeal that without having to do a high upkeep of white-list-read-write directories by the new proposal. The third solution can throw people off, if they do not know about black-list is hijacked to empty location. However, the depth of directories might defeat second solution. If community favors the third solution, I can revise patch accordingly. was (Author: eyang): [~jlowe] {quote}Either /run isn't in the whitelist in the first place rendering the blacklist entry moot or /run is in the whitelist and the user can simply mount {{/run}} and access the blacklist path.{quote} Let's expand on the real world example. A hacker tries to take control of {{/run/docker.socket}} to acquire root privileges and spawn root containers to access vital system area to become root on the host system. The system admin placed {{/var}} in read-write white list for ability to write to database and log directories, without black list capability. Hacker explicitly specify {{/var/run/docker.socket}} to be included, put the socket in {{/tmp/docker.socket}}. Hacker generates a docker image with {{/etc/group}} modified to include his own name or setuid bit binary in container. Hack can successfully gain control to host level docker without much effort. {{/run}} contains a growing list of software that put their pid file or socket in this location. System admin can't say no to not allow other software (i.e. hdfs short circuit read) to place their socket in {{/run}} location and share between containers due to company requirement. However, he still doesn't want to let hacker gain root access. h3. Solution 1: System admin placed {{/var/*}} and {{/run/\*}} (except /run/docker.socket), carefully in read-write white list. None of the symlink is exposed. Hacker can not get in. h3. Solution 2 (All symlinks, and hardcoded locations are banned): (Current proposed patch) System admin specifies: white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), {{/mnt/hdfs/user/\*}} (exception yarn) black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or
[jira] [Comment Edited] (YARN-7197) Add support for a volume blacklist for docker containers
[ https://issues.apache.org/jira/browse/YARN-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225452#comment-16225452 ] Eric Yang edited comment on YARN-7197 at 10/30/17 6:13 PM: --- [~jlowe] {quote}Either /run isn't in the whitelist in the first place rendering the blacklist entry moot or /run is in the whitelist and the user can simply mount {{/run}} and access the blacklist path.{quote} Let's expand on the real world example. A hacker tries to take control of {{/run/docker.socket}} to acquire root privileges and spawn root containers to access vital system area to become root on the host system. The system admin placed {{/var}} in read-write white list for ability to write to database and log directories, without black list capability. Hacker explicitly specify {{/var/run/docker.socket}} to be included, put the socket in {{/tmp/docker.socket}}. Hacker generates a docker image with {{/etc/group}} modified to include his own name or setuid bit binary in container. Hack can successfully gain control to host level docker without much effort. {{/run}} contains a growing list of software that put their pid file or socket in this location. System admin can't say no to not allow other software to place their socket in {{/run}} location and share between containers due to company requirement. However, he still doesn't want to let hacker gain root access. h3. Solution 1: System admin placed {{/var/*}} and {{/run/\*}} (except /run/docker.socket), carefully in read-write white list. None of the symlink is exposed. Hacker can not get in. h3. Solution 2 (All symlinks, and hardcoded locations are banned): (Current proposed patch) System admin specifies: white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), {{/mnt/hdfs/user/\*}} (exception yarn) black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or explicit hard coded location also result in ban. h3. Solution 3: (Replace black list location with empty directories): (Jason proposed implementation) System admin specifies: white-list-read-write: {{/var}},{{/run}},{{/mnt/hdfs/user}} black-list: {{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or mount /run/docker.socket manually, but result in empty file. All solutions requires system administrator to enforce ability to upload secure image to private registry to prevent torjan horse in docker image. I can see the appeal that without having to do a high upkeep of white-list-read-write directories by the new proposal. The third solution can throw people off, if they do not know about black-list is hijacked to empty location. However, the depth of directories might defeat second solution. If community favors the third solution, I can revise patch accordingly. was (Author: eyang): [~jlowe] {quote}Either /run isn't in the whitelist in the first place rendering the blacklist entry moot or /run is in the whitelist and the user can simply mount {{/run}} and access the blacklist path.{quote} Let's expand on the real world example. A hacker tries to take control of {{/run/docker.socket}} to acquire root privileges and spawn root containers to access vital system area to become root on the host system. The system admin placed {{/var}} in read-write white list for ability to write to database and log directories, without black list capability. Hacker explicitly specify {{/var/run/docker.socket}} to be included, put the socket in {{/tmp/docker.socket}}. Hacker generates a docker image with {{/etc/group}} modified to include his own name or setuid bit binary in container. Hack can successfully gain control to host level docker without much effort. {{/run}} contains a growing list of software that put their pid file or socket in this location. System admin can't say no to not allow other software to place their socket in {{/run}} location and share between containers due to company requirement. However, he still doesn't want to let hacker gain root access. Solution 1: System admin placed {{/var/*}} and {{/run/\*}} (except /run/docker.socket), carefully in read-write white list. None of the symlink is exposed. Hacker can not get in. Solution 2 (All symlinks are banned and explicit hardcoded locations): (Current proposed patch) System admin specifies: white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), {{/mnt/hdfs/user/\*}} (exception yarn) black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or explicit hard coded location also result in ban. Solution 3: (Ban symlink and replace black list
[jira] [Comment Edited] (YARN-7197) Add support for a volume blacklist for docker containers
[ https://issues.apache.org/jira/browse/YARN-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225452#comment-16225452 ] Eric Yang edited comment on YARN-7197 at 10/30/17 6:17 PM: --- [~jlowe] {quote}Either /run isn't in the whitelist in the first place rendering the blacklist entry moot or /run is in the whitelist and the user can simply mount {{/run}} and access the blacklist path.{quote} Let's expand on the real world example. A hacker tries to take control of {{/run/docker.socket}} to acquire root privileges and spawn root containers to access vital system area to become root on the host system. The system admin placed {{/var}} in read-write white list for ability to write to database and log directories, without black list capability. Hacker explicitly specify {{/var/run/docker.socket}} to be included, put the socket in {{/tmp/docker.socket}}. Hacker generates a docker image with {{/etc/group}} modified to include his own name or setuid bit binary in container. Hack can successfully gain control to host level docker without much effort. {{/run}} contains a growing list of software that put their pid file or socket in this location. System admin can't say no to not allow other software (i.e. hdfs short circuit read) to place their socket in {{/run}} location and share between containers due to company requirement. However, he still doesn't want to let hacker gain root access. h3. Solution 1: System admin placed {{/var/*}} and {{/run/\*}} (except /run/docker.socket), carefully in read-write white list. None of the symlink is exposed. Hacker can not get in. h3. Solution 2 (All symlinks, and hardcoded locations are banned): (Current proposed patch) System admin specifies: white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), {{/mnt/hdfs/user/\*}} (exception yarn) black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or explicit hard coded location also result in ban. h3. Solution 3: (Replace black list location with empty directories): (Jason proposed implementation) System admin specifies: white-list-read-write: {{/var}},{{/run}},{{/mnt/hdfs/user}} black-list: {{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or mount /run/docker.socket manually, but result in empty file. All solutions requires system administrator to enforce ability to upload secure image to private registry to prevent torjan horse in docker image. I can see the appeal that without having to do a high upkeep of white-list-read-write directories by the new proposal. The third solution can throw people off, if they do not know about black-list is hijacked to empty location. However, the depth of directories might defeat second solution. If community favors the third solution, I can revise patch accordingly. was (Author: eyang): [~jlowe] {quote}Either /run isn't in the whitelist in the first place rendering the blacklist entry moot or /run is in the whitelist and the user can simply mount {{/run}} and access the blacklist path.{quote} Let's expand on the real world example. A hacker tries to take control of {{/run/docker.socket}} to acquire root privileges and spawn root containers to access vital system area to become root on the host system. The system admin placed {{/var}} in read-write white list for ability to write to database and log directories, without black list capability. Hacker explicitly specify {{/var/run/docker.socket}} to be included, put the socket in {{/tmp/docker.socket}}. Hacker generates a docker image with {{/etc/group}} modified to include his own name or setuid bit binary in container. Hack can successfully gain control to host level docker without much effort. {{/run}} contains a growing list of software that put their pid file or socket in this location. System admin can't say no to not allow other software to place their socket in {{/run}} location and share between containers due to company requirement. However, he still doesn't want to let hacker gain root access. h3. Solution 1: System admin placed {{/var/*}} and {{/run/\*}} (except /run/docker.socket), carefully in read-write white list. None of the symlink is exposed. Hacker can not get in. h3. Solution 2 (All symlinks, and hardcoded locations are banned): (Current proposed patch) System admin specifies: white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), {{/mnt/hdfs/user/\*}} (exception yarn) black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or explicit hard coded location also result in ban. h3. Solution 3: (Replace
[jira] [Commented] (YARN-7197) Add support for a volume blacklist for docker containers
[ https://issues.apache.org/jira/browse/YARN-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225452#comment-16225452 ] Eric Yang commented on YARN-7197: - [~jlowe] {quote}Either /run isn't in the whitelist in the first place rendering the blacklist entry moot or /run is in the whitelist and the user can simply mount /run and access the blacklist path.{quote} Let's expand on the real world example. A hacker tries to take control of {{/run/docker.socket}} to acquire root privileges and spawn root containers to access vital system area to become root on the host system. The system admin placed {{/var}} in read-write white list for ability to write to database and log directories, without black list capability. Hacker explicitly specify {{/var/run/docker.socket}} to be included, put the socket in {{/tmp/docker.socket}}. Hacker generates a docker image with /etc/group modified to include his own name or setuid bit binary in container. Hack can successfully gain control to host level docker without much effort. {{/run}} contains a growing list of software that put their pid file or socket in this location. System admin can't say no to not allow other software to place their socket in {{/run}} location and share between containers due to company requirement. However, he still doesn't want to let hacker gain root access. Solution 1: System admin placed {{/var/*}} and {{/run/*}} (except /run/docker.socket), carefully in read-write white list. None of the symlink is exposed. Hacker can not get in. Solution 2 (All symlinks are banned and explicit hardcoded locations): (Current proposed patch) System admin specifies: white-list-read-write: {{/var}}, {{/run/*}} (except /run/docker.socket), {{/mnt/hdfs/user/*}} (exception yarn) black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or explicit hard coded location also result in ban. Solution 3: (Ban symlink and replace black list location with empty directory): (Jason proposed implementation) System admin specifies: white-list-read-write: {{/var}},{{/run}},{{/mnt/hdfs/user}} black-list: {{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or mount /run/docker.socket manually, but result in empty file. All solutions requires system administrator to enforce ability to upload secure image to private registry to prevent torjan horse in docker image. I can see the appeal that without having to do a high upkeep of white-list-read-write directories by the new proposal. The third solution can throw people off, if they do not know about black-list is hijacked to empty location. However, the deeper nested directories, it would be harder to secure by second solution. If community favors the third solution, I can revise patch accordingly. > Add support for a volume blacklist for docker containers > > > Key: YARN-7197 > URL: https://issues.apache.org/jira/browse/YARN-7197 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Reporter: Shane Kumpf >Assignee: Eric Yang > Attachments: YARN-7197.001.patch, YARN-7197.002.patch > > > Docker supports bind mounting host directories into containers. Work is > underway to allow admins to configure a whilelist of volume mounts. While > this is a much needed and useful feature, it opens the door for > misconfiguration that may lead to users being able to compromise or crash the > system. > One example would be allowing users to mount /run from a host running > systemd, and then running systemd in that container, rendering the host > mostly unusable. > This issue is to add support for a default blacklist. The default blacklist > would be where we put files and directories that if mounted into a container, > are likely to have negative consequences. Users are encouraged not to remove > items from the default blacklist, but may do so if necessary. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7197) Add support for a volume blacklist for docker containers
[ https://issues.apache.org/jira/browse/YARN-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225452#comment-16225452 ] Eric Yang edited comment on YARN-7197 at 10/30/17 6:19 PM: --- [~jlowe] {quote}Either /run isn't in the whitelist in the first place rendering the blacklist entry moot or /run is in the whitelist and the user can simply mount {{/run}} and access the blacklist path.{quote} Let's expand on the real world example. A hacker tries to take control of {{/run/docker.socket}} to acquire root privileges and spawn root containers to access vital system area to become root on the host system. The system admin placed {{/var}} in read-write white list for ability to write to database and log directories, without black list capability. Hacker explicitly specify {{/var/run/docker.socket}} to be included, put the socket in {{/tmp/docker.socket}}. Hacker generates a docker image with {{/etc/group}} modified to include his own name or setuid bit binary in container. Hack can successfully gain control to host level docker without much effort. {{/run}} contains a growing list of software that put their pid file or socket in this location. System admin can't say no to not allow other software (i.e. hdfs short circuit read) to place their socket in {{/run}} location and share between containers due to company requirement. However, he still doesn't want to let hacker gain root access. h3. Solution 1: System admin placed {{/var/*}}, {{/run/\*}} (except /run/docker.socket), and {{/mnt/hdfs/user/\*}} (except yarn), carefully in read-write white list. None of the symlink is exposed. Hacker can not get in. h3. Solution 2 (All symlinks, and hardcoded locations are banned): (Current proposed patch) System admin specifies: white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), {{/mnt/hdfs/user/\*}} (exception yarn) black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or explicit hard coded location also result in ban. h3. Solution 3: (Replace black list location with empty directories): (Jason proposed implementation) System admin specifies: white-list-read-write: {{/var}},{{/run}},{{/mnt/hdfs/user}} black-list: {{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or mount /run/docker.socket manually, but result in empty file. All solutions requires system administrator to enforce ability to upload secure image to private registry to prevent torjan horse in docker image. I can see the appeal that without having to do a high upkeep of white-list-read-write directories by the new proposal. The third solution can throw people off, if they do not know about black-list is hijacked to empty location. However, the depth of directories might defeat second solution. If community favors the third solution, I can revise patch accordingly. was (Author: eyang): [~jlowe] {quote}Either /run isn't in the whitelist in the first place rendering the blacklist entry moot or /run is in the whitelist and the user can simply mount {{/run}} and access the blacklist path.{quote} Let's expand on the real world example. A hacker tries to take control of {{/run/docker.socket}} to acquire root privileges and spawn root containers to access vital system area to become root on the host system. The system admin placed {{/var}} in read-write white list for ability to write to database and log directories, without black list capability. Hacker explicitly specify {{/var/run/docker.socket}} to be included, put the socket in {{/tmp/docker.socket}}. Hacker generates a docker image with {{/etc/group}} modified to include his own name or setuid bit binary in container. Hack can successfully gain control to host level docker without much effort. {{/run}} contains a growing list of software that put their pid file or socket in this location. System admin can't say no to not allow other software (i.e. hdfs short circuit read) to place their socket in {{/run}} location and share between containers due to company requirement. However, he still doesn't want to let hacker gain root access. h3. Solution 1: System admin placed {{/var/*}}, {{/run/\*}} (except /run/docker.socket), and {{/mnt/hdfs/user/*}} (except yarn), carefully in read-write white list. None of the symlink is exposed. Hacker can not get in. h3. Solution 2 (All symlinks, and hardcoded locations are banned): (Current proposed patch) System admin specifies: white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), {{/mnt/hdfs/user/\*}} (exception yarn) black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access
[jira] [Comment Edited] (YARN-7197) Add support for a volume blacklist for docker containers
[ https://issues.apache.org/jira/browse/YARN-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225452#comment-16225452 ] Eric Yang edited comment on YARN-7197 at 10/30/17 6:09 PM: --- [~jlowe] {quote}Either /run isn't in the whitelist in the first place rendering the blacklist entry moot or /run is in the whitelist and the user can simply mount {{/run}} and access the blacklist path.{quote} Let's expand on the real world example. A hacker tries to take control of {{/run/docker.socket}} to acquire root privileges and spawn root containers to access vital system area to become root on the host system. The system admin placed {{/var}} in read-write white list for ability to write to database and log directories, without black list capability. Hacker explicitly specify {{/var/run/docker.socket}} to be included, put the socket in {{/tmp/docker.socket}}. Hacker generates a docker image with {{/etc/group}} modified to include his own name or setuid bit binary in container. Hack can successfully gain control to host level docker without much effort. {{/run}} contains a growing list of software that put their pid file or socket in this location. System admin can't say no to not allow other software to place their socket in {{/run}} location and share between containers due to company requirement. However, he still doesn't want to let hacker gain root access. Solution 1: System admin placed {{/var/*}} and {{/run/\*}} (except /run/docker.socket), carefully in read-write white list. None of the symlink is exposed. Hacker can not get in. Solution 2 (All symlinks are banned and explicit hardcoded locations): (Current proposed patch) System admin specifies: white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), {{/mnt/hdfs/user/\*}} (exception yarn) black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or explicit hard coded location also result in ban. Solution 3: (Ban symlink and replace black list location with empty directory): (Jason proposed implementation) System admin specifies: white-list-read-write: {{/var}},{{/run}},{{/mnt/hdfs/user}} black-list: {{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or mount /run/docker.socket manually, but result in empty file. All solutions requires system administrator to enforce ability to upload secure image to private registry to prevent torjan horse in docker image. I can see the appeal that without having to do a high upkeep of white-list-read-write directories by the new proposal. The third solution can throw people off, if they do not know about black-list is hijacked to empty location. However, the depth of directories will defeat second solution. If community favors the third solution, I can revise patch accordingly. was (Author: eyang): [~jlowe] {quote}Either /run isn't in the whitelist in the first place rendering the blacklist entry moot or /run is in the whitelist and the user can simply mount /run and access the blacklist path.{quote} Let's expand on the real world example. A hacker tries to take control of {{/run/docker.socket}} to acquire root privileges and spawn root containers to access vital system area to become root on the host system. The system admin placed {{/var}} in read-write white list for ability to write to database and log directories, without black list capability. Hacker explicitly specify {{/var/run/docker.socket}} to be included, put the socket in {{/tmp/docker.socket}}. Hacker generates a docker image with /etc/group modified to include his own name or setuid bit binary in container. Hack can successfully gain control to host level docker without much effort. {{/run}} contains a growing list of software that put their pid file or socket in this location. System admin can't say no to not allow other software to place their socket in {{/run}} location and share between containers due to company requirement. However, he still doesn't want to let hacker gain root access. Solution 1: System admin placed {{/var/*}} and {{/run/\*}} (except /run/docker.socket), carefully in read-write white list. None of the symlink is exposed. Hacker can not get in. Solution 2 (All symlinks are banned and explicit hardcoded locations): (Current proposed patch) System admin specifies: white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), {{/mnt/hdfs/user/\*}} (exception yarn) black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or explicit hard coded location also result in ban. Solution 3: (Ban symlink and replace black list
[jira] [Comment Edited] (YARN-7197) Add support for a volume blacklist for docker containers
[ https://issues.apache.org/jira/browse/YARN-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225452#comment-16225452 ] Eric Yang edited comment on YARN-7197 at 10/30/17 6:08 PM: --- [~jlowe] {quote}Either /run isn't in the whitelist in the first place rendering the blacklist entry moot or /run is in the whitelist and the user can simply mount /run and access the blacklist path.{quote} Let's expand on the real world example. A hacker tries to take control of {{/run/docker.socket}} to acquire root privileges and spawn root containers to access vital system area to become root on the host system. The system admin placed {{/var}} in read-write white list for ability to write to database and log directories, without black list capability. Hacker explicitly specify {{/var/run/docker.socket}} to be included, put the socket in {{/tmp/docker.socket}}. Hacker generates a docker image with /etc/group modified to include his own name or setuid bit binary in container. Hack can successfully gain control to host level docker without much effort. {{/run}} contains a growing list of software that put their pid file or socket in this location. System admin can't say no to not allow other software to place their socket in {{/run}} location and share between containers due to company requirement. However, he still doesn't want to let hacker gain root access. Solution 1: System admin placed {{/var/*}} and {{/run/\*}} (except /run/docker.socket), carefully in read-write white list. None of the symlink is exposed. Hacker can not get in. Solution 2 (All symlinks are banned and explicit hardcoded locations): (Current proposed patch) System admin specifies: white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), {{/mnt/hdfs/user/\*}} (exception yarn) black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or explicit hard coded location also result in ban. Solution 3: (Ban symlink and replace black list location with empty directory): (Jason proposed implementation) System admin specifies: white-list-read-write: {{/var}},{{/run}},{{/mnt/hdfs/user}} black-list: {{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or mount /run/docker.socket manually, but result in empty file. All solutions requires system administrator to enforce ability to upload secure image to private registry to prevent torjan horse in docker image. I can see the appeal that without having to do a high upkeep of white-list-read-write directories by the new proposal. The third solution can throw people off, if they do not know about black-list is hijacked to empty location. However, the depth of directories will defeat second solution. If community favors the third solution, I can revise patch accordingly. was (Author: eyang): [~jlowe] {quote}Either /run isn't in the whitelist in the first place rendering the blacklist entry moot or /run is in the whitelist and the user can simply mount /run and access the blacklist path.{quote} Let's expand on the real world example. A hacker tries to take control of {{/run/docker.socket}} to acquire root privileges and spawn root containers to access vital system area to become root on the host system. The system admin placed {{/var}} in read-write white list for ability to write to database and log directories, without black list capability. Hacker explicitly specify {{/var/run/docker.socket}} to be included, put the socket in {{/tmp/docker.socket}}. Hacker generates a docker image with /etc/group modified to include his own name or setuid bit binary in container. Hack can successfully gain control to host level docker without much effort. {{/run}} contains a growing list of software that put their pid file or socket in this location. System admin can't say no to not allow other software to place their socket in {{/run}} location and share between containers due to company requirement. However, he still doesn't want to let hacker gain root access. Solution 1: System admin placed {{/var/*}} and {{/run/*}} (except /run/docker.socket), carefully in read-write white list. None of the symlink is exposed. Hacker can not get in. Solution 2 (All symlinks are banned and explicit hardcoded locations): (Current proposed patch) System admin specifies: white-list-read-write: {{/var}}, {{/run/*}} (except /run/docker.socket), {{/mnt/hdfs/user/*}} (exception yarn) black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}} Hacker attempt to mount a symlink location resulting in access denied from container startup, or explicit hard coded location also result in ban. Solution 3: (Ban symlink and replace black list location with
[jira] [Commented] (YARN-7565) Yarn service pre-maturely releases the container after AM restart
[ https://issues.apache.org/jira/browse/YARN-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298837#comment-16298837 ] Eric Yang commented on YARN-7565: - More information revealed that, there was a problem with znode on my cluster. I am not sure how it reached that state. By removing the faulty znode for DNS registry, the null pointer exception problem doesn't happen any more. > Yarn service pre-maturely releases the container after AM restart > -- > > Key: YARN-7565 > URL: https://issues.apache.org/jira/browse/YARN-7565 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh > Fix For: 3.1.0 > > Attachments: YARN-7565.001.patch, YARN-7565.002.patch, > YARN-7565.003.patch, YARN-7565.004.patch, YARN-7565.005.patch, > YARN-7565.addendum.001.patch > > > With YARN-6168, recovered containers can be reported to AM in response to the > AM heartbeat. > Currently, the Service Master will release the containers, that are not > reported in the AM registration response, immediately. > Instead, the master can wait for a configured amount of time for the > containers to be recovered by RM. These containers are sent to AM in the > heartbeat response. Once a container is not reported in the configured > interval, it can be released by the master. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-7565) Yarn service pre-maturely releases the container after AM restart
[ https://issues.apache.org/jira/browse/YARN-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298747#comment-16298747 ] Eric Yang edited comment on YARN-7565 at 12/20/17 5:33 PM: --- Thank you for point out the ServiceRecord.description maps to container name (and not Service Spec description field), but it appears to be a race condition for newly created application. serviceStart is invoked recoverComponent first. Application hasn't registered with Registry yet. This looks like the reason that we get null pointer exception. was (Author: eyang): Thank you for point out the record.description maps to container name, but it appears to be a race condition for newly created application. serviceStart is invoked recoverComponent first. Application hasn't registered with Registry yet. This looks like the reason that we get null pointer exception. > Yarn service pre-maturely releases the container after AM restart > -- > > Key: YARN-7565 > URL: https://issues.apache.org/jira/browse/YARN-7565 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh > Fix For: 3.1.0 > > Attachments: YARN-7565.001.patch, YARN-7565.002.patch, > YARN-7565.003.patch, YARN-7565.004.patch, YARN-7565.005.patch, > YARN-7565.addendum.001.patch > > > With YARN-6168, recovered containers can be reported to AM in response to the > AM heartbeat. > Currently, the Service Master will release the containers, that are not > reported in the AM registration response, immediately. > Instead, the master can wait for a configured amount of time for the > containers to be recovered by RM. These containers are sent to AM in the > heartbeat response. Once a container is not reported in the configured > interval, it can be released by the master. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7565) Yarn service pre-maturely releases the container after AM restart
[ https://issues.apache.org/jira/browse/YARN-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298747#comment-16298747 ] Eric Yang commented on YARN-7565: - Thank you for point out the record.description maps to container name, but it appears to be a race condition for newly created application. serviceStart is invoked recoverComponent first. Application hasn't registered with Registry yet. This looks like the reason that we get null pointer exception. > Yarn service pre-maturely releases the container after AM restart > -- > > Key: YARN-7565 > URL: https://issues.apache.org/jira/browse/YARN-7565 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh > Fix For: 3.1.0 > > Attachments: YARN-7565.001.patch, YARN-7565.002.patch, > YARN-7565.003.patch, YARN-7565.004.patch, YARN-7565.005.patch, > YARN-7565.addendum.001.patch > > > With YARN-6168, recovered containers can be reported to AM in response to the > AM heartbeat. > Currently, the Service Master will release the containers, that are not > reported in the AM registration response, immediately. > Instead, the master can wait for a configured amount of time for the > containers to be recovered by RM. These containers are sent to AM in the > heartbeat response. Once a container is not reported in the configured > interval, it can be released by the master. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8080) YARN native service should support component restart policy
[ https://issues.apache.org/jira/browse/YARN-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464158#comment-16464158 ] Eric Yang edited comment on YARN-8080 at 5/4/18 5:23 PM: - [~suma.shivaprasad] Thank you for the patch. Flex is a black box operation, it is not context aware of how application requires more or less containers. Therefore, it is reliant on the user/program to make decision. Here are the possible usage of each case: *Retry policy = NEVER and Flex Up* A data scientist might be training datasets and found that the dataset produced by the first two completed container is insufficient, and he like to get more iteration to train on the same dataset. The input parameters could stay the same, but perform more of the same iterations in parallel. Flex operation can come in handy that flex up to reach the desired state of 4 containers (2 currently running and 2 additional containers). This can produce more data model for him in the same run. *Retry policy = NEVER and Flex down* When system administrator ask data scientist to save system resources for his bitcoin mining operation. Flex down could mean to save system resources and perform ML training iterations at a later run. *Retry policy = ON_FAILURE and Flex Up* In the case where container workload are stateful, such as SparkSQL that translated query into multiple partitions. SparkSQL driver can decide if it wants to attempt multiple retries on failure with smaller dataset to ensure query completion. It may decide to increase the number of containers, and change some hint file on hdfs to reduce the workload computed per container and increase number of containers to complete the query computation. In this case, the counter should reset 0 for successful container runs, and restart all containers. *Retry policy = ON_FAILURE and Flex down* In some case, merging data from many partitions at the same time, it might have unbalanced dataset, and prevent merging from happening. SparkSQL driver might decide to use alternate technique to merge using few containers. In this case, Yarn Service AM reduce the container count, and let Spark executor program to communicate directly with Spark driver program to compute by alternate strategy. In this case, the counter should reset to 0 for successful container runs, and restart all containers. There are possible use cases for each of the scenario, and we provide the knobs to enable each scenario. There are some additional programming from application point of view to take advantage of the advance feature. I also agree that some stateful program might not work in combinations of retry policy and flex operation, and we provide a option to disable flex for such type of program. was (Author: eyang): [~suma.shivaprasad] Thank you for the patch. Flex is a black box operation, it is not context aware of how application requires more or less containers. Therefore, it reliant on the user/program to make decision. Here are the possible usage of each case: Retry policy = NEVER and Flex Up A data scientist might be training datasets and found that the dataset produced by the first two completed container is insufficient, and he like to get more iteration to train on the same dataset. The input parameters could stay the same, but perform more of the same iterations in parallel. Flex operation can come in handy that flex up to reach the desired state of 4 containers (2 currently running and 2 additional containers). This can produce more data model for him in the same run. Retry policy = NEVER and Flex down When system administrator ask data scientist to save system resources for his bitcoin mining operation. Flex down could mean to save system resources and perform ML training iterations at a later run. Retry policy = ON_FAILURE and Flex Up In the case where container workload are stateful, such as SparkSQL that translated query into multiple partitions. SparkSQL driver can decide if it wants to attempt multiple retries on failure with smaller dataset to ensure query completion. It may decide to increase the number of containers, and change some hint file on hdfs to reduce the workload computed per container and increase number of containers to complete the query computation. Retry policy = ON_FAILURE and Flex down In some case, merging data from many partitions at the same time, it might have unbalanced dataset, and prevent merging from happening. SparkSQL driver might decide to use alternate technique to merge using few containers. In this case, Yarn Service AM reduce the container count, and let Spark executor program to communicate directly with Spark driver program to compute by alternate strategy. There are possible use cases for each of the scenario, and we provide the knobs to enable each scenario. There are some additional programming
[jira] [Updated] (YARN-8223) ClassNotFoundException when auxiliary service is loaded from HDFS
[ https://issues.apache.org/jira/browse/YARN-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8223: Target Version/s: 3.2.0, 3.1.1 > ClassNotFoundException when auxiliary service is loaded from HDFS > - > > Key: YARN-8223 > URL: https://issues.apache.org/jira/browse/YARN-8223 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Charan Hebri >Assignee: Zian Chen >Priority: Blocker > Attachments: YARN-8223.001.patch, YARN-8223.002.patch > > > Loading an auxiliary jar from a local location on a node manager works as > expected, > {noformat} > 2018-04-26 15:09:26,179 INFO util.ApplicationClassLoader > (ApplicationClassLoader.java:(98)) - classpath: > [file:/grid/0/hadoop/yarn/local/aux-service-local.jar] > 2018-04-26 15:09:26,179 INFO util.ApplicationClassLoader > (ApplicationClassLoader.java:(99)) - system classes: [java., > javax.accessibility., javax.activation., javax.activity., javax.annotation., > javax.annotation.processing., javax.crypto., javax.imageio., javax.jws., > javax.lang.model., -javax.management.j2ee., javax.management., javax.naming., > javax.net., javax.print., javax.rmi., javax.script., > -javax.security.auth.message., javax.security.auth., javax.security.cert., > javax.security.sasl., javax.sound., javax.sql., javax.swing., javax.tools., > javax.transaction., -javax.xml.registry., -javax.xml.rpc., javax.xml., > org.w3c.dom., org.xml.sax., org.apache.commons.logging., org.apache.log4j., > -org.apache.hadoop.hbase., org.apache.hadoop., core-default.xml, > hdfs-default.xml, mapred-default.xml, yarn-default.xml] > 2018-04-26 15:09:26,181 INFO containermanager.AuxServices > (AuxServices.java:serviceInit(252)) - The aux service:test_aux_local are > using the custom classloader > 2018-04-26 15:09:26,182 WARN containermanager.AuxServices > (AuxServices.java:serviceInit(268)) - The Auxiliary Service named > 'test_aux_local' in the configuration is for class > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader > which has a name of 'org.apache.auxtest.AuxServiceFromLocal with custom > class loader'. Because these are not the same tools trying to send > ServiceData and read Service Meta Data may have issues unless the refer to > the name in the config. > 2018-04-26 15:09:26,182 INFO containermanager.AuxServices > (AuxServices.java:addService(103)) - Adding auxiliary service > org.apache.auxtest.AuxServiceFromLocal with custom class loader, > "test_aux_local"{noformat} > But loading the same jar from a location on HDFS fails with a > ClassNotFoundException. > {noformat} > 018-04-26 15:14:39,683 INFO util.ApplicationClassLoader > (ApplicationClassLoader.java:(98)) - classpath: [] > 2018-04-26 15:14:39,683 INFO util.ApplicationClassLoader > (ApplicationClassLoader.java:(99)) - system classes: [java., > javax.accessibility., javax.activation., javax.activity., javax.annotation., > javax.annotation.processing., javax.crypto., javax.imageio., javax.jws., > javax.lang.model., -javax.management.j2ee., javax.management., javax.naming., > javax.net., javax.print., javax.rmi., javax.script., > -javax.security.auth.message., javax.security.auth., javax.security.cert., > javax.security.sasl., javax.sound., javax.sql., javax.swing., javax.tools., > javax.transaction., -javax.xml.registry., -javax.xml.rpc., javax.xml., > org.w3c.dom., org.xml.sax., org.apache.commons.logging., org.apache.log4j., > -org.apache.hadoop.hbase., org.apache.hadoop., core-default.xml, > hdfs-default.xml, mapred-default.xml, yarn-default.xml] > 2018-04-26 15:14:39,687 INFO service.AbstractService > (AbstractService.java:noteFailure(267)) - Service > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed > in state INITED > java.lang.ClassNotFoundException: org.apache.auxtest.AuxServiceFromLocal > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189) > at > org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:169) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:249) > at >
[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464035#comment-16464035 ] Eric Yang commented on YARN-8207: - [~jlowe] I see your concerns now. Thanks for the explanation. I will update the code to use typedef data structure above, and ensure null terminator is passed to execvp after getting the data out of args. > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: Docker > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8080) YARN native service should support component restart policy
[ https://issues.apache.org/jira/browse/YARN-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464158#comment-16464158 ] Eric Yang commented on YARN-8080: - [~suma.shivaprasad] Thank you for the patch. Flex is a black box operation, it is not context aware of how application requires more or less containers. Therefore, it reliant on the user/program to make decision. Here are the possible usage of each case: Retry policy = NEVER and Flex Up A data scientist might be training datasets and found that the dataset produced by the first two completed container is insufficient, and he like to get more iteration to train on the same dataset. The input parameters could stay the same, but perform more of the same iterations in parallel. Flex operation can come in handy that flex up to reach the desired state of 4 containers (2 currently running and 2 additional containers). This can produce more data model for him in the same run. Retry policy = NEVER and Flex down When system administrator ask data scientist to save system resources for his bitcoin mining operation. Flex down could mean to save system resources and perform ML training iterations at a later run. Retry policy = ON_FAILURE and Flex Up In the case where container workload are stateful, such as SparkSQL that translated query into multiple partitions. SparkSQL driver can decide if it wants to attempt multiple retries on failure with smaller dataset to ensure query completion. It may decide to increase the number of containers, and change some hint file on hdfs to reduce the workload computed per container and increase number of containers to complete the query computation. Retry policy = ON_FAILURE and Flex down In some case, merging data from many partitions at the same time, it might have unbalanced dataset, and prevent merging from happening. SparkSQL driver might decide to use alternate technique to merge using few containers. In this case, Yarn Service AM reduce the container count, and let Spark executor program to communicate directly with Spark driver program to compute by alternate strategy. There are possible use cases for each of the scenario, and we provide the knobs to enable each scenario. There are some additional programming from application point of view to take advantage of the advance feature. I also agree that some stateful program might not work in combinations of retry policy and flex operation, and we provide a option to disable flex for such type of program. > YARN native service should support component restart policy > --- > > Key: YARN-8080 > URL: https://issues.apache.org/jira/browse/YARN-8080 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Suma Shivaprasad >Priority: Critical > Attachments: YARN-8080.001.patch, YARN-8080.002.patch, > YARN-8080.003.patch, YARN-8080.005.patch, YARN-8080.006.patch, > YARN-8080.007.patch > > > Existing native service assumes the service is long running and never > finishes. Containers will be restarted even if exit code == 0. > To support boarder use cases, we need to allow restart policy of component > specified by users. Propose to have following policies: > 1) Always: containers always restarted by framework regardless of container > exit status. This is existing/default behavior. > 2) Never: Do not restart containers in any cases after container finishes: To > support job-like workload (for example Tensorflow training job). If a task > exit with code == 0, we should not restart the task. This can be used by > services which is not restart/recovery-able. > 3) On-failure: Similar to above, only restart task with exitcode != 0. > Behaviors after component *instance* finalize (Succeeded or Failed when > restart_policy != ALWAYS): > 1) For single component, single instance: complete service. > 2) For single component, multiple instance: other running instances from the > same component won't be affected by the finalized component instance. Service > will be terminated once all instances finalized. > 3) For multiple components: Service will be terminated once all components > finalized. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8207: Attachment: YARN-8207.007.patch > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: Docker > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, > YARN-8207.006.patch, YARN-8207.007.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467950#comment-16467950 ] Eric Yang commented on YARN-8207: - [~jlowe] Thank you for the persistent reviews to make this better. :) > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, > YARN-8207.006.patch, YARN-8207.007.patch, YARN-8207.008.patch, > YARN-8207.009.patch, YARN-8207.010.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467853#comment-16467853 ] Eric Yang commented on YARN-8207: - [~jlowe] Patch 10 is posted. > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, > YARN-8207.006.patch, YARN-8207.007.patch, YARN-8207.008.patch, > YARN-8207.009.patch, YARN-8207.010.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Created] (YARN-8265) AM should retrieve new IP for restarted container
Eric Yang created YARN-8265: --- Summary: AM should retrieve new IP for restarted container Key: YARN-8265 URL: https://issues.apache.org/jira/browse/YARN-8265 Project: Hadoop YARN Issue Type: Bug Components: yarn-native-services Affects Versions: 3.1.0 Reporter: Eric Yang Assignee: Eric Yang Fix For: 3.2.0, 3.1.1 When a docker container is restarted, it gets a new IP, but the service AM only retrieves one IP for a container and then cancels the container status retriever. I suspect the issue would be solved by restarting the retriever (if it has been canceled) when the onContainerRestart callback is received, but we'll have to do some testing to make sure this works. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8265) AM should retrieve new IP for restarted container
[ https://issues.apache.org/jira/browse/YARN-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8265: Attachment: YARN-8265.001.patch > AM should retrieve new IP for restarted container > - > > Key: YARN-8265 > URL: https://issues.apache.org/jira/browse/YARN-8265 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8265.001.patch > > > When a docker container is restarted, it gets a new IP, but the service AM > only retrieves one IP for a container and then cancels the container status > retriever. I suspect the issue would be solved by restarting the retriever > (if it has been canceled) when the onContainerRestart callback is received, > but we'll have to do some testing to make sure this works. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment
[ https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474510#comment-16474510 ] Eric Yang commented on YARN-8108: - Kerberos SPN support by browser definition are: HTTP/, where is either white list server name or canonical DNS name of the server. Chrome, IE, and Firefox all shares the similar logic. Firefox and IE don't allow canonical DNS to prevent MITM attack. Safari and Chrome supports canonical DNS with options to disable canonical DNS. >From Server point of view, a single server can host multiple virtual hosts >with different web applications. This is technically possible to configure >web server to run with multiple SPN. It is incorrect to assume that same >virtual host can serve two different SPN for two different subset of URLs. >All browsers do not support subset of URLs to be served by one SPN, while >other subset of URLs to be served by another SPN. In Hadoop 0.2x, Hadoop components are designed to serve a collection of servlets (log, static, cluster) per port. Therefore, AuthenticationFilter can cover the entire port by targeting the fixed set of servlet for filtering, that matches browser expectation without problem. AuthenticationFilter was later reused in Hadoop 1.x and 2.x as Kerberos SPNEGO filter. The current problem is only surfaced when multiple web contexts are configured to share on the same port with same server hostname, and each web contexts tried to initialize its own SPN. This is not by design and it just happened due to code reuse and lack of testing. For Hadoop 2.x+ to offer embedded services securely, the individual AuthenticationFilter can be turned into one [security handler|http://www.eclipse.org/jetty/documentation/9.3.x/architecture.html#_handlers] to match Jetty design specification. This fall through the crack in open source when no one is looking because the first security mechanism for Hadoop was to implement a XSS filter (was committed as part of Chukwa) instead of security handler. Unfortunately, Hadoop security mechanisms followed the bottom up approach to implement as filter instead of following web application design to write security handler as Handlers. Due to lack of understanding that session persistence require authentication and authorization security mechanism to be built differently from web filters. The one line change is to loop through all Context and ensure all contexts are registered with the same AuthenticationFilter to apply one filter globally to all URLs. This is the reason that this one line patch can plug this security hole in the short term bug fix. The long term solution is writing security handler to match handler design to ensure no API breakage during jetty version upgrade and improve session persistence in Hadoop web applications, which is beyond the scope of this JIRA. > RM metrics rest API throws GSSException in kerberized environment > - > > Key: YARN-8108 > URL: https://issues.apache.org/jira/browse/YARN-8108 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Kshitij Badani >Assignee: Eric Yang >Priority: Blocker > Attachments: YARN-8108.001.patch > > > Test is trying to pull up metrics data from SHS after kiniting as 'test_user' > It is throwing GSSException as follows > {code:java} > b2b460b80713|RUNNING: curl --silent -k -X GET -D > /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : > http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15 > 07:15:48,757|INFO|MainThread|machine.py:194 - > run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0 > 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - > getMetricsJsonData()|metrics: > > > > Error 403 GSSException: Failure unspecified at GSS-API level > (Mechanism level: Request is a replay (34)) > > HTTP ERROR 403 > Problem accessing /proxy/application_1518674952153_0070/metrics/json. > Reason: > GSSException: Failure unspecified at GSS-API level (Mechanism level: > Request is a replay (34)) > > > {code} > Rootcausing : proxyserver on RM can't be supported for Kerberos enabled > cluster because AuthenticationFilter is applied twice in Hadoop code (once in > httpServer2 for RM, and another instance from AmFilterInitializer for proxy > server). This will require code changes to hadoop-yarn-server-web-proxy > project -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8284) get_docker_command refactoring
[ https://issues.apache.org/jira/browse/YARN-8284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474826#comment-16474826 ] Eric Yang commented on YARN-8284: - +1 looks good to me. > get_docker_command refactoring > -- > > Key: YARN-8284 > URL: https://issues.apache.org/jira/browse/YARN-8284 > Project: Hadoop YARN > Issue Type: Improvement >Affects Versions: 3.2.0, 3.1.1 >Reporter: Jason Lowe >Assignee: Eric Badger >Priority: Minor > Attachments: YARN-8284.001.patch > > > YARN-8274 occurred because get_docker_command's helper functions each have to > remember to put the docker binary as the first argument. This is error prone > and causes code duplication for each of the helper functions. It would be > safer and simpler if get_docker_command initialized the docker binary > argument in one place and each of the helper functions only added the > arguments specific to their particular docker sub-command. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8206) Sending a kill does not immediately kill docker containers
[ https://issues.apache.org/jira/browse/YARN-8206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466616#comment-16466616 ] Eric Yang commented on YARN-8206: - [~ebadger] +1 for proposal 2. This is safer option in my opinion. > Sending a kill does not immediately kill docker containers > -- > > Key: YARN-8206 > URL: https://issues.apache.org/jira/browse/YARN-8206 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Labels: Docker > Attachments: YARN-8206.001.patch, YARN-8206.002.patch, > YARN-8206.003.patch, YARN-8206.004.patch > > > {noformat} > if (ContainerExecutor.Signal.KILL.equals(signal) > || ContainerExecutor.Signal.TERM.equals(signal)) { > handleContainerStop(containerId, env); > {noformat} > Currently in the code, we are handling both SIGKILL and SIGTERM as equivalent > for docker containers. However, they should actually be separate. When YARN > sends a SIGKILL to a process, it means for it to die immediately and not sit > around waiting for anything. This ensures an immediate reclamation of > resources. Additionally, if a SIGTERM is sent before the SIGKILL, the task > might not handle the signal correctly, and will then end up as a failed task > instead of a killed task. This is especially bad for preemption. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8207: Priority: Blocker (was: Major) > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, > YARN-8207.006.patch, YARN-8207.007.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8255) Allow option to disable flex for a service component
[ https://issues.apache.org/jira/browse/YARN-8255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466448#comment-16466448 ] Eric Yang commented on YARN-8255: - Instead of introduce another field to enable or disable flex. We can identify if the workload can perform flex operation base on restart_policy. When restart_policy=ON_FAILURE or ALWAYS, this means the data can be recomputed, or the process can resume from failure. Flex operation can be enabled. When restart_policy=NEVER, this means the data is stateful, and can not reprocess. (i.e. mapreduce writes to HBase without transaction property.) . This type of containers are not allowed to have flexing operation. By reasoning deduction, it is possible to reduce combinations that will be supported. This also implies that restart_policy=NEVER doesn't have to support upgrade. > Allow option to disable flex for a service component > - > > Key: YARN-8255 > URL: https://issues.apache.org/jira/browse/YARN-8255 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Suma Shivaprasad >Assignee: Suma Shivaprasad >Priority: Major > > YARN-8080 implements restart capabilities for service component instances. > YARN service components should add an option to disallow flexing to support > workloads which are essentially batch/iterative jobs which terminate with > restart_policy=NEVER/ON_FAILURE. This could be disabled by default for > components where restart_policy=NEVER/ON_FAILURE and enabled by default when > restart_policy=ALWAYS(which is the default restart_policy) unless explicitly > set at the service spec. > The option could be exposed as part of the component spec as "allow_flexing". > cc [~billie.rinaldi] [~gsaha] [~eyang] [~csingh] [~wangda] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466512#comment-16466512 ] Eric Yang commented on YARN-8207: - [~jlowe] Hadoop 3.1.1 release date was proposed for May 7th. This is a blocking issue for YARN-7654. I think this JIRA is very close to completion, and I like to make sure that we can catch the release train. Are you comfortable to the last iteration of this patch? > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, > YARN-8207.006.patch, YARN-8207.007.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466512#comment-16466512 ] Eric Yang edited comment on YARN-8207 at 5/7/18 9:47 PM: - [~jlowe] Hadoop 3.1.1 release date was proposed for May 7th. This is a blocking issue for YARN-7654. I think this JIRA is very close to completion, and I like to make sure that we can catch the release train. Are you comfortable with the latest iteration of this patch? was (Author: eyang): [~jlowe] Hadoop 3.1.1 release date was proposed for May 7th. This is a blocking issue for YARN-7654. I think this JIRA is very close to completion, and I like to make sure that we can catch the release train. Are you comfortable to the last iteration of this patch? > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, > YARN-8207.006.patch, YARN-8207.007.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466683#comment-16466683 ] Eric Yang commented on YARN-8207: - [~jlowe] {quote}Rather than make an expensive deep copy of the arguments, construct_docker_command only needs to copy the args vector then set the number of arguments to zero. At that point we'd be effectively transferring ownership of the already allocated arg strings to the caller without requiring full copies.{quote} Struct args is still evolving. I think it would be safer to keep the data structure private for opaque data structure and deep copy to caller. This avoids to put responsibility on external caller to free internal implementation of struct args. In case if we want to have ability to trim or truncate the string array base on allowed parameters. We have a way to fix it later. {quote}add_param_to_command_if_allowed (and many other places) doesn't check for make_string failure, and add_to_args will segfault when it tries to dereference the NULL argument. Does it make sense to have add_to_args return failure if the caller tried to add a NULL argument?{quote} At this time, add_to_args returns no opts to avoid having to check for null on make_string. I think the proposal of making the reverse change will add more null pointer check, which makes the code harder to read again. It will contradict the original intend of your reviews to make code easier to read. {quote}flatten adds 1 to the strlen length in the loop, but there is only a need for one NUL terminator which is already accounted for in the total initial value.{quote} The +1 is for space, not NULL terminator for rendering html page that looks like a command line. The last space is replaced with NULL terminator. {quote}flatten is using stpcpy incorrectly as it ignores the return values from the function. stpcpy returns a pointer to the terminating NUL of the resulting string which is exactly what we need for appending, so each invocation of stpcpy should be like: to = stpcpy(to, ...){quote} This is fixed in YARN-7654 patch. It's hard to rebase n times, and stuff gets to the wrong patch. I will fix this. {quote}This change doesn't look related to the execv changes? Also looks like a case that could be simplified quite a bit with strndup and strdup.{quote} There is one byte off memory corruption that pattern is not null terminated properly. This was detected by valgrind, and I decided to fix this because it causes segfault if I leave it in the code. I will fix the rest of issues that you found. Thank you again for the review. > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, > YARN-8207.006.patch, YARN-8207.007.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466716#comment-16466716 ] Eric Yang commented on YARN-8207: - [~jlowe] Patch 008 fixed the issues discovered except char array copy. There is approximately 900kb leaks in container-executor prior to this patch, and we saved 20kb from leaking base on valgrind report exercising test cases. Execvp will wipe out all the leaks anyhow. Unless we find more of the buffer overflow problems. I am going to stop styling code changes because styling change has diminished return of investment at this point. > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, > YARN-8207.006.patch, YARN-8207.007.patch, YARN-8207.008.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8255) Allow option to disable flex for a service component
[ https://issues.apache.org/jira/browse/YARN-8255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466800#comment-16466800 ] Eric Yang commented on YARN-8255: - [~leftnoteasy] Recompute and expandable are intertwined. They are the same thing. At conceptual level, teragen has no dependency of input format. You can add more partitions to get more data generated. Hadoop's own implementation limited this from happening, but this does not mean docker containers should be imposed by the same initialization time limitation. On the other hand, we must optimize the framework for general purpose usage and prevent ourselves from giving too many untested and unsupported options. I think it make sense to reduce the flex options to 2 main types instead of giving all 6 options. > Allow option to disable flex for a service component > - > > Key: YARN-8255 > URL: https://issues.apache.org/jira/browse/YARN-8255 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Suma Shivaprasad >Assignee: Suma Shivaprasad >Priority: Major > > YARN-8080 implements restart capabilities for service component instances. > YARN service components should add an option to disallow flexing to support > workloads which are essentially batch/iterative jobs which terminate with > restart_policy=NEVER/ON_FAILURE. This could be disabled by default for > components where restart_policy=NEVER/ON_FAILURE and enabled by default when > restart_policy=ALWAYS(which is the default restart_policy) unless explicitly > set at the service spec. > The option could be exposed as part of the component spec as "allow_flexing". > cc [~billie.rinaldi] [~gsaha] [~eyang] [~csingh] [~wangda] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8255) Allow option to disable flex for a service component
[ https://issues.apache.org/jira/browse/YARN-8255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466800#comment-16466800 ] Eric Yang edited comment on YARN-8255 at 5/8/18 3:08 AM: - [~leftnoteasy] Recompute and expandable are intertwined. They are not the same thing. At conceptual level, teragen has no dependency of input format. You can add more partitions to get more data generated. Hadoop's own implementation limited this from happening, but this does not mean docker containers should be imposed by the same initialization time limitation. On the other hand, we must optimize the framework for general purpose usage and prevent ourselves from giving too many untested and unsupported options. I think it make sense to reduce the flex options to 2 main types instead of giving all 6 options. was (Author: eyang): [~leftnoteasy] Recompute and expandable are intertwined. They are the same thing. At conceptual level, teragen has no dependency of input format. You can add more partitions to get more data generated. Hadoop's own implementation limited this from happening, but this does not mean docker containers should be imposed by the same initialization time limitation. On the other hand, we must optimize the framework for general purpose usage and prevent ourselves from giving too many untested and unsupported options. I think it make sense to reduce the flex options to 2 main types instead of giving all 6 options. > Allow option to disable flex for a service component > - > > Key: YARN-8255 > URL: https://issues.apache.org/jira/browse/YARN-8255 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Reporter: Suma Shivaprasad >Assignee: Suma Shivaprasad >Priority: Major > > YARN-8080 implements restart capabilities for service component instances. > YARN service components should add an option to disallow flexing to support > workloads which are essentially batch/iterative jobs which terminate with > restart_policy=NEVER/ON_FAILURE. This could be disabled by default for > components where restart_policy=NEVER/ON_FAILURE and enabled by default when > restart_policy=ALWAYS(which is the default restart_policy) unless explicitly > set at the service spec. > The option could be exposed as part of the component spec as "allow_flexing". > cc [~billie.rinaldi] [~gsaha] [~eyang] [~csingh] [~wangda] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8207: Attachment: YARN-8207.008.patch > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, > YARN-8207.006.patch, YARN-8207.007.patch, YARN-8207.008.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7654) Support ENTRY_POINT for docker container
[ https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-7654: Attachment: YARN-7654.021.patch > Support ENTRY_POINT for docker container > > > Key: YARN-7654 > URL: https://issues.apache.org/jira/browse/YARN-7654 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-7654.001.patch, YARN-7654.002.patch, > YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, > YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, > YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, > YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, > YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, > YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, > YARN-7654.021.patch > > > Docker image may have ENTRY_POINT predefined, but this is not supported in > the current implementation. It would be nice if we can detect existence of > {{launch_command}} and base on this variable launch docker container in > different ways: > h3. Launch command exists > {code} > docker run [image]:[version] > docker exec [container_id] [launch_command] > {code} > h3. Use ENTRY_POINT > {code} > docker run [image]:[version] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container
[ https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468111#comment-16468111 ] Eric Yang commented on YARN-7654: - [~jlowe] {quote}I'll try to find time to take a closer look at this patch tomorrow, but I'm wondering if we really need to separate the detached vs. foreground launching for override vs. entry-point containers. The main problem with running containers in the foreground is that we have no idea how long it takes to actually start a container. As I mentioned above, any required localization for the image is likely to cause the container launch to fail due to docker inspect retries hitting the retry limit and failing, leaving the container uncontrolled or at best finally killed sometime later if Shane's lifecycle changes cause the container to get recognized long afterwards and killed.{quote} Detach option is only obtaining a container id, and container process continues to update information in the background. We call docker inspect by name reference instead of container id. Detach does not produce more accurate result than running in the foreground from docker inspect point of view because operations to docker daemon via docker CLI are asynchronous via docker daemon's rest api. Json output from docker inspect may have partial information. Since we know exactly the information to parse, therefore retry provides better success rate. For ENTRY_POINT, docker run in foreground to capture stdout and stderr of ENTRY_POINT process without reliant on mounting host log directory to docker container. This helps to prevent host log path sticking out inside the container that may look odd to users. {quote}I think a cleaner approach would be to always run containers as detached, so when the docker run command returns we will know the docker inspect command will work. If I understand correctly, the main obstacle to this approach is finding out what to do with the container's standard out and standard error streams which aren't directly visible when the container runs detached. However I think we can use the docker logs command after the container is launched to reacquire the container's stdout and stderr streams and tie them to the intended files. At least my local experiments show docker logs is able to obtain the separate stdout and stderr streams for containers whether they were started detached or not. Thoughts?{quote} If we want to run in background, then we have problems to capture logs again base on issues found in prior meetings. # The docker logs command will show logs from beginning of the launch to the point where it was captured. Without frequent calls to docker logs command, we don't get the complete log. It is expensive to call docker logs with fork and exec than reading a local log file. If we use --tail option, it is still one extra fork and managing the child process liveness and resource usage. This complicates how the resource usage should be computed. # docker logs does not seem to separate out stdout from stderr. [This issue|https://github.com/moby/moby/issues/7440] is unresolved in docker. This is different from YARN log file management. It would be nice to follow yarn approach to make the output less confusing in many situations. After many experiments, I settled on foreground and dup for simplicity. Foreground and retry docker inspect is a good concern. However, there is a way to find the reasonable timeout value to decide if a docker container should be marked as failed. > Support ENTRY_POINT for docker container > > > Key: YARN-7654 > URL: https://issues.apache.org/jira/browse/YARN-7654 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-7654.001.patch, YARN-7654.002.patch, > YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, > YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, > YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, > YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, > YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, > YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch > > > Docker image may have ENTRY_POINT predefined, but this is not supported in > the current implementation. It would be nice if we can detect existence of > {{launch_command}} and base on this variable launch docker container in > different ways: > h3. Launch command exists > {code} > docker run [image]:[version] > docker exec [container_id] [launch_command] > {code} > h3. Use ENTRY_POINT > {code} > docker run [image]:[version] > {code} -- This message
[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container
[ https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469061#comment-16469061 ] Eric Yang commented on YARN-7654: - [~jlowe] [~Jim_Brennan] I misread the last message in the discussion forum. Logs feature can redirect stdout and stderr streams correctly. However, I am not thrilled to call extra docker logs command to fetch logs, and maintaining the liveness of docker logs command. In my view, this is more fragile because docker logs command can receive external signal to prevent the whole log to be sent to yarn, and subsequence tailing will report duplicated information. If it is attached to the real stdout and stderr of the running program, we reduces the headache of additional process management and no duplicate information. I don't believe blocking call is the correct answer to help determine liveness of docker container. The blocking call to wait for docker detach has several problems: 1. Docker run could get stuck in pull docker images when mass number of containers are all starting at the same time and image is not cached locally. This happen a lot on repositories that are hosted on docker hub. 2. Docker run cli can also get stuck when docker daemon hangs, and no exit code is returned. 3. Some docker image that are not built to run in detached mode. Some developer might have built their system to require foreground mode. These images will terminate in detach mode. When "docker run -d", and "docker logs" combination are employed, there is some progress are not logged. i.e. the downloading progress, docker daemon error message. The current patch would log any errors coming from docker run cli to provide more information for user who is troubleshooting the problems. Regarding the racy problem, this is a problem that can be optimized by system administrator. On a cluster that download all images from internet via a slow internet link. It is perfectly reasonable to set the retry and timeout value to 30 minutes to wait for download to complete. In highly automated system, such as a cloud vendor trying to spin up images in fraction of a second for mass number of user, the timeout value might be set to as short as 5 seconds. If the image came up in 6 seconds, and it missed the SLA, another container takes its place in the next 5 second to provide smooth user experience. The 6 seconds container is recycled and rebuilt. At mass scale, race condition problem is easier to deal with than blocking call that prevent the entire automated system from working. I can update the code to make retry configurable setting in the short term. I am not discounting the possibilities to support docker run -d and docker logs, but this requires more development experiments to ensure all mechanic are covered well. The current approach has been in use in my environment for the past 6 months, and it works well. For 3.1.1 release, it would be safer to use the current approach to get us better coverage of the type of containers that can be supported. Thoughts? > Support ENTRY_POINT for docker container > > > Key: YARN-7654 > URL: https://issues.apache.org/jira/browse/YARN-7654 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-7654.001.patch, YARN-7654.002.patch, > YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, > YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, > YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, > YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, > YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, > YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, > YARN-7654.021.patch > > > Docker image may have ENTRY_POINT predefined, but this is not supported in > the current implementation. It would be nice if we can detect existence of > {{launch_command}} and base on this variable launch docker container in > different ways: > h3. Launch command exists > {code} > docker run [image]:[version] > docker exec [container_id] [launch_command] > {code} > h3. Use ENTRY_POINT > {code} > docker run [image]:[version] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8261) Docker container launch fails due to .cmd file creation failure
[ https://issues.apache.org/jira/browse/YARN-8261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8261: Fix Version/s: 3.1.1 3.2.0 > Docker container launch fails due to .cmd file creation failure > --- > > Key: YARN-8261 > URL: https://issues.apache.org/jira/browse/YARN-8261 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0, 3.1.1 >Reporter: Eric Badger >Assignee: Jason Lowe >Priority: Blocker > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8261.001.patch, YARN-8261.002.patch > > > Due to YARN-8064, the location of the docker .cmd files was changed. These > files are now being placed in the nmPrivate directory of the container. > However, this directory will not always be created. If the localizer does not > run or the credentials are written to a different disk, then this directory > will not exist and so the .cmd file creation will fail, thus causing the > container launch to fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8261) Docker container launch fails due to .cmd file creation failure
[ https://issues.apache.org/jira/browse/YARN-8261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469152#comment-16469152 ] Eric Yang commented on YARN-8261: - +1 looks good to me. > Docker container launch fails due to .cmd file creation failure > --- > > Key: YARN-8261 > URL: https://issues.apache.org/jira/browse/YARN-8261 > Project: Hadoop YARN > Issue Type: Sub-task >Affects Versions: 3.2.0, 3.1.1 >Reporter: Eric Badger >Assignee: Jason Lowe >Priority: Blocker > Attachments: YARN-8261.001.patch, YARN-8261.002.patch > > > Due to YARN-8064, the location of the docker .cmd files was changed. These > files are now being placed in the nmPrivate directory of the container. > However, this directory will not always be created. If the localizer does not > run or the credentials are written to a different disk, then this directory > will not exist and so the .cmd file creation will fail, thus causing the > container launch to fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7799) YARN Service dependency follow up work
[ https://issues.apache.org/jira/browse/YARN-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456808#comment-16456808 ] Eric Yang commented on YARN-7799: - [~billie.rinaldi] The summary of our discussion: 1. We can check prefix directory of yarn.service.framework.path to ensure all sub-directories are world readable and executable to ensure other user can access this path. 2. If the calling user to -enableFastLaunch is one of yarn.admin.acl, and yarn.service.framework.path is pre-conigured. User is allowed to upload service-dep.tar.gz. 3. If the calling user is dfs.cluster.administrators, user is allowed to upload service-dep.tar.gz. 4. Auto-upload follows the same logic. > YARN Service dependency follow up work > -- > > Key: YARN-7799 > URL: https://issues.apache.org/jira/browse/YARN-7799 > Project: Hadoop YARN > Issue Type: Bug > Components: client, resourcemanager >Reporter: Gour Saha >Assignee: Billie Rinaldi >Priority: Critical > Attachments: YARN-7799.1.patch > > > As per [~jianhe] these are some followup items that make sense to do after > YARN-7766. Quoting Jian's comment below - > Currently, if user doesn't supply location when run yarn app > -enableFastLaunch, the jars will be put under this location > {code} > hdfs:///yarn-services//service-dep.tar.gz > {code} > Since API server is embedded in RM, should RM look for this location too if > "yarn.service.framework.path" is not specified ? > And if "yarn.service.framework.path" is not specified and still the file > doesn't exist at above default location, I think RM can try to upload the > jars to above default location instead, currently RM is uploading the jars to > the location defined by below code. This folder is per app and also > inconsistent with CLI location. > {code} > protected Path addJarResource(String serviceName, > MaplocalResources) > throws IOException, SliderException { > Path libPath = fs.buildClusterDirPath(serviceName); > {code} > By doing this, the next time a submission request comes, RM doesn't need to > upload the jars again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8209) NPE in DeletionService
[ https://issues.apache.org/jira/browse/YARN-8209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456680#comment-16456680 ] Eric Yang commented on YARN-8209: - [~jlowe] [~ebadger] Yes we have agreement on this issue. The most frequent commands have structure: {code} docker-command format name {code} If we export those key value pair to container-executor environment, this approach will cover most of the cases. Given that we have some idea to contain this problem, I think we can do this without reverting YARN-8064. Thoughts? > NPE in DeletionService > -- > > Key: YARN-8209 > URL: https://issues.apache.org/jira/browse/YARN-8209 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Chandni Singh >Assignee: Eric Badger >Priority: Major > > {code:java} > 2018-04-25 23:38:41,039 WARN concurrent.ExecutorHelper > (ExecutorHelper.java:logThrowableFromAfterExecute(63)) - Caught exception in > thread DeletionService #1: > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerClient.writeCommandToTempFile(DockerClient.java:109) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeDockerCommand(DockerCommandExecutor.java:85) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeStatusCommand(DockerCommandExecutor.java:192) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.getContainerStatus(DockerCommandExecutor.java:128) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.removeDockerContainer(LinuxContainerExecutor.java:935) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DockerContainerDeletionTask.run(DockerContainerDeletionTask.java:61) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8204) Yarn Service Upgrade: Add a flag to disable upgrade
[ https://issues.apache.org/jira/browse/YARN-8204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8204: Fix Version/s: 3.1.1 3.2.0 I just committed this, thank you [~csingh]. > Yarn Service Upgrade: Add a flag to disable upgrade > --- > > Key: YARN-8204 > URL: https://issues.apache.org/jira/browse/YARN-8204 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Chandni Singh >Assignee: Chandni Singh >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8204.001.patch, YARN-8204.002.patch > > > Add a flag that will enable/disable service upgrade on the cluster. > By default it is set to false since upgrade is in early stages. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8211) Yarn registry dns log finds BufferUnderflowException on port ping
[ https://issues.apache.org/jira/browse/YARN-8211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456685#comment-16456685 ] Eric Yang commented on YARN-8211: - Thank you [~billie.rinaldi] for the review and commit. > Yarn registry dns log finds BufferUnderflowException on port ping > - > > Key: YARN-8211 > URL: https://issues.apache.org/jira/browse/YARN-8211 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Yesha Vora >Assignee: Eric Yang >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8211.001.patch, YARN-8211.002.patch > > > Yarn registry dns server is constantly getting BufferUnderflowException. > {code:java} > 2018-04-25 01:36:56,139 WARN concurrent.ExecutorHelper > (ExecutorHelper.java:logThrowableFromAfterExecute(50)) - Execution exception > when running task in RegistryDNS 76 > 2018-04-25 01:36:56,139 WARN concurrent.ExecutorHelper > (ExecutorHelper.java:logThrowableFromAfterExecute(63)) - Caught exception in > thread RegistryDNS 76: > java.nio.BufferUnderflowException > at java.nio.Buffer.nextGetIndex(Buffer.java:500) > at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:135) > at > org.apache.hadoop.registry.server.dns.RegistryDNS.getMessgeLength(RegistryDNS.java:820) > at > org.apache.hadoop.registry.server.dns.RegistryDNS.nioTCPClient(RegistryDNS.java:767) > at > org.apache.hadoop.registry.server.dns.RegistryDNS$3.call(RegistryDNS.java:846) > at > org.apache.hadoop.registry.server.dns.RegistryDNS$3.call(RegistryDNS.java:843) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748){code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457139#comment-16457139 ] Eric Yang edited comment on YARN-8207 at 4/27/18 11:03 PM: --- [~jlowe] Thank you for the review. Good suggestions on coding style issues. I will fix the coding style issues. {quote} stderr.txt is fopen'd and never used before the fclose? Are we supposed to dup2 these file descriptors to 1 and 2 before the execv so any errors from docker run appear in those output files?{quote} When using launch_script.sh, there is stdout and stderr redirection inside launch_script.sh which bind-mount to host log directory. This is the reason that there is fopen and fclosed immediately until YARN-7654 logic are added. {quote}The parent process that is responsible for obtaining the pid is not waiting for the child to complete before running the inspect command. That's why retries had to be added to get it to work when they were not needed before. The parent should simply wait and check for error exit codes as it did before when it was using popen. After that we can ditch the retries since they won't be necessary.{quote} Using launch_script.sh, container-executor runs "docker run" with detach option. It assumes the exit code can be obtained quickly. This is the reason there is no logic for retry "docker inspect". This assumption is some what flawed. If the docker image is unavailable on the host, docker will show download progress and some other information and errors. The progression are not captured, which is difficult to debug. When docker inspect is probed, there is no information of what failed. Without launch_script.sh, container-executor runs "docker run" in the foreground, and obtain pid when the first process is started. Inspect command is checked asynchronously because docker run exit code is only reported when the docker process is terminated. There is a balance between how long that we should wait before we decide if the system is hang. We can make MAX_RETRIES configurable in case people like to wait for longer or period of time before deciding if the container should fail. {quote}Why does make_string calculate size = n + 2 instead of n + 1?{quote} This change makes make_string function twice faster than sample code while waste 1% or less space if recursion is required. It is probably a reasonable trade off for modern day computers. was (Author: eyang): [~jlowe] Thank you for the review. Good suggestions on coding style issues. I will fix the coding style issues. {quote} stderr.txt is fopen'd and never used before the fclose? Are we supposed to dup2 these file descriptors to 1 and 2 before the execv so any errors from docker run appear in those output files?{quote} When using launch_script.sh, there is stdout and stderr redirection inside launch_script.sh which bind-mount to host log directory. This is the reason that there is fopen and fclosed immediately until YARN-7654 logic are added. {quote}The parent process that is responsible for obtaining the pid is not waiting for the child to complete before running the inspect command. That's why retries had to be added to get it to work when they were not needed before. The parent should simply wait and check for error exit codes as it did before when it was using popen. After that we can ditch the retries since they won't be necessary.{quote} Using launch_script.sh, container-executor runs "docker run" with detach option. It assumes the exit code can be obtained quickly. This is the reason there is no logic for retry "docker inspect". This assumption is some what flawed. If the docker image is unavailable on the host, docker will show download progress and some other information and errors. The progression are not captured, which is difficult to debug. When docker inspect is probed, there is no information of what failed. Without launch_script.sh, container-executor runs "docker run" in the foreground, and obtain pid when the first process is started. Inspect command is checked asynchronously because docker run exit code is only reported when the docker process is terminated. There is a balance between how long that we should wait before we decide if the system is hang. We can make MAX_RETRIES configurable in case people like to wait for longer or period of time before deciding if the container should fail. {quote}Why does make_string calculate size = n + 2 instead of n + 1?{quote} This change make make_string function twice faster than sample code while waste 1% or less space if recursion is required. It is probably a reasonable trade off for modern day computers. > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL:
[jira] [Comment Edited] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457139#comment-16457139 ] Eric Yang edited comment on YARN-8207 at 4/27/18 11:06 PM: --- [~jlowe] Thank you for the review. Good suggestions on coding style issues. I will fix the coding style issues. {quote} stderr.txt is fopen'd and never used before the fclose? Are we supposed to dup2 these file descriptors to 1 and 2 before the execv so any errors from docker run appear in those output files?{quote} When using launch_script.sh, there is stdout and stderr redirection inside launch_script.sh which bind-mount to host log directory. This is the reason that there is fopen and fclosed immediately until YARN-7654 logic are added. {quote}The parent process that is responsible for obtaining the pid is not waiting for the child to complete before running the inspect command. That's why retries had to be added to get it to work when they were not needed before. The parent should simply wait and check for error exit codes as it did before when it was using popen. After that we can ditch the retries since they won't be necessary.{quote} Using launch_script.sh, container-executor runs "docker run" with detach option. It assumes the exit code can be obtained quickly. This is the reason there is no logic for retry "docker inspect". This assumption is some what flawed. If the docker image is unavailable on the host, docker will show download progress and some other information and errors. The progression are not captured, which is difficult to debug. When docker inspect is probed, there is no information of what failed. Without launch_script.sh, container-executor runs "docker run" in the foreground, and obtain pid when the first process is started. Inspect command is checked asynchronously because docker run exit code is only reported when the docker process is terminated. There is a balance between how long that we should wait before we decide if the system is hang. We can make MAX_RETRIES configurable in case people have a difference preference of wait time for docker inspect. {quote}Why does make_string calculate size = n + 2 instead of n + 1?{quote} This change makes make_string function twice faster than sample code while waste 1% or less space if recursion is required. It is probably a reasonable trade off for modern day computers. was (Author: eyang): [~jlowe] Thank you for the review. Good suggestions on coding style issues. I will fix the coding style issues. {quote} stderr.txt is fopen'd and never used before the fclose? Are we supposed to dup2 these file descriptors to 1 and 2 before the execv so any errors from docker run appear in those output files?{quote} When using launch_script.sh, there is stdout and stderr redirection inside launch_script.sh which bind-mount to host log directory. This is the reason that there is fopen and fclosed immediately until YARN-7654 logic are added. {quote}The parent process that is responsible for obtaining the pid is not waiting for the child to complete before running the inspect command. That's why retries had to be added to get it to work when they were not needed before. The parent should simply wait and check for error exit codes as it did before when it was using popen. After that we can ditch the retries since they won't be necessary.{quote} Using launch_script.sh, container-executor runs "docker run" with detach option. It assumes the exit code can be obtained quickly. This is the reason there is no logic for retry "docker inspect". This assumption is some what flawed. If the docker image is unavailable on the host, docker will show download progress and some other information and errors. The progression are not captured, which is difficult to debug. When docker inspect is probed, there is no information of what failed. Without launch_script.sh, container-executor runs "docker run" in the foreground, and obtain pid when the first process is started. Inspect command is checked asynchronously because docker run exit code is only reported when the docker process is terminated. There is a balance between how long that we should wait before we decide if the system is hang. We can make MAX_RETRIES configurable in case people like to wait for longer or period of time before deciding if the container should fail. {quote}Why does make_string calculate size = n + 2 instead of n + 1?{quote} This change makes make_string function twice faster than sample code while waste 1% or less space if recursion is required. It is probably a reasonable trade off for modern day computers. > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL:
[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457139#comment-16457139 ] Eric Yang commented on YARN-8207: - [~jlowe] Thank you for the review. Good suggestions on coding style issues. I will fix the coding style issues. {quote} stderr.txt is fopen'd and never used before the fclose? Are we supposed to dup2 these file descriptors to 1 and 2 before the execv so any errors from docker run appear in those output files?{quote} When using launch_script.sh, there is stdout and stderr redirection inside launch_script.sh which bind-mount to host log directory. This is the reason that there is fopen and fclosed immediately until YARN-7654 logic are added. {quote}The parent process that is responsible for obtaining the pid is not waiting for the child to complete before running the inspect command. That's why retries had to be added to get it to work when they were not needed before. The parent should simply wait and check for error exit codes as it did before when it was using popen. After that we can ditch the retries since they won't be necessary.{quote} Using launch_script.sh, container-executor runs "docker run" with detach option. It assumes the exit code can be obtained quickly. This is the reason there is no logic for retry "docker inspect". This assumption is some what flawed. If the docker image is unavailable on the host, docker will show download progress and some other information and errors. The progression are not captured, which is difficult to debug. When docker inspect is probed, there is no information of what failed. Without launch_script.sh, container-executor runs "docker run" in the foreground, and obtain pid when the first process is started. Inspect command is checked asynchronously because docker run exit code is only reported when the docker process is terminated. There is a balance between how long that we should wait before we decide if the system is hang. We can make MAX_RETRIES configurable in case people like to wait for longer or period of time before deciding if the container should fail. {quote}Why does make_string calculate size = n + 2 instead of n + 1?{quote} This change make make_string function twice faster than sample code while waste 1% or less space if recursion is required. It is probably a reasonable trade off for modern day computers. > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Attachments: YARN-8207.001.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container
[ https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469823#comment-16469823 ] Eric Yang commented on YARN-7654: - [~jlowe] Thank you for the review, the styling improvement will be addressed. {quote} DockerClient is creating the environment file in /tmp which has the same leaking problem we had with the docker .cmd files. {quote} The patch writes .env file in the same nmPrivate directory as .cmd file. It doesn't write to /tmp. {quote} The code is now writing "Launching docker container..." etc. even when not using the entry point. Are these smashed by the container_launch.sh script when not using the entry point? If not it could be an issue since it's changing what the user's code is writing to those files today.{quote} Yes, these lines are overwritten by container_launch.sh for non entry_point mode. It doesn't break existing compatibility. > Support ENTRY_POINT for docker container > > > Key: YARN-7654 > URL: https://issues.apache.org/jira/browse/YARN-7654 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-7654.001.patch, YARN-7654.002.patch, > YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, > YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, > YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, > YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, > YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, > YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, > YARN-7654.021.patch > > > Docker image may have ENTRY_POINT predefined, but this is not supported in > the current implementation. It would be nice if we can detect existence of > {{launch_command}} and base on this variable launch docker container in > different ways: > h3. Launch command exists > {code} > docker run [image]:[version] > docker exec [container_id] [launch_command] > {code} > h3. Use ENTRY_POINT > {code} > docker run [image]:[version] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment
[ https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470814#comment-16470814 ] Eric Yang commented on YARN-8108: - [~daryn] This issue doesn't present in Hadoop 2.7.5, does not mean it was done properly. It is not possible to configure different HTTP principal for RM and Proxy Server on the same host/port, and it was only half working. This is because Hadoop only have yarn.resourcemanager.webapp.spnego-keytab-file and yarn.resourcemanager.webapp.spnego-principal setting to define HTTP principal to use on RM server. It does not have yarn.web-proxy.webapp.spnego-keytab-file and yarn.web-proxy.webapp.spnego-principal settings to make differentiation. Even if those settings are defined, they are not being used. Further analysis on Hadoop 2.7.5, /proxy URL is not secured by any HTTP principal when running in RM embedded mode. > RM metrics rest API throws GSSException in kerberized environment > - > > Key: YARN-8108 > URL: https://issues.apache.org/jira/browse/YARN-8108 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Kshitij Badani >Assignee: Eric Yang >Priority: Major > Attachments: YARN-8108.001.patch > > > Test is trying to pull up metrics data from SHS after kiniting as 'test_user' > It is throwing GSSException as follows > {code:java} > b2b460b80713|RUNNING: curl --silent -k -X GET -D > /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : > http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15 > 07:15:48,757|INFO|MainThread|machine.py:194 - > run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0 > 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - > getMetricsJsonData()|metrics: > > > > Error 403 GSSException: Failure unspecified at GSS-API level > (Mechanism level: Request is a replay (34)) > > HTTP ERROR 403 > Problem accessing /proxy/application_1518674952153_0070/metrics/json. > Reason: > GSSException: Failure unspecified at GSS-API level (Mechanism level: > Request is a replay (34)) > > > {code} > Rootcausing : proxyserver on RM can't be supported for Kerberos enabled > cluster because AuthenticationFilter is applied twice in Hadoop code (once in > httpServer2 for RM, and another instance from AmFilterInitializer for proxy > server). This will require code changes to hadoop-yarn-server-web-proxy > project -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7799) YARN Service dependency follow up work
[ https://issues.apache.org/jira/browse/YARN-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-7799: Fix Version/s: 3.1.1 3.2.0 > YARN Service dependency follow up work > -- > > Key: YARN-7799 > URL: https://issues.apache.org/jira/browse/YARN-7799 > Project: Hadoop YARN > Issue Type: Bug > Components: client, resourcemanager >Reporter: Gour Saha >Assignee: Billie Rinaldi >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-7799.1.patch, YARN-7799.2.patch, YARN-7799.3.patch, > YARN-7799.4.patch, YARN-7799.5.patch > > > As per [~jianhe] these are some followup items that make sense to do after > YARN-7766. Quoting Jian's comment below - > Currently, if user doesn't supply location when run yarn app > -enableFastLaunch, the jars will be put under this location > {code} > hdfs:///yarn-services//service-dep.tar.gz > {code} > Since API server is embedded in RM, should RM look for this location too if > "yarn.service.framework.path" is not specified ? > And if "yarn.service.framework.path" is not specified and still the file > doesn't exist at above default location, I think RM can try to upload the > jars to above default location instead, currently RM is uploading the jars to > the location defined by below code. This folder is per app and also > inconsistent with CLI location. > {code} > protected Path addJarResource(String serviceName, > MaplocalResources) > throws IOException, SliderException { > Path libPath = fs.buildClusterDirPath(serviceName); > {code} > By doing this, the next time a submission request comes, RM doesn't need to > upload the jars again. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8265) Service AM should retrieve new IP for docker container relaunched by NM
[ https://issues.apache.org/jira/browse/YARN-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8265: Target Version/s: 3.2.0, 3.1.1 (was: 3.2.0) Fix Version/s: 3.1.1 3.2.0 > Service AM should retrieve new IP for docker container relaunched by NM > --- > > Key: YARN-8265 > URL: https://issues.apache.org/jira/browse/YARN-8265 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Billie Rinaldi >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8265.001.patch, YARN-8265.002.patch, > YARN-8265.003.patch > > > When a docker container is restarted, it gets a new IP, but the service AM > only retrieves one IP for a container and then cancels the container status > retriever. I suspect the issue would be solved by restarting the retriever > (if it has been canceled) when the onContainerRestart callback is received, > but we'll have to do some testing to make sure this works. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8265) Service AM should retrieve new IP for docker container relaunched by NM
[ https://issues.apache.org/jira/browse/YARN-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472923#comment-16472923 ] Eric Yang commented on YARN-8265: - +1 looks good to me. I just committed this on trunk and branch-3.1. Thank you [~billie.rinaldi] for the review and patch. > Service AM should retrieve new IP for docker container relaunched by NM > --- > > Key: YARN-8265 > URL: https://issues.apache.org/jira/browse/YARN-8265 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Billie Rinaldi >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8265.001.patch, YARN-8265.002.patch, > YARN-8265.003.patch > > > When a docker container is restarted, it gets a new IP, but the service AM > only retrieves one IP for a container and then cancels the container status > retriever. I suspect the issue would be solved by restarting the retriever > (if it has been canceled) when the onContainerRestart callback is received, > but we'll have to do some testing to make sure this works. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8265) Service AM should retrieve new IP for docker container relaunched by NM
[ https://issues.apache.org/jira/browse/YARN-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472901#comment-16472901 ] Eric Yang commented on YARN-8265: - "onContainerRestart" event is currently not working. Therefore the workaround solution is the only feasible solution. Therefore, I am inclined to commit the patch 003 for 3.1.1 release. > Service AM should retrieve new IP for docker container relaunched by NM > --- > > Key: YARN-8265 > URL: https://issues.apache.org/jira/browse/YARN-8265 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Billie Rinaldi >Priority: Critical > Attachments: YARN-8265.001.patch, YARN-8265.002.patch, > YARN-8265.003.patch > > > When a docker container is restarted, it gets a new IP, but the service AM > only retrieves one IP for a container and then cancels the container status > retriever. I suspect the issue would be solved by restarting the retriever > (if it has been canceled) when the onContainerRestart callback is received, > but we'll have to do some testing to make sure this works. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8265) Service AM should retrieve new IP for docker container relaunched by NM
[ https://issues.apache.org/jira/browse/YARN-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16473179#comment-16473179 ] Eric Yang commented on YARN-8265: - [~billie.rinaldi] The plan looks good. Thank you. > Service AM should retrieve new IP for docker container relaunched by NM > --- > > Key: YARN-8265 > URL: https://issues.apache.org/jira/browse/YARN-8265 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Billie Rinaldi >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8265.001.patch, YARN-8265.002.patch, > YARN-8265.003.patch > > > When a docker container is restarted, it gets a new IP, but the service AM > only retrieves one IP for a container and then cancels the container status > retriever. I suspect the issue would be solved by restarting the retriever > (if it has been canceled) when the onContainerRestart callback is received, > but we'll have to do some testing to make sure this works. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8286) Add NMClient callback on container relaunch
[ https://issues.apache.org/jira/browse/YARN-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8286: Description: The AM may need to perform actions when a container has been relaunched. For example, the service AM would want to change the state it has recorded for the container and retrieve new container status for the container, in case the container IP has changed. (The NM would also need to remove the IP it has stored for the container, so container status calls don't return an IP for a container that is not currently running.) > Add NMClient callback on container relaunch > --- > > Key: YARN-8286 > URL: https://issues.apache.org/jira/browse/YARN-8286 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi >Priority: Critical > > The AM may need to perform actions when a container has been relaunched. For > example, the service AM would want to change the state it has recorded for > the container and retrieve new container status for the container, in case > the container IP has changed. (The NM would also need to remove the IP it has > stored for the container, so container status calls don't return an IP for a > container that is not currently running.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8286) Add NMClient callback on container relaunch
[ https://issues.apache.org/jira/browse/YARN-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8286: Environment: (was: The AM may need to perform actions when a container has been relaunched. For example, the service AM would want to change the state it has recorded for the container and retrieve new container status for the container, in case the container IP has changed. (The NM would also need to remove the IP it has stored for the container, so container status calls don't return an IP for a container that is not currently running.)) > Add NMClient callback on container relaunch > --- > > Key: YARN-8286 > URL: https://issues.apache.org/jira/browse/YARN-8286 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Billie Rinaldi >Priority: Critical > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8265) Service AM should retrieve new IP for docker container relaunched by NM
[ https://issues.apache.org/jira/browse/YARN-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472837#comment-16472837 ] Eric Yang commented on YARN-8265: - [~billie.rinaldi] I am struggling to understand the reason that node manager would decide to restart the docker container without consulting with application master. AM makes the decision of the state of the containers, and node manager only follow orders from AM. This helps to prevent race conditions between AM and NM to decide which container should stay up and running. AM will follow state transitions to ensure it is following a pre-defined path. With relaunch container implemented in YARN-7973, AM still make decision when to restart container. "onContainerRestart" event will be received by AM. If we run ContainerStartedTransition again, it will check for IP changes and cancel the scheduled timer thread. I think this will leads to more desired outcome without leaving the timer thread open ended. Alternate approach is to move ContainerStatusRetriever to ContainerBecomeReadyTransition, and use BECOME_READY transition to check for IP address. > Service AM should retrieve new IP for docker container relaunched by NM > --- > > Key: YARN-8265 > URL: https://issues.apache.org/jira/browse/YARN-8265 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Billie Rinaldi >Priority: Critical > Attachments: YARN-8265.001.patch, YARN-8265.002.patch, > YARN-8265.003.patch > > > When a docker container is restarted, it gets a new IP, but the service AM > only retrieves one IP for a container and then cancels the container status > retriever. I suspect the issue would be solved by restarting the retriever > (if it has been canceled) when the onContainerRestart callback is received, > but we'll have to do some testing to make sure this works. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container
[ https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472839#comment-16472839 ] Eric Yang commented on YARN-7654: - [~jlowe] Thank you for the great reviews and commit. [~shaneku...@gmail.com] [~Jim_Brennan] [~ebadger] Thank you for the reviews. > Support ENTRY_POINT for docker container > > > Key: YARN-7654 > URL: https://issues.apache.org/jira/browse/YARN-7654 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-7654.001.patch, YARN-7654.002.patch, > YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, > YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, > YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, > YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, > YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, > YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, > YARN-7654.021.patch, YARN-7654.022.patch, YARN-7654.023.patch, > YARN-7654.024.patch > > > Docker image may have ENTRY_POINT predefined, but this is not supported in > the current implementation. It would be nice if we can detect existence of > {{launch_command}} and base on this variable launch docker container in > different ways: > h3. Launch command exists > {code} > docker run [image]:[version] > docker exec [container_id] [launch_command] > {code} > h3. Use ENTRY_POINT > {code} > docker run [image]:[version] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container
[ https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471387#comment-16471387 ] Eric Yang commented on YARN-7654: - [~jlowe] Patch 22 contains all requested changes except refactoring code in AbstractProviderService and DockerProviderService. I tried to refactor the code, but I haven't got a working implementation. Due to time constraint, I upload the latest revision for your review first. > Support ENTRY_POINT for docker container > > > Key: YARN-7654 > URL: https://issues.apache.org/jira/browse/YARN-7654 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-7654.001.patch, YARN-7654.002.patch, > YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, > YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, > YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, > YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, > YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, > YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, > YARN-7654.021.patch, YARN-7654.022.patch > > > Docker image may have ENTRY_POINT predefined, but this is not supported in > the current implementation. It would be nice if we can detect existence of > {{launch_command}} and base on this variable launch docker container in > different ways: > h3. Launch command exists > {code} > docker run [image]:[version] > docker exec [container_id] [launch_command] > {code} > h3. Use ENTRY_POINT > {code} > docker run [image]:[version] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7654) Support ENTRY_POINT for docker container
[ https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-7654: Attachment: YARN-7654.022.patch > Support ENTRY_POINT for docker container > > > Key: YARN-7654 > URL: https://issues.apache.org/jira/browse/YARN-7654 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-7654.001.patch, YARN-7654.002.patch, > YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, > YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, > YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, > YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, > YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, > YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, > YARN-7654.021.patch, YARN-7654.022.patch > > > Docker image may have ENTRY_POINT predefined, but this is not supported in > the current implementation. It would be nice if we can detect existence of > {{launch_command}} and base on this variable launch docker container in > different ways: > h3. Launch command exists > {code} > docker run [image]:[version] > docker exec [container_id] [launch_command] > {code} > h3. Use ENTRY_POINT > {code} > docker run [image]:[version] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment
[ https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8108: Target Version/s: 3.2.0, 3.1.1, 3.0.3 > RM metrics rest API throws GSSException in kerberized environment > - > > Key: YARN-8108 > URL: https://issues.apache.org/jira/browse/YARN-8108 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Kshitij Badani >Assignee: Eric Yang >Priority: Blocker > Attachments: YARN-8108.001.patch > > > Test is trying to pull up metrics data from SHS after kiniting as 'test_user' > It is throwing GSSException as follows > {code:java} > b2b460b80713|RUNNING: curl --silent -k -X GET -D > /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : > http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15 > 07:15:48,757|INFO|MainThread|machine.py:194 - > run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0 > 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - > getMetricsJsonData()|metrics: > > > > Error 403 GSSException: Failure unspecified at GSS-API level > (Mechanism level: Request is a replay (34)) > > HTTP ERROR 403 > Problem accessing /proxy/application_1518674952153_0070/metrics/json. > Reason: > GSSException: Failure unspecified at GSS-API level (Mechanism level: > Request is a replay (34)) > > > {code} > Rootcausing : proxyserver on RM can't be supported for Kerberos enabled > cluster because AuthenticationFilter is applied twice in Hadoop code (once in > httpServer2 for RM, and another instance from AmFilterInitializer for proxy > server). This will require code changes to hadoop-yarn-server-web-proxy > project -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment
[ https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8108: Priority: Blocker (was: Major) > RM metrics rest API throws GSSException in kerberized environment > - > > Key: YARN-8108 > URL: https://issues.apache.org/jira/browse/YARN-8108 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.0.0 >Reporter: Kshitij Badani >Assignee: Eric Yang >Priority: Blocker > Attachments: YARN-8108.001.patch > > > Test is trying to pull up metrics data from SHS after kiniting as 'test_user' > It is throwing GSSException as follows > {code:java} > b2b460b80713|RUNNING: curl --silent -k -X GET -D > /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : > http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15 > 07:15:48,757|INFO|MainThread|machine.py:194 - > run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0 > 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - > getMetricsJsonData()|metrics: > > > > Error 403 GSSException: Failure unspecified at GSS-API level > (Mechanism level: Request is a replay (34)) > > HTTP ERROR 403 > Problem accessing /proxy/application_1518674952153_0070/metrics/json. > Reason: > GSSException: Failure unspecified at GSS-API level (Mechanism level: > Request is a replay (34)) > > > {code} > Rootcausing : proxyserver on RM can't be supported for Kerberos enabled > cluster because AuthenticationFilter is applied twice in Hadoop code (once in > httpServer2 for RM, and another instance from AmFilterInitializer for proxy > server). This will require code changes to hadoop-yarn-server-web-proxy > project -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container
[ https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472407#comment-16472407 ] Eric Yang commented on YARN-7654: - [~jlowe] Patch 23 includes all your suggestions. > Support ENTRY_POINT for docker container > > > Key: YARN-7654 > URL: https://issues.apache.org/jira/browse/YARN-7654 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-7654.001.patch, YARN-7654.002.patch, > YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, > YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, > YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, > YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, > YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, > YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, > YARN-7654.021.patch, YARN-7654.022.patch, YARN-7654.023.patch > > > Docker image may have ENTRY_POINT predefined, but this is not supported in > the current implementation. It would be nice if we can detect existence of > {{launch_command}} and base on this variable launch docker container in > different ways: > h3. Launch command exists > {code} > docker run [image]:[version] > docker exec [container_id] [launch_command] > {code} > h3. Use ENTRY_POINT > {code} > docker run [image]:[version] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8274) Docker command error during container relaunch
[ https://issues.apache.org/jira/browse/YARN-8274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472392#comment-16472392 ] Eric Yang commented on YARN-8274: - Sorry the code was missed during refactoring. +1 The change looks good. > Docker command error during container relaunch > -- > > Key: YARN-8274 > URL: https://issues.apache.org/jira/browse/YARN-8274 > Project: Hadoop YARN > Issue Type: Task >Reporter: Billie Rinaldi >Assignee: Jason Lowe >Priority: Critical > Attachments: YARN-8274.001.patch, YARN-8274.002.patch > > > I initiated container relaunch with a "sleep 60; exit 1" launch command and > saw a "not a docker command" error on relaunch. Haven't figured out why this > is happening, but it seems like it has been introduced recently to > trunk/branch-3.1. cc [~shaneku...@gmail.com] [~ebadger] > {noformat} > org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: > Relaunch container failed > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.relaunchContainer(DockerLinuxContainerRuntime.java:954) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.relaunchContainer(DelegatingLinuxContainerRuntime.java:150) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:562) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.relaunchContainer(LinuxContainerExecutor.java:486) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.relaunchContainer(ContainerLaunch.java:504) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:111) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from > container-launch. > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: > container_1525897486447_0003_01_02 > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 7 > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception > message: Relaunch container failed > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Shell error > output: docker: 'container_1525897486447_0003_01_02' is not a docker > command. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7654) Support ENTRY_POINT for docker container
[ https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-7654: Attachment: YARN-7654.023.patch > Support ENTRY_POINT for docker container > > > Key: YARN-7654 > URL: https://issues.apache.org/jira/browse/YARN-7654 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-7654.001.patch, YARN-7654.002.patch, > YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, > YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, > YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, > YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, > YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, > YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, > YARN-7654.021.patch, YARN-7654.022.patch, YARN-7654.023.patch > > > Docker image may have ENTRY_POINT predefined, but this is not supported in > the current implementation. It would be nice if we can detect existence of > {{launch_command}} and base on this variable launch docker container in > different ways: > h3. Launch command exists > {code} > docker run [image]:[version] > docker exec [container_id] [launch_command] > {code} > h3. Use ENTRY_POINT > {code} > docker run [image]:[version] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8274) Docker command error during container relaunch
[ https://issues.apache.org/jira/browse/YARN-8274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8274: Fix Version/s: 3.1.1 3.2.0 > Docker command error during container relaunch > -- > > Key: YARN-8274 > URL: https://issues.apache.org/jira/browse/YARN-8274 > Project: Hadoop YARN > Issue Type: Task >Reporter: Billie Rinaldi >Assignee: Jason Lowe >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8274.001.patch, YARN-8274.002.patch > > > I initiated container relaunch with a "sleep 60; exit 1" launch command and > saw a "not a docker command" error on relaunch. Haven't figured out why this > is happening, but it seems like it has been introduced recently to > trunk/branch-3.1. cc [~shaneku...@gmail.com] [~ebadger] > {noformat} > org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: > Relaunch container failed > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.relaunchContainer(DockerLinuxContainerRuntime.java:954) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.relaunchContainer(DelegatingLinuxContainerRuntime.java:150) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:562) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.relaunchContainer(LinuxContainerExecutor.java:486) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.relaunchContainer(ContainerLaunch.java:504) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:111) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from > container-launch. > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: > container_1525897486447_0003_01_02 > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 7 > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception > message: Relaunch container failed > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Shell error > output: docker: 'container_1525897486447_0003_01_02' is not a docker > command. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container
[ https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472318#comment-16472318 ] Eric Yang commented on YARN-7654: - [~jlowe] Thanks for the reply. Some answers: {quote} In DockerProviderService#buildContainerLaunchContext it's calling processArtifact then super.buildContainerLaunchContext, but the parent's buildContainerLaunchContext calls processArtifact as well. Is the double-call intentional? {quote} Not intentional, this is fixed in patch 23. {quote}Note I'm not sure if we really need to rebuild tokensForSubtitution in DockerProviderService, I'm just preserving what the patch was doing. AFAICT the only difference between what the patch had DockerProviderService build for tokens and what AbstractProviderService builds is the latter is doing a pass adding ${env} forms of every env var to the map. If DockerProviderService is supposed to be doing that as well then it can just use the tokenProviderService arg directly rather than building it from scratch.{quote} I was able to make the refactoring happen this morning with a clear head. This is more readable without repeat in patch 23. > Support ENTRY_POINT for docker container > > > Key: YARN-7654 > URL: https://issues.apache.org/jira/browse/YARN-7654 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-7654.001.patch, YARN-7654.002.patch, > YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, > YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, > YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, > YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, > YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, > YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, > YARN-7654.021.patch, YARN-7654.022.patch > > > Docker image may have ENTRY_POINT predefined, but this is not supported in > the current implementation. It would be nice if we can detect existence of > {{launch_command}} and base on this variable launch docker container in > different ways: > h3. Launch command exists > {code} > docker run [image]:[version] > docker exec [container_id] [launch_command] > {code} > h3. Use ENTRY_POINT > {code} > docker run [image]:[version] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8274) Docker command error during container relaunch
[ https://issues.apache.org/jira/browse/YARN-8274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472771#comment-16472771 ] Eric Yang commented on YARN-8274: - [~ebadger] Your earnestly advocate is not going unheard. I am sorry that I introduced bugs during the rebase. There is no excuse for making mistakes when patch is snowballing. It won't happen again. [~jlowe] Nits: It would be nice if the code was refactored to add docker_binary in construct_docker_command to avoid duplicated add_to_args for docker_binary for all get_docker_*_command, but the priority is to get a good stable state for release. Hence, I am sorry that I committed this prematurely without listening to my inner voice. > Docker command error during container relaunch > -- > > Key: YARN-8274 > URL: https://issues.apache.org/jira/browse/YARN-8274 > Project: Hadoop YARN > Issue Type: Task >Reporter: Billie Rinaldi >Assignee: Jason Lowe >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8274.001.patch, YARN-8274.002.patch > > > I initiated container relaunch with a "sleep 60; exit 1" launch command and > saw a "not a docker command" error on relaunch. Haven't figured out why this > is happening, but it seems like it has been introduced recently to > trunk/branch-3.1. cc [~shaneku...@gmail.com] [~ebadger] > {noformat} > org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: > Relaunch container failed > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.relaunchContainer(DockerLinuxContainerRuntime.java:954) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.relaunchContainer(DelegatingLinuxContainerRuntime.java:150) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:562) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.relaunchContainer(LinuxContainerExecutor.java:486) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.relaunchContainer(ContainerLaunch.java:504) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:111) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from > container-launch. > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: > container_1525897486447_0003_01_02 > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 7 > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception > message: Relaunch container failed > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Shell error > output: docker: 'container_1525897486447_0003_01_02' is not a docker > command. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container
[ https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472778#comment-16472778 ] Eric Yang commented on YARN-7654: - [~jlowe] All 5 scenarios passed with my local kerberos enabled cluster tests. > Support ENTRY_POINT for docker container > > > Key: YARN-7654 > URL: https://issues.apache.org/jira/browse/YARN-7654 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-7654.001.patch, YARN-7654.002.patch, > YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, > YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, > YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, > YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, > YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, > YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, > YARN-7654.021.patch, YARN-7654.022.patch, YARN-7654.023.patch, > YARN-7654.024.patch > > > Docker image may have ENTRY_POINT predefined, but this is not supported in > the current implementation. It would be nice if we can detect existence of > {{launch_command}} and base on this variable launch docker container in > different ways: > h3. Launch command exists > {code} > docker run [image]:[version] > docker exec [container_id] [launch_command] > {code} > h3. Use ENTRY_POINT > {code} > docker run [image]:[version] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8274) Docker command error during container relaunch
[ https://issues.apache.org/jira/browse/YARN-8274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472780#comment-16472780 ] Eric Yang commented on YARN-8274: - [~jlowe] Thank you for all your efforts. It is greatly appreciated. > Docker command error during container relaunch > -- > > Key: YARN-8274 > URL: https://issues.apache.org/jira/browse/YARN-8274 > Project: Hadoop YARN > Issue Type: Task >Reporter: Billie Rinaldi >Assignee: Jason Lowe >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8274.001.patch, YARN-8274.002.patch > > > I initiated container relaunch with a "sleep 60; exit 1" launch command and > saw a "not a docker command" error on relaunch. Haven't figured out why this > is happening, but it seems like it has been introduced recently to > trunk/branch-3.1. cc [~shaneku...@gmail.com] [~ebadger] > {noformat} > org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: > Relaunch container failed > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.relaunchContainer(DockerLinuxContainerRuntime.java:954) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.relaunchContainer(DelegatingLinuxContainerRuntime.java:150) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:562) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.relaunchContainer(LinuxContainerExecutor.java:486) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.relaunchContainer(ContainerLaunch.java:504) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:111) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from > container-launch. > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: > container_1525897486447_0003_01_02 > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 7 > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception > message: Relaunch container failed > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Shell error > output: docker: 'container_1525897486447_0003_01_02' is not a docker > command. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8274) Docker command error during container relaunch
[ https://issues.apache.org/jira/browse/YARN-8274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472548#comment-16472548 ] Eric Yang commented on YARN-8274: - [~ebadger] Sorry my mistake, I thought the report was for the second patch. With 3.1.1 code freeze on Saturday, it is easy to make mistakes, and I like to get YARN-7654 committed before end of today. YARN-7654 and YARN-8207 are probably left uncommitted for too long, and it is easy to make mistakes to rebase changes that includes logic for other patches including YARN-7973, YARN-8209, YARN-8261, YARN-8064. I recommend to go through YARN-7654 to make sure the rebase was done correctly for those patches. > Docker command error during container relaunch > -- > > Key: YARN-8274 > URL: https://issues.apache.org/jira/browse/YARN-8274 > Project: Hadoop YARN > Issue Type: Task >Reporter: Billie Rinaldi >Assignee: Jason Lowe >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8274.001.patch, YARN-8274.002.patch > > > I initiated container relaunch with a "sleep 60; exit 1" launch command and > saw a "not a docker command" error on relaunch. Haven't figured out why this > is happening, but it seems like it has been introduced recently to > trunk/branch-3.1. cc [~shaneku...@gmail.com] [~ebadger] > {noformat} > org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: > Relaunch container failed > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.relaunchContainer(DockerLinuxContainerRuntime.java:954) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.relaunchContainer(DelegatingLinuxContainerRuntime.java:150) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:562) > at > org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.relaunchContainer(LinuxContainerExecutor.java:486) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.relaunchContainer(ContainerLaunch.java:504) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:111) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from > container-launch. > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: > container_1525897486447_0003_01_02 > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 7 > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception > message: Relaunch container failed > 2018-05-09 21:41:46,631 INFO > org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Shell error > output: docker: 'container_1525897486447_0003_01_02' is not a docker > command. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7654) Support ENTRY_POINT for docker container
[ https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-7654: Attachment: YARN-7654.024.patch > Support ENTRY_POINT for docker container > > > Key: YARN-7654 > URL: https://issues.apache.org/jira/browse/YARN-7654 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-7654.001.patch, YARN-7654.002.patch, > YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, > YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, > YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, > YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, > YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, > YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, > YARN-7654.021.patch, YARN-7654.022.patch, YARN-7654.023.patch, > YARN-7654.024.patch > > > Docker image may have ENTRY_POINT predefined, but this is not supported in > the current implementation. It would be nice if we can detect existence of > {{launch_command}} and base on this variable launch docker container in > different ways: > h3. Launch command exists > {code} > docker run [image]:[version] > docker exec [container_id] [launch_command] > {code} > h3. Use ENTRY_POINT > {code} > docker run [image]:[version] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container
[ https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472661#comment-16472661 ] Eric Yang commented on YARN-7654: - [~jlowe] Patch 24 fixed the issues above. I still need time to test all 5 scenarios to make sure that command doesn't get pre-processed by mistake. The 5 scenarios are: # Mapreduce # LLAP app # Docker app with command override # Docker app with entry point # Docker app with entry point and no launch command > Support ENTRY_POINT for docker container > > > Key: YARN-7654 > URL: https://issues.apache.org/jira/browse/YARN-7654 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-7654.001.patch, YARN-7654.002.patch, > YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, > YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, > YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, > YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, > YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, > YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, > YARN-7654.021.patch, YARN-7654.022.patch, YARN-7654.023.patch, > YARN-7654.024.patch > > > Docker image may have ENTRY_POINT predefined, but this is not supported in > the current implementation. It would be nice if we can detect existence of > {{launch_command}} and base on this variable launch docker container in > different ways: > h3. Launch command exists > {code} > docker run [image]:[version] > docker exec [container_id] [launch_command] > {code} > h3. Use ENTRY_POINT > {code} > docker run [image]:[version] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container
[ https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471073#comment-16471073 ] Eric Yang commented on YARN-7654: - [~jlowe] I am struggling withe the following problems: {quote}AbstractProviderService#buildContainerLaunchContext so the pieces needed by DockerProviderService can be reused without requiring the launcher command to be clobbered afterwards?{quote} Launch command is override to bash -c 'launch-command' in DockerLinuxContainerRuntime, and subsequently appended the log redirection. '2> /stderr.txt 1> /stdout.txt', then replaced with actual container logging directory. The number of steps to go through the preprocessing before writing to .cmd file complicates how to refactor the code base without breaking things. This is the reason that setCommand was created to flush out the override commands to ensure the command is not tempered incorrectly during the hand off from DockerLinuxContainerRuntime to DockerClient to container-executor. For safety reason, I keep setCommand to ensure the command is not tempered by string substitutions, and not break YARN v2 API. {quote}The instance checking and downcasting in writeCommandToTempFile looks pretty ugly. It would be cleaner to encapsulate this in the DockerCommand abstraction. One example way to do this is to move the logic of writing a docker command file into the DockerCommand abstract class. DockerRunCommand can then override that method to call the parent method and then separately write the env file. Worst case we can add a getEnv method to DockerCommand that returns the collection of environment variables to write out for a command. DockerCommand would return null or an empty collection while DockerRunCommand can return its environment.{quote} DockerCommand is a data structure class. It does not handle IO operation. If we move IO operation to this class, it would not be clean data structure to represent the docker command. I think it is more self explanatory that for DockerRunCommand, we also write out the environment file. With changes in YARN-8261, we are interested to ensure that directory is created, create the cmd file, create the env file. For safety reason, I think we should not make the styling changes for this area at this time because we are out of time to throughly retest what have been tested in the previous patch set. > Support ENTRY_POINT for docker container > > > Key: YARN-7654 > URL: https://issues.apache.org/jira/browse/YARN-7654 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-7654.001.patch, YARN-7654.002.patch, > YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, > YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, > YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, > YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, > YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, > YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, > YARN-7654.021.patch > > > Docker image may have ENTRY_POINT predefined, but this is not supported in > the current implementation. It would be nice if we can detect existence of > {{launch_command}} and base on this variable launch docker container in > different ways: > h3. Launch command exists > {code} > docker run [image]:[version] > docker exec [container_id] [launch_command] > {code} > h3. Use ENTRY_POINT > {code} > docker run [image]:[version] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8207: Attachment: YARN-8207.006.patch > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: Docker > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, > YARN-8207.006.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8207: Attachment: (was: YARN-8207.006.patch) > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: Docker > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467673#comment-16467673 ] Eric Yang commented on YARN-8207: - [~jlowe] {quote}At a bare minimum there should be a utility method, e.g: extract_execv_args(args* args){quote} I agree to this point, and will do this. {quote}Please create an init function, e.g.: init_args(args* args), or a macro to encapsulate initialization of the structure.{quote} Init_args is only assigning 0 to length. I prefer to write it as: {code} struct args buffer = { 0 }; {code} Instead of: {code} struct args *buffer = malloc(sizeof(args)); init_args(buffer); {code} I understand the desire and obsession for code perfection, but I am trying to restrain myself from making more mess in crunch time. {quote}As add_to_args works today, the lack of a NULL check on the make_string result will cause the program to crash. {quote} Sorry, I thought I had a null check, but it was changed to length check. This will be fixed. {quote}would be safer and easier to understand written with strdup/strndup, e.g.: {code} dst = strndup(values[i], tmp_ptr - values[i]); pattern = strdup(permitted_values[j] + 6); {code} {quote} This will be optimized. {quote}make_string is still not checking for vsnprintf failure. If the first vsnprintf fails and returns -1, the code will allocate a 0-byte buffer.{quote} No, it doesn't malloc(-1) will return null instead of 0 bytes, the second check will not succeed. > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, > YARN-8207.006.patch, YARN-8207.007.patch, YARN-8207.008.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-7654) Support ENTRY_POINT for docker container
[ https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-7654: Attachment: YARN-7654.020.patch > Support ENTRY_POINT for docker container > > > Key: YARN-7654 > URL: https://issues.apache.org/jira/browse/YARN-7654 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-7654.001.patch, YARN-7654.002.patch, > YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, > YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, > YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, > YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, > YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, > YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch > > > Docker image may have ENTRY_POINT predefined, but this is not supported in > the current implementation. It would be nice if we can detect existence of > {{launch_command}} and base on this variable launch docker container in > different ways: > h3. Launch command exists > {code} > docker run [image]:[version] > docker exec [container_id] [launch_command] > {code} > h3. Use ENTRY_POINT > {code} > docker run [image]:[version] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container
[ https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466258#comment-16466258 ] Eric Yang commented on YARN-7654: - Rebased patch 20 to based on YARN-8207 patch 007. > Support ENTRY_POINT for docker container > > > Key: YARN-7654 > URL: https://issues.apache.org/jira/browse/YARN-7654 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn >Affects Versions: 3.1.0 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-7654.001.patch, YARN-7654.002.patch, > YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, > YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, > YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, > YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, > YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, > YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch > > > Docker image may have ENTRY_POINT predefined, but this is not supported in > the current implementation. It would be nice if we can detect existence of > {{launch_command}} and base on this variable launch docker container in > different ways: > h3. Launch command exists > {code} > docker run [image]:[version] > docker exec [container_id] [launch_command] > {code} > h3. Use ENTRY_POINT > {code} > docker run [image]:[version] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467727#comment-16467727 ] Eric Yang commented on YARN-8207: - [~jlowe] Patch 9 fixes most nits from your comments except init_args. I did not write init_args to prevent myself from making a mess. If you have strong feeling about the initialization. Please open a separate issue for it. Thanks > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, > YARN-8207.006.patch, YARN-8207.007.patch, YARN-8207.008.patch, > YARN-8207.009.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467762#comment-16467762 ] Eric Yang commented on YARN-8207: - [~jlowe] I see what you mean now, and patch 10 updated accordingly for initialize args and make_string check. One concern about the shallow copy, struct args buffer supposedly disappeared after construct_docker_command. This was the reason that I used deep copy to extract the data. Now, I am retaining the pointer reference to strings internal to struct args buffer instead of deep copy. Wouldn't those strings get overwritten at some point or they will be reserved until copy is freed up? > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, > YARN-8207.006.patch, YARN-8207.007.patch, YARN-8207.008.patch, > YARN-8207.009.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8207: Attachment: YARN-8207.009.patch > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, > YARN-8207.006.patch, YARN-8207.007.patch, YARN-8207.008.patch, > YARN-8207.009.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8207: Attachment: YARN-8207.010.patch > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Blocker > Labels: Docker > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, > YARN-8207.006.patch, YARN-8207.007.patch, YARN-8207.008.patch, > YARN-8207.009.patch, YARN-8207.010.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8207: Attachment: YARN-8207.006.patch > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: Docker > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, > YARN-8207.006.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion
[ https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464873#comment-16464873 ] Eric Yang commented on YARN-8207: - [~jlowe] Patch 006 contains all style fixes from your recommendations. > Docker container launch use popen have risk of shell expansion > -- > > Key: YARN-8207 > URL: https://issues.apache.org/jira/browse/YARN-8207 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2 >Reporter: Eric Yang >Assignee: Eric Yang >Priority: Major > Labels: Docker > Attachments: YARN-8207.001.patch, YARN-8207.002.patch, > YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, > YARN-8207.006.patch > > > Container-executor code utilize a string buffer to construct docker run > command, and pass the string buffer to popen for execution. Popen spawn a > shell to run the command. Some arguments for docker run are still vulnerable > to shell expansion. The possible solution is to convert from char * buffer > to string array for execv to avoid shell expansion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8079) Support static and archive unmodified local resources in service AM
[ https://issues.apache.org/jira/browse/YARN-8079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482076#comment-16482076 ] Eric Yang commented on YARN-8079: - [~leftnoteasy] When the files are placed in resources directory, patch 10 implementation prevents mistake to overwrite system level generated files, such as .token file, and launch_container.sh. However, this design can created inconvenience for some users because existing Hadoop workload may already be using the top level localized directory instead of resource directory. We may not need to worry about launch_container.sh getting overwritten because container-executor generates the file after static files are localized. Apps will try to avoid .token files because they can not contact HDFS from containers, if they overwrites the token files. In summary, it is likely safe to remove the requirement of "resources" directory from my point of view. > Support static and archive unmodified local resources in service AM > --- > > Key: YARN-8079 > URL: https://issues.apache.org/jira/browse/YARN-8079 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: Suma Shivaprasad >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8079.001.patch, YARN-8079.002.patch, > YARN-8079.003.patch, YARN-8079.004.patch, YARN-8079.005.patch, > YARN-8079.006.patch, YARN-8079.007.patch, YARN-8079.008.patch, > YARN-8079.009.patch, YARN-8079.010.patch > > > Currently, {{srcFile}} is not respected. {{ProviderUtils}} doesn't properly > read srcFile, instead it always construct {{remoteFile}} by using > componentDir and fileName of {{destFile}}: > {code} > Path remoteFile = new Path(compInstanceDir, fileName); > {code} > To me it is a common use case which services have some files existed in HDFS > and need to be localized when components get launched. (For example, if we > want to serve a Tensorflow model, we need to localize Tensorflow model > (typically not huge, less than GB) to local disk. Otherwise launched docker > container has to access HDFS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8079) Support static and archive unmodified local resources in service AM
[ https://issues.apache.org/jira/browse/YARN-8079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482076#comment-16482076 ] Eric Yang edited comment on YARN-8079 at 5/21/18 12:20 AM: --- [~leftnoteasy] When the files are placed in resources directory, patch 10 implementation prevents mistake to overwrite system level generated files, such as .token file, and launch_container.sh. However, this design can created inconvenience for some users because existing Hadoop workload may already be using the top level localized directory instead of resource directory. We may not need to worry about launch_container.sh getting overwritten because container-executor generates the file after static files are localized. Apps will try to avoid .token files because they can not contact HDFS from containers, if they overwrites the token files. With resources directory, it maybe easier for end user to specify a single relative directory to bind-mount instead of specifying individual files to bind-mount in yarnfile. By removing the resources directory, user will need to think a bit more on how to manage the bind-mount directories to reduce wordy syntax. With both approaches considered, it all comes down to usability of which approach is easiest to use, while not creating too much clutters. In summary, it might be safe to remove the requirement of "resources" directory from my point of view. was (Author: eyang): [~leftnoteasy] When the files are placed in resources directory, patch 10 implementation prevents mistake to overwrite system level generated files, such as .token file, and launch_container.sh. However, this design can created inconvenience for some users because existing Hadoop workload may already be using the top level localized directory instead of resource directory. We may not need to worry about launch_container.sh getting overwritten because container-executor generates the file after static files are localized. Apps will try to avoid .token files because they can not contact HDFS from containers, if they overwrites the token files. In summary, it is likely safe to remove the requirement of "resources" directory from my point of view. > Support static and archive unmodified local resources in service AM > --- > > Key: YARN-8079 > URL: https://issues.apache.org/jira/browse/YARN-8079 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Wangda Tan >Assignee: Suma Shivaprasad >Priority: Critical > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8079.001.patch, YARN-8079.002.patch, > YARN-8079.003.patch, YARN-8079.004.patch, YARN-8079.005.patch, > YARN-8079.006.patch, YARN-8079.007.patch, YARN-8079.008.patch, > YARN-8079.009.patch, YARN-8079.010.patch > > > Currently, {{srcFile}} is not respected. {{ProviderUtils}} doesn't properly > read srcFile, instead it always construct {{remoteFile}} by using > componentDir and fileName of {{destFile}}: > {code} > Path remoteFile = new Path(compInstanceDir, fileName); > {code} > To me it is a common use case which services have some files existed in HDFS > and need to be localized when components get launched. (For example, if we > want to serve a Tensorflow model, we need to localize Tensorflow model > (typically not huge, less than GB) to local disk. Otherwise launched docker > container has to access HDFS. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8290) Yarn application failed to recover with "Error Launching job : User is not set in the application report" error after RM restart
[ https://issues.apache.org/jira/browse/YARN-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8290: Attachment: YARN-8290.002.patch > Yarn application failed to recover with "Error Launching job : User is not > set in the application report" error after RM restart > > > Key: YARN-8290 > URL: https://issues.apache.org/jira/browse/YARN-8290 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Eric Yang >Priority: Major > Attachments: YARN-8290.001.patch, YARN-8290.002.patch > > > Scenario: > 1) Start 5 streaming application in background > 2) Kill Active RM and cause RM failover > After RM failover, The application failed with below error. > {code}18/02/01 21:24:29 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception on [rm2] : > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1517520038847_0003' doesn't exist in RM. Please check > that the job submission was successful. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:338) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347) > , so propagating back to caller. > 18/02/01 21:24:29 INFO impl.YarnClientImpl: Submitted application > application_1517520038847_0003 > 18/02/01 21:24:30 INFO mapreduce.JobSubmitter: Cleaning up the staging area > /user/hrt_qa/.staging/job_1517520038847_0003 > 18/02/01 21:24:30 ERROR streaming.StreamJob: Error Launching job : User is > not set in the application report > Streaming Command Failed!{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8290) Yarn application failed to recover with "Error Launching job : User is not set in the application report" error after RM restart
[ https://issues.apache.org/jira/browse/YARN-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479255#comment-16479255 ] Eric Yang commented on YARN-8290: - - Patch 002 Fixed white space. > Yarn application failed to recover with "Error Launching job : User is not > set in the application report" error after RM restart > > > Key: YARN-8290 > URL: https://issues.apache.org/jira/browse/YARN-8290 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Eric Yang >Priority: Major > Attachments: YARN-8290.001.patch, YARN-8290.002.patch > > > Scenario: > 1) Start 5 streaming application in background > 2) Kill Active RM and cause RM failover > After RM failover, The application failed with below error. > {code}18/02/01 21:24:29 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception on [rm2] : > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1517520038847_0003' doesn't exist in RM. Please check > that the job submission was successful. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:338) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347) > , so propagating back to caller. > 18/02/01 21:24:29 INFO impl.YarnClientImpl: Submitted application > application_1517520038847_0003 > 18/02/01 21:24:30 INFO mapreduce.JobSubmitter: Cleaning up the staging area > /user/hrt_qa/.staging/job_1517520038847_0003 > 18/02/01 21:24:30 ERROR streaming.StreamJob: Error Launching job : User is > not set in the application report > Streaming Command Failed!{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8080) YARN native service should support component restart policy
[ https://issues.apache.org/jira/browse/YARN-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477665#comment-16477665 ] Eric Yang commented on YARN-8080: - [~suma.shivaprasad] {quote} {quote} restart_policy=ON_FAILURE, and each component instance failed 3 times, and application goes into FINISHED state instead of FAILED state. Is this expected?{quote} Can you please explain which part of code you are referring to? Or was it found during testing?{quote} This was found during testing, and review of code. The decision making process is based on {code} nSucceeded + nFailed < comp.getComponentSpec().getNumberOfContainers() {code} If a user specifies 2 containers, and purposely failed containers. The first failed container will trigger retries once. The second container failed. The total failed containers are 3 because first container failed + second container failed + first container retires failed, which is greater than number of containers. This triggers the program to terminate, and report FINISHED. This is almost working for restart_policy=NEVER, and it should report FAILED if number of failed containers is greater than 50% of total containers. For restart_policy=ON_FAILURE, we will want to compare the total succeed containers = getNumberOfContainers, otherwise continue to retry. This helps the measurement to count toward success and best effort to retry. For restart_policy=ALWAYS, shouldTerminate always = false. Checkstyle still reports indentation and unused import problems. It would be good to automate the clean up using IDE features. > YARN native service should support component restart policy > --- > > Key: YARN-8080 > URL: https://issues.apache.org/jira/browse/YARN-8080 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Suma Shivaprasad >Priority: Critical > Attachments: YARN-8080.001.patch, YARN-8080.002.patch, > YARN-8080.003.patch, YARN-8080.005.patch, YARN-8080.006.patch, > YARN-8080.007.patch, YARN-8080.009.patch, YARN-8080.010.patch, > YARN-8080.011.patch, YARN-8080.012.patch, YARN-8080.013.patch, > YARN-8080.014.patch, YARN-8080.015.patch > > > Existing native service assumes the service is long running and never > finishes. Containers will be restarted even if exit code == 0. > To support boarder use cases, we need to allow restart policy of component > specified by users. Propose to have following policies: > 1) Always: containers always restarted by framework regardless of container > exit status. This is existing/default behavior. > 2) Never: Do not restart containers in any cases after container finishes: To > support job-like workload (for example Tensorflow training job). If a task > exit with code == 0, we should not restart the task. This can be used by > services which is not restart/recovery-able. > 3) On-failure: Similar to above, only restart task with exitcode != 0. > Behaviors after component *instance* finalize (Succeeded or Failed when > restart_policy != ALWAYS): > 1) For single component, single instance: complete service. > 2) For single component, multiple instance: other running instances from the > same component won't be affected by the finalized component instance. Service > will be terminated once all instances finalized. > 3) For multiple components: Service will be terminated once all components > finalized. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8293) In YARN Services UI, "User Name for service" should be completely removed in secure clusters
[ https://issues.apache.org/jira/browse/YARN-8293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477594#comment-16477594 ] Eric Yang commented on YARN-8293: - [~sunilg] The changes in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui/src/main/webapp/app/templates/components/deploy-service.hbs will hide username column from the displaying table. Does this imply that the user interface can only display jobs for the login user, and not all the jobs from all users for yarn admin? This seems to be a use ability limitation for yarn admin users. We might need follow up JIRAs to make sure that we can support the case where yarn admin look at all the jobs from all users. Other than this nitpick, I think this patch is ready. > In YARN Services UI, "User Name for service" should be completely removed in > secure clusters > > > Key: YARN-8293 > URL: https://issues.apache.org/jira/browse/YARN-8293 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Sunil G >Assignee: Sunil G >Priority: Major > Attachments: YARN-8293.001.patch > > > "User Name for service" should be completely removed in secure clusters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-8293) In YARN Services UI, "User Name for service" should be completely removed in secure clusters
[ https://issues.apache.org/jira/browse/YARN-8293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476034#comment-16476034 ] Eric Yang commented on YARN-8293: - YARN service can have duplicated application name between users. If user name field is removed, this will cause confusion for administrator who is looking at all jobs from all users. > In YARN Services UI, "User Name for service" should be completely removed in > secure clusters > > > Key: YARN-8293 > URL: https://issues.apache.org/jira/browse/YARN-8293 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn-ui-v2 >Reporter: Sunil G >Assignee: Sunil G >Priority: Major > Attachments: YARN-8293.001.patch > > > "User Name for service" should be completely removed in secure clusters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8080) YARN native service should support component restart policy
[ https://issues.apache.org/jira/browse/YARN-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477665#comment-16477665 ] Eric Yang edited comment on YARN-8080 at 5/16/18 4:19 PM: -- Thank you for the patch, [~suma.shivaprasad]. {quote} Can you please explain which part of code you are referring to? Or was it found during testing?{quote} This was found during testing, and review of code. The decision making process is based on {code} nSucceeded + nFailed < comp.getComponentSpec().getNumberOfContainers() {code} If a user specifies 2 containers, and purposely failed containers. The first failed container will trigger retries once. The second container failed. The total failed containers are 3 because first container failed + second container failed + first container retires failed, which is greater than number of containers. This triggers the program to terminate, and report FINISHED. This is almost working for restart_policy=NEVER, and it should report FAILED if number of failed containers is greater than 50% of total containers. For restart_policy=ON_FAILURE, we will want to compare the total succeed containers = getNumberOfContainers, otherwise continue to retry. This helps the measurement to count toward success and best effort to retry. For restart_policy=ALWAYS, shouldTerminate always = false. Checkstyle still reports indentation and unused import problems. It would be good to automate the clean up using IDE features. was (Author: eyang): [~suma.shivaprasad] {quote} {quote} restart_policy=ON_FAILURE, and each component instance failed 3 times, and application goes into FINISHED state instead of FAILED state. Is this expected?{quote} Can you please explain which part of code you are referring to? Or was it found during testing?{quote} This was found during testing, and review of code. The decision making process is based on {code} nSucceeded + nFailed < comp.getComponentSpec().getNumberOfContainers() {code} If a user specifies 2 containers, and purposely failed containers. The first failed container will trigger retries once. The second container failed. The total failed containers are 3 because first container failed + second container failed + first container retires failed, which is greater than number of containers. This triggers the program to terminate, and report FINISHED. This is almost working for restart_policy=NEVER, and it should report FAILED if number of failed containers is greater than 50% of total containers. For restart_policy=ON_FAILURE, we will want to compare the total succeed containers = getNumberOfContainers, otherwise continue to retry. This helps the measurement to count toward success and best effort to retry. For restart_policy=ALWAYS, shouldTerminate always = false. Checkstyle still reports indentation and unused import problems. It would be good to automate the clean up using IDE features. > YARN native service should support component restart policy > --- > > Key: YARN-8080 > URL: https://issues.apache.org/jira/browse/YARN-8080 > Project: Hadoop YARN > Issue Type: Task >Reporter: Wangda Tan >Assignee: Suma Shivaprasad >Priority: Critical > Attachments: YARN-8080.001.patch, YARN-8080.002.patch, > YARN-8080.003.patch, YARN-8080.005.patch, YARN-8080.006.patch, > YARN-8080.007.patch, YARN-8080.009.patch, YARN-8080.010.patch, > YARN-8080.011.patch, YARN-8080.012.patch, YARN-8080.013.patch, > YARN-8080.014.patch, YARN-8080.015.patch > > > Existing native service assumes the service is long running and never > finishes. Containers will be restarted even if exit code == 0. > To support boarder use cases, we need to allow restart policy of component > specified by users. Propose to have following policies: > 1) Always: containers always restarted by framework regardless of container > exit status. This is existing/default behavior. > 2) Never: Do not restart containers in any cases after container finishes: To > support job-like workload (for example Tensorflow training job). If a task > exit with code == 0, we should not restart the task. This can be used by > services which is not restart/recovery-able. > 3) On-failure: Similar to above, only restart task with exitcode != 0. > Behaviors after component *instance* finalize (Succeeded or Failed when > restart_policy != ALWAYS): > 1) For single component, single instance: complete service. > 2) For single component, multiple instance: other running instances from the > same component won't be affected by the finalized component instance. Service > will be terminated once all instances finalized. > 3) For multiple components: Service will be terminated once all components > finalized. -- This message was sent by
[jira] [Commented] (YARN-8300) Fix NPE in DefaultUpgradeComponentsFinder
[ https://issues.apache.org/jira/browse/YARN-8300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477697#comment-16477697 ] Eric Yang commented on YARN-8300: - [~giovanni.fumarola] Patch 003 looks good to me. I can help with the commit. > Fix NPE in DefaultUpgradeComponentsFinder > -- > > Key: YARN-8300 > URL: https://issues.apache.org/jira/browse/YARN-8300 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Reporter: Suma Shivaprasad >Assignee: Suma Shivaprasad >Priority: Major > Fix For: 3.1.1 > > Attachments: YARN-8300.1.patch, YARN-8300.2.patch, YARN-8300.3.patch > > > In current upgrades for Yarn native services, we do not support > addition/deletion of compoents during upgrade. On trying to upgrade with the > same number of components in target spec as the current service spec but with > the one of the components having a new target spec and name, see the > following NPE in service AM logs > {noformat} > 2018-05-15 00:10:41,489 [IPC Server handler 0 on 37488] ERROR > service.ClientAMService - Error while trying to upgrade service {} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.service.UpgradeComponentsFinder$DefaultUpgradeComponentsFinder.lambda$findTargetComponentSpecs$0(UpgradeComponentsFinder.java:103) > at java.util.ArrayList.forEach(ArrayList.java:1257) > at > org.apache.hadoop.yarn.service.UpgradeComponentsFinder$DefaultUpgradeComponentsFinder.findTargetComponentSpecs(UpgradeComponentsFinder.java:100) > at > org.apache.hadoop.yarn.service.ServiceManager.processUpgradeRequest(ServiceManager.java:259) > at > org.apache.hadoop.yarn.service.ClientAMService.upgrade(ClientAMService.java:163) > at > org.apache.hadoop.yarn.service.impl.pb.service.ClientAMProtocolPBServiceImpl.upgradeService(ClientAMProtocolPBServiceImpl.java:81) > at > org.apache.hadoop.yarn.proto.ClientAMProtocol$ClientAMProtocolService$2.callBlockingMethod(ClientAMProtocol.java:5972) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7960) Add no-new-privileges flag to docker run
[ https://issues.apache.org/jira/browse/YARN-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477861#comment-16477861 ] Eric Yang commented on YARN-7960: - [~ebadger] You are right. Selinux presence is not a good indicator if the option should be enabled or not. No_new_privileges can work with selinux after CentOS 7.5. Config knob for this feature is the better choice. > Add no-new-privileges flag to docker run > > > Key: YARN-7960 > URL: https://issues.apache.org/jira/browse/YARN-7960 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Labels: Docker > Attachments: YARN-7960.001.patch > > > Minimally, this should be used for unprivileged containers. It's a cheap way > to add an extra layer of security to the docker model. For privileged > containers, it might be appropriate to omit this flag > https://github.com/moby/moby/pull/20727 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8300) Fix NPE in DefaultUpgradeComponentsFinder
[ https://issues.apache.org/jira/browse/YARN-8300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8300: Affects Version/s: 3.1.1 Target Version/s: 3.2.0, 3.1.1 Fix Version/s: 3.2.0 Description: In current upgrades for Yarn native services, we do not support addition/deletion of compoents during upgrade. On trying to upgrade with the same number of components in target spec as the current service spec but with the one of the components having a new target spec and name, see the following NPE in service AM logs {noformat} 2018-05-15 00:10:41,489 [IPC Server handler 0 on 37488] ERROR service.ClientAMService - Error while trying to upgrade service {} java.lang.NullPointerException at org.apache.hadoop.yarn.service.UpgradeComponentsFinder$DefaultUpgradeComponentsFinder.lambda$findTargetComponentSpecs$0(UpgradeComponentsFinder.java:103) at java.util.ArrayList.forEach(ArrayList.java:1257) at org.apache.hadoop.yarn.service.UpgradeComponentsFinder$DefaultUpgradeComponentsFinder.findTargetComponentSpecs(UpgradeComponentsFinder.java:100) at org.apache.hadoop.yarn.service.ServiceManager.processUpgradeRequest(ServiceManager.java:259) at org.apache.hadoop.yarn.service.ClientAMService.upgrade(ClientAMService.java:163) at org.apache.hadoop.yarn.service.impl.pb.service.ClientAMProtocolPBServiceImpl.upgradeService(ClientAMProtocolPBServiceImpl.java:81) at org.apache.hadoop.yarn.proto.ClientAMProtocol$ClientAMProtocolService$2.callBlockingMethod(ClientAMProtocol.java:5972) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) {noformat} was: In current upgrades for Yarn native services, we do not support addition/deletion of compoents during upgrade. On trying to upgrade with the same number of components in target spec as the current service spec but with the one of the components having a new target spec and name, see the following NPE in service AM logs {noformat} 2018-05-15 00:10:41,489 [IPC Server handler 0 on 37488] ERROR service.ClientAMService - Error while trying to upgrade service {} java.lang.NullPointerException at org.apache.hadoop.yarn.service.UpgradeComponentsFinder$DefaultUpgradeComponentsFinder.lambda$findTargetComponentSpecs$0(UpgradeComponentsFinder.java:103) at java.util.ArrayList.forEach(ArrayList.java:1257) at org.apache.hadoop.yarn.service.UpgradeComponentsFinder$DefaultUpgradeComponentsFinder.findTargetComponentSpecs(UpgradeComponentsFinder.java:100) at org.apache.hadoop.yarn.service.ServiceManager.processUpgradeRequest(ServiceManager.java:259) at org.apache.hadoop.yarn.service.ClientAMService.upgrade(ClientAMService.java:163) at org.apache.hadoop.yarn.service.impl.pb.service.ClientAMProtocolPBServiceImpl.upgradeService(ClientAMProtocolPBServiceImpl.java:81) at org.apache.hadoop.yarn.proto.ClientAMProtocol$ClientAMProtocolService$2.callBlockingMethod(ClientAMProtocol.java:5972) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) {noformat} > Fix NPE in DefaultUpgradeComponentsFinder > -- > > Key: YARN-8300 > URL: https://issues.apache.org/jira/browse/YARN-8300 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.1.1 >Reporter: Suma Shivaprasad >Assignee: Suma Shivaprasad >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8300.1.patch, YARN-8300.2.patch, YARN-8300.3.patch > > > In current upgrades for Yarn native services, we do not support > addition/deletion of compoents during upgrade. On trying to upgrade with the > same number of components in target
[jira] [Comment Edited] (YARN-8300) Fix NPE in DefaultUpgradeComponentsFinder
[ https://issues.apache.org/jira/browse/YARN-8300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477708#comment-16477708 ] Eric Yang edited comment on YARN-8300 at 5/16/18 4:44 PM: -- Thank you [~suma.shivaprasad] for the patch. Thank you [~giovanni.fumarola] for the review. +1 I committed this to branch 3.1 and trunk. was (Author: eyang): Thank you [~suma.shivaprasad] for the patch. Thank you [~giovanni.fumarola] for the review. I committed this to branch 3.1 and trunk. > Fix NPE in DefaultUpgradeComponentsFinder > -- > > Key: YARN-8300 > URL: https://issues.apache.org/jira/browse/YARN-8300 > Project: Hadoop YARN > Issue Type: Bug > Components: yarn >Affects Versions: 3.1.1 >Reporter: Suma Shivaprasad >Assignee: Suma Shivaprasad >Priority: Major > Fix For: 3.2.0, 3.1.1 > > Attachments: YARN-8300.1.patch, YARN-8300.2.patch, YARN-8300.3.patch > > > In current upgrades for Yarn native services, we do not support > addition/deletion of compoents during upgrade. On trying to upgrade with the > same number of components in target spec as the current service spec but with > the one of the components having a new target spec and name, see the > following NPE in service AM logs > {noformat} > 2018-05-15 00:10:41,489 [IPC Server handler 0 on 37488] ERROR > service.ClientAMService - Error while trying to upgrade service {} > java.lang.NullPointerException > at > org.apache.hadoop.yarn.service.UpgradeComponentsFinder$DefaultUpgradeComponentsFinder.lambda$findTargetComponentSpecs$0(UpgradeComponentsFinder.java:103) > at java.util.ArrayList.forEach(ArrayList.java:1257) > at > org.apache.hadoop.yarn.service.UpgradeComponentsFinder$DefaultUpgradeComponentsFinder.findTargetComponentSpecs(UpgradeComponentsFinder.java:100) > at > org.apache.hadoop.yarn.service.ServiceManager.processUpgradeRequest(ServiceManager.java:259) > at > org.apache.hadoop.yarn.service.ClientAMService.upgrade(ClientAMService.java:163) > at > org.apache.hadoop.yarn.service.impl.pb.service.ClientAMProtocolPBServiceImpl.upgradeService(ClientAMProtocolPBServiceImpl.java:81) > at > org.apache.hadoop.yarn.proto.ClientAMProtocol$ClientAMProtocolService$2.callBlockingMethod(ClientAMProtocol.java:5972) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7960) Add no-new-privileges flag to docker run
[ https://issues.apache.org/jira/browse/YARN-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476416#comment-16476416 ] Eric Yang commented on YARN-7960: - [~ebadger] Can we run sestatus to check instead of depending on config values? If sestatus is not found, then no-new-privileges option is enabled. Like you said that selinux auditing is the exception. I am ok with this option being enabled by default in absence of selinux. This can prevent configuration mistake made by system administrator. > Add no-new-privileges flag to docker run > > > Key: YARN-7960 > URL: https://issues.apache.org/jira/browse/YARN-7960 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Labels: Docker > Attachments: YARN-7960.001.patch > > > Minimally, this should be used for unprivileged containers. It's a cheap way > to add an extra layer of security to the docker model. For privileged > containers, it might be appropriate to omit this flag > https://github.com/moby/moby/pull/20727 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Commented] (YARN-7960) Add no-new-privileges flag to docker run
[ https://issues.apache.org/jira/browse/YARN-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476330#comment-16476330 ] Eric Yang commented on YARN-7960: - [~ebadger] no-new-privileges option will block [selinux auditing|https://github.com/projectatomic/container-selinux/issues/51]. This feature will prevent enterprise customers from auditing security inside the container. Some effort has been put in place to ensure selinux auditing is unblocked for CentOS 7.5 and newer. It might be a good idea to check if the Hadoop cluster has selinux enforced before this option is appended to non-privileged container. > Add no-new-privileges flag to docker run > > > Key: YARN-7960 > URL: https://issues.apache.org/jira/browse/YARN-7960 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Eric Badger >Assignee: Eric Badger >Priority: Major > Labels: Docker > Attachments: YARN-7960.001.patch > > > Minimally, this should be used for unprivileged containers. It's a cheap way > to add an extra layer of security to the docker model. For privileged > containers, it might be appropriate to omit this flag > https://github.com/moby/moby/pull/20727 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Updated] (YARN-8290) Yarn application failed to recover with "Error Launching job : User is not set in the application report" error after RM restart
[ https://issues.apache.org/jira/browse/YARN-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang updated YARN-8290: Attachment: YARN-8290.001.patch > Yarn application failed to recover with "Error Launching job : User is not > set in the application report" error after RM restart > > > Key: YARN-8290 > URL: https://issues.apache.org/jira/browse/YARN-8290 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Priority: Major > Attachments: YARN-8290.001.patch > > > Scenario: > 1) Start 5 streaming application in background > 2) Kill Active RM and cause RM failover > After RM failover, The application failed with below error. > {code}18/02/01 21:24:29 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception on [rm2] : > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1517520038847_0003' doesn't exist in RM. Please check > that the job submission was successful. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:338) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347) > , so propagating back to caller. > 18/02/01 21:24:29 INFO impl.YarnClientImpl: Submitted application > application_1517520038847_0003 > 18/02/01 21:24:30 INFO mapreduce.JobSubmitter: Cleaning up the staging area > /user/hrt_qa/.staging/job_1517520038847_0003 > 18/02/01 21:24:30 ERROR streaming.StreamJob: Error Launching job : User is > not set in the application report > Streaming Command Failed!{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Assigned] (YARN-8290) Yarn application failed to recover with "Error Launching job : User is not set in the application report" error after RM restart
[ https://issues.apache.org/jira/browse/YARN-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Yang reassigned YARN-8290: --- Assignee: Eric Yang Affects Version/s: 3.1.1 [~leftnoteasy] According to your suggestion that ACL information is set too late and killing AM prior to ACL information is propagated can cause RM recovery to load partial application record. The suggested change is to move the ACL setup into ApplicationToSchedulerTransition. The patch moved the block of code accordingly. Let me know if this is the correct fix. Thanks > Yarn application failed to recover with "Error Launching job : User is not > set in the application report" error after RM restart > > > Key: YARN-8290 > URL: https://issues.apache.org/jira/browse/YARN-8290 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 3.1.1 >Reporter: Yesha Vora >Assignee: Eric Yang >Priority: Major > Attachments: YARN-8290.001.patch > > > Scenario: > 1) Start 5 streaming application in background > 2) Kill Active RM and cause RM failover > After RM failover, The application failed with below error. > {code}18/02/01 21:24:29 WARN client.RequestHedgingRMFailoverProxyProvider: > Invocation returned exception on [rm2] : > org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application > with id 'application_1517520038847_0003' doesn't exist in RM. Please check > that the job submission was successful. > at > org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:338) > at > org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175) > at > org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347) > , so propagating back to caller. > 18/02/01 21:24:29 INFO impl.YarnClientImpl: Submitted application > application_1517520038847_0003 > 18/02/01 21:24:30 INFO mapreduce.JobSubmitter: Cleaning up the staging area > /user/hrt_qa/.staging/job_1517520038847_0003 > 18/02/01 21:24:30 ERROR streaming.StreamJob: Error Launching job : User is > not set in the application report > Streaming Command Failed!{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org