[jira] [Resolved] (YARN-7218) ApiServer REST API naming convention /ws/v1 is already used in Hadoop v2

2017-10-31 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang resolved YARN-7218.
-
Resolution: Won't Fix

It looks like v1 of YARN rest api is still evolving.  The name space used by 
services is independent of other paths, hence the incompatibility concern is a 
non-issue at this time.  We can close this as wrong fix.

> ApiServer REST API naming convention /ws/v1 is already used in Hadoop v2
> 
>
> Key: YARN-7218
> URL: https://issues.apache.org/jira/browse/YARN-7218
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: api, applications
>Reporter: Eric Yang
>Assignee: Eric Yang
>
> In YARN-6626, there is a desire to have ability to run ApiServer REST API in 
> Resource Manager, this can eliminate the requirement to deploy another daemon 
> service for submitting docker applications.  In YARN-5698, a new UI has been 
> implemented as a separate web application.  There are some problems in the 
> arrangement that can cause conflicts of how Java session are being managed.  
> The root context of Resource Manager web application is /ws.  This is hard 
> coded in startWebapp method in ResourceManager.java.  This means all the 
> session management is applied to Web URL of /ws prefix.  /ui2 is independent 
> of /ws context, therefore session management code doesn't apply to /ui2.  
> This could be a session management problem, if servlet based code is going to 
> be introduced into /ui2 web application.
> ApiServer code base is designed as a separate web application.  There is no 
> easy way to inject a separate web application into the same /ws context 
> because ResourceManager is already setup to bind to RMWebServices.  Unless 
> ApiServer code is moved into RMWebServices, otherwise, they will not share 
> the same session management.
> The alternate solution is to keep ApiServer prefix URL independent of /ws 
> context.  However, this will be a departure from YARN web services naming 
> convention.  This can be loaded as a separate web application in Resource 
> Manager jetty server.  One possible proposal is /app/v1/services.  This can 
> keep ApiServer code modular and independent from Resource Manager.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6187) Auto-generate REST API resources and server side stubs from swagger definition

2017-10-31 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227785#comment-16227785
 ] 

Eric Yang commented on YARN-6187:
-

[~gsaha] Swagger is good for generating initial classes to get development 
going.  Changes made to Swagger definition will result in generating new code 
with empty classes.  I don't see a way to constantly update swagger yaml file, 
and keeping generated code inline with human added logic.  Do we still need 
this?

> Auto-generate REST API resources and server side stubs from swagger definition
> --
>
> Key: YARN-6187
> URL: https://issues.apache.org/jira/browse/YARN-6187
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Gour Saha
> Fix For: yarn-native-services
>
>
> Currently the REST API resource package is generated offline using swagger 
> codegen library and then formatted with basic eclipse formatter and then 
> checked in. It is not entirely in line with YARN documentation and coding 
> guidelines. We need to do these things to streamline this effort -
> # Auto-generate the resource package and the server side API interface/stubs 
> using swagger codegen libraries
> # Use a template framework like jmustache or similar (or better) to align/add 
> documentation and code-formatting in-line with Yarn project standards



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6387) Provide a flag in Rest API GET response to notify if the app launch delay is due to docker image download.

2017-10-31 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16227789#comment-16227789
 ] 

Eric Yang commented on YARN-6387:
-

Do we still need this?  There is no description.  The current REST API only 
response the final result.  If the goal is to provide progress during REST API 
call to create containers, then we probably need to add extension to the REST 
API.  Each of the operation (create, start, stop, flex) can be reference by an 
operation ID.  Front end can invoke REST API with operation ID to inspect the 
current progress of the operation.  Without operation centric API, it is not 
possible to determine if container is in downloading state or container is 
started and running.

> Provide a flag in Rest API GET response to notify if the app launch delay is 
> due to docker image download.
> --
>
> Key: YARN-6387
> URL: https://issues.apache.org/jira/browse/YARN-6387
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: sriharsha devineni
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7197) Add support for a volume blacklist for docker containers

2017-10-30 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225452#comment-16225452
 ] 

Eric Yang edited comment on YARN-7197 at 10/30/17 6:18 PM:
---

[~jlowe] 

{quote}Either /run isn't in the whitelist in the first place rendering the 
blacklist entry moot or /run is in the whitelist and the user can simply mount 
{{/run}} and access the blacklist path.{quote}

Let's expand on the real world example.  A hacker tries to take control of 
{{/run/docker.socket}} to acquire root privileges and spawn root containers to 
access vital system area to become root on the host system.  The system admin 
placed {{/var}} in read-write white list for ability to write to database and 
log directories, without black list capability.  Hacker explicitly specify 
{{/var/run/docker.socket}} to be included, put the socket in 
{{/tmp/docker.socket}}.  Hacker generates a docker image with {{/etc/group}} 
modified to include his own name or setuid bit binary in container.  Hack can 
successfully gain control to host level docker without much effort.

{{/run}} contains a growing list of software that put their pid file or socket 
in this location.  System admin can't say no to not allow other software (i.e. 
hdfs short circuit read) to place their socket in {{/run}} location and share 
between containers due to company requirement.  However, he still doesn't want 
to let hacker gain root access.

h3. Solution 1:
System admin placed {{/var/*}}, {{/run/\*}} (except /run/docker.socket), and 
{{/mnt/hdfs/user/*}} (except yarn), carefully in read-write white list.  None 
of the symlink is exposed.  Hacker can not get in.

h3. Solution 2 (All symlinks, and hardcoded locations are banned):
(Current proposed patch)
System admin specifies:
  white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), 
{{/mnt/hdfs/user/\*}} (exception yarn)
  black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or explicit hard coded location also result in ban.

h3. Solution 3: (Replace black list location with empty directories):
(Jason proposed implementation)
System admin specifies:
  white-list-read-write: {{/var}},{{/run}},{{/mnt/hdfs/user}}
  black-list: {{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or mount /run/docker.socket manually, but result in empty 
file.

All solutions requires system administrator to enforce ability to upload secure 
image to private registry to prevent torjan horse in docker image.
 
I can see the appeal that without having to do a high upkeep of 
white-list-read-write directories by the new proposal.  The third solution can 
throw people off, if they do not know about black-list is hijacked to empty 
location.  However, the depth of directories might defeat second solution.  If 
community favors the third solution, I can revise patch accordingly.


was (Author: eyang):
[~jlowe] 

{quote}Either /run isn't in the whitelist in the first place rendering the 
blacklist entry moot or /run is in the whitelist and the user can simply mount 
{{/run}} and access the blacklist path.{quote}

Let's expand on the real world example.  A hacker tries to take control of 
{{/run/docker.socket}} to acquire root privileges and spawn root containers to 
access vital system area to become root on the host system.  The system admin 
placed {{/var}} in read-write white list for ability to write to database and 
log directories, without black list capability.  Hacker explicitly specify 
{{/var/run/docker.socket}} to be included, put the socket in 
{{/tmp/docker.socket}}.  Hacker generates a docker image with {{/etc/group}} 
modified to include his own name or setuid bit binary in container.  Hack can 
successfully gain control to host level docker without much effort.

{{/run}} contains a growing list of software that put their pid file or socket 
in this location.  System admin can't say no to not allow other software (i.e. 
hdfs short circuit read) to place their socket in {{/run}} location and share 
between containers due to company requirement.  However, he still doesn't want 
to let hacker gain root access.

h3. Solution 1:
System admin placed {{/var/*}} and {{/run/\*}} (except /run/docker.socket), 
carefully in read-write white list.  None of the symlink is exposed.  Hacker 
can not get in.

h3. Solution 2 (All symlinks, and hardcoded locations are banned):
(Current proposed patch)
System admin specifies:
  white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), 
{{/mnt/hdfs/user/\*}} (exception yarn)
  black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or 

[jira] [Comment Edited] (YARN-7197) Add support for a volume blacklist for docker containers

2017-10-30 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225452#comment-16225452
 ] 

Eric Yang edited comment on YARN-7197 at 10/30/17 6:13 PM:
---

[~jlowe] 

{quote}Either /run isn't in the whitelist in the first place rendering the 
blacklist entry moot or /run is in the whitelist and the user can simply mount 
{{/run}} and access the blacklist path.{quote}

Let's expand on the real world example.  A hacker tries to take control of 
{{/run/docker.socket}} to acquire root privileges and spawn root containers to 
access vital system area to become root on the host system.  The system admin 
placed {{/var}} in read-write white list for ability to write to database and 
log directories, without black list capability.  Hacker explicitly specify 
{{/var/run/docker.socket}} to be included, put the socket in 
{{/tmp/docker.socket}}.  Hacker generates a docker image with {{/etc/group}} 
modified to include his own name or setuid bit binary in container.  Hack can 
successfully gain control to host level docker without much effort.

{{/run}} contains a growing list of software that put their pid file or socket 
in this location.  System admin can't say no to not allow other software to 
place their socket in {{/run}} location and share between containers due to 
company requirement.  However, he still doesn't want to let hacker gain root 
access.

h3. Solution 1:
System admin placed {{/var/*}} and {{/run/\*}} (except /run/docker.socket), 
carefully in read-write white list.  None of the symlink is exposed.  Hacker 
can not get in.

h3. Solution 2 (All symlinks, and hardcoded locations are banned):
(Current proposed patch)
System admin specifies:
  white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), 
{{/mnt/hdfs/user/\*}} (exception yarn)
  black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or explicit hard coded location also result in ban.

h3. Solution 3: (Replace black list location with empty directories):
(Jason proposed implementation)
System admin specifies:
  white-list-read-write: {{/var}},{{/run}},{{/mnt/hdfs/user}}
  black-list: {{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or mount /run/docker.socket manually, but result in empty 
file.

All solutions requires system administrator to enforce ability to upload secure 
image to private registry to prevent torjan horse in docker image.
 
I can see the appeal that without having to do a high upkeep of 
white-list-read-write directories by the new proposal.  The third solution can 
throw people off, if they do not know about black-list is hijacked to empty 
location.  However, the depth of directories might defeat second solution.  If 
community favors the third solution, I can revise patch accordingly.


was (Author: eyang):
[~jlowe] 

{quote}Either /run isn't in the whitelist in the first place rendering the 
blacklist entry moot or /run is in the whitelist and the user can simply mount 
{{/run}} and access the blacklist path.{quote}

Let's expand on the real world example.  A hacker tries to take control of 
{{/run/docker.socket}} to acquire root privileges and spawn root containers to 
access vital system area to become root on the host system.  The system admin 
placed {{/var}} in read-write white list for ability to write to database and 
log directories, without black list capability.  Hacker explicitly specify 
{{/var/run/docker.socket}} to be included, put the socket in 
{{/tmp/docker.socket}}.  Hacker generates a docker image with {{/etc/group}} 
modified to include his own name or setuid bit binary in container.  Hack can 
successfully gain control to host level docker without much effort.

{{/run}} contains a growing list of software that put their pid file or socket 
in this location.  System admin can't say no to not allow other software to 
place their socket in {{/run}} location and share between containers due to 
company requirement.  However, he still doesn't want to let hacker gain root 
access.

Solution 1:
System admin placed {{/var/*}} and {{/run/\*}} (except /run/docker.socket), 
carefully in read-write white list.  None of the symlink is exposed.  Hacker 
can not get in.

Solution 2 (All symlinks are banned and explicit hardcoded locations):
(Current proposed patch)
System admin specifies:
  white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), 
{{/mnt/hdfs/user/\*}} (exception yarn)
  black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or explicit hard coded location also result in ban.

Solution 3: (Ban symlink and replace black list 

[jira] [Comment Edited] (YARN-7197) Add support for a volume blacklist for docker containers

2017-10-30 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225452#comment-16225452
 ] 

Eric Yang edited comment on YARN-7197 at 10/30/17 6:17 PM:
---

[~jlowe] 

{quote}Either /run isn't in the whitelist in the first place rendering the 
blacklist entry moot or /run is in the whitelist and the user can simply mount 
{{/run}} and access the blacklist path.{quote}

Let's expand on the real world example.  A hacker tries to take control of 
{{/run/docker.socket}} to acquire root privileges and spawn root containers to 
access vital system area to become root on the host system.  The system admin 
placed {{/var}} in read-write white list for ability to write to database and 
log directories, without black list capability.  Hacker explicitly specify 
{{/var/run/docker.socket}} to be included, put the socket in 
{{/tmp/docker.socket}}.  Hacker generates a docker image with {{/etc/group}} 
modified to include his own name or setuid bit binary in container.  Hack can 
successfully gain control to host level docker without much effort.

{{/run}} contains a growing list of software that put their pid file or socket 
in this location.  System admin can't say no to not allow other software (i.e. 
hdfs short circuit read) to place their socket in {{/run}} location and share 
between containers due to company requirement.  However, he still doesn't want 
to let hacker gain root access.

h3. Solution 1:
System admin placed {{/var/*}} and {{/run/\*}} (except /run/docker.socket), 
carefully in read-write white list.  None of the symlink is exposed.  Hacker 
can not get in.

h3. Solution 2 (All symlinks, and hardcoded locations are banned):
(Current proposed patch)
System admin specifies:
  white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), 
{{/mnt/hdfs/user/\*}} (exception yarn)
  black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or explicit hard coded location also result in ban.

h3. Solution 3: (Replace black list location with empty directories):
(Jason proposed implementation)
System admin specifies:
  white-list-read-write: {{/var}},{{/run}},{{/mnt/hdfs/user}}
  black-list: {{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or mount /run/docker.socket manually, but result in empty 
file.

All solutions requires system administrator to enforce ability to upload secure 
image to private registry to prevent torjan horse in docker image.
 
I can see the appeal that without having to do a high upkeep of 
white-list-read-write directories by the new proposal.  The third solution can 
throw people off, if they do not know about black-list is hijacked to empty 
location.  However, the depth of directories might defeat second solution.  If 
community favors the third solution, I can revise patch accordingly.


was (Author: eyang):
[~jlowe] 

{quote}Either /run isn't in the whitelist in the first place rendering the 
blacklist entry moot or /run is in the whitelist and the user can simply mount 
{{/run}} and access the blacklist path.{quote}

Let's expand on the real world example.  A hacker tries to take control of 
{{/run/docker.socket}} to acquire root privileges and spawn root containers to 
access vital system area to become root on the host system.  The system admin 
placed {{/var}} in read-write white list for ability to write to database and 
log directories, without black list capability.  Hacker explicitly specify 
{{/var/run/docker.socket}} to be included, put the socket in 
{{/tmp/docker.socket}}.  Hacker generates a docker image with {{/etc/group}} 
modified to include his own name or setuid bit binary in container.  Hack can 
successfully gain control to host level docker without much effort.

{{/run}} contains a growing list of software that put their pid file or socket 
in this location.  System admin can't say no to not allow other software to 
place their socket in {{/run}} location and share between containers due to 
company requirement.  However, he still doesn't want to let hacker gain root 
access.

h3. Solution 1:
System admin placed {{/var/*}} and {{/run/\*}} (except /run/docker.socket), 
carefully in read-write white list.  None of the symlink is exposed.  Hacker 
can not get in.

h3. Solution 2 (All symlinks, and hardcoded locations are banned):
(Current proposed patch)
System admin specifies:
  white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), 
{{/mnt/hdfs/user/\*}} (exception yarn)
  black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or explicit hard coded location also result in ban.

h3. Solution 3: (Replace 

[jira] [Commented] (YARN-7197) Add support for a volume blacklist for docker containers

2017-10-30 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225452#comment-16225452
 ] 

Eric Yang commented on YARN-7197:
-

[~jlowe] 

{quote}Either /run isn't in the whitelist in the first place rendering the 
blacklist entry moot or /run is in the whitelist and the user can simply mount 
/run and access the blacklist path.{quote}

Let's expand on the real world example.  A hacker tries to take control of 
{{/run/docker.socket}} to acquire root privileges and spawn root containers to 
access vital system area to become root on the host system.  The system admin 
placed {{/var}} in read-write white list for ability to write to database and 
log directories, without black list capability.  Hacker explicitly specify 
{{/var/run/docker.socket}} to be included, put the socket in 
{{/tmp/docker.socket}}.  Hacker generates a docker image with /etc/group 
modified to include his own name or setuid bit binary in container.  Hack can 
successfully gain control to host level docker without much effort.

{{/run}} contains a growing list of software that put their pid file or socket 
in this location.  System admin can't say no to not allow other software to 
place their socket in {{/run}} location and share between containers due to 
company requirement.  However, he still doesn't want to let hacker gain root 
access.

Solution 1:
System admin placed {{/var/*}} and {{/run/*}} (except /run/docker.socket), 
carefully in read-write white list.  None of the symlink is exposed.  Hacker 
can not get in.

Solution 2 (All symlinks are banned and explicit hardcoded locations):
(Current proposed patch)
System admin specifies:
  white-list-read-write: {{/var}}, {{/run/*}} (except /run/docker.socket), 
{{/mnt/hdfs/user/*}} (exception yarn)
  black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or explicit hard coded location also result in ban.

Solution 3: (Ban symlink and replace black list location with empty directory):
(Jason proposed implementation)
System admin specifies:
  white-list-read-write: {{/var}},{{/run}},{{/mnt/hdfs/user}}
  black-list: {{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or mount /run/docker.socket manually, but result in empty 
file.

All solutions requires system administrator to enforce ability to upload secure 
image to private registry to prevent torjan horse in docker image.
 
I can see the appeal that without having to do a high upkeep of 
white-list-read-write directories by the new proposal.  The third solution can 
throw people off, if they do not know about black-list is hijacked to empty 
location.  However, the deeper nested directories, it would be harder to secure 
by second solution.  If community favors the third solution, I can revise patch 
accordingly.

> Add support for a volume blacklist for docker containers
> 
>
> Key: YARN-7197
> URL: https://issues.apache.org/jira/browse/YARN-7197
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Reporter: Shane Kumpf
>Assignee: Eric Yang
> Attachments: YARN-7197.001.patch, YARN-7197.002.patch
>
>
> Docker supports bind mounting host directories into containers. Work is 
> underway to allow admins to configure a whilelist of volume mounts. While 
> this is a much needed and useful feature, it opens the door for 
> misconfiguration that may lead to users being able to compromise or crash the 
> system. 
> One example would be allowing users to mount /run from a host running 
> systemd, and then running systemd in that container, rendering the host 
> mostly unusable.
> This issue is to add support for a default blacklist. The default blacklist 
> would be where we put files and directories that if mounted into a container, 
> are likely to have negative consequences. Users are encouraged not to remove 
> items from the default blacklist, but may do so if necessary.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7197) Add support for a volume blacklist for docker containers

2017-10-30 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225452#comment-16225452
 ] 

Eric Yang edited comment on YARN-7197 at 10/30/17 6:19 PM:
---

[~jlowe] 

{quote}Either /run isn't in the whitelist in the first place rendering the 
blacklist entry moot or /run is in the whitelist and the user can simply mount 
{{/run}} and access the blacklist path.{quote}

Let's expand on the real world example.  A hacker tries to take control of 
{{/run/docker.socket}} to acquire root privileges and spawn root containers to 
access vital system area to become root on the host system.  The system admin 
placed {{/var}} in read-write white list for ability to write to database and 
log directories, without black list capability.  Hacker explicitly specify 
{{/var/run/docker.socket}} to be included, put the socket in 
{{/tmp/docker.socket}}.  Hacker generates a docker image with {{/etc/group}} 
modified to include his own name or setuid bit binary in container.  Hack can 
successfully gain control to host level docker without much effort.

{{/run}} contains a growing list of software that put their pid file or socket 
in this location.  System admin can't say no to not allow other software (i.e. 
hdfs short circuit read) to place their socket in {{/run}} location and share 
between containers due to company requirement.  However, he still doesn't want 
to let hacker gain root access.

h3. Solution 1:
System admin placed {{/var/*}}, {{/run/\*}} (except /run/docker.socket), and 
{{/mnt/hdfs/user/\*}} (except yarn), carefully in read-write white list.  None 
of the symlink is exposed.  Hacker can not get in.

h3. Solution 2 (All symlinks, and hardcoded locations are banned):
(Current proposed patch)
System admin specifies:
  white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), 
{{/mnt/hdfs/user/\*}} (exception yarn)
  black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or explicit hard coded location also result in ban.

h3. Solution 3: (Replace black list location with empty directories):
(Jason proposed implementation)
System admin specifies:
  white-list-read-write: {{/var}},{{/run}},{{/mnt/hdfs/user}}
  black-list: {{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or mount /run/docker.socket manually, but result in empty 
file.

All solutions requires system administrator to enforce ability to upload secure 
image to private registry to prevent torjan horse in docker image.
 
I can see the appeal that without having to do a high upkeep of 
white-list-read-write directories by the new proposal.  The third solution can 
throw people off, if they do not know about black-list is hijacked to empty 
location.  However, the depth of directories might defeat second solution.  If 
community favors the third solution, I can revise patch accordingly.


was (Author: eyang):
[~jlowe] 

{quote}Either /run isn't in the whitelist in the first place rendering the 
blacklist entry moot or /run is in the whitelist and the user can simply mount 
{{/run}} and access the blacklist path.{quote}

Let's expand on the real world example.  A hacker tries to take control of 
{{/run/docker.socket}} to acquire root privileges and spawn root containers to 
access vital system area to become root on the host system.  The system admin 
placed {{/var}} in read-write white list for ability to write to database and 
log directories, without black list capability.  Hacker explicitly specify 
{{/var/run/docker.socket}} to be included, put the socket in 
{{/tmp/docker.socket}}.  Hacker generates a docker image with {{/etc/group}} 
modified to include his own name or setuid bit binary in container.  Hack can 
successfully gain control to host level docker without much effort.

{{/run}} contains a growing list of software that put their pid file or socket 
in this location.  System admin can't say no to not allow other software (i.e. 
hdfs short circuit read) to place their socket in {{/run}} location and share 
between containers due to company requirement.  However, he still doesn't want 
to let hacker gain root access.

h3. Solution 1:
System admin placed {{/var/*}}, {{/run/\*}} (except /run/docker.socket), and 
{{/mnt/hdfs/user/*}} (except yarn), carefully in read-write white list.  None 
of the symlink is exposed.  Hacker can not get in.

h3. Solution 2 (All symlinks, and hardcoded locations are banned):
(Current proposed patch)
System admin specifies:
  white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), 
{{/mnt/hdfs/user/\*}} (exception yarn)
  black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access 

[jira] [Comment Edited] (YARN-7197) Add support for a volume blacklist for docker containers

2017-10-30 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225452#comment-16225452
 ] 

Eric Yang edited comment on YARN-7197 at 10/30/17 6:09 PM:
---

[~jlowe] 

{quote}Either /run isn't in the whitelist in the first place rendering the 
blacklist entry moot or /run is in the whitelist and the user can simply mount 
{{/run}} and access the blacklist path.{quote}

Let's expand on the real world example.  A hacker tries to take control of 
{{/run/docker.socket}} to acquire root privileges and spawn root containers to 
access vital system area to become root on the host system.  The system admin 
placed {{/var}} in read-write white list for ability to write to database and 
log directories, without black list capability.  Hacker explicitly specify 
{{/var/run/docker.socket}} to be included, put the socket in 
{{/tmp/docker.socket}}.  Hacker generates a docker image with {{/etc/group}} 
modified to include his own name or setuid bit binary in container.  Hack can 
successfully gain control to host level docker without much effort.

{{/run}} contains a growing list of software that put their pid file or socket 
in this location.  System admin can't say no to not allow other software to 
place their socket in {{/run}} location and share between containers due to 
company requirement.  However, he still doesn't want to let hacker gain root 
access.

Solution 1:
System admin placed {{/var/*}} and {{/run/\*}} (except /run/docker.socket), 
carefully in read-write white list.  None of the symlink is exposed.  Hacker 
can not get in.

Solution 2 (All symlinks are banned and explicit hardcoded locations):
(Current proposed patch)
System admin specifies:
  white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), 
{{/mnt/hdfs/user/\*}} (exception yarn)
  black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or explicit hard coded location also result in ban.

Solution 3: (Ban symlink and replace black list location with empty directory):
(Jason proposed implementation)
System admin specifies:
  white-list-read-write: {{/var}},{{/run}},{{/mnt/hdfs/user}}
  black-list: {{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or mount /run/docker.socket manually, but result in empty 
file.

All solutions requires system administrator to enforce ability to upload secure 
image to private registry to prevent torjan horse in docker image.
 
I can see the appeal that without having to do a high upkeep of 
white-list-read-write directories by the new proposal.  The third solution can 
throw people off, if they do not know about black-list is hijacked to empty 
location.  However, the depth of directories will defeat second solution.  If 
community favors the third solution, I can revise patch accordingly.


was (Author: eyang):
[~jlowe] 

{quote}Either /run isn't in the whitelist in the first place rendering the 
blacklist entry moot or /run is in the whitelist and the user can simply mount 
/run and access the blacklist path.{quote}

Let's expand on the real world example.  A hacker tries to take control of 
{{/run/docker.socket}} to acquire root privileges and spawn root containers to 
access vital system area to become root on the host system.  The system admin 
placed {{/var}} in read-write white list for ability to write to database and 
log directories, without black list capability.  Hacker explicitly specify 
{{/var/run/docker.socket}} to be included, put the socket in 
{{/tmp/docker.socket}}.  Hacker generates a docker image with /etc/group 
modified to include his own name or setuid bit binary in container.  Hack can 
successfully gain control to host level docker without much effort.

{{/run}} contains a growing list of software that put their pid file or socket 
in this location.  System admin can't say no to not allow other software to 
place their socket in {{/run}} location and share between containers due to 
company requirement.  However, he still doesn't want to let hacker gain root 
access.

Solution 1:
System admin placed {{/var/*}} and {{/run/\*}} (except /run/docker.socket), 
carefully in read-write white list.  None of the symlink is exposed.  Hacker 
can not get in.

Solution 2 (All symlinks are banned and explicit hardcoded locations):
(Current proposed patch)
System admin specifies:
  white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), 
{{/mnt/hdfs/user/\*}} (exception yarn)
  black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or explicit hard coded location also result in ban.

Solution 3: (Ban symlink and replace black list 

[jira] [Comment Edited] (YARN-7197) Add support for a volume blacklist for docker containers

2017-10-30 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16225452#comment-16225452
 ] 

Eric Yang edited comment on YARN-7197 at 10/30/17 6:08 PM:
---

[~jlowe] 

{quote}Either /run isn't in the whitelist in the first place rendering the 
blacklist entry moot or /run is in the whitelist and the user can simply mount 
/run and access the blacklist path.{quote}

Let's expand on the real world example.  A hacker tries to take control of 
{{/run/docker.socket}} to acquire root privileges and spawn root containers to 
access vital system area to become root on the host system.  The system admin 
placed {{/var}} in read-write white list for ability to write to database and 
log directories, without black list capability.  Hacker explicitly specify 
{{/var/run/docker.socket}} to be included, put the socket in 
{{/tmp/docker.socket}}.  Hacker generates a docker image with /etc/group 
modified to include his own name or setuid bit binary in container.  Hack can 
successfully gain control to host level docker without much effort.

{{/run}} contains a growing list of software that put their pid file or socket 
in this location.  System admin can't say no to not allow other software to 
place their socket in {{/run}} location and share between containers due to 
company requirement.  However, he still doesn't want to let hacker gain root 
access.

Solution 1:
System admin placed {{/var/*}} and {{/run/\*}} (except /run/docker.socket), 
carefully in read-write white list.  None of the symlink is exposed.  Hacker 
can not get in.

Solution 2 (All symlinks are banned and explicit hardcoded locations):
(Current proposed patch)
System admin specifies:
  white-list-read-write: {{/var}}, {{/run/\*}} (except /run/docker.socket), 
{{/mnt/hdfs/user/\*}} (exception yarn)
  black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or explicit hard coded location also result in ban.

Solution 3: (Ban symlink and replace black list location with empty directory):
(Jason proposed implementation)
System admin specifies:
  white-list-read-write: {{/var}},{{/run}},{{/mnt/hdfs/user}}
  black-list: {{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or mount /run/docker.socket manually, but result in empty 
file.

All solutions requires system administrator to enforce ability to upload secure 
image to private registry to prevent torjan horse in docker image.
 
I can see the appeal that without having to do a high upkeep of 
white-list-read-write directories by the new proposal.  The third solution can 
throw people off, if they do not know about black-list is hijacked to empty 
location.  However, the depth of directories will defeat second solution.  If 
community favors the third solution, I can revise patch accordingly.


was (Author: eyang):
[~jlowe] 

{quote}Either /run isn't in the whitelist in the first place rendering the 
blacklist entry moot or /run is in the whitelist and the user can simply mount 
/run and access the blacklist path.{quote}

Let's expand on the real world example.  A hacker tries to take control of 
{{/run/docker.socket}} to acquire root privileges and spawn root containers to 
access vital system area to become root on the host system.  The system admin 
placed {{/var}} in read-write white list for ability to write to database and 
log directories, without black list capability.  Hacker explicitly specify 
{{/var/run/docker.socket}} to be included, put the socket in 
{{/tmp/docker.socket}}.  Hacker generates a docker image with /etc/group 
modified to include his own name or setuid bit binary in container.  Hack can 
successfully gain control to host level docker without much effort.

{{/run}} contains a growing list of software that put their pid file or socket 
in this location.  System admin can't say no to not allow other software to 
place their socket in {{/run}} location and share between containers due to 
company requirement.  However, he still doesn't want to let hacker gain root 
access.

Solution 1:
System admin placed {{/var/*}} and {{/run/*}} (except /run/docker.socket), 
carefully in read-write white list.  None of the symlink is exposed.  Hacker 
can not get in.

Solution 2 (All symlinks are banned and explicit hardcoded locations):
(Current proposed patch)
System admin specifies:
  white-list-read-write: {{/var}}, {{/run/*}} (except /run/docker.socket), 
{{/mnt/hdfs/user/*}} (exception yarn)
  black-list: {{/var/run}},{{/run/docker.socket}},{{/mnt/hdfs/user/yarn}}
Hacker attempt to mount a symlink location resulting in access denied from 
container startup, or explicit hard coded location also result in ban.

Solution 3: (Ban symlink and replace black list location with 

[jira] [Commented] (YARN-7565) Yarn service pre-maturely releases the container after AM restart

2017-12-20 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298837#comment-16298837
 ] 

Eric Yang commented on YARN-7565:
-

More information revealed that, there was a problem with znode on my cluster.  
I am not sure how it reached that state.  By removing the faulty znode for DNS 
registry, the null pointer exception problem doesn't happen any more.  

> Yarn service pre-maturely releases the container after AM restart 
> --
>
> Key: YARN-7565
> URL: https://issues.apache.org/jira/browse/YARN-7565
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
> Fix For: 3.1.0
>
> Attachments: YARN-7565.001.patch, YARN-7565.002.patch, 
> YARN-7565.003.patch, YARN-7565.004.patch, YARN-7565.005.patch, 
> YARN-7565.addendum.001.patch
>
>
> With YARN-6168, recovered containers can be reported to AM in response to the 
> AM heartbeat. 
> Currently, the Service Master will release the containers, that are not 
> reported in the AM registration response, immediately.
> Instead, the master can wait for a configured amount of time for the 
> containers to be recovered by RM. These containers are sent to AM in the 
> heartbeat response. Once a container is not reported in the configured 
> interval, it can be released by the master.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-7565) Yarn service pre-maturely releases the container after AM restart

2017-12-20 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298747#comment-16298747
 ] 

Eric Yang edited comment on YARN-7565 at 12/20/17 5:33 PM:
---

Thank you for point out the ServiceRecord.description maps to container name 
(and not Service Spec description field), but it appears to be a race condition 
for newly created application.  serviceStart is invoked recoverComponent first. 
 Application hasn't registered with Registry yet.  This looks like the reason 
that we get null pointer exception.


was (Author: eyang):
Thank you for point out the record.description maps to container name, but it 
appears to be a race condition for newly created application.  serviceStart is 
invoked recoverComponent first.  Application hasn't registered with Registry 
yet.  This looks like the reason that we get null pointer exception.

> Yarn service pre-maturely releases the container after AM restart 
> --
>
> Key: YARN-7565
> URL: https://issues.apache.org/jira/browse/YARN-7565
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
> Fix For: 3.1.0
>
> Attachments: YARN-7565.001.patch, YARN-7565.002.patch, 
> YARN-7565.003.patch, YARN-7565.004.patch, YARN-7565.005.patch, 
> YARN-7565.addendum.001.patch
>
>
> With YARN-6168, recovered containers can be reported to AM in response to the 
> AM heartbeat. 
> Currently, the Service Master will release the containers, that are not 
> reported in the AM registration response, immediately.
> Instead, the master can wait for a configured amount of time for the 
> containers to be recovered by RM. These containers are sent to AM in the 
> heartbeat response. Once a container is not reported in the configured 
> interval, it can be released by the master.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7565) Yarn service pre-maturely releases the container after AM restart

2017-12-20 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298747#comment-16298747
 ] 

Eric Yang commented on YARN-7565:
-

Thank you for point out the record.description maps to container name, but it 
appears to be a race condition for newly created application.  serviceStart is 
invoked recoverComponent first.  Application hasn't registered with Registry 
yet.  This looks like the reason that we get null pointer exception.

> Yarn service pre-maturely releases the container after AM restart 
> --
>
> Key: YARN-7565
> URL: https://issues.apache.org/jira/browse/YARN-7565
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
> Fix For: 3.1.0
>
> Attachments: YARN-7565.001.patch, YARN-7565.002.patch, 
> YARN-7565.003.patch, YARN-7565.004.patch, YARN-7565.005.patch, 
> YARN-7565.addendum.001.patch
>
>
> With YARN-6168, recovered containers can be reported to AM in response to the 
> AM heartbeat. 
> Currently, the Service Master will release the containers, that are not 
> reported in the AM registration response, immediately.
> Instead, the master can wait for a configured amount of time for the 
> containers to be recovered by RM. These containers are sent to AM in the 
> heartbeat response. Once a container is not reported in the configured 
> interval, it can be released by the master.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8080) YARN native service should support component restart policy

2018-05-04 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464158#comment-16464158
 ] 

Eric Yang edited comment on YARN-8080 at 5/4/18 5:23 PM:
-

[~suma.shivaprasad] Thank you for the patch.

Flex is a black box operation, it is not context aware of how application 
requires more or less containers.  Therefore, it is reliant on the user/program 
to make decision.  Here are the possible usage of each case:

*Retry policy = NEVER and Flex Up*
A data scientist might be training datasets and found that the dataset produced 
by the first two completed container is insufficient, and he like to get more 
iteration to train on the same dataset.  The input parameters could stay the 
same, but perform more of the same iterations in parallel.  Flex operation can 
come in handy that flex up to reach the desired state of 4 containers (2 
currently running and 2 additional containers).  This can produce more data 
model for him in the same run.

*Retry policy = NEVER and Flex down*
When system administrator ask data scientist to save system resources for his 
bitcoin mining operation.  Flex down could mean to save system resources and 
perform ML training iterations at a later run.  

*Retry policy = ON_FAILURE and Flex Up*
In the case where container workload are stateful, such as SparkSQL that 
translated query into multiple partitions.  SparkSQL driver can decide if it 
wants to attempt multiple retries on failure with smaller dataset to ensure 
query completion.  It may decide to increase the number of containers, and 
change some hint file on hdfs to reduce the workload computed per container and 
increase number of containers to complete the query computation.  In this case, 
the counter should reset 0 for successful container runs, and restart all 
containers.

*Retry policy = ON_FAILURE and Flex down*
In some case, merging data from many partitions at the same time, it might have 
unbalanced dataset, and prevent merging from happening.  SparkSQL driver might 
decide to use alternate technique to merge using few containers.  In this case, 
Yarn Service AM reduce the container count, and let Spark executor program to 
communicate directly with Spark driver program to compute by alternate 
strategy.  In this case, the counter should reset to 0 for successful container 
runs, and restart all containers.

There are possible use cases for each of the scenario, and we provide the knobs 
to enable each scenario.  There are some additional programming from 
application point of view to take advantage of the advance feature.  I also 
agree that some stateful program might not work in combinations of retry policy 
and flex operation, and we provide a option to disable flex for such type of 
program.


was (Author: eyang):
[~suma.shivaprasad] Thank you for the patch.

Flex is a black box operation, it is not context aware of how application 
requires more or less containers.  Therefore, it reliant on the user/program to 
make decision.  Here are the possible usage of each case:

Retry policy = NEVER and Flex Up
A data scientist might be training datasets and found that the dataset produced 
by the first two completed container is insufficient, and he like to get more 
iteration to train on the same dataset.  The input parameters could stay the 
same, but perform more of the same iterations in parallel.  Flex operation can 
come in handy that flex up to reach the desired state of 4 containers (2 
currently running and 2 additional containers).  This can produce more data 
model for him in the same run.

Retry policy = NEVER and Flex down
When system administrator ask data scientist to save system resources for his 
bitcoin mining operation.  Flex down could mean to save system resources and 
perform ML training iterations at a later run.  

Retry policy = ON_FAILURE and Flex Up
In the case where container workload are stateful, such as SparkSQL that 
translated query into multiple partitions.  SparkSQL driver can decide if it 
wants to attempt multiple retries on failure with smaller dataset to ensure 
query completion.  It may decide to increase the number of containers, and 
change some hint file on hdfs to reduce the workload computed per container and 
increase number of containers to complete the query computation.

Retry policy = ON_FAILURE and Flex down
In some case, merging data from many partitions at the same time, it might have 
unbalanced dataset, and prevent merging from happening.  SparkSQL driver might 
decide to use alternate technique to merge using few containers.  In this case, 
Yarn Service AM reduce the container count, and let Spark executor program to 
communicate directly with Spark driver program to compute by alternate strategy.

There are possible use cases for each of the scenario, and we provide the knobs 
to enable each scenario.  There are some additional programming 

[jira] [Updated] (YARN-8223) ClassNotFoundException when auxiliary service is loaded from HDFS

2018-05-04 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8223:

Target Version/s: 3.2.0, 3.1.1

> ClassNotFoundException when auxiliary service is loaded from HDFS
> -
>
> Key: YARN-8223
> URL: https://issues.apache.org/jira/browse/YARN-8223
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Charan Hebri
>Assignee: Zian Chen
>Priority: Blocker
> Attachments: YARN-8223.001.patch, YARN-8223.002.patch
>
>
> Loading an auxiliary jar from a local location on a node manager works as 
> expected,
> {noformat}
> 2018-04-26 15:09:26,179 INFO  util.ApplicationClassLoader 
> (ApplicationClassLoader.java:(98)) - classpath: 
> [file:/grid/0/hadoop/yarn/local/aux-service-local.jar]
> 2018-04-26 15:09:26,179 INFO  util.ApplicationClassLoader 
> (ApplicationClassLoader.java:(99)) - system classes: [java., 
> javax.accessibility., javax.activation., javax.activity., javax.annotation., 
> javax.annotation.processing., javax.crypto., javax.imageio., javax.jws., 
> javax.lang.model., -javax.management.j2ee., javax.management., javax.naming., 
> javax.net., javax.print., javax.rmi., javax.script., 
> -javax.security.auth.message., javax.security.auth., javax.security.cert., 
> javax.security.sasl., javax.sound., javax.sql., javax.swing., javax.tools., 
> javax.transaction., -javax.xml.registry., -javax.xml.rpc., javax.xml., 
> org.w3c.dom., org.xml.sax., org.apache.commons.logging., org.apache.log4j., 
> -org.apache.hadoop.hbase., org.apache.hadoop., core-default.xml, 
> hdfs-default.xml, mapred-default.xml, yarn-default.xml]
> 2018-04-26 15:09:26,181 INFO  containermanager.AuxServices 
> (AuxServices.java:serviceInit(252)) - The aux service:test_aux_local are 
> using the custom classloader
> 2018-04-26 15:09:26,182 WARN  containermanager.AuxServices 
> (AuxServices.java:serviceInit(268)) - The Auxiliary Service named 
> 'test_aux_local' in the configuration is for class 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader
>  which has a name of 'org.apache.auxtest.AuxServiceFromLocal with custom 
> class loader'. Because these are not the same tools trying to send 
> ServiceData and read Service Meta Data may have issues unless the refer to 
> the name in the config.
> 2018-04-26 15:09:26,182 INFO  containermanager.AuxServices 
> (AuxServices.java:addService(103)) - Adding auxiliary service 
> org.apache.auxtest.AuxServiceFromLocal with custom class loader, 
> "test_aux_local"{noformat}
> But loading the same jar from a location on HDFS fails with a 
> ClassNotFoundException.
> {noformat}
> 018-04-26 15:14:39,683 INFO  util.ApplicationClassLoader 
> (ApplicationClassLoader.java:(98)) - classpath: []
> 2018-04-26 15:14:39,683 INFO  util.ApplicationClassLoader 
> (ApplicationClassLoader.java:(99)) - system classes: [java., 
> javax.accessibility., javax.activation., javax.activity., javax.annotation., 
> javax.annotation.processing., javax.crypto., javax.imageio., javax.jws., 
> javax.lang.model., -javax.management.j2ee., javax.management., javax.naming., 
> javax.net., javax.print., javax.rmi., javax.script., 
> -javax.security.auth.message., javax.security.auth., javax.security.cert., 
> javax.security.sasl., javax.sound., javax.sql., javax.swing., javax.tools., 
> javax.transaction., -javax.xml.registry., -javax.xml.rpc., javax.xml., 
> org.w3c.dom., org.xml.sax., org.apache.commons.logging., org.apache.log4j., 
> -org.apache.hadoop.hbase., org.apache.hadoop., core-default.xml, 
> hdfs-default.xml, mapred-default.xml, yarn-default.xml]
> 2018-04-26 15:14:39,687 INFO  service.AbstractService 
> (AbstractService.java:noteFailure(267)) - Service 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices failed 
> in state INITED
> java.lang.ClassNotFoundException: org.apache.auxtest.AuxServiceFromLocal
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:189)
>   at 
> org.apache.hadoop.util.ApplicationClassLoader.loadClass(ApplicationClassLoader.java:157)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxiliaryServiceWithCustomClassLoader.getInstance(AuxiliaryServiceWithCustomClassLoader.java:169)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:249)
>   at 
> 

[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-04 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464035#comment-16464035
 ] 

Eric Yang commented on YARN-8207:
-

[~jlowe] I see your concerns now.  Thanks for the explanation.  I will update 
the code to use typedef data structure above, and ensure null terminator is 
passed to execvp after getting the data out of args.

> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8080) YARN native service should support component restart policy

2018-05-04 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464158#comment-16464158
 ] 

Eric Yang commented on YARN-8080:
-

[~suma.shivaprasad] Thank you for the patch.

Flex is a black box operation, it is not context aware of how application 
requires more or less containers.  Therefore, it reliant on the user/program to 
make decision.  Here are the possible usage of each case:

Retry policy = NEVER and Flex Up
A data scientist might be training datasets and found that the dataset produced 
by the first two completed container is insufficient, and he like to get more 
iteration to train on the same dataset.  The input parameters could stay the 
same, but perform more of the same iterations in parallel.  Flex operation can 
come in handy that flex up to reach the desired state of 4 containers (2 
currently running and 2 additional containers).  This can produce more data 
model for him in the same run.

Retry policy = NEVER and Flex down
When system administrator ask data scientist to save system resources for his 
bitcoin mining operation.  Flex down could mean to save system resources and 
perform ML training iterations at a later run.  

Retry policy = ON_FAILURE and Flex Up
In the case where container workload are stateful, such as SparkSQL that 
translated query into multiple partitions.  SparkSQL driver can decide if it 
wants to attempt multiple retries on failure with smaller dataset to ensure 
query completion.  It may decide to increase the number of containers, and 
change some hint file on hdfs to reduce the workload computed per container and 
increase number of containers to complete the query computation.

Retry policy = ON_FAILURE and Flex down
In some case, merging data from many partitions at the same time, it might have 
unbalanced dataset, and prevent merging from happening.  SparkSQL driver might 
decide to use alternate technique to merge using few containers.  In this case, 
Yarn Service AM reduce the container count, and let Spark executor program to 
communicate directly with Spark driver program to compute by alternate strategy.

There are possible use cases for each of the scenario, and we provide the knobs 
to enable each scenario.  There are some additional programming from 
application point of view to take advantage of the advance feature.  I also 
agree that some stateful program might not work in combinations of retry policy 
and flex operation, and we provide a option to disable flex for such type of 
program.

> YARN native service should support component restart policy
> ---
>
> Key: YARN-8080
> URL: https://issues.apache.org/jira/browse/YARN-8080
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Suma Shivaprasad
>Priority: Critical
> Attachments: YARN-8080.001.patch, YARN-8080.002.patch, 
> YARN-8080.003.patch, YARN-8080.005.patch, YARN-8080.006.patch, 
> YARN-8080.007.patch
>
>
> Existing native service assumes the service is long running and never 
> finishes. Containers will be restarted even if exit code == 0. 
> To support boarder use cases, we need to allow restart policy of component 
> specified by users. Propose to have following policies:
> 1) Always: containers always restarted by framework regardless of container 
> exit status. This is existing/default behavior.
> 2) Never: Do not restart containers in any cases after container finishes: To 
> support job-like workload (for example Tensorflow training job). If a task 
> exit with code == 0, we should not restart the task. This can be used by 
> services which is not restart/recovery-able.
> 3) On-failure: Similar to above, only restart task with exitcode != 0. 
> Behaviors after component *instance* finalize (Succeeded or Failed when 
> restart_policy != ALWAYS): 
> 1) For single component, single instance: complete service.
> 2) For single component, multiple instance: other running instances from the 
> same component won't be affected by the finalized component instance. Service 
> will be terminated once all instances finalized. 
> 3) For multiple components: Service will be terminated once all components 
> finalized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-07 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8207:

Attachment: YARN-8207.007.patch

> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, 
> YARN-8207.006.patch, YARN-8207.007.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-08 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467950#comment-16467950
 ] 

Eric Yang commented on YARN-8207:
-

[~jlowe] Thank you for the persistent reviews to make this better.  :)

> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, 
> YARN-8207.006.patch, YARN-8207.007.patch, YARN-8207.008.patch, 
> YARN-8207.009.patch, YARN-8207.010.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-08 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467853#comment-16467853
 ] 

Eric Yang commented on YARN-8207:
-

[~jlowe] Patch 10 is posted.

> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, 
> YARN-8207.006.patch, YARN-8207.007.patch, YARN-8207.008.patch, 
> YARN-8207.009.patch, YARN-8207.010.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-8265) AM should retrieve new IP for restarted container

2018-05-08 Thread Eric Yang (JIRA)
Eric Yang created YARN-8265:
---

 Summary: AM should retrieve new IP for restarted container
 Key: YARN-8265
 URL: https://issues.apache.org/jira/browse/YARN-8265
 Project: Hadoop YARN
  Issue Type: Bug
  Components: yarn-native-services
Affects Versions: 3.1.0
Reporter: Eric Yang
Assignee: Eric Yang
 Fix For: 3.2.0, 3.1.1


When a docker container is restarted, it gets a new IP, but the service AM only 
retrieves one IP for a container and then cancels the container status 
retriever. I suspect the issue would be solved by restarting the retriever (if 
it has been canceled) when the onContainerRestart callback is received, but 
we'll have to do some testing to make sure this works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8265) AM should retrieve new IP for restarted container

2018-05-08 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8265:

Attachment: YARN-8265.001.patch

> AM should retrieve new IP for restarted container
> -
>
> Key: YARN-8265
> URL: https://issues.apache.org/jira/browse/YARN-8265
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8265.001.patch
>
>
> When a docker container is restarted, it gets a new IP, but the service AM 
> only retrieves one IP for a container and then cancels the container status 
> retriever. I suspect the issue would be solved by restarting the retriever 
> (if it has been canceled) when the onContainerRestart callback is received, 
> but we'll have to do some testing to make sure this works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment

2018-05-14 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474510#comment-16474510
 ] 

Eric Yang commented on YARN-8108:
-

Kerberos SPN support by browser definition are:

HTTP/, where  is either white list server name or 
canonical DNS name of the server.  Chrome, IE, and Firefox all shares the 
similar logic.  Firefox and IE don't allow canonical DNS to prevent MITM 
attack.  Safari and Chrome supports canonical DNS with options to disable 
canonical DNS.

>From Server point of view, a single server can host multiple virtual hosts 
>with different web applications.  This is technically possible to configure 
>web server to run with multiple SPN.  It is incorrect to assume that same 
>virtual host can serve two different SPN for two different subset of URLs.  
>All browsers do not support subset of URLs to be served by one SPN, while 
>other subset of URLs to be served by another SPN.
  
In Hadoop 0.2x, Hadoop components are designed to serve a collection of 
servlets (log, static, cluster) per port.  Therefore, AuthenticationFilter can 
cover the entire port by targeting the fixed set of servlet for filtering, that 
matches browser expectation without problem.  AuthenticationFilter was later 
reused in Hadoop 1.x and 2.x as Kerberos SPNEGO filter.

The current problem is only surfaced when multiple web contexts are configured 
to share on the same port with same server hostname, and each web contexts 
tried to initialize its own SPN.  This is not by design and it just happened 
due to code reuse and lack of testing.  For Hadoop 2.x+ to offer embedded 
services securely, the individual AuthenticationFilter can be turned into one 
[security 
handler|http://www.eclipse.org/jetty/documentation/9.3.x/architecture.html#_handlers]
 to match Jetty design specification.  This fall through the crack in open 
source when no one is looking because the first security mechanism for Hadoop 
was to implement a XSS filter (was committed as part of Chukwa) instead of 
security handler.  Unfortunately, Hadoop security mechanisms followed the 
bottom up approach to implement as filter instead of following web application 
design to write security handler as Handlers.  Due to lack of understanding 
that session persistence require authentication and authorization security 
mechanism to be built differently from web filters.

The one line change is to loop through all Context and ensure all contexts are 
registered with the same AuthenticationFilter to apply one filter globally to 
all URLs.  This is the reason that this one line patch can plug this security 
hole in the short term bug fix.  The long term solution is writing security 
handler to match handler design to ensure no API breakage during jetty version 
upgrade and improve session persistence in Hadoop web applications, which is 
beyond the scope of this JIRA.

> RM metrics rest API throws GSSException in kerberized environment
> -
>
> Key: YARN-8108
> URL: https://issues.apache.org/jira/browse/YARN-8108
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Kshitij Badani
>Assignee: Eric Yang
>Priority: Blocker
> Attachments: YARN-8108.001.patch
>
>
> Test is trying to pull up metrics data from SHS after kiniting as 'test_user'
> It is throwing GSSException as follows
> {code:java}
> b2b460b80713|RUNNING: curl --silent -k -X GET -D 
> /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : 
> http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15
>  07:15:48,757|INFO|MainThread|machine.py:194 - 
> run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0
> 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - 
> getMetricsJsonData()|metrics:
> 
> 
> 
> Error 403 GSSException: Failure unspecified at GSS-API level 
> (Mechanism level: Request is a replay (34))
> 
> HTTP ERROR 403
> Problem accessing /proxy/application_1518674952153_0070/metrics/json. 
> Reason:
>  GSSException: Failure unspecified at GSS-API level (Mechanism level: 
> Request is a replay (34))
> 
> 
> {code}
> Rootcausing : proxyserver on RM can't be supported for Kerberos enabled 
> cluster because AuthenticationFilter is applied twice in Hadoop code (once in 
> httpServer2 for RM, and another instance from AmFilterInitializer for proxy 
> server). This will require code changes to hadoop-yarn-server-web-proxy 
> project



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8284) get_docker_command refactoring

2018-05-14 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16474826#comment-16474826
 ] 

Eric Yang commented on YARN-8284:
-

+1 looks good to me.

> get_docker_command refactoring
> --
>
> Key: YARN-8284
> URL: https://issues.apache.org/jira/browse/YARN-8284
> Project: Hadoop YARN
>  Issue Type: Improvement
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Jason Lowe
>Assignee: Eric Badger
>Priority: Minor
> Attachments: YARN-8284.001.patch
>
>
> YARN-8274 occurred because get_docker_command's helper functions each have to 
> remember to put the docker binary as the first argument.  This is error prone 
> and causes code duplication for each of the helper functions.  It would be 
> safer and simpler if get_docker_command initialized the docker binary 
> argument in one place and each of the helper functions only added the 
> arguments specific to their particular docker sub-command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8206) Sending a kill does not immediately kill docker containers

2018-05-07 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466616#comment-16466616
 ] 

Eric Yang commented on YARN-8206:
-

[~ebadger] +1 for proposal 2.  This is safer option in my opinion.

> Sending a kill does not immediately kill docker containers
> --
>
> Key: YARN-8206
> URL: https://issues.apache.org/jira/browse/YARN-8206
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8206.001.patch, YARN-8206.002.patch, 
> YARN-8206.003.patch, YARN-8206.004.patch
>
>
> {noformat}
> if (ContainerExecutor.Signal.KILL.equals(signal)
> || ContainerExecutor.Signal.TERM.equals(signal)) {
>   handleContainerStop(containerId, env);
> {noformat}
> Currently in the code, we are handling both SIGKILL and SIGTERM as equivalent 
> for docker containers. However, they should actually be separate. When YARN 
> sends a SIGKILL to a process, it means for it to die immediately and not sit 
> around waiting for anything. This ensures an immediate reclamation of 
> resources. Additionally, if a SIGTERM is sent before the SIGKILL, the task 
> might not handle the signal correctly, and will then end up as a failed task 
> instead of a killed task. This is especially bad for preemption. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-07 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8207:

Priority: Blocker  (was: Major)

> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, 
> YARN-8207.006.patch, YARN-8207.007.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8255) Allow option to disable flex for a service component

2018-05-07 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466448#comment-16466448
 ] 

Eric Yang commented on YARN-8255:
-

Instead of introduce another field to enable or disable flex.  We can identify 
if the workload can perform flex operation base on restart_policy.

When restart_policy=ON_FAILURE or ALWAYS, this means the data can be 
recomputed, or the process can resume from failure.  Flex operation can be 
enabled.

When restart_policy=NEVER, this means the data is stateful, and can not 
reprocess.  (i.e. mapreduce writes to HBase without transaction property.) . 
This type of containers are not allowed to have flexing operation.

By reasoning deduction, it is possible to reduce combinations that will be 
supported.  This also implies that restart_policy=NEVER doesn't have to support 
upgrade.

> Allow option to disable flex for a service component 
> -
>
> Key: YARN-8255
> URL: https://issues.apache.org/jira/browse/YARN-8255
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
>
> YARN-8080 implements restart capabilities for service component instances. 
> YARN service components should add an option to disallow flexing to support 
> workloads which are essentially batch/iterative jobs which terminate with 
> restart_policy=NEVER/ON_FAILURE. This could be disabled by default for 
> components where restart_policy=NEVER/ON_FAILURE and enabled by default when 
> restart_policy=ALWAYS(which is the default restart_policy) unless explicitly 
> set at the service spec.
> The option could be exposed as part of the component spec as "allow_flexing". 
> cc [~billie.rinaldi] [~gsaha] [~eyang] [~csingh] [~wangda]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-07 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466512#comment-16466512
 ] 

Eric Yang commented on YARN-8207:
-

[~jlowe] Hadoop 3.1.1 release date was proposed for May 7th.  This is a 
blocking issue for YARN-7654.  I think this JIRA is very close to completion, 
and I like to make sure that we can catch the release train.  Are you 
comfortable to the last iteration of this patch?

> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, 
> YARN-8207.006.patch, YARN-8207.007.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-07 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466512#comment-16466512
 ] 

Eric Yang edited comment on YARN-8207 at 5/7/18 9:47 PM:
-

[~jlowe] Hadoop 3.1.1 release date was proposed for May 7th.  This is a 
blocking issue for YARN-7654.  I think this JIRA is very close to completion, 
and I like to make sure that we can catch the release train.  Are you 
comfortable with the latest iteration of this patch?


was (Author: eyang):
[~jlowe] Hadoop 3.1.1 release date was proposed for May 7th.  This is a 
blocking issue for YARN-7654.  I think this JIRA is very close to completion, 
and I like to make sure that we can catch the release train.  Are you 
comfortable to the last iteration of this patch?

> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, 
> YARN-8207.006.patch, YARN-8207.007.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-07 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466683#comment-16466683
 ] 

Eric Yang commented on YARN-8207:
-

[~jlowe] 

{quote}Rather than make an expensive deep copy of the arguments, 
construct_docker_command only needs to copy the args vector then set the number 
of arguments to zero. At that point we'd be effectively transferring ownership 
of the already allocated arg strings to the caller without requiring full 
copies.{quote}

Struct args is still evolving.  I think it would be safer to keep the data 
structure private for opaque data structure and deep copy to caller.  This 
avoids to put responsibility on external caller to free internal implementation 
of struct args.  In case if we want to have ability to trim or truncate the 
string array base on allowed parameters.  We have a way to fix it later.

{quote}add_param_to_command_if_allowed (and many other places) doesn't check 
for make_string failure, and add_to_args will segfault when it tries to 
dereference the NULL argument. Does it make sense to have add_to_args return 
failure if the caller tried to add a NULL argument?{quote}

At this time, add_to_args returns no opts to avoid having to check for null on 
make_string.  I think the proposal of making the reverse change will add more 
null pointer check, which makes the code harder to read again.  It will 
contradict the original intend of your reviews to make code easier to read.

{quote}flatten adds 1 to the strlen length in the loop, but there is only a 
need for one NUL terminator which is already accounted for in the total initial 
value.{quote}

The +1 is for space, not NULL terminator for rendering html page that looks 
like a command line.  The last space is replaced with NULL terminator.

{quote}flatten is using stpcpy incorrectly as it ignores the return values from 
the function. stpcpy returns a pointer to the terminating NUL of the resulting 
string which is exactly what we need for appending, so each invocation of 
stpcpy should be like: to = stpcpy(to, ...){quote}

This is fixed in YARN-7654 patch.  It's hard to rebase n times, and stuff gets 
to the wrong patch.  I will fix this.

{quote}This change doesn't look related to the execv changes? Also looks like a 
case that could be simplified quite a bit with strndup and strdup.{quote}

There is one byte off memory corruption that pattern is not null terminated 
properly.  This was detected by valgrind, and I decided to fix this because it 
causes segfault if I leave it in the code.

I will fix the rest of issues that you found.  Thank you again for the review.

> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, 
> YARN-8207.006.patch, YARN-8207.007.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-07 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466716#comment-16466716
 ] 

Eric Yang commented on YARN-8207:
-

[~jlowe] Patch 008 fixed the issues discovered except char array copy.  There 
is approximately 900kb leaks in container-executor prior to this patch, and we 
saved 20kb from leaking base on valgrind report exercising test cases.  Execvp 
will wipe out all the leaks anyhow.  Unless we find more of the buffer overflow 
problems.  I am going to stop styling code changes because styling change has 
diminished return of investment at this point.

> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, 
> YARN-8207.006.patch, YARN-8207.007.patch, YARN-8207.008.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8255) Allow option to disable flex for a service component

2018-05-07 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466800#comment-16466800
 ] 

Eric Yang commented on YARN-8255:
-

[~leftnoteasy] Recompute and expandable are intertwined.  They are the same 
thing.  At conceptual level, teragen has no dependency of input format.  You 
can add more partitions to get more data generated.  Hadoop's own 
implementation limited this from happening, but this does not mean docker 
containers should be imposed by the same initialization time limitation.  On 
the other hand, we must optimize the framework for general purpose usage and 
prevent ourselves from giving too many untested and unsupported options.  I 
think it make sense to reduce the flex options to 2 main types instead of 
giving all 6 options.

> Allow option to disable flex for a service component 
> -
>
> Key: YARN-8255
> URL: https://issues.apache.org/jira/browse/YARN-8255
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
>
> YARN-8080 implements restart capabilities for service component instances. 
> YARN service components should add an option to disallow flexing to support 
> workloads which are essentially batch/iterative jobs which terminate with 
> restart_policy=NEVER/ON_FAILURE. This could be disabled by default for 
> components where restart_policy=NEVER/ON_FAILURE and enabled by default when 
> restart_policy=ALWAYS(which is the default restart_policy) unless explicitly 
> set at the service spec.
> The option could be exposed as part of the component spec as "allow_flexing". 
> cc [~billie.rinaldi] [~gsaha] [~eyang] [~csingh] [~wangda]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8255) Allow option to disable flex for a service component

2018-05-07 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466800#comment-16466800
 ] 

Eric Yang edited comment on YARN-8255 at 5/8/18 3:08 AM:
-

[~leftnoteasy] Recompute and expandable are intertwined.  They are not the same 
thing.  At conceptual level, teragen has no dependency of input format.  You 
can add more partitions to get more data generated.  Hadoop's own 
implementation limited this from happening, but this does not mean docker 
containers should be imposed by the same initialization time limitation.  On 
the other hand, we must optimize the framework for general purpose usage and 
prevent ourselves from giving too many untested and unsupported options.  I 
think it make sense to reduce the flex options to 2 main types instead of 
giving all 6 options.


was (Author: eyang):
[~leftnoteasy] Recompute and expandable are intertwined.  They are the same 
thing.  At conceptual level, teragen has no dependency of input format.  You 
can add more partitions to get more data generated.  Hadoop's own 
implementation limited this from happening, but this does not mean docker 
containers should be imposed by the same initialization time limitation.  On 
the other hand, we must optimize the framework for general purpose usage and 
prevent ourselves from giving too many untested and unsupported options.  I 
think it make sense to reduce the flex options to 2 main types instead of 
giving all 6 options.

> Allow option to disable flex for a service component 
> -
>
> Key: YARN-8255
> URL: https://issues.apache.org/jira/browse/YARN-8255
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
>
> YARN-8080 implements restart capabilities for service component instances. 
> YARN service components should add an option to disallow flexing to support 
> workloads which are essentially batch/iterative jobs which terminate with 
> restart_policy=NEVER/ON_FAILURE. This could be disabled by default for 
> components where restart_policy=NEVER/ON_FAILURE and enabled by default when 
> restart_policy=ALWAYS(which is the default restart_policy) unless explicitly 
> set at the service spec.
> The option could be exposed as part of the component spec as "allow_flexing". 
> cc [~billie.rinaldi] [~gsaha] [~eyang] [~csingh] [~wangda]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-07 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8207:

Attachment: YARN-8207.008.patch

> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, 
> YARN-8207.006.patch, YARN-8207.007.patch, YARN-8207.008.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7654) Support ENTRY_POINT for docker container

2018-05-08 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-7654:

Attachment: YARN-7654.021.patch

> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-7654.001.patch, YARN-7654.002.patch, 
> YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, 
> YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, 
> YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, 
> YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, 
> YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, 
> YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, 
> YARN-7654.021.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container

2018-05-08 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468111#comment-16468111
 ] 

Eric Yang commented on YARN-7654:
-

[~jlowe] {quote}I'll try to find time to take a closer look at this patch 
tomorrow, but I'm wondering if we really need to separate the detached vs. 
foreground launching for override vs. entry-point containers. The main problem 
with running containers in the foreground is that we have no idea how long it 
takes to actually start a container. As I mentioned above, any required 
localization for the image is likely to cause the container launch to fail due 
to docker inspect retries hitting the retry limit and failing, leaving the 
container uncontrolled or at best finally killed sometime later if Shane's 
lifecycle changes cause the container to get recognized long afterwards and 
killed.{quote}

Detach option is only obtaining a container id, and container process continues 
to update information in the background.  We call docker inspect by name 
reference instead of container id.  Detach does not produce more accurate 
result than running in the foreground from docker inspect point of view because 
operations to docker daemon via docker CLI are asynchronous via docker daemon's 
rest api.  Json output from docker inspect may have partial information.  Since 
we know exactly the information to parse, therefore retry provides better 
success rate.  For ENTRY_POINT, docker run in foreground to capture stdout and 
stderr of ENTRY_POINT process without reliant on mounting host log directory to 
docker container.  This helps to prevent host log path sticking out inside the 
container that may look odd to users.

{quote}I think a cleaner approach would be to always run containers as 
detached, so when the docker run command returns we will know the docker 
inspect command will work. If I understand correctly, the main obstacle to this 
approach is finding out what to do with the container's standard out and 
standard error streams which aren't directly visible when the container runs 
detached. However I think we can use the docker logs command after the 
container is launched to reacquire the container's stdout and stderr streams 
and tie them to the intended files. At least my local experiments show docker 
logs is able to obtain the separate stdout and stderr streams for containers 
whether they were started detached or not. Thoughts?{quote}

If we want to run in background, then we have problems to capture logs again 
base on issues found in prior meetings.  

# The docker logs command will show logs from beginning of the launch to the 
point where it was captured.  Without frequent calls to docker logs command, we 
don't get the complete log.  It is expensive to call docker logs with fork and 
exec than reading a local log file.  If we use --tail option, it is still one 
extra fork and managing the child process liveness and resource usage.  This 
complicates how the resource usage should be computed.
# docker logs does not seem to separate out stdout from stderr.  [This 
issue|https://github.com/moby/moby/issues/7440] is unresolved in docker. This 
is different from YARN log file management.  It would be nice to follow yarn 
approach to make the output less confusing in many situations.

After many experiments, I settled on foreground and dup for simplicity.  
Foreground and retry docker inspect is a good concern.  However, there is a way 
to find the reasonable timeout value to decide if a docker container should be 
marked as failed.


> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-7654.001.patch, YARN-7654.002.patch, 
> YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, 
> YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, 
> YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, 
> YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, 
> YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, 
> YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message 

[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container

2018-05-09 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469061#comment-16469061
 ] 

Eric Yang commented on YARN-7654:
-

[~jlowe]  [~Jim_Brennan] I misread the last message in the discussion forum.  
Logs feature can redirect stdout and stderr streams correctly.  However, I am 
not thrilled to call extra docker logs command to fetch logs, and maintaining 
the liveness of docker logs command.  In my view, this is more fragile because 
docker logs command can receive external signal to prevent the whole log to be 
sent to yarn, and subsequence tailing will report duplicated information.  If 
it is attached to the real stdout and stderr of the running program, we reduces 
the headache of additional process management and no duplicate information.

I don't believe blocking call is the correct answer to help determine liveness 
of docker container.  The blocking call to wait for docker detach has several 
problems: 1.  Docker run could get stuck in pull docker images when mass number 
of containers are all starting at the same time and image is not cached 
locally.  This happen a lot on repositories that are hosted on docker hub.  2.  
Docker run cli can also get stuck when docker daemon hangs, and no exit code is 
returned.  3.  Some docker image that are not built to run in detached mode.  
Some developer might have built their system to require foreground mode.  These 
images will terminate in detach mode.

When "docker run -d", and "docker logs" combination are employed, there is some 
progress are not logged.  i.e. the downloading progress, docker daemon error 
message.  The current patch would log any errors coming from docker run cli to 
provide more information for user who is troubleshooting the problems.

Regarding the racy problem, this is a problem that can be optimized by system 
administrator.  On a cluster that download all images from internet via a slow 
internet link.  It is perfectly reasonable to set the retry and timeout value 
to 30 minutes to wait for download to complete.  In highly automated system, 
such as a cloud vendor trying to spin up images in fraction of a second for 
mass number of user, the timeout value might be set to as short as 5 seconds.  
If the image came up in 6 seconds, and it missed the SLA, another container 
takes its place in the next 5 second to provide smooth user experience.  The 6 
seconds container is recycled and rebuilt.  At mass scale, race condition 
problem is easier to deal with than blocking call that prevent the entire 
automated system from working.
I can update the code to make retry configurable setting in the short term.

I am not discounting the possibilities to support docker run -d and docker 
logs, but this requires more development experiments to ensure all mechanic are 
covered well.  The current approach has been in use in my environment for the 
past 6 months, and it works well.  For 3.1.1 release, it would be safer to use 
the current approach to get us better coverage of the type of containers that 
can be supported.  Thoughts?

> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-7654.001.patch, YARN-7654.002.patch, 
> YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, 
> YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, 
> YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, 
> YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, 
> YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, 
> YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, 
> YARN-7654.021.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8261) Docker container launch fails due to .cmd file creation failure

2018-05-09 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8261:

Fix Version/s: 3.1.1
   3.2.0

> Docker container launch fails due to .cmd file creation failure
> ---
>
> Key: YARN-8261
> URL: https://issues.apache.org/jira/browse/YARN-8261
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Eric Badger
>Assignee: Jason Lowe
>Priority: Blocker
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8261.001.patch, YARN-8261.002.patch
>
>
> Due to YARN-8064, the location of the docker .cmd files was changed. These 
> files are now being placed in the nmPrivate directory of the container. 
> However, this directory will not always be created. If the localizer does not 
> run or the credentials are written to a different disk, then this directory 
> will not exist and so the .cmd file creation will fail, thus causing the 
> container launch to fail. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8261) Docker container launch fails due to .cmd file creation failure

2018-05-09 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469152#comment-16469152
 ] 

Eric Yang commented on YARN-8261:
-

+1 looks good to me.

> Docker container launch fails due to .cmd file creation failure
> ---
>
> Key: YARN-8261
> URL: https://issues.apache.org/jira/browse/YARN-8261
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Affects Versions: 3.2.0, 3.1.1
>Reporter: Eric Badger
>Assignee: Jason Lowe
>Priority: Blocker
> Attachments: YARN-8261.001.patch, YARN-8261.002.patch
>
>
> Due to YARN-8064, the location of the docker .cmd files was changed. These 
> files are now being placed in the nmPrivate directory of the container. 
> However, this directory will not always be created. If the localizer does not 
> run or the credentials are written to a different disk, then this directory 
> will not exist and so the .cmd file creation will fail, thus causing the 
> container launch to fail. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7799) YARN Service dependency follow up work

2018-04-27 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456808#comment-16456808
 ] 

Eric Yang commented on YARN-7799:
-

[~billie.rinaldi] The summary of our discussion:

1.  We can check prefix directory of yarn.service.framework.path to ensure all 
sub-directories are world readable and executable to ensure other user can 
access this path.
2.  If the calling user to -enableFastLaunch is one of yarn.admin.acl, and 
yarn.service.framework.path is pre-conigured.  User is allowed to upload 
service-dep.tar.gz.
3.  If the calling user is dfs.cluster.administrators, user is allowed to 
upload service-dep.tar.gz.
4.  Auto-upload follows the same logic.

> YARN Service dependency follow up work
> --
>
> Key: YARN-7799
> URL: https://issues.apache.org/jira/browse/YARN-7799
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client, resourcemanager
>Reporter: Gour Saha
>Assignee: Billie Rinaldi
>Priority: Critical
> Attachments: YARN-7799.1.patch
>
>
> As per [~jianhe] these are some followup items that make sense to do after 
> YARN-7766. Quoting Jian's comment below -
> Currently, if user doesn't supply location when run yarn app 
> -enableFastLaunch, the jars will be put under this location
> {code}
> hdfs:///yarn-services//service-dep.tar.gz
> {code}
> Since API server is embedded in RM, should RM look for this location too if 
> "yarn.service.framework.path" is not specified ?
> And if "yarn.service.framework.path" is not specified and still the file 
> doesn't exist at above default location, I think RM can try to upload the 
> jars to above default location instead, currently RM is uploading the jars to 
> the location defined by below code. This folder is per app and also 
> inconsistent with CLI location.
> {code}
>   protected Path addJarResource(String serviceName,
>   Map localResources)
>   throws IOException, SliderException {
> Path libPath = fs.buildClusterDirPath(serviceName);
> {code}
> By doing this, the next time a submission request comes, RM doesn't need to 
> upload the jars again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8209) NPE in DeletionService

2018-04-27 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456680#comment-16456680
 ] 

Eric Yang commented on YARN-8209:
-

[~jlowe] [~ebadger] Yes we have agreement on this issue.  The most frequent 
commands have structure:

{code}
docker-command
format
name
{code}

If we export those key value pair to container-executor environment, this 
approach will cover most of the cases.  Given that we have some idea to contain 
this problem, I think we can do this without reverting YARN-8064.  Thoughts?

> NPE in DeletionService
> --
>
> Key: YARN-8209
> URL: https://issues.apache.org/jira/browse/YARN-8209
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Chandni Singh
>Assignee: Eric Badger
>Priority: Major
>
> {code:java}
> 2018-04-25 23:38:41,039 WARN  concurrent.ExecutorHelper 
> (ExecutorHelper.java:logThrowableFromAfterExecute(63)) - Caught exception in 
> thread DeletionService #1:
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerClient.writeCommandToTempFile(DockerClient.java:109)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeDockerCommand(DockerCommandExecutor.java:85)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.executeStatusCommand(DockerCommandExecutor.java:192)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.docker.DockerCommandExecutor.getContainerStatus(DockerCommandExecutor.java:128)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.removeDockerContainer(LinuxContainerExecutor.java:935)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.deletion.task.DockerContainerDeletionTask.run(DockerContainerDeletionTask.java:61)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8204) Yarn Service Upgrade: Add a flag to disable upgrade

2018-04-27 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8204:

Fix Version/s: 3.1.1
   3.2.0

I just committed this, thank you [~csingh].

> Yarn Service Upgrade: Add a flag to disable upgrade
> ---
>
> Key: YARN-8204
> URL: https://issues.apache.org/jira/browse/YARN-8204
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chandni Singh
>Assignee: Chandni Singh
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8204.001.patch, YARN-8204.002.patch
>
>
> Add a flag that will enable/disable service upgrade on the cluster. 
> By default it is set to false since upgrade is in early stages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8211) Yarn registry dns log finds BufferUnderflowException on port ping

2018-04-27 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16456685#comment-16456685
 ] 

Eric Yang commented on YARN-8211:
-

Thank you [~billie.rinaldi] for the review and commit.

> Yarn registry dns log finds BufferUnderflowException on port ping
> -
>
> Key: YARN-8211
> URL: https://issues.apache.org/jira/browse/YARN-8211
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Yesha Vora
>Assignee: Eric Yang
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8211.001.patch, YARN-8211.002.patch
>
>
> Yarn registry dns server is constantly getting BufferUnderflowException. 
> {code:java}
> 2018-04-25 01:36:56,139 WARN  concurrent.ExecutorHelper 
> (ExecutorHelper.java:logThrowableFromAfterExecute(50)) - Execution exception 
> when running task in RegistryDNS 76
> 2018-04-25 01:36:56,139 WARN  concurrent.ExecutorHelper 
> (ExecutorHelper.java:logThrowableFromAfterExecute(63)) - Caught exception in 
> thread RegistryDNS 76:
> java.nio.BufferUnderflowException
>         at java.nio.Buffer.nextGetIndex(Buffer.java:500)
>         at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:135)
>         at 
> org.apache.hadoop.registry.server.dns.RegistryDNS.getMessgeLength(RegistryDNS.java:820)
>         at 
> org.apache.hadoop.registry.server.dns.RegistryDNS.nioTCPClient(RegistryDNS.java:767)
>         at 
> org.apache.hadoop.registry.server.dns.RegistryDNS$3.call(RegistryDNS.java:846)
>         at 
> org.apache.hadoop.registry.server.dns.RegistryDNS$3.call(RegistryDNS.java:843)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748){code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-04-27 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457139#comment-16457139
 ] 

Eric Yang edited comment on YARN-8207 at 4/27/18 11:03 PM:
---

[~jlowe] Thank you for the review.  Good suggestions on coding style issues.  I 
will fix the coding style issues.

{quote}
stderr.txt is fopen'd and never used before the fclose? Are we supposed to dup2 
these file descriptors to 1 and 2 before the execv so any errors from docker 
run appear in those output files?{quote}

When using launch_script.sh, there is stdout and stderr redirection inside 
launch_script.sh which bind-mount to host log directory.  This is the reason 
that there is fopen and fclosed immediately until YARN-7654 logic are added.

{quote}The parent process that is responsible for obtaining the pid is not 
waiting for the child to complete before running the inspect command. That's 
why retries had to be added to get it to work when they were not needed before. 
The parent should simply wait and check for error exit codes as it did before 
when it was using popen. After that we can ditch the retries since they won't 
be necessary.{quote}

Using launch_script.sh, container-executor runs "docker run" with detach 
option.  It assumes the exit code can be obtained quickly.  This is the reason 
there is no logic for retry "docker inspect".  This assumption is some what 
flawed.  If the docker image is unavailable on the host, docker will show 
download progress and some other information and errors.  The progression are 
not captured, which is difficult to debug.  When docker inspect is probed, 
there is no information of what failed.

Without launch_script.sh, container-executor runs "docker run" in the 
foreground, and obtain pid when the first process is started.  Inspect command 
is checked asynchronously because docker run exit code is only reported when 
the docker process is terminated.  There is a balance between how long that we 
should wait before we decide if the system is hang.  We can make MAX_RETRIES 
configurable in case people like to wait for longer or period of time before 
deciding if the container should fail.

{quote}Why does make_string calculate size = n + 2 instead of n + 1?{quote}

This change makes make_string function twice faster than sample code while 
waste 1% or less space if recursion is required.  It is probably a reasonable 
trade off for modern day computers.



was (Author: eyang):
[~jlowe] Thank you for the review.  Good suggestions on coding style issues.  I 
will fix the coding style issues.

{quote}
stderr.txt is fopen'd and never used before the fclose? Are we supposed to dup2 
these file descriptors to 1 and 2 before the execv so any errors from docker 
run appear in those output files?{quote}

When using launch_script.sh, there is stdout and stderr redirection inside 
launch_script.sh which bind-mount to host log directory.  This is the reason 
that there is fopen and fclosed immediately until YARN-7654 logic are added.

{quote}The parent process that is responsible for obtaining the pid is not 
waiting for the child to complete before running the inspect command. That's 
why retries had to be added to get it to work when they were not needed before. 
The parent should simply wait and check for error exit codes as it did before 
when it was using popen. After that we can ditch the retries since they won't 
be necessary.{quote}

Using launch_script.sh, container-executor runs "docker run" with detach 
option.  It assumes the exit code can be obtained quickly.  This is the reason 
there is no logic for retry "docker inspect".  This assumption is some what 
flawed.  If the docker image is unavailable on the host, docker will show 
download progress and some other information and errors.  The progression are 
not captured, which is difficult to debug.  When docker inspect is probed, 
there is no information of what failed.

Without launch_script.sh, container-executor runs "docker run" in the 
foreground, and obtain pid when the first process is started.  Inspect command 
is checked asynchronously because docker run exit code is only reported when 
the docker process is terminated.  There is a balance between how long that we 
should wait before we decide if the system is hang.  We can make MAX_RETRIES 
configurable in case people like to wait for longer or period of time before 
deciding if the container should fail.

{quote}Why does make_string calculate size = n + 2 instead of n + 1?{quote}

This change make make_string function twice faster than sample code while waste 
1% or less space if recursion is required.  It is probably a reasonable trade 
off for modern day computers.


> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: 

[jira] [Comment Edited] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-04-27 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457139#comment-16457139
 ] 

Eric Yang edited comment on YARN-8207 at 4/27/18 11:06 PM:
---

[~jlowe] Thank you for the review.  Good suggestions on coding style issues.  I 
will fix the coding style issues.

{quote}
stderr.txt is fopen'd and never used before the fclose? Are we supposed to dup2 
these file descriptors to 1 and 2 before the execv so any errors from docker 
run appear in those output files?{quote}

When using launch_script.sh, there is stdout and stderr redirection inside 
launch_script.sh which bind-mount to host log directory.  This is the reason 
that there is fopen and fclosed immediately until YARN-7654 logic are added.

{quote}The parent process that is responsible for obtaining the pid is not 
waiting for the child to complete before running the inspect command. That's 
why retries had to be added to get it to work when they were not needed before. 
The parent should simply wait and check for error exit codes as it did before 
when it was using popen. After that we can ditch the retries since they won't 
be necessary.{quote}

Using launch_script.sh, container-executor runs "docker run" with detach 
option.  It assumes the exit code can be obtained quickly.  This is the reason 
there is no logic for retry "docker inspect".  This assumption is some what 
flawed.  If the docker image is unavailable on the host, docker will show 
download progress and some other information and errors.  The progression are 
not captured, which is difficult to debug.  When docker inspect is probed, 
there is no information of what failed.

Without launch_script.sh, container-executor runs "docker run" in the 
foreground, and obtain pid when the first process is started.  Inspect command 
is checked asynchronously because docker run exit code is only reported when 
the docker process is terminated.  There is a balance between how long that we 
should wait before we decide if the system is hang.  We can make MAX_RETRIES 
configurable in case people have a difference preference of wait time for 
docker inspect.

{quote}Why does make_string calculate size = n + 2 instead of n + 1?{quote}

This change makes make_string function twice faster than sample code while 
waste 1% or less space if recursion is required.  It is probably a reasonable 
trade off for modern day computers.



was (Author: eyang):
[~jlowe] Thank you for the review.  Good suggestions on coding style issues.  I 
will fix the coding style issues.

{quote}
stderr.txt is fopen'd and never used before the fclose? Are we supposed to dup2 
these file descriptors to 1 and 2 before the execv so any errors from docker 
run appear in those output files?{quote}

When using launch_script.sh, there is stdout and stderr redirection inside 
launch_script.sh which bind-mount to host log directory.  This is the reason 
that there is fopen and fclosed immediately until YARN-7654 logic are added.

{quote}The parent process that is responsible for obtaining the pid is not 
waiting for the child to complete before running the inspect command. That's 
why retries had to be added to get it to work when they were not needed before. 
The parent should simply wait and check for error exit codes as it did before 
when it was using popen. After that we can ditch the retries since they won't 
be necessary.{quote}

Using launch_script.sh, container-executor runs "docker run" with detach 
option.  It assumes the exit code can be obtained quickly.  This is the reason 
there is no logic for retry "docker inspect".  This assumption is some what 
flawed.  If the docker image is unavailable on the host, docker will show 
download progress and some other information and errors.  The progression are 
not captured, which is difficult to debug.  When docker inspect is probed, 
there is no information of what failed.

Without launch_script.sh, container-executor runs "docker run" in the 
foreground, and obtain pid when the first process is started.  Inspect command 
is checked asynchronously because docker run exit code is only reported when 
the docker process is terminated.  There is a balance between how long that we 
should wait before we decide if the system is hang.  We can make MAX_RETRIES 
configurable in case people like to wait for longer or period of time before 
deciding if the container should fail.

{quote}Why does make_string calculate size = n + 2 instead of n + 1?{quote}

This change makes make_string function twice faster than sample code while 
waste 1% or less space if recursion is required.  It is probably a reasonable 
trade off for modern day computers.


> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: 

[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-04-27 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16457139#comment-16457139
 ] 

Eric Yang commented on YARN-8207:
-

[~jlowe] Thank you for the review.  Good suggestions on coding style issues.  I 
will fix the coding style issues.

{quote}
stderr.txt is fopen'd and never used before the fclose? Are we supposed to dup2 
these file descriptors to 1 and 2 before the execv so any errors from docker 
run appear in those output files?{quote}

When using launch_script.sh, there is stdout and stderr redirection inside 
launch_script.sh which bind-mount to host log directory.  This is the reason 
that there is fopen and fclosed immediately until YARN-7654 logic are added.

{quote}The parent process that is responsible for obtaining the pid is not 
waiting for the child to complete before running the inspect command. That's 
why retries had to be added to get it to work when they were not needed before. 
The parent should simply wait and check for error exit codes as it did before 
when it was using popen. After that we can ditch the retries since they won't 
be necessary.{quote}

Using launch_script.sh, container-executor runs "docker run" with detach 
option.  It assumes the exit code can be obtained quickly.  This is the reason 
there is no logic for retry "docker inspect".  This assumption is some what 
flawed.  If the docker image is unavailable on the host, docker will show 
download progress and some other information and errors.  The progression are 
not captured, which is difficult to debug.  When docker inspect is probed, 
there is no information of what failed.

Without launch_script.sh, container-executor runs "docker run" in the 
foreground, and obtain pid when the first process is started.  Inspect command 
is checked asynchronously because docker run exit code is only reported when 
the docker process is terminated.  There is a balance between how long that we 
should wait before we decide if the system is hang.  We can make MAX_RETRIES 
configurable in case people like to wait for longer or period of time before 
deciding if the container should fail.

{quote}Why does make_string calculate size = n + 2 instead of n + 1?{quote}

This change make make_string function twice faster than sample code while waste 
1% or less space if recursion is required.  It is probably a reasonable trade 
off for modern day computers.


> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-8207.001.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container

2018-05-09 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469823#comment-16469823
 ] 

Eric Yang commented on YARN-7654:
-

[~jlowe] Thank you for the review, the styling improvement will be addressed.  

{quote}
DockerClient is creating the environment file in /tmp which has the same 
leaking problem we had with the docker .cmd files.
{quote}

The patch writes .env file in the same nmPrivate directory as .cmd file.  It 
doesn't write to /tmp.

{quote}
The code is now writing "Launching docker container..." etc. even when not 
using the entry point. Are these smashed by the container_launch.sh script when 
not using the entry point? If not it could be an issue since it's changing what 
the user's code is writing to those files today.{quote}

Yes, these lines are overwritten by container_launch.sh for non entry_point 
mode.  It doesn't break existing compatibility.


> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-7654.001.patch, YARN-7654.002.patch, 
> YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, 
> YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, 
> YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, 
> YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, 
> YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, 
> YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, 
> YARN-7654.021.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment

2018-05-10 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470814#comment-16470814
 ] 

Eric Yang commented on YARN-8108:
-

[~daryn] This issue doesn't present in Hadoop 2.7.5, does not mean it was done 
properly.  It is not possible to configure different HTTP principal for RM and 
Proxy Server on the same host/port, and it was only half working.  This is 
because Hadoop only have yarn.resourcemanager.webapp.spnego-keytab-file and 
yarn.resourcemanager.webapp.spnego-principal setting to define HTTP principal 
to use on RM server.  It does not have yarn.web-proxy.webapp.spnego-keytab-file 
and yarn.web-proxy.webapp.spnego-principal settings to make differentiation.  
Even if those settings are defined, they are not being used.  Further analysis 
on Hadoop 2.7.5, /proxy URL is not secured by any HTTP principal when running 
in RM embedded mode.

> RM metrics rest API throws GSSException in kerberized environment
> -
>
> Key: YARN-8108
> URL: https://issues.apache.org/jira/browse/YARN-8108
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Kshitij Badani
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-8108.001.patch
>
>
> Test is trying to pull up metrics data from SHS after kiniting as 'test_user'
> It is throwing GSSException as follows
> {code:java}
> b2b460b80713|RUNNING: curl --silent -k -X GET -D 
> /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : 
> http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15
>  07:15:48,757|INFO|MainThread|machine.py:194 - 
> run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0
> 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - 
> getMetricsJsonData()|metrics:
> 
> 
> 
> Error 403 GSSException: Failure unspecified at GSS-API level 
> (Mechanism level: Request is a replay (34))
> 
> HTTP ERROR 403
> Problem accessing /proxy/application_1518674952153_0070/metrics/json. 
> Reason:
>  GSSException: Failure unspecified at GSS-API level (Mechanism level: 
> Request is a replay (34))
> 
> 
> {code}
> Rootcausing : proxyserver on RM can't be supported for Kerberos enabled 
> cluster because AuthenticationFilter is applied twice in Hadoop code (once in 
> httpServer2 for RM, and another instance from AmFilterInitializer for proxy 
> server). This will require code changes to hadoop-yarn-server-web-proxy 
> project



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7799) YARN Service dependency follow up work

2018-05-10 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-7799:

Fix Version/s: 3.1.1
   3.2.0

> YARN Service dependency follow up work
> --
>
> Key: YARN-7799
> URL: https://issues.apache.org/jira/browse/YARN-7799
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: client, resourcemanager
>Reporter: Gour Saha
>Assignee: Billie Rinaldi
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-7799.1.patch, YARN-7799.2.patch, YARN-7799.3.patch, 
> YARN-7799.4.patch, YARN-7799.5.patch
>
>
> As per [~jianhe] these are some followup items that make sense to do after 
> YARN-7766. Quoting Jian's comment below -
> Currently, if user doesn't supply location when run yarn app 
> -enableFastLaunch, the jars will be put under this location
> {code}
> hdfs:///yarn-services//service-dep.tar.gz
> {code}
> Since API server is embedded in RM, should RM look for this location too if 
> "yarn.service.framework.path" is not specified ?
> And if "yarn.service.framework.path" is not specified and still the file 
> doesn't exist at above default location, I think RM can try to upload the 
> jars to above default location instead, currently RM is uploading the jars to 
> the location defined by below code. This folder is per app and also 
> inconsistent with CLI location.
> {code}
>   protected Path addJarResource(String serviceName,
>   Map localResources)
>   throws IOException, SliderException {
> Path libPath = fs.buildClusterDirPath(serviceName);
> {code}
> By doing this, the next time a submission request comes, RM doesn't need to 
> upload the jars again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8265) Service AM should retrieve new IP for docker container relaunched by NM

2018-05-11 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8265:

Target Version/s: 3.2.0, 3.1.1  (was: 3.2.0)
   Fix Version/s: 3.1.1
  3.2.0

> Service AM should retrieve new IP for docker container relaunched by NM
> ---
>
> Key: YARN-8265
> URL: https://issues.apache.org/jira/browse/YARN-8265
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Billie Rinaldi
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8265.001.patch, YARN-8265.002.patch, 
> YARN-8265.003.patch
>
>
> When a docker container is restarted, it gets a new IP, but the service AM 
> only retrieves one IP for a container and then cancels the container status 
> retriever. I suspect the issue would be solved by restarting the retriever 
> (if it has been canceled) when the onContainerRestart callback is received, 
> but we'll have to do some testing to make sure this works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8265) Service AM should retrieve new IP for docker container relaunched by NM

2018-05-11 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472923#comment-16472923
 ] 

Eric Yang commented on YARN-8265:
-

+1 looks good to me.  I just committed this on trunk and branch-3.1.
Thank you [~billie.rinaldi] for the review and patch.

> Service AM should retrieve new IP for docker container relaunched by NM
> ---
>
> Key: YARN-8265
> URL: https://issues.apache.org/jira/browse/YARN-8265
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Billie Rinaldi
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8265.001.patch, YARN-8265.002.patch, 
> YARN-8265.003.patch
>
>
> When a docker container is restarted, it gets a new IP, but the service AM 
> only retrieves one IP for a container and then cancels the container status 
> retriever. I suspect the issue would be solved by restarting the retriever 
> (if it has been canceled) when the onContainerRestart callback is received, 
> but we'll have to do some testing to make sure this works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8265) Service AM should retrieve new IP for docker container relaunched by NM

2018-05-11 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472901#comment-16472901
 ] 

Eric Yang commented on YARN-8265:
-

"onContainerRestart" event is currently not working.  Therefore the workaround 
solution is the only feasible solution.  Therefore, I am inclined to commit the 
patch 003 for 3.1.1 release.

> Service AM should retrieve new IP for docker container relaunched by NM
> ---
>
> Key: YARN-8265
> URL: https://issues.apache.org/jira/browse/YARN-8265
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Billie Rinaldi
>Priority: Critical
> Attachments: YARN-8265.001.patch, YARN-8265.002.patch, 
> YARN-8265.003.patch
>
>
> When a docker container is restarted, it gets a new IP, but the service AM 
> only retrieves one IP for a container and then cancels the container status 
> retriever. I suspect the issue would be solved by restarting the retriever 
> (if it has been canceled) when the onContainerRestart callback is received, 
> but we'll have to do some testing to make sure this works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8265) Service AM should retrieve new IP for docker container relaunched by NM

2018-05-12 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16473179#comment-16473179
 ] 

Eric Yang commented on YARN-8265:
-

[~billie.rinaldi] The plan looks good.  Thank you.

> Service AM should retrieve new IP for docker container relaunched by NM
> ---
>
> Key: YARN-8265
> URL: https://issues.apache.org/jira/browse/YARN-8265
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Billie Rinaldi
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8265.001.patch, YARN-8265.002.patch, 
> YARN-8265.003.patch
>
>
> When a docker container is restarted, it gets a new IP, but the service AM 
> only retrieves one IP for a container and then cancels the container status 
> retriever. I suspect the issue would be solved by restarting the retriever 
> (if it has been canceled) when the onContainerRestart callback is received, 
> but we'll have to do some testing to make sure this works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8286) Add NMClient callback on container relaunch

2018-05-12 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8286:

Description: The AM may need to perform actions when a container has been 
relaunched. For example, the service AM would want to change the state it has 
recorded for the container and retrieve new container status for the container, 
in case the container IP has changed. (The NM would also need to remove the IP 
it has stored for the container, so container status calls don't return an IP 
for a container that is not currently running.)

> Add NMClient callback on container relaunch
> ---
>
> Key: YARN-8286
> URL: https://issues.apache.org/jira/browse/YARN-8286
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Billie Rinaldi
>Priority: Critical
>
> The AM may need to perform actions when a container has been relaunched. For 
> example, the service AM would want to change the state it has recorded for 
> the container and retrieve new container status for the container, in case 
> the container IP has changed. (The NM would also need to remove the IP it has 
> stored for the container, so container status calls don't return an IP for a 
> container that is not currently running.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8286) Add NMClient callback on container relaunch

2018-05-12 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8286:

Environment: (was: The AM may need to perform actions when a container 
has been relaunched. For example, the service AM would want to change the state 
it has recorded for the container and retrieve new container status for the 
container, in case the container IP has changed. (The NM would also need to 
remove the IP it has stored for the container, so container status calls don't 
return an IP for a container that is not currently running.))

> Add NMClient callback on container relaunch
> ---
>
> Key: YARN-8286
> URL: https://issues.apache.org/jira/browse/YARN-8286
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Billie Rinaldi
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8265) Service AM should retrieve new IP for docker container relaunched by NM

2018-05-11 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472837#comment-16472837
 ] 

Eric Yang commented on YARN-8265:
-

[~billie.rinaldi] I am struggling to understand the reason that node manager 
would decide to restart the docker container without consulting with 
application master.  AM makes the decision of the state of the containers, and 
node manager only follow orders from AM.  This helps to prevent race conditions 
between AM and NM to decide which container should stay up and running.  AM 
will follow state transitions to ensure it is following a pre-defined path.  
With relaunch container implemented in YARN-7973, AM still make decision when 
to restart container.  "onContainerRestart" event will be received by AM.  If 
we run ContainerStartedTransition again, it will check for IP changes and 
cancel the scheduled timer thread.  I think this will leads to more desired 
outcome without leaving the timer thread open ended.

Alternate approach is to move ContainerStatusRetriever to 
ContainerBecomeReadyTransition, and use BECOME_READY transition to check for IP 
address.

> Service AM should retrieve new IP for docker container relaunched by NM
> ---
>
> Key: YARN-8265
> URL: https://issues.apache.org/jira/browse/YARN-8265
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-native-services
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Billie Rinaldi
>Priority: Critical
> Attachments: YARN-8265.001.patch, YARN-8265.002.patch, 
> YARN-8265.003.patch
>
>
> When a docker container is restarted, it gets a new IP, but the service AM 
> only retrieves one IP for a container and then cancels the container status 
> retriever. I suspect the issue would be solved by restarting the retriever 
> (if it has been canceled) when the onContainerRestart callback is received, 
> but we'll have to do some testing to make sure this works.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container

2018-05-11 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472839#comment-16472839
 ] 

Eric Yang commented on YARN-7654:
-

[~jlowe] Thank you for the great reviews and commit.
[~shaneku...@gmail.com] [~Jim_Brennan] [~ebadger] Thank you for the reviews.

> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-7654.001.patch, YARN-7654.002.patch, 
> YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, 
> YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, 
> YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, 
> YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, 
> YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, 
> YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, 
> YARN-7654.021.patch, YARN-7654.022.patch, YARN-7654.023.patch, 
> YARN-7654.024.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container

2018-05-10 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471387#comment-16471387
 ] 

Eric Yang commented on YARN-7654:
-

[~jlowe] Patch 22 contains all requested changes except refactoring code in 
AbstractProviderService and DockerProviderService.  I tried to refactor the 
code, but I haven't got a working implementation.  Due to time constraint, I 
upload the latest revision for your review first.

> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-7654.001.patch, YARN-7654.002.patch, 
> YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, 
> YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, 
> YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, 
> YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, 
> YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, 
> YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, 
> YARN-7654.021.patch, YARN-7654.022.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7654) Support ENTRY_POINT for docker container

2018-05-10 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-7654:

Attachment: YARN-7654.022.patch

> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-7654.001.patch, YARN-7654.002.patch, 
> YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, 
> YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, 
> YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, 
> YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, 
> YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, 
> YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, 
> YARN-7654.021.patch, YARN-7654.022.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment

2018-05-11 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8108:

Target Version/s: 3.2.0, 3.1.1, 3.0.3

> RM metrics rest API throws GSSException in kerberized environment
> -
>
> Key: YARN-8108
> URL: https://issues.apache.org/jira/browse/YARN-8108
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Kshitij Badani
>Assignee: Eric Yang
>Priority: Blocker
> Attachments: YARN-8108.001.patch
>
>
> Test is trying to pull up metrics data from SHS after kiniting as 'test_user'
> It is throwing GSSException as follows
> {code:java}
> b2b460b80713|RUNNING: curl --silent -k -X GET -D 
> /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : 
> http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15
>  07:15:48,757|INFO|MainThread|machine.py:194 - 
> run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0
> 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - 
> getMetricsJsonData()|metrics:
> 
> 
> 
> Error 403 GSSException: Failure unspecified at GSS-API level 
> (Mechanism level: Request is a replay (34))
> 
> HTTP ERROR 403
> Problem accessing /proxy/application_1518674952153_0070/metrics/json. 
> Reason:
>  GSSException: Failure unspecified at GSS-API level (Mechanism level: 
> Request is a replay (34))
> 
> 
> {code}
> Rootcausing : proxyserver on RM can't be supported for Kerberos enabled 
> cluster because AuthenticationFilter is applied twice in Hadoop code (once in 
> httpServer2 for RM, and another instance from AmFilterInitializer for proxy 
> server). This will require code changes to hadoop-yarn-server-web-proxy 
> project



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8108) RM metrics rest API throws GSSException in kerberized environment

2018-05-11 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8108:

Priority: Blocker  (was: Major)

> RM metrics rest API throws GSSException in kerberized environment
> -
>
> Key: YARN-8108
> URL: https://issues.apache.org/jira/browse/YARN-8108
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.0.0
>Reporter: Kshitij Badani
>Assignee: Eric Yang
>Priority: Blocker
> Attachments: YARN-8108.001.patch
>
>
> Test is trying to pull up metrics data from SHS after kiniting as 'test_user'
> It is throwing GSSException as follows
> {code:java}
> b2b460b80713|RUNNING: curl --silent -k -X GET -D 
> /hwqe/hadoopqe/artifacts/tmp-94845 --negotiate -u : 
> http://rm_host:8088/proxy/application_1518674952153_0070/metrics/json2018-02-15
>  07:15:48,757|INFO|MainThread|machine.py:194 - 
> run()||GUID=fc5a3266-28f8-4eed-bae2-b2b460b80713|Exit Code: 0
> 2018-02-15 07:15:48,758|INFO|MainThread|spark.py:1757 - 
> getMetricsJsonData()|metrics:
> 
> 
> 
> Error 403 GSSException: Failure unspecified at GSS-API level 
> (Mechanism level: Request is a replay (34))
> 
> HTTP ERROR 403
> Problem accessing /proxy/application_1518674952153_0070/metrics/json. 
> Reason:
>  GSSException: Failure unspecified at GSS-API level (Mechanism level: 
> Request is a replay (34))
> 
> 
> {code}
> Rootcausing : proxyserver on RM can't be supported for Kerberos enabled 
> cluster because AuthenticationFilter is applied twice in Hadoop code (once in 
> httpServer2 for RM, and another instance from AmFilterInitializer for proxy 
> server). This will require code changes to hadoop-yarn-server-web-proxy 
> project



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container

2018-05-11 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472407#comment-16472407
 ] 

Eric Yang commented on YARN-7654:
-

[~jlowe] Patch 23 includes all your suggestions.

> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-7654.001.patch, YARN-7654.002.patch, 
> YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, 
> YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, 
> YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, 
> YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, 
> YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, 
> YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, 
> YARN-7654.021.patch, YARN-7654.022.patch, YARN-7654.023.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8274) Docker command error during container relaunch

2018-05-11 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472392#comment-16472392
 ] 

Eric Yang commented on YARN-8274:
-

Sorry the code was missed during refactoring.

+1 The change looks good.  

> Docker command error during container relaunch
> --
>
> Key: YARN-8274
> URL: https://issues.apache.org/jira/browse/YARN-8274
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Billie Rinaldi
>Assignee: Jason Lowe
>Priority: Critical
> Attachments: YARN-8274.001.patch, YARN-8274.002.patch
>
>
> I initiated container relaunch with a "sleep 60; exit 1" launch command and 
> saw a "not a docker command" error on relaunch. Haven't figured out why this 
> is happening, but it seems like it has been introduced recently to 
> trunk/branch-3.1. cc [~shaneku...@gmail.com] [~ebadger]
> {noformat}
> org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
>  Relaunch container failed
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.relaunchContainer(DockerLinuxContainerRuntime.java:954)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.relaunchContainer(DelegatingLinuxContainerRuntime.java:150)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:562)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.relaunchContainer(LinuxContainerExecutor.java:486)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.relaunchContainer(ContainerLaunch.java:504)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:111)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from 
> container-launch.
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: 
> container_1525897486447_0003_01_02
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 7
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception 
> message: Relaunch container failed
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Shell error 
> output: docker: 'container_1525897486447_0003_01_02' is not a docker 
> command.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7654) Support ENTRY_POINT for docker container

2018-05-11 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-7654:

Attachment: YARN-7654.023.patch

> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-7654.001.patch, YARN-7654.002.patch, 
> YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, 
> YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, 
> YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, 
> YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, 
> YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, 
> YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, 
> YARN-7654.021.patch, YARN-7654.022.patch, YARN-7654.023.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8274) Docker command error during container relaunch

2018-05-11 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8274:

Fix Version/s: 3.1.1
   3.2.0

> Docker command error during container relaunch
> --
>
> Key: YARN-8274
> URL: https://issues.apache.org/jira/browse/YARN-8274
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Billie Rinaldi
>Assignee: Jason Lowe
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8274.001.patch, YARN-8274.002.patch
>
>
> I initiated container relaunch with a "sleep 60; exit 1" launch command and 
> saw a "not a docker command" error on relaunch. Haven't figured out why this 
> is happening, but it seems like it has been introduced recently to 
> trunk/branch-3.1. cc [~shaneku...@gmail.com] [~ebadger]
> {noformat}
> org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
>  Relaunch container failed
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.relaunchContainer(DockerLinuxContainerRuntime.java:954)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.relaunchContainer(DelegatingLinuxContainerRuntime.java:150)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:562)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.relaunchContainer(LinuxContainerExecutor.java:486)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.relaunchContainer(ContainerLaunch.java:504)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:111)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from 
> container-launch.
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: 
> container_1525897486447_0003_01_02
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 7
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception 
> message: Relaunch container failed
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Shell error 
> output: docker: 'container_1525897486447_0003_01_02' is not a docker 
> command.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container

2018-05-11 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472318#comment-16472318
 ] 

Eric Yang commented on YARN-7654:
-

[~jlowe] Thanks for the reply.  Some answers:

{quote}
In DockerProviderService#buildContainerLaunchContext it's calling 
processArtifact then super.buildContainerLaunchContext, but the parent's 
buildContainerLaunchContext calls processArtifact as well. Is the double-call 
intentional?
{quote}

Not intentional, this is fixed in patch 23.

{quote}Note I'm not sure if we really need to rebuild tokensForSubtitution in 
DockerProviderService, I'm just preserving what the patch was doing. AFAICT the 
only difference between what the patch had DockerProviderService build for 
tokens and what AbstractProviderService builds is the latter is doing a pass 
adding ${env} forms of every env var to the map. If DockerProviderService is 
supposed to be doing that as well then it can just use the tokenProviderService 
arg directly rather than building it from scratch.{quote}

I was able to make the refactoring happen this morning with a clear head.  This 
is more readable without repeat in patch 23.



> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-7654.001.patch, YARN-7654.002.patch, 
> YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, 
> YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, 
> YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, 
> YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, 
> YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, 
> YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, 
> YARN-7654.021.patch, YARN-7654.022.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8274) Docker command error during container relaunch

2018-05-11 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472771#comment-16472771
 ] 

Eric Yang commented on YARN-8274:
-

[~ebadger] Your earnestly advocate is not going unheard.  I am sorry that I 
introduced bugs during the rebase.  There is no excuse for making mistakes when 
patch is snowballing.  It won't happen again.

[~jlowe] Nits: It would be nice if the code was refactored to add docker_binary 
in construct_docker_command to avoid duplicated add_to_args for docker_binary 
for all get_docker_*_command, but the priority is to get a good stable state 
for release.  Hence, I am sorry that I committed this prematurely without 
listening to my inner voice.

> Docker command error during container relaunch
> --
>
> Key: YARN-8274
> URL: https://issues.apache.org/jira/browse/YARN-8274
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Billie Rinaldi
>Assignee: Jason Lowe
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8274.001.patch, YARN-8274.002.patch
>
>
> I initiated container relaunch with a "sleep 60; exit 1" launch command and 
> saw a "not a docker command" error on relaunch. Haven't figured out why this 
> is happening, but it seems like it has been introduced recently to 
> trunk/branch-3.1. cc [~shaneku...@gmail.com] [~ebadger]
> {noformat}
> org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
>  Relaunch container failed
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.relaunchContainer(DockerLinuxContainerRuntime.java:954)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.relaunchContainer(DelegatingLinuxContainerRuntime.java:150)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:562)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.relaunchContainer(LinuxContainerExecutor.java:486)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.relaunchContainer(ContainerLaunch.java:504)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:111)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from 
> container-launch.
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: 
> container_1525897486447_0003_01_02
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 7
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception 
> message: Relaunch container failed
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Shell error 
> output: docker: 'container_1525897486447_0003_01_02' is not a docker 
> command.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container

2018-05-11 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472778#comment-16472778
 ] 

Eric Yang commented on YARN-7654:
-

[~jlowe] All 5 scenarios passed with my local kerberos enabled cluster tests.

> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-7654.001.patch, YARN-7654.002.patch, 
> YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, 
> YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, 
> YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, 
> YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, 
> YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, 
> YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, 
> YARN-7654.021.patch, YARN-7654.022.patch, YARN-7654.023.patch, 
> YARN-7654.024.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8274) Docker command error during container relaunch

2018-05-11 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472780#comment-16472780
 ] 

Eric Yang commented on YARN-8274:
-

[~jlowe] Thank you for all your efforts.  It is greatly appreciated.

> Docker command error during container relaunch
> --
>
> Key: YARN-8274
> URL: https://issues.apache.org/jira/browse/YARN-8274
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Billie Rinaldi
>Assignee: Jason Lowe
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8274.001.patch, YARN-8274.002.patch
>
>
> I initiated container relaunch with a "sleep 60; exit 1" launch command and 
> saw a "not a docker command" error on relaunch. Haven't figured out why this 
> is happening, but it seems like it has been introduced recently to 
> trunk/branch-3.1. cc [~shaneku...@gmail.com] [~ebadger]
> {noformat}
> org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
>  Relaunch container failed
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.relaunchContainer(DockerLinuxContainerRuntime.java:954)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.relaunchContainer(DelegatingLinuxContainerRuntime.java:150)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:562)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.relaunchContainer(LinuxContainerExecutor.java:486)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.relaunchContainer(ContainerLaunch.java:504)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:111)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from 
> container-launch.
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: 
> container_1525897486447_0003_01_02
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 7
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception 
> message: Relaunch container failed
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Shell error 
> output: docker: 'container_1525897486447_0003_01_02' is not a docker 
> command.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8274) Docker command error during container relaunch

2018-05-11 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472548#comment-16472548
 ] 

Eric Yang commented on YARN-8274:
-

[~ebadger] Sorry my mistake, I thought the report was for the second patch.  
With 3.1.1 code freeze on Saturday, it is easy to make mistakes, and I like to 
get YARN-7654 committed before end of today.  YARN-7654 and YARN-8207 are 
probably left uncommitted for too long, and it is easy to make mistakes to 
rebase changes that includes logic for other patches including YARN-7973, 
YARN-8209, YARN-8261, YARN-8064.  I recommend to go through YARN-7654 to make 
sure the rebase was done correctly for those patches.

> Docker command error during container relaunch
> --
>
> Key: YARN-8274
> URL: https://issues.apache.org/jira/browse/YARN-8274
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Billie Rinaldi
>Assignee: Jason Lowe
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8274.001.patch, YARN-8274.002.patch
>
>
> I initiated container relaunch with a "sleep 60; exit 1" launch command and 
> saw a "not a docker command" error on relaunch. Haven't figured out why this 
> is happening, but it seems like it has been introduced recently to 
> trunk/branch-3.1. cc [~shaneku...@gmail.com] [~ebadger]
> {noformat}
> org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException:
>  Relaunch container failed
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.relaunchContainer(DockerLinuxContainerRuntime.java:954)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.relaunchContainer(DelegatingLinuxContainerRuntime.java:150)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:562)
> at 
> org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.relaunchContainer(LinuxContainerExecutor.java:486)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.relaunchContainer(ContainerLaunch.java:504)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:111)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerRelaunch.call(ContainerRelaunch.java:47)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from 
> container-launch.
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: 
> container_1525897486447_0003_01_02
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 7
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception 
> message: Relaunch container failed
> 2018-05-09 21:41:46,631 INFO 
> org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Shell error 
> output: docker: 'container_1525897486447_0003_01_02' is not a docker 
> command.
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7654) Support ENTRY_POINT for docker container

2018-05-11 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-7654:

Attachment: YARN-7654.024.patch

> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-7654.001.patch, YARN-7654.002.patch, 
> YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, 
> YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, 
> YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, 
> YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, 
> YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, 
> YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, 
> YARN-7654.021.patch, YARN-7654.022.patch, YARN-7654.023.patch, 
> YARN-7654.024.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container

2018-05-11 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472661#comment-16472661
 ] 

Eric Yang commented on YARN-7654:
-

[~jlowe] Patch 24 fixed the issues above.  I still need time to test all 5 
scenarios to make sure that command doesn't get pre-processed by mistake.  The 
5 scenarios are:

# Mapreduce
# LLAP app
# Docker app with command override
# Docker app with entry point
# Docker app with entry point and no launch command

> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-7654.001.patch, YARN-7654.002.patch, 
> YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, 
> YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, 
> YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, 
> YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, 
> YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, 
> YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, 
> YARN-7654.021.patch, YARN-7654.022.patch, YARN-7654.023.patch, 
> YARN-7654.024.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container

2018-05-10 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471073#comment-16471073
 ] 

Eric Yang commented on YARN-7654:
-

[~jlowe] I am struggling withe the following problems:
{quote}AbstractProviderService#buildContainerLaunchContext so the pieces needed 
by DockerProviderService can be reused without requiring the launcher command 
to be clobbered afterwards?{quote}

Launch command is override to bash -c 'launch-command' in 
DockerLinuxContainerRuntime, and subsequently appended the log redirection. '2> 
/stderr.txt 1> /stdout.txt', then replaced  with 
actual container logging directory.  The number of steps to go through the 
preprocessing before writing to .cmd file complicates how to refactor the code 
base without breaking things.  This is the reason that setCommand was created 
to flush out the override commands to ensure the command is not tempered 
incorrectly during the hand off from DockerLinuxContainerRuntime to 
DockerClient to container-executor.  For safety reason, I keep setCommand to 
ensure the command is not tempered by string substitutions, and not break YARN 
v2 API. 

{quote}The instance checking and downcasting in writeCommandToTempFile looks 
pretty ugly. It would be cleaner to encapsulate this in the DockerCommand 
abstraction. One example way to do this is to move the logic of writing a 
docker command file into the DockerCommand abstract class. DockerRunCommand can 
then override that method to call the parent method and then separately write 
the env file. Worst case we can add a getEnv method to DockerCommand that 
returns the collection of environment variables to write out for a command. 
DockerCommand would return null or an empty collection while DockerRunCommand 
can return its environment.{quote}

DockerCommand is a data structure class.  It does not handle IO operation.  If 
we move IO operation to this class, it would not be clean data structure to 
represent the docker command.  I think it is more self explanatory that for 
DockerRunCommand, we also write out the environment file.  With changes in 
YARN-8261, we are interested to ensure that directory is created, create the 
cmd file, create the env file.  For safety reason, I think we should not make 
the styling changes for this area at this time because we are out of time to 
throughly retest what have been tested in the previous patch set.



> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-7654.001.patch, YARN-7654.002.patch, 
> YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, 
> YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, 
> YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, 
> YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, 
> YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, 
> YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch, 
> YARN-7654.021.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-05 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8207:

Attachment: YARN-8207.006.patch

> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, 
> YARN-8207.006.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-05 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8207:

Attachment: (was: YARN-8207.006.patch)

> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-08 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467673#comment-16467673
 ] 

Eric Yang commented on YARN-8207:
-

[~jlowe]
{quote}At a bare minimum there should be a utility method, e.g: 
extract_execv_args(args* args){quote}
I agree to this point, and will do this.

{quote}Please create an init function, e.g.: init_args(args* args), or a macro 
to encapsulate initialization of the structure.{quote}
Init_args is only assigning 0 to length.  I prefer to write it as:
{code}
struct args buffer = { 0 };
{code}

Instead of:

{code}
struct args *buffer = malloc(sizeof(args));
init_args(buffer);
{code}

I understand the desire and obsession for code perfection, but I am trying to 
restrain myself from making more mess in crunch time.

{quote}As add_to_args works today, the lack of a NULL check on the make_string 
result will cause the program to crash. {quote}

Sorry, I thought I had a null check, but it was changed to length check.  This 
will be fixed.

{quote}would be safer and easier to understand written with strdup/strndup, 
e.g.:
{code}
  dst = strndup(values[i], tmp_ptr - values[i]);
  pattern = strdup(permitted_values[j] + 6);
{code}
{quote}
This will be optimized.

{quote}make_string is still not checking for vsnprintf failure. If the first 
vsnprintf fails and returns -1, the code will allocate a 0-byte buffer.{quote}

No, it doesn't malloc(-1) will return null instead of 0 bytes, the second check 
will not succeed.


> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, 
> YARN-8207.006.patch, YARN-8207.007.patch, YARN-8207.008.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-7654) Support ENTRY_POINT for docker container

2018-05-07 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-7654:

Attachment: YARN-7654.020.patch

> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-7654.001.patch, YARN-7654.002.patch, 
> YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, 
> YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, 
> YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, 
> YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, 
> YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, 
> YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7654) Support ENTRY_POINT for docker container

2018-05-07 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466258#comment-16466258
 ] 

Eric Yang commented on YARN-7654:
-

Rebased patch 20 to based on YARN-8207 patch 007.

> Support ENTRY_POINT for docker container
> 
>
> Key: YARN-7654
> URL: https://issues.apache.org/jira/browse/YARN-7654
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn
>Affects Versions: 3.1.0
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-7654.001.patch, YARN-7654.002.patch, 
> YARN-7654.003.patch, YARN-7654.004.patch, YARN-7654.005.patch, 
> YARN-7654.006.patch, YARN-7654.007.patch, YARN-7654.008.patch, 
> YARN-7654.009.patch, YARN-7654.010.patch, YARN-7654.011.patch, 
> YARN-7654.012.patch, YARN-7654.013.patch, YARN-7654.014.patch, 
> YARN-7654.015.patch, YARN-7654.016.patch, YARN-7654.017.patch, 
> YARN-7654.018.patch, YARN-7654.019.patch, YARN-7654.020.patch
>
>
> Docker image may have ENTRY_POINT predefined, but this is not supported in 
> the current implementation.  It would be nice if we can detect existence of 
> {{launch_command}} and base on this variable launch docker container in 
> different ways:
> h3. Launch command exists
> {code}
> docker run [image]:[version]
> docker exec [container_id] [launch_command]
> {code}
> h3. Use ENTRY_POINT
> {code}
> docker run [image]:[version]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-08 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467727#comment-16467727
 ] 

Eric Yang commented on YARN-8207:
-

[~jlowe] Patch 9 fixes most nits from your comments except init_args.  I did 
not write init_args to prevent myself from making a mess.  If you have strong 
feeling about the initialization.  Please open a separate issue for it.  Thanks

> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, 
> YARN-8207.006.patch, YARN-8207.007.patch, YARN-8207.008.patch, 
> YARN-8207.009.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-08 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467762#comment-16467762
 ] 

Eric Yang commented on YARN-8207:
-

[~jlowe] I see what you mean now, and patch 10 updated accordingly for 
initialize args and make_string check.  One concern about the shallow copy, 
struct args buffer supposedly disappeared after construct_docker_command.  This 
was the reason that I used deep copy to extract the data.  Now, I am retaining 
the pointer reference to strings internal to struct args buffer instead of deep 
copy.  Wouldn't those strings get overwritten at some point or they will be 
reserved until copy is freed up?

> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, 
> YARN-8207.006.patch, YARN-8207.007.patch, YARN-8207.008.patch, 
> YARN-8207.009.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-08 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8207:

Attachment: YARN-8207.009.patch

> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, 
> YARN-8207.006.patch, YARN-8207.007.patch, YARN-8207.008.patch, 
> YARN-8207.009.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-08 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8207:

Attachment: YARN-8207.010.patch

> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Blocker
>  Labels: Docker
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, 
> YARN-8207.006.patch, YARN-8207.007.patch, YARN-8207.008.patch, 
> YARN-8207.009.patch, YARN-8207.010.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-05 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8207:

Attachment: YARN-8207.006.patch

> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, 
> YARN-8207.006.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8207) Docker container launch use popen have risk of shell expansion

2018-05-05 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16464873#comment-16464873
 ] 

Eric Yang commented on YARN-8207:
-

[~jlowe] Patch 006 contains all style fixes from your recommendations.

> Docker container launch use popen have risk of shell expansion
> --
>
> Key: YARN-8207
> URL: https://issues.apache.org/jira/browse/YARN-8207
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: yarn-native-services
>Affects Versions: 3.0.0, 3.1.0, 3.0.1, 3.0.2
>Reporter: Eric Yang
>Assignee: Eric Yang
>Priority: Major
>  Labels: Docker
> Attachments: YARN-8207.001.patch, YARN-8207.002.patch, 
> YARN-8207.003.patch, YARN-8207.004.patch, YARN-8207.005.patch, 
> YARN-8207.006.patch
>
>
> Container-executor code utilize a string buffer to construct docker run 
> command, and pass the string buffer to popen for execution.  Popen spawn a 
> shell to run the command.  Some arguments for docker run are still vulnerable 
> to shell expansion.  The possible solution is to convert from char * buffer 
> to string array for execv to avoid shell expansion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8079) Support static and archive unmodified local resources in service AM

2018-05-20 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482076#comment-16482076
 ] 

Eric Yang commented on YARN-8079:
-

[~leftnoteasy] When the files are placed in resources directory, patch 10 
implementation prevents mistake to overwrite system level generated files, such 
as .token file, and launch_container.sh.  However, this design can created 
inconvenience for some users because existing Hadoop workload may already be 
using the top level localized directory instead of resource directory.  We may 
not need to worry about launch_container.sh getting overwritten because 
container-executor generates the file after static files are localized.  Apps 
will try to avoid .token files because they can not contact HDFS from 
containers, if they overwrites the token files.  In summary, it is likely safe 
to remove the requirement of "resources" directory from my point of view.

> Support static and archive unmodified local resources in service AM
> ---
>
> Key: YARN-8079
> URL: https://issues.apache.org/jira/browse/YARN-8079
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Suma Shivaprasad
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8079.001.patch, YARN-8079.002.patch, 
> YARN-8079.003.patch, YARN-8079.004.patch, YARN-8079.005.patch, 
> YARN-8079.006.patch, YARN-8079.007.patch, YARN-8079.008.patch, 
> YARN-8079.009.patch, YARN-8079.010.patch
>
>
> Currently, {{srcFile}} is not respected. {{ProviderUtils}} doesn't properly 
> read srcFile, instead it always construct {{remoteFile}} by using 
> componentDir and fileName of {{destFile}}:
> {code}
> Path remoteFile = new Path(compInstanceDir, fileName);
> {code} 
> To me it is a common use case which services have some files existed in HDFS 
> and need to be localized when components get launched. (For example, if we 
> want to serve a Tensorflow model, we need to localize Tensorflow model 
> (typically not huge, less than GB) to local disk. Otherwise launched docker 
> container has to access HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8079) Support static and archive unmodified local resources in service AM

2018-05-20 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16482076#comment-16482076
 ] 

Eric Yang edited comment on YARN-8079 at 5/21/18 12:20 AM:
---

[~leftnoteasy] When the files are placed in resources directory, patch 10 
implementation prevents mistake to overwrite system level generated files, such 
as .token file, and launch_container.sh.  However, this design can created 
inconvenience for some users because existing Hadoop workload may already be 
using the top level localized directory instead of resource directory.  We may 
not need to worry about launch_container.sh getting overwritten because 
container-executor generates the file after static files are localized.  Apps 
will try to avoid .token files because they can not contact HDFS from 
containers, if they overwrites the token files.  

With resources directory, it maybe easier for end user to specify a single 
relative directory to bind-mount instead of specifying individual files to 
bind-mount in yarnfile.  By removing the resources directory, user will need to 
think a bit more on how to manage the bind-mount directories to reduce wordy 
syntax.

With both approaches considered, it all comes down to usability of which 
approach is easiest to use, while not creating too much clutters.  In summary, 
it might be safe to remove the requirement of "resources" directory from my 
point of view.


was (Author: eyang):
[~leftnoteasy] When the files are placed in resources directory, patch 10 
implementation prevents mistake to overwrite system level generated files, such 
as .token file, and launch_container.sh.  However, this design can created 
inconvenience for some users because existing Hadoop workload may already be 
using the top level localized directory instead of resource directory.  We may 
not need to worry about launch_container.sh getting overwritten because 
container-executor generates the file after static files are localized.  Apps 
will try to avoid .token files because they can not contact HDFS from 
containers, if they overwrites the token files.  In summary, it is likely safe 
to remove the requirement of "resources" directory from my point of view.

> Support static and archive unmodified local resources in service AM
> ---
>
> Key: YARN-8079
> URL: https://issues.apache.org/jira/browse/YARN-8079
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Wangda Tan
>Assignee: Suma Shivaprasad
>Priority: Critical
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8079.001.patch, YARN-8079.002.patch, 
> YARN-8079.003.patch, YARN-8079.004.patch, YARN-8079.005.patch, 
> YARN-8079.006.patch, YARN-8079.007.patch, YARN-8079.008.patch, 
> YARN-8079.009.patch, YARN-8079.010.patch
>
>
> Currently, {{srcFile}} is not respected. {{ProviderUtils}} doesn't properly 
> read srcFile, instead it always construct {{remoteFile}} by using 
> componentDir and fileName of {{destFile}}:
> {code}
> Path remoteFile = new Path(compInstanceDir, fileName);
> {code} 
> To me it is a common use case which services have some files existed in HDFS 
> and need to be localized when components get launched. (For example, if we 
> want to serve a Tensorflow model, we need to localize Tensorflow model 
> (typically not huge, less than GB) to local disk. Otherwise launched docker 
> container has to access HDFS.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8290) Yarn application failed to recover with "Error Launching job : User is not set in the application report" error after RM restart

2018-05-17 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8290:

Attachment: YARN-8290.002.patch

> Yarn application failed to recover with "Error Launching job : User is not 
> set in the application report" error after RM restart
> 
>
> Key: YARN-8290
> URL: https://issues.apache.org/jira/browse/YARN-8290
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Yesha Vora
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-8290.001.patch, YARN-8290.002.patch
>
>
> Scenario:
> 1) Start 5 streaming application in background
> 2) Kill Active RM and cause RM failover
> After RM failover, The application failed with below error.
> {code}18/02/01 21:24:29 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception on [rm2] : 
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1517520038847_0003' doesn't exist in RM. Please check 
> that the job submission was successful.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:338)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
> , so propagating back to caller.
> 18/02/01 21:24:29 INFO impl.YarnClientImpl: Submitted application 
> application_1517520038847_0003
> 18/02/01 21:24:30 INFO mapreduce.JobSubmitter: Cleaning up the staging area 
> /user/hrt_qa/.staging/job_1517520038847_0003
> 18/02/01 21:24:30 ERROR streaming.StreamJob: Error Launching job : User is 
> not set in the application report
> Streaming Command Failed!{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8290) Yarn application failed to recover with "Error Launching job : User is not set in the application report" error after RM restart

2018-05-17 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479255#comment-16479255
 ] 

Eric Yang commented on YARN-8290:
-

- Patch 002 Fixed white space.

> Yarn application failed to recover with "Error Launching job : User is not 
> set in the application report" error after RM restart
> 
>
> Key: YARN-8290
> URL: https://issues.apache.org/jira/browse/YARN-8290
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Yesha Vora
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-8290.001.patch, YARN-8290.002.patch
>
>
> Scenario:
> 1) Start 5 streaming application in background
> 2) Kill Active RM and cause RM failover
> After RM failover, The application failed with below error.
> {code}18/02/01 21:24:29 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception on [rm2] : 
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1517520038847_0003' doesn't exist in RM. Please check 
> that the job submission was successful.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:338)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
> , so propagating back to caller.
> 18/02/01 21:24:29 INFO impl.YarnClientImpl: Submitted application 
> application_1517520038847_0003
> 18/02/01 21:24:30 INFO mapreduce.JobSubmitter: Cleaning up the staging area 
> /user/hrt_qa/.staging/job_1517520038847_0003
> 18/02/01 21:24:30 ERROR streaming.StreamJob: Error Launching job : User is 
> not set in the application report
> Streaming Command Failed!{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8080) YARN native service should support component restart policy

2018-05-16 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477665#comment-16477665
 ] 

Eric Yang commented on YARN-8080:
-

[~suma.shivaprasad] {quote}
{quote}
restart_policy=ON_FAILURE, and each component instance failed 3 times, and 
application goes into FINISHED state instead of FAILED state. Is this 
expected?{quote}

Can you please explain which part of code you are referring to? Or was it found 
during testing?{quote}

This was found during testing, and review of code.  The decision making process 
is based on 

{code}
nSucceeded + nFailed < comp.getComponentSpec().getNumberOfContainers()
{code}

If a user specifies 2 containers, and purposely failed containers.  The first 
failed container will trigger retries once.  The second container failed.  The 
total failed containers are 3 because first container failed + second container 
failed + first container retires failed, which is greater than number of 
containers.  This triggers the program to terminate, and report FINISHED.  This 
is almost working for restart_policy=NEVER, and it should report FAILED if 
number of failed containers is greater than 50% of total containers.

For restart_policy=ON_FAILURE, we will want to compare the total succeed 
containers = getNumberOfContainers, otherwise continue to retry.  This helps 
the measurement to count toward success and best effort to retry.

For restart_policy=ALWAYS, shouldTerminate always = false.

Checkstyle still reports indentation and unused import problems.  It would be 
good to automate the clean up using IDE features.

> YARN native service should support component restart policy
> ---
>
> Key: YARN-8080
> URL: https://issues.apache.org/jira/browse/YARN-8080
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Suma Shivaprasad
>Priority: Critical
> Attachments: YARN-8080.001.patch, YARN-8080.002.patch, 
> YARN-8080.003.patch, YARN-8080.005.patch, YARN-8080.006.patch, 
> YARN-8080.007.patch, YARN-8080.009.patch, YARN-8080.010.patch, 
> YARN-8080.011.patch, YARN-8080.012.patch, YARN-8080.013.patch, 
> YARN-8080.014.patch, YARN-8080.015.patch
>
>
> Existing native service assumes the service is long running and never 
> finishes. Containers will be restarted even if exit code == 0. 
> To support boarder use cases, we need to allow restart policy of component 
> specified by users. Propose to have following policies:
> 1) Always: containers always restarted by framework regardless of container 
> exit status. This is existing/default behavior.
> 2) Never: Do not restart containers in any cases after container finishes: To 
> support job-like workload (for example Tensorflow training job). If a task 
> exit with code == 0, we should not restart the task. This can be used by 
> services which is not restart/recovery-able.
> 3) On-failure: Similar to above, only restart task with exitcode != 0. 
> Behaviors after component *instance* finalize (Succeeded or Failed when 
> restart_policy != ALWAYS): 
> 1) For single component, single instance: complete service.
> 2) For single component, multiple instance: other running instances from the 
> same component won't be affected by the finalized component instance. Service 
> will be terminated once all instances finalized. 
> 3) For multiple components: Service will be terminated once all components 
> finalized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8293) In YARN Services UI, "User Name for service" should be completely removed in secure clusters

2018-05-16 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477594#comment-16477594
 ] 

Eric Yang commented on YARN-8293:
-

[~sunilg] The changes in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-ui/src/main/webapp/app/templates/components/deploy-service.hbs
 will hide username column from the displaying table.  Does this imply that the 
user interface can only display jobs for the login user, and not all the jobs 
from all users for yarn admin?  This seems to be a use ability limitation for 
yarn admin users.  We might need follow up JIRAs to make sure that we can 
support the case where yarn admin look at all the jobs from all users.  Other 
than this nitpick, I think this patch is ready.

> In YARN Services UI, "User Name for service" should be completely removed in 
> secure clusters
> 
>
> Key: YARN-8293
> URL: https://issues.apache.org/jira/browse/YARN-8293
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Major
> Attachments: YARN-8293.001.patch
>
>
> "User Name for service" should be completely removed in secure clusters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-8293) In YARN Services UI, "User Name for service" should be completely removed in secure clusters

2018-05-15 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476034#comment-16476034
 ] 

Eric Yang commented on YARN-8293:
-

YARN service can have duplicated application name between users.  If user name 
field is removed, this will cause confusion for administrator who is looking at 
all jobs from all users.

> In YARN Services UI, "User Name for service" should be completely removed in 
> secure clusters
> 
>
> Key: YARN-8293
> URL: https://issues.apache.org/jira/browse/YARN-8293
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn-ui-v2
>Reporter: Sunil G
>Assignee: Sunil G
>Priority: Major
> Attachments: YARN-8293.001.patch
>
>
> "User Name for service" should be completely removed in secure clusters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8080) YARN native service should support component restart policy

2018-05-16 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477665#comment-16477665
 ] 

Eric Yang edited comment on YARN-8080 at 5/16/18 4:19 PM:
--

Thank you for the patch, [~suma.shivaprasad].

{quote}
Can you please explain which part of code you are referring to? Or was it found 
during testing?{quote}

This was found during testing, and review of code.  The decision making process 
is based on 

{code}
nSucceeded + nFailed < comp.getComponentSpec().getNumberOfContainers()
{code}

If a user specifies 2 containers, and purposely failed containers.  The first 
failed container will trigger retries once.  The second container failed.  The 
total failed containers are 3 because first container failed + second container 
failed + first container retires failed, which is greater than number of 
containers.  This triggers the program to terminate, and report FINISHED.  This 
is almost working for restart_policy=NEVER, and it should report FAILED if 
number of failed containers is greater than 50% of total containers.

For restart_policy=ON_FAILURE, we will want to compare the total succeed 
containers = getNumberOfContainers, otherwise continue to retry.  This helps 
the measurement to count toward success and best effort to retry.

For restart_policy=ALWAYS, shouldTerminate always = false.

Checkstyle still reports indentation and unused import problems.  It would be 
good to automate the clean up using IDE features.


was (Author: eyang):
[~suma.shivaprasad] {quote}
{quote}
restart_policy=ON_FAILURE, and each component instance failed 3 times, and 
application goes into FINISHED state instead of FAILED state. Is this 
expected?{quote}

Can you please explain which part of code you are referring to? Or was it found 
during testing?{quote}

This was found during testing, and review of code.  The decision making process 
is based on 

{code}
nSucceeded + nFailed < comp.getComponentSpec().getNumberOfContainers()
{code}

If a user specifies 2 containers, and purposely failed containers.  The first 
failed container will trigger retries once.  The second container failed.  The 
total failed containers are 3 because first container failed + second container 
failed + first container retires failed, which is greater than number of 
containers.  This triggers the program to terminate, and report FINISHED.  This 
is almost working for restart_policy=NEVER, and it should report FAILED if 
number of failed containers is greater than 50% of total containers.

For restart_policy=ON_FAILURE, we will want to compare the total succeed 
containers = getNumberOfContainers, otherwise continue to retry.  This helps 
the measurement to count toward success and best effort to retry.

For restart_policy=ALWAYS, shouldTerminate always = false.

Checkstyle still reports indentation and unused import problems.  It would be 
good to automate the clean up using IDE features.

> YARN native service should support component restart policy
> ---
>
> Key: YARN-8080
> URL: https://issues.apache.org/jira/browse/YARN-8080
> Project: Hadoop YARN
>  Issue Type: Task
>Reporter: Wangda Tan
>Assignee: Suma Shivaprasad
>Priority: Critical
> Attachments: YARN-8080.001.patch, YARN-8080.002.patch, 
> YARN-8080.003.patch, YARN-8080.005.patch, YARN-8080.006.patch, 
> YARN-8080.007.patch, YARN-8080.009.patch, YARN-8080.010.patch, 
> YARN-8080.011.patch, YARN-8080.012.patch, YARN-8080.013.patch, 
> YARN-8080.014.patch, YARN-8080.015.patch
>
>
> Existing native service assumes the service is long running and never 
> finishes. Containers will be restarted even if exit code == 0. 
> To support boarder use cases, we need to allow restart policy of component 
> specified by users. Propose to have following policies:
> 1) Always: containers always restarted by framework regardless of container 
> exit status. This is existing/default behavior.
> 2) Never: Do not restart containers in any cases after container finishes: To 
> support job-like workload (for example Tensorflow training job). If a task 
> exit with code == 0, we should not restart the task. This can be used by 
> services which is not restart/recovery-able.
> 3) On-failure: Similar to above, only restart task with exitcode != 0. 
> Behaviors after component *instance* finalize (Succeeded or Failed when 
> restart_policy != ALWAYS): 
> 1) For single component, single instance: complete service.
> 2) For single component, multiple instance: other running instances from the 
> same component won't be affected by the finalized component instance. Service 
> will be terminated once all instances finalized. 
> 3) For multiple components: Service will be terminated once all components 
> finalized.



--
This message was sent by 

[jira] [Commented] (YARN-8300) Fix NPE in DefaultUpgradeComponentsFinder

2018-05-16 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477697#comment-16477697
 ] 

Eric Yang commented on YARN-8300:
-

[~giovanni.fumarola] Patch 003 looks good to me.  I can help with the commit.

> Fix NPE in DefaultUpgradeComponentsFinder 
> --
>
> Key: YARN-8300
> URL: https://issues.apache.org/jira/browse/YARN-8300
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
> Fix For: 3.1.1
>
> Attachments: YARN-8300.1.patch, YARN-8300.2.patch, YARN-8300.3.patch
>
>
> In current upgrades for Yarn native services, we do not support 
> addition/deletion of compoents during upgrade. On trying to upgrade with the 
> same number of components in target spec as the current service spec but with 
> the one of the components having a new target spec and name, see the 
> following NPE in service AM logs
> {noformat}
> 2018-05-15 00:10:41,489 [IPC Server handler 0 on 37488] ERROR 
> service.ClientAMService - Error while trying to upgrade service {} 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.service.UpgradeComponentsFinder$DefaultUpgradeComponentsFinder.lambda$findTargetComponentSpecs$0(UpgradeComponentsFinder.java:103)
>   at java.util.ArrayList.forEach(ArrayList.java:1257)
>   at 
> org.apache.hadoop.yarn.service.UpgradeComponentsFinder$DefaultUpgradeComponentsFinder.findTargetComponentSpecs(UpgradeComponentsFinder.java:100)
>   at 
> org.apache.hadoop.yarn.service.ServiceManager.processUpgradeRequest(ServiceManager.java:259)
>   at 
> org.apache.hadoop.yarn.service.ClientAMService.upgrade(ClientAMService.java:163)
>   at 
> org.apache.hadoop.yarn.service.impl.pb.service.ClientAMProtocolPBServiceImpl.upgradeService(ClientAMProtocolPBServiceImpl.java:81)
>   at 
> org.apache.hadoop.yarn.proto.ClientAMProtocol$ClientAMProtocolService$2.callBlockingMethod(ClientAMProtocol.java:5972)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7960) Add no-new-privileges flag to docker run

2018-05-16 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477861#comment-16477861
 ] 

Eric Yang commented on YARN-7960:
-

[~ebadger] You are right.  Selinux presence is not a good indicator if the 
option should be enabled or not.  No_new_privileges can work with selinux after 
CentOS 7.5.  Config knob for this feature is the better choice. 

> Add no-new-privileges flag to docker run
> 
>
> Key: YARN-7960
> URL: https://issues.apache.org/jira/browse/YARN-7960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>  Labels: Docker
> Attachments: YARN-7960.001.patch
>
>
> Minimally, this should be used for unprivileged containers. It's a cheap way 
> to add an extra layer of security to the docker model. For privileged 
> containers, it might be appropriate to omit this flag
> https://github.com/moby/moby/pull/20727



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8300) Fix NPE in DefaultUpgradeComponentsFinder

2018-05-16 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8300:

Affects Version/s: 3.1.1
 Target Version/s: 3.2.0, 3.1.1
Fix Version/s: 3.2.0
  Description: 
In current upgrades for Yarn native services, we do not support 
addition/deletion of compoents during upgrade. On trying to upgrade with the 
same number of components in target spec as the current service spec but with 
the one of the components having a new target spec and name, see the following 
NPE in service AM logs

{noformat}
2018-05-15 00:10:41,489 [IPC Server handler 0 on 37488] ERROR 
service.ClientAMService - Error while trying to upgrade service {} 
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.service.UpgradeComponentsFinder$DefaultUpgradeComponentsFinder.lambda$findTargetComponentSpecs$0(UpgradeComponentsFinder.java:103)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at 
org.apache.hadoop.yarn.service.UpgradeComponentsFinder$DefaultUpgradeComponentsFinder.findTargetComponentSpecs(UpgradeComponentsFinder.java:100)
at 
org.apache.hadoop.yarn.service.ServiceManager.processUpgradeRequest(ServiceManager.java:259)
at 
org.apache.hadoop.yarn.service.ClientAMService.upgrade(ClientAMService.java:163)
at 
org.apache.hadoop.yarn.service.impl.pb.service.ClientAMProtocolPBServiceImpl.upgradeService(ClientAMProtocolPBServiceImpl.java:81)
at 
org.apache.hadoop.yarn.proto.ClientAMProtocol$ClientAMProtocolService$2.callBlockingMethod(ClientAMProtocol.java:5972)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
{noformat}

  was:

In current upgrades for Yarn native services, we do not support 
addition/deletion of compoents during upgrade. On trying to upgrade with the 
same number of components in target spec as the current service spec but with 
the one of the components having a new target spec and name, see the following 
NPE in service AM logs

{noformat}
2018-05-15 00:10:41,489 [IPC Server handler 0 on 37488] ERROR 
service.ClientAMService - Error while trying to upgrade service {} 
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.service.UpgradeComponentsFinder$DefaultUpgradeComponentsFinder.lambda$findTargetComponentSpecs$0(UpgradeComponentsFinder.java:103)
at java.util.ArrayList.forEach(ArrayList.java:1257)
at 
org.apache.hadoop.yarn.service.UpgradeComponentsFinder$DefaultUpgradeComponentsFinder.findTargetComponentSpecs(UpgradeComponentsFinder.java:100)
at 
org.apache.hadoop.yarn.service.ServiceManager.processUpgradeRequest(ServiceManager.java:259)
at 
org.apache.hadoop.yarn.service.ClientAMService.upgrade(ClientAMService.java:163)
at 
org.apache.hadoop.yarn.service.impl.pb.service.ClientAMProtocolPBServiceImpl.upgradeService(ClientAMProtocolPBServiceImpl.java:81)
at 
org.apache.hadoop.yarn.proto.ClientAMProtocol$ClientAMProtocolService$2.callBlockingMethod(ClientAMProtocol.java:5972)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
{noformat}


> Fix NPE in DefaultUpgradeComponentsFinder 
> --
>
> Key: YARN-8300
> URL: https://issues.apache.org/jira/browse/YARN-8300
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.1
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8300.1.patch, YARN-8300.2.patch, YARN-8300.3.patch
>
>
> In current upgrades for Yarn native services, we do not support 
> addition/deletion of compoents during upgrade. On trying to upgrade with the 
> same number of components in target 

[jira] [Comment Edited] (YARN-8300) Fix NPE in DefaultUpgradeComponentsFinder

2018-05-16 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-8300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16477708#comment-16477708
 ] 

Eric Yang edited comment on YARN-8300 at 5/16/18 4:44 PM:
--

Thank you [~suma.shivaprasad] for the patch.
Thank you [~giovanni.fumarola] for the review.

+1 I committed this to branch 3.1 and trunk.


was (Author: eyang):
Thank you [~suma.shivaprasad] for the patch.
Thank you [~giovanni.fumarola] for the review.

I committed this to branch 3.1 and trunk.

> Fix NPE in DefaultUpgradeComponentsFinder 
> --
>
> Key: YARN-8300
> URL: https://issues.apache.org/jira/browse/YARN-8300
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 3.1.1
>Reporter: Suma Shivaprasad
>Assignee: Suma Shivaprasad
>Priority: Major
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-8300.1.patch, YARN-8300.2.patch, YARN-8300.3.patch
>
>
> In current upgrades for Yarn native services, we do not support 
> addition/deletion of compoents during upgrade. On trying to upgrade with the 
> same number of components in target spec as the current service spec but with 
> the one of the components having a new target spec and name, see the 
> following NPE in service AM logs
> {noformat}
> 2018-05-15 00:10:41,489 [IPC Server handler 0 on 37488] ERROR 
> service.ClientAMService - Error while trying to upgrade service {} 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.yarn.service.UpgradeComponentsFinder$DefaultUpgradeComponentsFinder.lambda$findTargetComponentSpecs$0(UpgradeComponentsFinder.java:103)
>   at java.util.ArrayList.forEach(ArrayList.java:1257)
>   at 
> org.apache.hadoop.yarn.service.UpgradeComponentsFinder$DefaultUpgradeComponentsFinder.findTargetComponentSpecs(UpgradeComponentsFinder.java:100)
>   at 
> org.apache.hadoop.yarn.service.ServiceManager.processUpgradeRequest(ServiceManager.java:259)
>   at 
> org.apache.hadoop.yarn.service.ClientAMService.upgrade(ClientAMService.java:163)
>   at 
> org.apache.hadoop.yarn.service.impl.pb.service.ClientAMProtocolPBServiceImpl.upgradeService(ClientAMProtocolPBServiceImpl.java:81)
>   at 
> org.apache.hadoop.yarn.proto.ClientAMProtocol$ClientAMProtocolService$2.callBlockingMethod(ClientAMProtocol.java:5972)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872)
>   at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7960) Add no-new-privileges flag to docker run

2018-05-15 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476416#comment-16476416
 ] 

Eric Yang commented on YARN-7960:
-

[~ebadger] Can we run sestatus to check instead of depending on config values?  
If sestatus is not found, then no-new-privileges option is enabled.  Like you 
said that selinux auditing is the exception.  I am ok with this option being 
enabled by default in absence of selinux.  This can prevent configuration 
mistake made by system administrator.  

> Add no-new-privileges flag to docker run
> 
>
> Key: YARN-7960
> URL: https://issues.apache.org/jira/browse/YARN-7960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>  Labels: Docker
> Attachments: YARN-7960.001.patch
>
>
> Minimally, this should be used for unprivileged containers. It's a cheap way 
> to add an extra layer of security to the docker model. For privileged 
> containers, it might be appropriate to omit this flag
> https://github.com/moby/moby/pull/20727



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-7960) Add no-new-privileges flag to docker run

2018-05-15 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16476330#comment-16476330
 ] 

Eric Yang commented on YARN-7960:
-

[~ebadger] no-new-privileges option will block [selinux 
auditing|https://github.com/projectatomic/container-selinux/issues/51].  This 
feature will prevent enterprise customers from auditing security inside the 
container.  Some effort has been put in place to ensure selinux auditing is 
unblocked for CentOS 7.5 and newer.  It might be a good idea to check if the 
Hadoop cluster has selinux enforced before this option is appended to 
non-privileged container.

> Add no-new-privileges flag to docker run
> 
>
> Key: YARN-7960
> URL: https://issues.apache.org/jira/browse/YARN-7960
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Eric Badger
>Assignee: Eric Badger
>Priority: Major
>  Labels: Docker
> Attachments: YARN-7960.001.patch
>
>
> Minimally, this should be used for unprivileged containers. It's a cheap way 
> to add an extra layer of security to the docker model. For privileged 
> containers, it might be appropriate to omit this flag
> https://github.com/moby/moby/pull/20727



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Updated] (YARN-8290) Yarn application failed to recover with "Error Launching job : User is not set in the application report" error after RM restart

2018-05-16 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang updated YARN-8290:

Attachment: YARN-8290.001.patch

> Yarn application failed to recover with "Error Launching job : User is not 
> set in the application report" error after RM restart
> 
>
> Key: YARN-8290
> URL: https://issues.apache.org/jira/browse/YARN-8290
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yesha Vora
>Priority: Major
> Attachments: YARN-8290.001.patch
>
>
> Scenario:
> 1) Start 5 streaming application in background
> 2) Kill Active RM and cause RM failover
> After RM failover, The application failed with below error.
> {code}18/02/01 21:24:29 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception on [rm2] : 
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1517520038847_0003' doesn't exist in RM. Please check 
> that the job submission was successful.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:338)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
> , so propagating back to caller.
> 18/02/01 21:24:29 INFO impl.YarnClientImpl: Submitted application 
> application_1517520038847_0003
> 18/02/01 21:24:30 INFO mapreduce.JobSubmitter: Cleaning up the staging area 
> /user/hrt_qa/.staging/job_1517520038847_0003
> 18/02/01 21:24:30 ERROR streaming.StreamJob: Error Launching job : User is 
> not set in the application report
> Streaming Command Failed!{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Assigned] (YARN-8290) Yarn application failed to recover with "Error Launching job : User is not set in the application report" error after RM restart

2018-05-16 Thread Eric Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Yang reassigned YARN-8290:
---

 Assignee: Eric Yang
Affects Version/s: 3.1.1

[~leftnoteasy] According to your suggestion that ACL information is set too 
late and killing AM prior to ACL information is propagated can cause RM 
recovery to load partial application record.  The suggested change is to move 
the ACL setup into ApplicationToSchedulerTransition.  The patch moved the block 
of code accordingly.  Let me know if this is the correct fix.  Thanks

> Yarn application failed to recover with "Error Launching job : User is not 
> set in the application report" error after RM restart
> 
>
> Key: YARN-8290
> URL: https://issues.apache.org/jira/browse/YARN-8290
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: Yesha Vora
>Assignee: Eric Yang
>Priority: Major
> Attachments: YARN-8290.001.patch
>
>
> Scenario:
> 1) Start 5 streaming application in background
> 2) Kill Active RM and cause RM failover
> After RM failover, The application failed with below error.
> {code}18/02/01 21:24:29 WARN client.RequestHedgingRMFailoverProxyProvider: 
> Invocation returned exception on [rm2] : 
> org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application 
> with id 'application_1517520038847_0003' doesn't exist in RM. Please check 
> that the job submission was successful.
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:338)
>   at 
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:175)
>   at 
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:417)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)
> , so propagating back to caller.
> 18/02/01 21:24:29 INFO impl.YarnClientImpl: Submitted application 
> application_1517520038847_0003
> 18/02/01 21:24:30 INFO mapreduce.JobSubmitter: Cleaning up the staging area 
> /user/hrt_qa/.staging/job_1517520038847_0003
> 18/02/01 21:24:30 ERROR streaming.StreamJob: Error Launching job : User is 
> not set in the application report
> Streaming Command Failed!{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



<    1   2   3   4   5   6   7   8   9   10   >