[jira] [Commented] (YARN-867) Isolation of failures in aux services

2016-07-12 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15373658#comment-15373658
 ] 

Ming Ma commented on YARN-867:
--

Will this be simplified if we have YARN-1593?

> Isolation of failures in aux services 
> --
>
> Key: YARN-867
> URL: https://issues.apache.org/jira/browse/YARN-867
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: Hitesh Shah
>Assignee: Xuan Gong
> Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
> YARN-867.4.patch, YARN-867.5.patch, YARN-867.6.patch, 
> YARN-867.sampleCode.2.patch
>
>
> Today, a malicious application can bring down the NM by sending bad data to a 
> service. For example, sending data to the ShuffleService such that it results 
> any non-IOException will cause the NM's async dispatcher to exit as the 
> service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-867) Isolation of failures in aux services

2015-07-10 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622691#comment-14622691
 ] 

Hitesh Shah commented on YARN-867:
--

[~vinodkv] [~xgong] Is this still open or addressed elsewhere? 

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.5.patch, YARN-867.6.patch, 
 YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2015-05-01 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523648#comment-14523648
 ] 

Hadoop QA commented on YARN-867:


\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:red}-1{color} | patch |   0m  0s | The patch command could not apply 
the patch during dryrun. |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12606599/YARN-867.6.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / 2d7363b |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7572/console |


This message was automatically generated.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.5.patch, YARN-867.6.patch, 
 YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2014-01-29 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13885767#comment-13885767
 ] 

Hadoop QA commented on YARN-867:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12606599/YARN-867.6.patch
  against trunk revision .

{color:red}-1 patch{color}.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2961//console

This message is automatically generated.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.5.patch, YARN-867.6.patch, 
 YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-10-04 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13786288#comment-13786288
 ] 

Alejandro Abdelnur commented on YARN-867:
-

patch6 does not look good to me, the try/catch are not correct as an exception 
in ANY auxiliary service will halt delivery to the other auxiliary services. 
the try/catch should be done around each call to the auxiliary service 
interface methods as done in patch4.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.5.patch, YARN-867.6.patch, 
 YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-10-04 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13786317#comment-13786317
 ] 

Bikas Saha commented on YARN-867:
-

tucu you comments were addressed in YARN-1256. This jira is now targeted for 
more elaborate changes.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.5.patch, YARN-867.6.patch, 
 YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-10-04 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13786402#comment-13786402
 ] 

Alejandro Abdelnur commented on YARN-867:
-

[~bikassaha] got it, missed that was moved to another jira, thx

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.5.patch, YARN-867.6.patch, 
 YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-10-03 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785306#comment-13785306
 ] 

Hitesh Shah commented on YARN-867:
--

[~xgong] [~bikassaha] [~vinodkv] It seems like this fix is getting quite 
complex and the introduction of container failure on service event handling has 
a possibility of introducing a lot of different race conditions.

I propose the following:

   - Add the code for catch Throwable whenever an aux service is invoked for 
handling the container related events ( app init, container start, container 
stop, app cleanup ). And, do not fail the container if an exception is thrown. 
   - A simpler check could be done to match the service metadata from the 
ContainerLaunchContext and ensure that the service is configured on the NM in 
question. 

Using the above, at the very least, we can catch issues related to 
mis-configured NMs where the shuffle service is not configured. This is way 
simpler as it could be done a simple synchronous check when handling the 
startContainers rpc call. This could be targeted to 2.1.2/2.2.0

As for the failing containers, I propose that we target fixing the feedback of 
failed containers back to the AM on service handling errors in 2.3.0. For the 
2.3.0 targeted jira, I would prefer to increase the scope of this to design for 
differentiating critical vs non-critical services so as to have the framework 
in place to understand which service's errors result in failed containers. 

Comments? 




 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.5.patch, YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-10-03 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785364#comment-13785364
 ] 

Alejandro Abdelnur commented on YARN-867:
-

the try/catch should be around each aux service method invocation so a failure 
of a given service does not affect delivery to other services.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.5.patch, YARN-867.6.patch, 
 YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-10-03 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785373#comment-13785373
 ] 

Bikas Saha commented on YARN-867:
-

bq. Using the above, at the very least, we can catch issues related to 
mis-configured NMs where the shuffle service is not configured. This is way 
simpler as it could be done a simple synchronous check when handling the 
startContainers rpc call. This could be targeted to 2.1.2/2.2.0

@hitesh, I agree. In that case shall we leave re-target this jira to 2.3 and 
use YARN-1256 to fix the misconfigured service and exception logging?


 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.5.patch, YARN-867.6.patch, 
 YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-10-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785371#comment-13785371
 ] 

Hadoop QA commented on YARN-867:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12606599/YARN-867.6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/2077//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2077//console

This message is automatically generated.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.5.patch, YARN-867.6.patch, 
 YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-10-03 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785380#comment-13785380
 ] 

Hitesh Shah commented on YARN-867:
--

+1 to Bikas's suggestion.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.5.patch, YARN-867.6.patch, 
 YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-10-03 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13785412#comment-13785412
 ] 

Vinod Kumar Vavilapalli commented on YARN-867:
--

bq. @hitesh, I agree. In that case shall we leave re-target this jira to 2.3 
and use YARN-1256 to fix the misconfigured service and exception logging?
+1.

+1 also to the earlier suggestion - too late to put it more state machine 
changes into 2.1.2.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.5.patch, YARN-867.6.patch, 
 YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-10-02 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784503#comment-13784503
 ] 

Bikas Saha commented on YARN-867:
-

Probably we can ignore the error here since the container has already failed.
{code}
 // From LOCALIZATION_FAILED State
 .addTransition(ContainerState.LOCALIZATION_FAILED,
@@ -180,6 +184,9 @@ public ContainerImpl(Configuration conf, Dispatcher 
dispatcher,
 .addTransition(ContainerState.LOCALIZATION_FAILED,
 ContainerState.LOCALIZATION_FAILED,
 ContainerEventType.RESOURCE_FAILED)
+.addTransition(ContainerState.LOCALIZATION_FAILED, 
ContainerState.EXITED_WITH_FAILURE,
+ContainerEventType.CONTAINER_EXITED_WITH_FAILURE,
+new ExitedWithFailureTransition(false))
{code}

Probably have 1 try catch instead of multiple.

Can we rename AUXSERVICE_FAIL to AUXSERVICE_ERROR since the service probably 
hasnt failed.

TestAuxService needs an addition for the new code

TestContainer - new test can be made simpler by not mocking AuxServiceHandler 
and instead sending the failed event directly like its done for other tests 
there.

In AuxService.handle(APPLICATION_INIT) and other places like that, where the 
service does not exist then we should fail too.

Zhijie, we should err on the side of caution here and fail the container. If we 
see real use cases where failure can be ignored then we can make that 
improvement.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-10-02 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784717#comment-13784717
 ] 

Xuan Gong commented on YARN-867:


bq.Probably have 1 try catch instead of multiple.

Fixed. Use only one big try catch block

bq.Can we rename AUXSERVICE_FAIL to AUXSERVICE_ERROR since the service probably 
hasnt failed.

Done

bq.TestAuxService needs an addition for the new code

Added a new test case in TestAuxService

bq.TestContainer - new test can be made simpler by not mocking 
AuxServiceHandler and instead sending the failed event directly like its done 
for other tests there.

Fixed

bq.In AuxService.handle(APPLICATION_INIT) and other places like that, where the 
service does not exist then we should fail too.

Done

bq.Probably we can ignore the error here since the container has already failed.

I think we still need this transition. The container can go to 
ContainerState.LOCALIZATION_FAILED from new state, and AuxService is triggered 
to do the Application_init at that time. If there is any exception, we will 
send the ContainerExitEvent with 
ContainerEventType.CONTAINER_EXITED_WITH_FAILURE to the Container. And It is 
very possible that container will start to process this event when it is in the 
LOCALIZATION_FAILED state. So, we should handle it.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-10-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784732#comment-13784732
 ] 

Hadoop QA commented on YARN-867:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12606498/YARN-867.5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/2071//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/2071//console

This message is automatically generated.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.5.patch, YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-10-02 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13784745#comment-13784745
 ] 

Bikas Saha commented on YARN-867:
-

Why is this check needed?
{code}
+  private void handleAuxServiceFail(AuxServicesEvent event, Throwable th) {
+if (event.getType() instanceof AuxServicesEventType) {
+  Container container = event.getContainer();
{code}

If container has already failed then why do we need to change state again? the 
container has already failed.
{code}
+.addTransition(ContainerState.LOCALIZATION_FAILED, 
ContainerState.EXITED_WITH_FAILURE,
+ContainerEventType.CONTAINER_EXITED_WITH_FAILURE,
+new ExitedWithFailureTransition(false))
{code}
{code}
+.addTransition(ContainerState.CONTAINER_CLEANEDUP_AFTER_KILL,
+ContainerState.EXITED_WITH_FAILURE,
+ContainerEventType.CONTAINER_EXITED_WITH_FAILURE,
+new ExitedWithFailureTransition(false))
{code}

Why is CONTAINER_EXITED_WITH_FAILURE not being handled while container state is 
localized/running?

Why are extra events being ignored in addition to 
ContainerEventType.CONTAINER_EXITED_WITH_FAILURE?
{code}
+ContainerState.EXITED_WITH_FAILURE,
+EnumSet.of(
+ContainerEventType.KILL_CONTAINER,
+ContainerEventType.CONTAINER_EXITED_WITH_FAILURE,
+ContainerEventType.RESOURCE_LOCALIZED,
+ContainerEventType.RESOURCE_FAILED,
+ContainerEventType.CONTAINER_LAUNCHED,
+ContainerEventType.CONTAINER_EXITED_WITH_SUCCESS,
+ContainerEventType.CONTAINER_KILLED_ON_REQUEST))
{code}
{code}
+.addTransition(ContainerState.DONE, ContainerState.DONE,
+EnumSet.of(
+ContainerEventType.RESOURCE_LOCALIZED,
+ContainerEventType.CONTAINER_LAUNCHED,
+ContainerEventType.CONTAINER_EXITED_WITH_FAILURE,
+ContainerEventType.CONTAINER_RESOURCES_CLEANEDUP,
+ContainerEventType.CONTAINER_EXITED_WITH_SUCCESS,
+ContainerEventType.CONTAINER_KILLED_ON_REQUEST))
{code}

Can you please check if ExitedWithFailureTransition(true) needs to be called in 
places where the patch is adding ExitedWithFailureTransition(false). Is cleanup 
required?

Do the new tests fail without the changes?

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.5.patch, YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-09-13 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765878#comment-13765878
 ] 

Zhijie Shen commented on YARN-867:
--

Think about the problem again. Essentially, problem is the implementation of 
AuxiliaryService may throw RuntimeException (or other Throwable), and fail the 
thread of NM dispatcher. Wrapping the calling statements with try/catch can 
basically prevent NM failure.

The next task is to handle the throwable from AuxiliaryService. In previous 
thread, what we plan to do is to fail the container directly, and let the AM 
know that the container is failed due to AUXSERVICE_FAILED. For MR, it may be 
okay, because without ShuffleHandler, MR jobs cannot run properly. However, 
should NM always make the decision to fail the container? I'm concerned that:
1. NM doesn't know what the AuxiliaryService serves the application and how 
important it is.
2. NM doesn't know how critical the exception is, or whether it is transit or 
reproducible.
Therefore, if the application can toleran

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-09-13 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765884#comment-13765884
 ] 

Zhijie Shen commented on YARN-867:
--

Sorry to post the broken comment before.

Think about the problem again. Essentially, problem is the implementation of 
AuxiliaryService may throw RuntimeException (or other Throwable), and fail the 
thread of NM dispatcher. Wrapping the calling statements with try/catch can 
basically prevent NM failure.
The next task is to handle the throwable from AuxiliaryService. In previous 
thread, what we plan to do is to fail the container directly, and let the AM 
know that the container is failed due to AUXSERVICE_FAILED. For MR, it may be 
okay, because without ShuffleHandler, MR jobs cannot run properly. However, 
should NM always make the decision to fail the container? I'm concerned that:
1. NM doesn't know what the AuxiliaryService serves the application and how 
important it is.
2. NM doesn't know how critical the exception is, or whether it is transit or 
reproducible.
Therefore, if the application can tolerant the AuxiliaryService failure? For 
example, if the AuxiliaryService just does some node-local monitoring work, the 
application can complete with the AuxiliaryService not working. Therefore, I'm 
wondering whether we should leave the decision to the AM. The application knows 
how to handle the exception best. NM just need to exposure the failure of the 
AuxiliaryService to the application in some method. Thoughts?

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-09-12 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765696#comment-13765696
 ] 

Xuan Gong commented on YARN-867:


NEW patch added more transitions in ContainerSEate.EXITED_WITH_FAILURE and 
ContainerState.DONE. This patch still handles the 
AuxServicesEventType.APPLICATION_INIT and handles exceptions at the container 
level. 

I thought about moving AuxServicesEventType.APPLICATION_INIT into application. 
But I do not think that we will get any benefits. The reasons are :
1. There are two newly events : AuxServicesEvent.CONTAINER_INIT and 
AuxServicesEvent.CONTAINER_STOP. We need to handle them at container level.
2. Even if we move AuxServicesEventType.APPLICATION_INIT into application, we 
will have two options :
   a. We will not start any containers until all the AuxServices finish their 
APPLICATION_INIT. If we choose this, that definitely simplify the problem. When 
there is any exceptions from APPLICATION_INIT on AuxServices, just simply kill 
the applications. But does it make sense that we need to block all the 
containers ?
   b. We can let AuxServices do APPLICATION_INIT and container starts at the 
same time, if this is the case, we will go to the same process as now. Because, 
when the container receives the CONTAINER_EXITED_WITH_FAILURE event, we can not 
guarantee which state the container is, maybe at killing state, LOCALIZED 
state, etc. Any state is possible.


 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-09-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765730#comment-13765730
 ] 

Hadoop QA commented on YARN-867:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12602838/YARN-867.4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1902//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1902//console

This message is automatically generated.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.4.patch, YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-09-11 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13765060#comment-13765060
 ] 

Xuan Gong commented on YARN-867:


We need to handle the AuxServicesEvent.CONTAINER_INIT and 
AuxServicesEvent.CONTAINER_STOP. Those need to be handle on container.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-09-10 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763452#comment-13763452
 ] 

Hadoop QA commented on YARN-867:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12602396/YARN-867.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1889//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1889//console

This message is automatically generated.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-09-10 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763423#comment-13763423
 ] 

Xuan Gong commented on YARN-867:


recreate the patch based on the latest trunk, and add new test case to test the 
logic.
Remove the API onAuxServiceFailure, we already have onContainersCompleted() to 
take care of it.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-09-10 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13763749#comment-13763749
 ] 

Zhijie Shen commented on YARN-867:
--

How about issuing a KILL_CONTAINER event instead CONTAINER_EXITED_WITH_FAILURE, 
which is already handled at all container states. Otherwise, we need to add the 
transition from a number of states to EXITED_WITH_FAILURE. I'm not sure it is 
obvious to ensure the transition correct.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-09-10 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13764007#comment-13764007
 ] 

Xuan Gong commented on YARN-867:


bq. I think we should handle AuxServicesEventType.APPLICATION_INIT and the stop 
event in Application and not container. That should simplify THIS patch a lot.

I did not see the benefits. 
So, when there is any auxServices fail in a container, we need to fail this 
container. If we handle the AuxServicesEventType in Application, eventually, 
from Application, we need to inform that certain container(not all the 
containers) to exit_with_failure. It will go to the same process as that we 
handle the it from container directly. If there is no difference, why do we 
increase the traffic (more events) for application ? 

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.3.patch, 
 YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-09-07 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13760933#comment-13760933
 ] 

Vinod Kumar Vavilapalli commented on YARN-867:
--

bq. Vinod Kumar Vavilapalli Are we making the call that an issue in service 
handling is considered a container failure? For the MR AM, it may be critical 
for the shuffle to work but this is not necessarily true for all applications 
and all services that they interact with.
Yeah, I think that is a reasonable assumption for now. I haven't seen any more 
aux-services besides shuffle. In the future, we could make it per container 
specifiable along with the concept of optional aux-services for containers 
(today everything is implicitly a required aux-service). And we can do that in 
a compatible manner.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-08-25 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13749573#comment-13749573
 ] 

Hadoop QA commented on YARN-867:


{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12599071/YARN-867.sampleCode.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1765//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1765//console

This message is automatically generated.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-08-21 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13746503#comment-13746503
 ] 

Hitesh Shah commented on YARN-867:
--

[~vinodkv] Are we making the call that an issue in service handling is 
considered a container failure? For the MR AM, it may be critical for the 
shuffle to work but this is not necessarily true for all applications and all 
services that they interact with.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-08-20 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13745794#comment-13745794
 ] 

Xuan Gong commented on YARN-867:


bq.Let's just handle the NM crash scenario here. And for informing the AM, 
instead of adding more protocol changes, we can fail the container setting a 
proper diagnostic and may be a custom exit-code.

I agree. I think that we can use a easier way to solve this issue. This is the 
proposal :
If auxServices throw out the exceptions, we still need to catch them, after 
that, we can fail the related container by send the containerExitEvent with 
ContainerEventType.CONTAINER_EXITED_WITH_FAILURE. Also we need to provide the 
proper diagnostic and custom exit-code. Eventually, this container will 
transfer to Completed state. Then we can inform the RM thru the node heartbeat. 
In that case, the related RMContainer will get this diagnostic info and custom 
exit-code, also will go to completed state. So, when AM do the heartbeat, it 
will the list of completed containerStatus. After that, AM just need simply 
check the exit code to find out whether there is any auxService fail.

Attached is the sample code for this propsal

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch, YARN-867.sampleCode.2.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-08-14 Thread Hitesh Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13740042#comment-13740042
 ] 

Hitesh Shah commented on YARN-867:
--

Might be good to break this down in a subset of jiras. The first ( this jira 
itself ) to just ensure that the NM does not crash. The second to address the 
proposed changes in the protocol and potential changes in the MR AM to use the 
new apis and handle failures as needed.


 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-867) Isolation of failures in aux services

2013-08-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13740060#comment-13740060
 ] 

Hadoop QA commented on YARN-867:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12598022/YARN-867.1.sampleCode.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 9 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/1714//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/1714//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1714//console

This message is automatically generated.

 Isolation of failures in aux services 
 --

 Key: YARN-867
 URL: https://issues.apache.org/jira/browse/YARN-867
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Hitesh Shah
Assignee: Xuan Gong
Priority: Critical
 Attachments: YARN-867.1.sampleCode.patch


 Today, a malicious application can bring down the NM by sending bad data to a 
 service. For example, sending data to the ShuffleService such that it results 
 any non-IOException will cause the NM's async dispatcher to exit as the 
 service's INIT APP event is not handled properly. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira