[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...

2016-03-14 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/11512#issuecomment-196207275
  
Reopen it and change the title.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...

2016-03-13 Thread jerryshao
Github user jerryshao closed the pull request at:

https://github.com/apache/spark/pull/11512


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...

2016-03-13 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/11512#issuecomment-196165073
  
Sure, thanks @tgravescs and @vanzin for your comments. I will close this 
create new PRs accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...

2016-03-11 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/11512#issuecomment-195370827
  
Yes its only if they explicitly System.exit.  Or someone puts a System.exit 
back into Spark but that shouldn't happen except on failure.

ok @jerryshao  can you update pr for master to just be to change default 
behavior of shutdown hook assume failures.  Then a PR to 1.6 with this signal 
handler


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...

2016-03-10 Thread vanzin
Github user vanzin commented on the pull request:

https://github.com/apache/spark/pull/11512#issuecomment-195137045
  
The signal handler change looks fine to me; I agree with Saisai that the 
race would happen very rarely and shouldn't be a problem.

Regarding the change in behavior...

> I'm wondering if we just don't change the behavior for 2.x to report 
failed by default

So this would only apply to applications that do `System.exit` explicitly, 
right? At this point in time I hope people have stopped doing that, so it 
should be fine. A quick grep in Spark code shows that `System.exit` seems to be 
used only for error cases, so the change would actually help with that (not 
that Spark should be calling `System.exit` itself, but that's a different 
problem).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...

2016-03-09 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/11512#issuecomment-194475748
  
So if they catch exceptions themselves we have to assume succeeded which is 
what we do now.  I just hate to add more logic and not really solve the 
underlying problem.  But unfortunately its really hard for us to know since its 
any arbitrary user code.  We can't just ask the scheduler if its done.

So I'm thinking about this some more and I'm wondering if we just don't 
change the behavior for 2.x to report failed by default.  The reason we do it 
this way is the SPARK code itself and examples used to System.exit().  We have 
removed those quite a few releases back now.  People shouldn't be calling 
System.exit and should just be calling sc.stop().   2.x gives us an opportunity 
to change that behavior.

@vanzin @pwendell  Do you guys have any concerns about this?

if we want to pull this into 1.6.x line I would be ok adding in the signal 
handler to fix this one issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...

2016-03-08 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/11512#issuecomment-194097212
  
Thanks a lot for your comment @tgravescs , I think the scenario you 
mentioned though has race condition, it seldom happens, and currently we don't 
have solutions to handle that, since driver and AM are independent.

Yes, normally we should expect scheduler throw all the exceptions, but you 
cannot expect any user code will starting a job immediately after `sc` is 
created, or exception will definitely be thrown to AM, what if main  function 
catch all the exceptions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...

2016-03-08 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/11512#issuecomment-193855548
  
So I think the problem with this is still that we don't know the real state 
of the application.  There is still a race condition here where the users 
application and sparkcontext could be succeeded (ie wrote output and was about 
to finish) but if we get a SIGTERM at the wrong point we mark it as failed 
rather then succeeded.

Normally if Spark was actually running anything the DAGScheduler would 
throw an exception. I think your use case here is the driver is sleeping and 
not running anything distributed so the scheduler doesn't throw, it just calls 
sc.stop().  I would rather explore having the scheduler throw if it has any 
jobs, including inactive ones or allowing the application master to know the 
state of the Spark jobs.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...

2016-03-08 Thread tgravescs
Github user tgravescs commented on a diff in the pull request:

https://github.com/apache/spark/pull/11512#discussion_r55375068
  
--- Diff: 
yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala ---
@@ -116,6 +118,21 @@ private[spark] class ApplicationMaster(
 
   private var delegationTokenRenewerOption: 
Option[AMDelegationTokenRenewer] = None
 
+  if (SystemUtils.IS_OS_UNIX) {
+// Register signal handler for signal "TERM", "INT" and "HUP". For the 
cases where AM receive a
+// signal and stop, from RM's aspect this application needs to be 
reattempted, rather than mark
+// as success.
+// Replace this signal handler with SignalLogger in AM side.
+val signalHandler = new SignalHandler() {
+  override def handle(sig: Signal): Unit = {
+
+logInfo(s"received signal: ${sig.getName}")
--- End diff --

I'd rather see this have same information as message in the signallogger:
log.error("RECEIVED SIGNAL " + signal.getNumber() + ": SIG" + 
signal.getName())

You can add something to indicate from AM and not signal logger if you want


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...

2016-03-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11512#issuecomment-192131877
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...

2016-03-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11512#issuecomment-192131879
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52448/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...

2016-03-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11512#issuecomment-192131732
  
**[Test build #52448 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52448/consoleFull)**
 for PR 11512 at commit 
[`b461b71`](https://github.com/apache/spark/commit/b461b717ed51b532f823615bcb79f66b17635c4d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...

2016-03-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11512#issuecomment-192125457
  
**[Test build #52448 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52448/consoleFull)**
 for PR 11512 at commit 
[`b461b71`](https://github.com/apache/spark/commit/b461b717ed51b532f823615bcb79f66b17635c4d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...

2016-03-03 Thread jerryshao
GitHub user jerryshao opened a pull request:

https://github.com/apache/spark/pull/11512

[SPARK-13642][Yarn] Properly handle signal kill in ApplicationMaster

## What changes were proposed in this pull request?

This patch is fixing the race condition in ApplicationMaster when receiving 
a signal. In the current implementation, if signal is received and with no any 
exception, this application will be finished with successful state in Yarn, and 
there's no another attempt. Actually the application is killed by signal in the 
runtime, so another attempt is expected.

This patch adds a signal handler to handle the signal things, if signal is 
received, marking this application finished with failure, rather than success.

## How was this patch tested?

This patch is tested with following situations:

1. Application is finished normally.
2. Application is finished by calling `System.exit(n)`.
3. Application is killed by yarn command.
4. ApplicationMaster is killed by "SIGTERM" send by `kill pid` command.
5. ApplicationMaster is killed by NM with "SIGTERM" in case of NM failure.

All the scenarios return the expected states.

CC @tgravescs , please help to review this fix, thanks a lot.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jerryshao/apache-spark SPARK-13642

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11512.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11512


commit b461b717ed51b532f823615bcb79f66b17635c4d
Author: jerryshao 
Date:   2016-03-04T05:52:18Z

Properly handle signal kill in ApplicationMaster




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org