[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...

2017-10-08 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/11205
  
I guess the issue still exists, let me verify the issue again, if it still 
exists I will bring the PR to latest. Thanks!


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...

2017-10-08 Thread rustagi
Github user rustagi commented on the issue:

https://github.com/apache/spark/pull/11205
  
Sorry haven't been able to confirm this patch becaus have not seen issue in 
production for quite some time.
It was much more persistent with 2.0 than 2.1
Not sure of cause.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...

2017-10-23 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/11205
  
This PR is pretty old and a lot has changed since, but it looks like this 
can be fixed now by just fixing code to look at `stageIdToTaskIndices` instead 
of keeping `numRunningTasks` around? (Or maybe use `numRunningTasks` as a cache 
for `stageIdToTaskIndices.values.sum`.)

Also, doesn't `isExecutorIdle` take care of the second bullet in your 
description?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/11205
  
**[Test build #83067 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83067/testReport)**
 for PR 11205 at commit 
[`59f9c15`](https://github.com/apache/spark/commit/59f9c156c3ad746f84f385bcf277685c9c329286).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...

2017-10-25 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/11205
  
@vanzin ,  in the current code `stageIdToTaskIndices` cannot be used to 
track number of running tasks, because this structure doesn't remove task index 
from itself when task is finished successfully.

Yes `isExecutorIdle` is used to take care of executor idle, but the way to 
identify whether executor is idle is not robust enough. In this scenario, when 
stage is aborted because of max task failures, some task end event will be 
missing, so using number of tasks per executor will lead to residual data, and 
makes executor always be busy.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...

2017-10-25 Thread jerryshao
Github user jerryshao commented on the issue:

https://github.com/apache/spark/pull/11205
  
Verified again, looks like the 2nd bullet is not valid anymore, I cannot 
reproduce it in latest master branch, this might have already been fixed in 
SPARK-13054. 

So only first issue still exists, I think @sitalkedia 's PR is enough to 
handle this 1st issue. I'm going to close this one. @sitalkedia would you 
please reopen your PR, sorry to bring in noise.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...

2017-10-25 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/11205
  
**[Test build #83067 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83067/testReport)**
 for PR 11205 at commit 
[`59f9c15`](https://github.com/apache/spark/commit/59f9c156c3ad746f84f385bcf277685c9c329286).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/11205
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...

2017-10-25 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/11205
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83067/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...

2017-07-23 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/11205
  
gentle ping @rustagi 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...

2017-06-18 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/11205
  
gentle ping @rustgi, have you maybe had some time to confirm this patch 
maybe? It sounds the only thing we need here is the confirmation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...

2018-04-26 Thread sadhen
Github user sadhen commented on the issue:

https://github.com/apache/spark/pull/11205
  
@jerryshao I think the 2nd bullet has not been fixed in SPARK-13054.

I use spark 2.1.1, and I still find that finished tasks remain in 
`private val executorIdToTaskIds = new mutable.HashMap[String, 
mutable.HashSet[Long]]`

But the numRunningTasks equals 0 since:

```
  if (numRunningTasks != 0) {
logWarning("No stages are running, but numRunningTasks != 0")
numRunningTasks = 0
  }
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...

2016-09-03 Thread rustagi
Github user rustagi commented on the issue:

https://github.com/apache/spark/pull/11205
  
I am seeing this issue quite frequently. Not sure what is causing it but 
frequently we will get a onTaskEnd event after a stage has ended. This will 
cause the numRunningTasks to become negative. If executor number is updated 
then number of required executors(maxNumExecutorsNeeded) becomes negative & 
have issues in new executor allocation and deallocation. Best case you get 
executors that are unable to deallocate & over time spark does not allocate new 
executors even if there are tasks pending.
There is a simple hacky patch here: 
https://github.com/apache/spark/pull/9288 & this one is an attempt to correct 
it with more accountability. 
I am seeing this issue so frequently that I am not sure its possible to run 
Spark with dynamic allocation successfully for long duration without fixing it. 
I'll try the hacky patch & confirm. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...

2016-09-06 Thread rustagi
Github user rustagi commented on the issue:

https://github.com/apache/spark/pull/11205
  
I can confirm that removing speculation & setting maxtaskfailure to 1 
eliminates this problem. Will try the patch & confirm 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org