[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-04-11 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20888
  
**[Test build #89184 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89184/testReport)**
 for PR 20888 at commit 
[`4e7be70`](https://github.com/apache/spark/commit/4e7be70e6d35f290a8ddd6278a9feccbc06791aa).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20888
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89131/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-04-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20888
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20888
  
**[Test build #89131 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89131/testReport)**
 for PR 20888 at commit 
[`16525b3`](https://github.com/apache/spark/commit/16525b33bbf879845d50d5b966ea67201cb2ed31).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-04-10 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20888
  
**[Test build #89131 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89131/testReport)**
 for PR 20888 at commit 
[`16525b3`](https://github.com/apache/spark/commit/16525b33bbf879845d50d5b966ea67201cb2ed31).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-04-10 Thread gaborgsomogyi
Github user gaborgsomogyi commented on the issue:

https://github.com/apache/spark/pull/20888
  
Thank for the hints. I've taken a deeper look at the possible solutions and 
the suggested test. The problem is similar but not the same so I would solve it 
a different way. So here is my proposal. `cancelStage` sets `reasonIfKilled` in 
`TaskContext` normally but the executor thread will run untouched at this 
timestamp. The thread will be killed later triggered by `killTaskIfInterrupted` 
which throws `TaskKilledException`. When `isInterrupted` checked all the time 
when `DataFrameRangeSuite.stageToKill` will be set then the race can be avoided.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-04-09 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/20888
  
Sounds like this approach is just racy by nature. There's a similar test in 
SparkContextSuite ("Cancelling stages/jobs with custom reasons") where I fixed 
a similar issue using a semaphore. Sounds like you need something similar here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-04-09 Thread gaborgsomogyi
Github user gaborgsomogyi commented on the issue:

https://github.com/apache/spark/pull/20888
  
`SparkStatusTracker` states the following:
```
 * These APIs intentionally provide very weak consistency semantics; 
consumers of these APIs should
 * be prepared to handle empty / missing information.  For example, a job's 
stage ids may be known
 * but the status API may not have any information about the details of 
those stages, so
 * `getStageInfo` could potentially return `None` for a valid stage id.
```
This is reflected in the additional logs. I've wrapped `cancelStage`, 
`DataFrameRangeSuite.stageToKill = DataFrameRangeSuite.INVALID_STAGE_ID` and 
`assert(sparkContext.statusTracker.getExecutorInfos.map(_.numRunningTasks()).sum
 == 0)` statements with additional log entries and reproduced the issue. (The 
number after the colon is `stageToKill`)

The log shows the following flow:

1. Stage 88 was successfully cancelled
```
07:17:50.862 Executor task launch worker for task 699 INFO Executor: 
Running task 0.0 in stage 88.0 (TID 699)
07:17:50.865 spark-listener-group-shared INFO DataFrameRangeSuite: BEFORE 
CANCELLED: 88
07:17:50.866 spark-listener-group-shared INFO DataFrameRangeSuite: AFTER 
CANCELLED: 88
```
2. Waiting on the task count drop to zero on executors
```
07:17:50.869 ScalaTest-main-running-DataFrameRangeSuite INFO 
DataFrameRangeSuite: BEFORE NO TASKS
07:17:50.869 ScalaTest-main-running-DataFrameRangeSuite INFO 
DataFrameRangeSuite: AFTER NO TASKS
```
3. Resetting the `stageToKill` to -1
```
07:17:50.869 ScalaTest-main-running-DataFrameRangeSuite INFO 
DataFrameRangeSuite: BEFORE RESET: 88
07:17:50.869 ScalaTest-main-running-DataFrameRangeSuite INFO 
DataFrameRangeSuite: AFTER RESET: -1
```
4. Only after that the executor thread killed
```
07:17:50.870 Executor task launch worker for task 699 INFO Executor: 
Executor killed task 0.0 in stage 88.0 (TID 699), reason: Stage cancelled
07:17:50.870 task-result-getter-0 WARN TaskSetManager: Lost task 0.0 in 
stage 88.0 (TID 699, localhost, executor driver): TaskKilled (Stage cancelled)
07:17:50.870 task-result-getter-0 INFO TaskSchedulerImpl: Removed TaskSet 
88.0, whose tasks have all completed, from pool
```
In this situation the old thread had the possibility to overwrite the 
`stageToKill` variable after the reset. The new task thread not yet running to 
update the `stageToKill` to 90 so the `onJobStart` hook see the old 88 value.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-03-23 Thread gaborgsomogyi
Github user gaborgsomogyi commented on the issue:

https://github.com/apache/spark/pull/20888
  
@vanzin @squito yeah, there is an issue with threading as well. I'm just 
taking a look at it because it's not obvious.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-03-23 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/20888
  
I had sort of the same doubt as Imran, because I thought scalasuite ran 
tests in order... but I just ran a suite where the tests were not run in the 
order declared in the source file. So it sounds possible that this bug would be 
hit even when not running that test in isolation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-03-23 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/20888
  
ah ok, yes when run in isolation, the stage will be 0, so your change makes 
sense.  But that is not what is making it flaky in a full test run


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20888
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88536/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-03-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20888
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-03-23 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20888
  
**[Test build #88536 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88536/testReport)**
 for PR 20888 at commit 
[`42c930d`](https://github.com/apache/spark/commit/42c930d694e0bbc66974516b6719a698d664f681).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-03-22 Thread gaborgsomogyi
Github user gaborgsomogyi commented on the issue:

https://github.com/apache/spark/pull/20888
  
Uploaded logs for the jira. You're right when you pointed out a second 
issue with the `stageToKill`. The `onJobStart` tries to cancel the same ID 
twice.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-03-22 Thread gaborgsomogyi
Github user gaborgsomogyi commented on the issue:

https://github.com/apache/spark/pull/20888
  
I mean on my machine the stage ID is zero for long long time here:
```
DataFrameRangeSuite.stageToKill = TaskContext.get().stageId()
```
and after 200 seconds the other thread still stands on this (increased the 
timeout to play):
```
eventually(timeout(300.seconds), interval(1.millis)) {
  assert(DataFrameRangeSuite.stageToKill != 
DataFrameRangeSuite.INVALID_STAGE_ID)
}
```
and `cancelStage` never called. Based on that the first non-invalid stage 
ID on my machine is 0. The actual code miss that edge case.

When I start the whole suite then other tests are pre-initializing 
something and the ID is not 0 anymore.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-03-22 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/20888
  
> if I execute the test on my machine alone it never pass.

you mean it never fails on your machine, right?  its only flaky when you 
run everything on jenkins?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-03-22 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/20888
  
hmm you're right, I was looking at a different branch in my editor and 
didn't pay attention that it was reset in the code I linked to on master, oops.

I still dont' understand your proposed solution, though -- how is checking 
`stageToKill != -1` better than checking `stageToKill > 0` in this case?




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-03-22 Thread gaborgsomogyi
Github user gaborgsomogyi commented on the issue:

https://github.com/apache/spark/pull/20888
  
Just an additional info if I execute the test on my machine alone it never 
pass.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-03-22 Thread gaborgsomogyi
Github user gaborgsomogyi commented on the issue:

https://github.com/apache/spark/pull/20888
  
Where do you think the reset should happen? There is already one inside 
`withSQLConf` which makes a reset before job submit.

Related the ID I've just taken a look at the original implementation and I 
see to kill the ID 0 
[here](https://github.com/apache/spark/commit/4064574d031215fcfdf899a1ee9f3b6fecb1bfc9).
 What have I missed?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-03-22 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/20888
  
I think you're right about killing the wrong stage, but I don't think its 
exactly what you've outlined.  The original code doesn't try to kill a stage 
with ID == 0 -- instead its just waiting until that volatile is set to 
something > 0, and then proceeds.  that seems to work fine, we do see that the 
stage does get canceled OK once.

However, I think the problem is because the test [runs twice, with and 
without 
codegen](https://github.com/apache/spark/blob/4d37008c78d7d6b8f8a649b375ecc090700eca4f/sql/core/src/test/scala/org/apache/spark/sql/DataFrameRangeSuite.scala#L165).
  The first time, it'll always wait to till the stage Id is set, because of 
that `eventually { ... stageToKill > 0}`.

however, on the second iteration, that `stageToKill` may still be > 0 based 
on the first iteration, not because its been set by the second iteration.  So I 
think you just need to reset the value to -1 between iterations.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-03-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/20888
  
Can one of the admins verify this patch?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...

2018-03-22 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/20888
  
**[Test build #88536 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88536/testReport)**
 for PR 20888 at commit 
[`42c930d`](https://github.com/apache/spark/commit/42c930d694e0bbc66974516b6719a698d664f681).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org