[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20888 **[Test build #89184 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89184/testReport)** for PR 20888 at commit [`4e7be70`](https://github.com/apache/spark/commit/4e7be70e6d35f290a8ddd6278a9feccbc06791aa). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20888 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89131/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20888 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20888 **[Test build #89131 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89131/testReport)** for PR 20888 at commit [`16525b3`](https://github.com/apache/spark/commit/16525b33bbf879845d50d5b966ea67201cb2ed31). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20888 **[Test build #89131 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89131/testReport)** for PR 20888 at commit [`16525b3`](https://github.com/apache/spark/commit/16525b33bbf879845d50d5b966ea67201cb2ed31). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user gaborgsomogyi commented on the issue: https://github.com/apache/spark/pull/20888 Thank for the hints. I've taken a deeper look at the possible solutions and the suggested test. The problem is similar but not the same so I would solve it a different way. So here is my proposal. `cancelStage` sets `reasonIfKilled` in `TaskContext` normally but the executor thread will run untouched at this timestamp. The thread will be killed later triggered by `killTaskIfInterrupted` which throws `TaskKilledException`. When `isInterrupted` checked all the time when `DataFrameRangeSuite.stageToKill` will be set then the race can be avoided. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/20888 Sounds like this approach is just racy by nature. There's a similar test in SparkContextSuite ("Cancelling stages/jobs with custom reasons") where I fixed a similar issue using a semaphore. Sounds like you need something similar here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user gaborgsomogyi commented on the issue: https://github.com/apache/spark/pull/20888 `SparkStatusTracker` states the following: ``` * These APIs intentionally provide very weak consistency semantics; consumers of these APIs should * be prepared to handle empty / missing information. For example, a job's stage ids may be known * but the status API may not have any information about the details of those stages, so * `getStageInfo` could potentially return `None` for a valid stage id. ``` This is reflected in the additional logs. I've wrapped `cancelStage`, `DataFrameRangeSuite.stageToKill = DataFrameRangeSuite.INVALID_STAGE_ID` and `assert(sparkContext.statusTracker.getExecutorInfos.map(_.numRunningTasks()).sum == 0)` statements with additional log entries and reproduced the issue. (The number after the colon is `stageToKill`) The log shows the following flow: 1. Stage 88 was successfully cancelled ``` 07:17:50.862 Executor task launch worker for task 699 INFO Executor: Running task 0.0 in stage 88.0 (TID 699) 07:17:50.865 spark-listener-group-shared INFO DataFrameRangeSuite: BEFORE CANCELLED: 88 07:17:50.866 spark-listener-group-shared INFO DataFrameRangeSuite: AFTER CANCELLED: 88 ``` 2. Waiting on the task count drop to zero on executors ``` 07:17:50.869 ScalaTest-main-running-DataFrameRangeSuite INFO DataFrameRangeSuite: BEFORE NO TASKS 07:17:50.869 ScalaTest-main-running-DataFrameRangeSuite INFO DataFrameRangeSuite: AFTER NO TASKS ``` 3. Resetting the `stageToKill` to -1 ``` 07:17:50.869 ScalaTest-main-running-DataFrameRangeSuite INFO DataFrameRangeSuite: BEFORE RESET: 88 07:17:50.869 ScalaTest-main-running-DataFrameRangeSuite INFO DataFrameRangeSuite: AFTER RESET: -1 ``` 4. Only after that the executor thread killed ``` 07:17:50.870 Executor task launch worker for task 699 INFO Executor: Executor killed task 0.0 in stage 88.0 (TID 699), reason: Stage cancelled 07:17:50.870 task-result-getter-0 WARN TaskSetManager: Lost task 0.0 in stage 88.0 (TID 699, localhost, executor driver): TaskKilled (Stage cancelled) 07:17:50.870 task-result-getter-0 INFO TaskSchedulerImpl: Removed TaskSet 88.0, whose tasks have all completed, from pool ``` In this situation the old thread had the possibility to overwrite the `stageToKill` variable after the reset. The new task thread not yet running to update the `stageToKill` to 90 so the `onJobStart` hook see the old 88 value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user gaborgsomogyi commented on the issue: https://github.com/apache/spark/pull/20888 @vanzin @squito yeah, there is an issue with threading as well. I'm just taking a look at it because it's not obvious. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/20888 I had sort of the same doubt as Imran, because I thought scalasuite ran tests in order... but I just ran a suite where the tests were not run in the order declared in the source file. So it sounds possible that this bug would be hit even when not running that test in isolation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user squito commented on the issue: https://github.com/apache/spark/pull/20888 ah ok, yes when run in isolation, the stage will be 0, so your change makes sense. But that is not what is making it flaky in a full test run --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20888 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/88536/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20888 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20888 **[Test build #88536 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88536/testReport)** for PR 20888 at commit [`42c930d`](https://github.com/apache/spark/commit/42c930d694e0bbc66974516b6719a698d664f681). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user gaborgsomogyi commented on the issue: https://github.com/apache/spark/pull/20888 Uploaded logs for the jira. You're right when you pointed out a second issue with the `stageToKill`. The `onJobStart` tries to cancel the same ID twice. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user gaborgsomogyi commented on the issue: https://github.com/apache/spark/pull/20888 I mean on my machine the stage ID is zero for long long time here: ``` DataFrameRangeSuite.stageToKill = TaskContext.get().stageId() ``` and after 200 seconds the other thread still stands on this (increased the timeout to play): ``` eventually(timeout(300.seconds), interval(1.millis)) { assert(DataFrameRangeSuite.stageToKill != DataFrameRangeSuite.INVALID_STAGE_ID) } ``` and `cancelStage` never called. Based on that the first non-invalid stage ID on my machine is 0. The actual code miss that edge case. When I start the whole suite then other tests are pre-initializing something and the ID is not 0 anymore. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user squito commented on the issue: https://github.com/apache/spark/pull/20888 > if I execute the test on my machine alone it never pass. you mean it never fails on your machine, right? its only flaky when you run everything on jenkins? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user squito commented on the issue: https://github.com/apache/spark/pull/20888 hmm you're right, I was looking at a different branch in my editor and didn't pay attention that it was reset in the code I linked to on master, oops. I still dont' understand your proposed solution, though -- how is checking `stageToKill != -1` better than checking `stageToKill > 0` in this case? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user gaborgsomogyi commented on the issue: https://github.com/apache/spark/pull/20888 Just an additional info if I execute the test on my machine alone it never pass. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user gaborgsomogyi commented on the issue: https://github.com/apache/spark/pull/20888 Where do you think the reset should happen? There is already one inside `withSQLConf` which makes a reset before job submit. Related the ID I've just taken a look at the original implementation and I see to kill the ID 0 [here](https://github.com/apache/spark/commit/4064574d031215fcfdf899a1ee9f3b6fecb1bfc9). What have I missed? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user squito commented on the issue: https://github.com/apache/spark/pull/20888 I think you're right about killing the wrong stage, but I don't think its exactly what you've outlined. The original code doesn't try to kill a stage with ID == 0 -- instead its just waiting until that volatile is set to something > 0, and then proceeds. that seems to work fine, we do see that the stage does get canceled OK once. However, I think the problem is because the test [runs twice, with and without codegen](https://github.com/apache/spark/blob/4d37008c78d7d6b8f8a649b375ecc090700eca4f/sql/core/src/test/scala/org/apache/spark/sql/DataFrameRangeSuite.scala#L165). The first time, it'll always wait to till the stage Id is set, because of that `eventually { ... stageToKill > 0}`. however, on the second iteration, that `stageToKill` may still be > 0 based on the first iteration, not because its been set by the second iteration. So I think you just need to reset the value to -1 between iterations. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20888 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] DataFrameRangeSuite should wait for ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20888 **[Test build #88536 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/88536/testReport)** for PR 20888 at commit [`42c930d`](https://github.com/apache/spark/commit/42c930d694e0bbc66974516b6719a698d664f681). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org