[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21558 @vanzin, thanks for working on this. I was out most of this week at a conference and I'm still on just half time, which is why I was delayed. Sorry to leave you all waiting. I'll make comments on your PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21558 See link above for the updated PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21558 If you are working on this, I'll merge the other one and wait for you and continue to investigate in parallel --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21558 Ah ok, was looking at my own version as well. There are other things we should update for v2 as well, other functions with the variable names, description in DataWriterFactory.java, etc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21558 FYI, I'm preparing my own version of this PR with the remaining feedback addressed. Ryan was on paternity leave and I don't know whether he's done yet, so he may not be that responsive. This will conflict with the output commit coordinator change in any case, so one of them needs to wait (and that one is further along). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21558 ping @rdblue ^ . If I don't hear tomorrow, will file separate jira. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21558 @rdblue would you want to update for v1 and hadoop committers? it should be very simialr to this change but in: createTaskAttemptContext should take the actual task attempt id (type Long) instead of the attempt number as an argument. I would need to look at the v1 writers to make sure there is nothing else, but perhaps you are more familiar with them. If not we can split that into a separate jira --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21558 Correct TaskId is unique. I agree with @squito can we rename to tid. Hmm, good point @rdblue we need to make sure the output directories/files and such would be unique. That problem may actually exist in the current committers as well, will comment on the other pr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21558 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92050/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21558 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21558 **[Test build #92050 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92050/testReport)** for PR 21558 at commit [`6c60d14`](https://github.com/apache/spark/commit/6c60d1462c34f01610ada50c989832775b6fd117). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21558 Actually I didn't mean speculation but something like this: ``` sc.parallelize(1 to 10).foreach { i => if (TaskContext.get().attemptNumber() == 0) throw new Exception("Fail") else println(i) } ``` Anyway I ran that and the behavior is the same (different attempt = different task ID) so it's all good. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21558 Yes, I just checked and speculative attempts do get a different TID. Just turn on speculation, run a large stage, and sort tasks in a stage by TID. There aren't duplicates. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21558 > the ID that this uses is the TID, which I thought was always unique. It appears to be a one-up counter If that's the case, and a different attempt for the same partition will get a different task ID, that's fine. I was under the impression it would have the same task ID. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21558 **[Test build #92050 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92050/testReport)** for PR 21558 at commit [`6c60d14`](https://github.com/apache/spark/commit/6c60d1462c34f01610ada50c989832775b6fd117). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21558 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/295/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21558 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21558 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21558 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4189/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21558 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21558 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92043/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21558 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21558 **[Test build #92043 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92043/testReport)** for PR 21558 at commit [`6c60d14`](https://github.com/apache/spark/commit/6c60d1462c34f01610ada50c989832775b6fd117). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21558 @vanzin, the ID that this uses is the TID, which I thought was always unique. It appears to be a one-up counter. Also, I noted on your PR that both are needed because even if we only commit one of the attempts, the writers may use this ID to track and clean up data. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21558 But note that the task ID is also not necessarily unique (since you can have multiple attempts of the same task). So perhaps you should consider Imran's suggestion or mixing stage and task attempt numbers. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21558 I'm fine with this; my PR should fix the underlying issue but this still seems like a good idea to me. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21558 **[Test build #92043 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92043/testReport)** for PR 21558 at commit [`6c60d14`](https://github.com/apache/spark/commit/6c60d1462c34f01610ada50c989832775b6fd117). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21558 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4182/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21558 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21558 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/288/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21558 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21558 I think the right thing to do for this commit is to use the task ID instead of the stage-local attempt number. I've updated the PR with the change so I think this is ready to commit. @vanzin, are you okay committing this? cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21558 @tgravescs, that's exactly what we're seeing. I think it might just be misleading to have a stage-local attempt ID although it is more friendly for users and matches what MR does. @jiangxb1987, we see SPARK-24492 occasionally (it has gotten better with later fixes to the coordinator) and haven't tracked down the cause yet. If this were the underlying cause that would be great, but I'm not sure how it could be the cause. If the same attempt number is reused, then two tasks in different stage attempts may both be authorized to commit. That wouldn't cause the retries because it wouldn't cause extra commit denials. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21558 I started with some ideas in #21577 ; I haven't tested that PR aside from the modified unit test, but I think it's in the right direction. I plan to work more on it Monday, but leaving it there in case people are around over the weekend and want to comment. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21558 (Or, alternatively, the output committer could track task IDs instead of attempt numbers. That should result in the same behavior but haven't really looked at that.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21558 I'm not sure, but I'm starting to think the part of the fix for SPARK-18113 that added the "Authorizing duplicate request..." stuff should be removed. The rest of that change replaces `askWithRetry` in the RPC calls with `ask`, so now there will only be one request per task. So there's no need to handle duplicate requests that way, because the only time you'd have a duplicate is when this situation happens (same task attempt number running concurrently on two stage attempts). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/21558 I guess https://issues.apache.org/jira/browse/SPARK-24492 is potentially cause by the output committer issue ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21558 yeah that is the code that I was still looking at to verify if it can actually happen. on a fetch failure the dag scheduler removes the stage from the ouput committer, but when the new stage attempt starts it just puts the stage back with the stage id (not attempt id). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21558 Actually, scratch that, there is really a problem. The output committer only checks the stage ID, not the stage attempt ID, so it will still allow tasks from the failed attempt to commit... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21558 > If you have one stage running that gets a fetch failure, if it leaves any tasks running I took a look at the output coordinator code and, depending on how the scheduler behaves, it might be ok. The coordinator will deny commits for finished stages; so it depends on the order of things. If the failed attempt is marked as "failed" before the next attempt starts, then it's ok, even if tasks for the failed attempt are still running. Looking at the code handling `FetchFailed` failures in `DAGScheduler`, that seems to be the case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21558 so I looked through the code and it certain appears to be a bug in the existing code (not just v2 datasource api). If you have one stage running that gets a fetch failure, if it leaves any tasks running with attempt 0, it could conflict with the restarted stage since those tasks would all start with attempt 0 as well. When I say it could it means it would be a race if they go to commit at about the same time. Its probably more of an issue if one commits, then starts the job commit and the other task starts to then commit its, you could end up with incomplete/corrupt file. We should see the warning "Authorizing duplicate request to commit" in the logs though if this occurs. @rdblue does this match what you are seeing? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/21558 sorry trying to catch up on this thread. Are we saying this is a bug in the existing output committer as well when we have a fetch failure? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21558 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91794/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21558 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21558 **[Test build #91794 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91794/testReport)** for PR 21558 at commit [`e9e776a`](https://github.com/apache/spark/commit/e9e776a097f5dca1dccdd6e50b3790e6a91873d8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21558 > IMO your change is the right fix, not just a workaround @squito, part of the problem is that the output commit coordinator -- that ensures only one attempt of a task commits -- relies on commit number to allow or deny commits. If this is the correct fix, should we also pass the TID as the attempt id to the commit coordinator? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21558 > So the problem here is, when we retry a stage, Spark doesn't kill the tasks of the old stage and just launch tasks for the new stage I think that's something that should be fixed, but it wouldn't entirely fix the problem unless we were very careful about ordering in the driver: the stage would have to fail, then stop allowing commits, then wait for all of the tasks that were allowed to commit to finish, and account for the coordination messages being in flight. Not an easy problem. I'd like to see a fix that makes the attempt number unique within a job and partition, i.e., no two tasks should have the same (job id, partition id, attempt number) triple as Wenchen said. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/21558 As @squito suggested, we can either use taskAttemptId or combine stageAttemptId and taskAttemptNumber together, both shall be able to represent a unique task attempt. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21558 for a write job, the writing only happens at the last stage, so stage id doesn't matter, and data source v2 API assumes `(job id, partition id, task attemp id)` can uniquely define a write task, even counting the failure cases. So the problem here is, when we retry a stage, Spark doesn't kill the tasks of the old stage and just launch tasks for the new stage. We may have running write tasks that have the same `(job id, partition id, task attemp id)`. One solution is, we can just use `(job id, task id)` to uniquely define a write task. But we may lose information like the index of this task and how many times it has been retried. Or we can use `(job id, partition id, stage attemp id, task attemp id)`, which is a little verbose, but has all the information. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user squito commented on the issue: https://github.com/apache/spark/pull/21558 IMO your change is the right fix, not just a workaround. I don't think its a scheduler bug (though its definitely unclear). I'll move that discussion to the jira. An alternative would be using `context.StageAttemptNumber()` and `context.attemptNumber()` together. > the stage ID can change for a different execution of the same stage, IIRC, and that would reset the attempt id. hmm, the only place I could imagine that happening is with a shared shuffle dependency between jobs, which gets renumbered and then skipped, but then perhaps re-executed on a fetch-failure. That isn't relevant here, though, since it would only be shuffle map stages, not result stages. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/21558 It's a little weird to use the attempt id here anyway; the stage ID can change for a different execution of the same stage, IIRC, and that would reset the attempt id. @squito probably remembers a lot more about this off the top of his head. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21558 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3998/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21558 **[Test build #91794 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91794/testReport)** for PR 21558 at commit [`e9e776a`](https://github.com/apache/spark/commit/e9e776a097f5dca1dccdd6e50b3790e6a91873d8). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21558 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21558 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/108/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21558 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21558 @cloud-fan, this is a work-around for SPARK-24552. I'm not sure the right way to fix this besides fixing the scheduler so that it doesn't use task attempt numbers twice, but I think this works. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org