[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-22 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21558
  
@vanzin, thanks for working on this. I was out most of this week at a 
conference and I'm still on just half time, which is why I was delayed. Sorry 
to leave you all waiting. I'll make comments on your PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-21 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21558
  
See link above for the updated PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-21 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/21558
  
If you are working on this, I'll merge the other one and wait for you and 
continue to investigate in parallel


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-21 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/21558
  
Ah ok, was looking at my own version as well.  There are other things we 
should update for v2 as well, other functions with the variable names, 
description in DataWriterFactory.java, etc.
 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-21 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21558
  
FYI, I'm preparing my own version of this PR with the remaining feedback 
addressed. Ryan was on paternity leave and I don't know whether he's done yet, 
so he may not be that responsive.

This will conflict with the output commit coordinator change in any case, 
so one of them needs to wait (and that one is further along).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-20 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/21558
  
ping @rdblue ^ .  If I don't hear tomorrow, will file separate jira.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-19 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/21558
  
@rdblue  would you want to update for v1 and hadoop committers?  it should 
be very simialr to this change but in: createTaskAttemptContext should take the 
actual task attempt id (type Long) instead of the attempt number as an argument.

I would need to look at the v1 writers to make sure there is nothing else, 
but perhaps you are more familiar with them.  

If not we can split that into a separate jira


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-19 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/21558
  
Correct TaskId is unique.  I agree with @squito  can we rename to tid. 

Hmm, good point @rdblue we need to make sure the output directories/files 
and such would be unique.  That problem may actually exist in the current 
committers as well, will comment on the other pr  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21558
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92050/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21558
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21558
  
**[Test build #92050 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92050/testReport)**
 for PR 21558 at commit 
[`6c60d14`](https://github.com/apache/spark/commit/6c60d1462c34f01610ada50c989832775b6fd117).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21558
  
Actually I didn't mean speculation but something like this:

```
sc.parallelize(1 to 10).foreach { i => if 
(TaskContext.get().attemptNumber() == 0) throw new Exception("Fail") else 
println(i) }
```

Anyway I ran that and the behavior is the same (different attempt = 
different task ID) so it's all good.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21558
  
Yes, I just checked and speculative attempts do get a different TID. Just 
turn on speculation, run a large stage, and sort tasks in a stage by TID. There 
aren't duplicates.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21558
  
> the ID that this uses is the TID, which I thought was always unique. It 
appears to be a one-up counter

If that's the case, and a different attempt for the same partition will get 
a different task ID, that's fine. I was under the impression it would have the 
same task ID.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21558
  
**[Test build #92050 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92050/testReport)**
 for PR 21558 at commit 
[`6c60d14`](https://github.com/apache/spark/commit/6c60d1462c34f01610ada50c989832775b6fd117).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21558
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/295/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21558
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21558
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21558
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4189/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21558
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21558
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92043/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21558
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21558
  
**[Test build #92043 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92043/testReport)**
 for PR 21558 at commit 
[`6c60d14`](https://github.com/apache/spark/commit/6c60d1462c34f01610ada50c989832775b6fd117).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21558
  
@vanzin, the ID that this uses is the TID, which I thought was always 
unique. It appears to be a one-up counter. Also, I noted on your PR that both 
are needed because even if we only commit one of the attempts, the writers may 
use this ID to track and clean up data.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21558
  
But note that the task ID is also not necessarily unique (since you can 
have multiple attempts of the same task). So perhaps you should consider 
Imran's suggestion or mixing stage and task attempt numbers.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21558
  
I'm fine with this; my PR should fix the underlying issue but this still 
seems like a good idea to me.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21558
  
**[Test build #92043 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92043/testReport)**
 for PR 21558 at commit 
[`6c60d14`](https://github.com/apache/spark/commit/6c60d1462c34f01610ada50c989832775b6fd117).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21558
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/4182/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21558
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21558
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/288/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21558
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21558
  
I think the right thing to do for this commit is to use the task ID instead 
of the stage-local attempt number. I've updated the PR with the change so I 
think this is ready to commit. @vanzin, are you okay committing this?

cc @cloud-fan


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-18 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21558
  
@tgravescs, that's exactly what we're seeing. I think it might just be 
misleading to have a stage-local attempt ID although it is more friendly for 
users and matches what MR does.

@jiangxb1987, we see SPARK-24492 occasionally (it has gotten better with 
later fixes to the coordinator) and haven't tracked down the cause yet. If this 
were the underlying cause that would be great, but I'm not sure how it could be 
the cause. If the same attempt number is reused, then two tasks in different 
stage attempts may both be authorized to commit. That wouldn't cause the 
retries because it wouldn't cause extra commit denials.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-15 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21558
  
I started with some ideas in #21577 ; I haven't tested that PR aside from 
the modified unit test, but I think it's in the right direction. I plan to work 
more on it Monday, but leaving it there in case people are around over the 
weekend and want to comment.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-15 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21558
  
(Or, alternatively, the output committer could track task IDs instead of 
attempt numbers. That should result in the same behavior but haven't really 
looked at that.)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-15 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21558
  
I'm not sure, but I'm starting to think the part of the fix for SPARK-18113 
that added the "Authorizing duplicate request..." stuff should be removed.

The rest of that change replaces `askWithRetry` in the RPC calls with 
`ask`, so now there will only be one request per task. So there's no need to 
handle duplicate requests that way, because the only time you'd have a 
duplicate is when this situation happens (same task attempt number running 
concurrently on two stage attempts).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-15 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/21558
  
I guess https://issues.apache.org/jira/browse/SPARK-24492 is potentially 
cause by the output committer issue ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-15 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/21558
  
yeah that is the code that I was still looking at to verify if it can 
actually happen. on a fetch failure the dag scheduler removes the stage from 
the ouput committer, but when the new stage attempt starts it just puts the 
stage back with the stage id (not attempt id).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-15 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21558
  
Actually, scratch that, there is really a problem. The output committer 
only checks the stage ID, not the stage attempt ID, so it will still allow 
tasks from the failed attempt to commit...


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-15 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21558
  
> If you have one stage running that gets a fetch failure, if it leaves any 
tasks running

I took a look at the output coordinator code and, depending on how the 
scheduler behaves, it might be ok.

The coordinator will deny commits for finished stages; so it depends on the 
order of things. If the failed attempt is marked as "failed" before the next 
attempt starts, then it's ok, even if tasks for the failed attempt are still 
running. Looking at the code handling `FetchFailed` failures in `DAGScheduler`, 
that seems to be the case.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-15 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/21558
  
so I looked through the code and it certain appears to be a bug in the 
existing code (not just v2 datasource api).  If you have one stage running that 
gets a fetch failure, if it leaves any tasks running with attempt 0, it could 
conflict with the restarted stage since those tasks would all start with 
attempt 0 as well.  When I say it could it means it would be a race if they go 
to commit at about the same time.   Its probably more of an issue if one 
commits, then starts the job commit and the other task starts to then commit 
its, you could end up with incomplete/corrupt file.  We should see the warning 
"Authorizing duplicate request to commit" in the logs though if this occurs.

@rdblue does this match what you are seeing?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-14 Thread tgravescs
Github user tgravescs commented on the issue:

https://github.com/apache/spark/pull/21558
  
sorry trying to catch up on this thread.  Are we saying this is a bug in 
the existing output committer as well when we have a fetch failure?   


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21558
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/91794/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21558
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21558
  
**[Test build #91794 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91794/testReport)**
 for PR 21558 at commit 
[`e9e776a`](https://github.com/apache/spark/commit/e9e776a097f5dca1dccdd6e50b3790e6a91873d8).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-13 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21558
  
> IMO your change is the right fix, not just a workaround

@squito, part of the problem is that the output commit coordinator -- that 
ensures only one attempt of a task commits -- relies on commit number to allow 
or deny commits. If this is the correct fix, should we also pass the TID as the 
attempt id to the commit coordinator?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-13 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21558
  
> So the problem here is, when we retry a stage, Spark doesn't kill the 
tasks of the old stage and just launch tasks for the new stage

I think that's something that should be fixed, but it wouldn't entirely fix 
the problem unless we were very careful about ordering in the driver: the stage 
would have to fail, then stop allowing commits, then wait for all of the tasks 
that were allowed to commit to finish, and account for the coordination 
messages being in flight. Not an easy problem.

I'd like to see a fix that makes the attempt number unique within a job and 
partition, i.e., no two tasks should have the same (job id, partition id, 
attempt number) triple as Wenchen said.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-13 Thread jiangxb1987
Github user jiangxb1987 commented on the issue:

https://github.com/apache/spark/pull/21558
  
As @squito suggested, we can either use taskAttemptId or combine 
stageAttemptId and taskAttemptNumber together, both shall be able to represent 
a unique task attempt.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-13 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/21558
  
for a write job, the writing only happens at the last stage, so stage id 
doesn't matter, and data source v2 API assumes `(job id, partition id, task 
attemp id)` can uniquely define a write task, even counting the failure cases.

So the problem here is, when we retry a stage, Spark doesn't kill the tasks 
of the old stage and just launch tasks for the new stage. We may have running 
write tasks that have the same `(job id, partition id, task attemp id)`.

One solution is, we can just use `(job id, task id)` to uniquely define a 
write task. But we may lose information like the index of this task and how 
many times it has been retried.

Or we can use `(job id, partition id, stage attemp id, task attemp id)`, 
which is a little verbose, but has all the information.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-13 Thread squito
Github user squito commented on the issue:

https://github.com/apache/spark/pull/21558
  
IMO your change is the right fix, not just a workaround.  I don't think its 
a scheduler bug (though its definitely unclear).  I'll move that discussion to 
the jira.

An alternative would be using `context.StageAttemptNumber()` and 
`context.attemptNumber()` together.


> the stage ID can change for a different execution of the same stage, 
IIRC, and that would reset the attempt id.

hmm, the only place I could imagine that happening is with a shared shuffle 
dependency between jobs, which gets renumbered and then skipped, but then 
perhaps re-executed on a fetch-failure.  That isn't relevant here, though, 
since it would only be shuffle map stages, not result stages.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-13 Thread vanzin
Github user vanzin commented on the issue:

https://github.com/apache/spark/pull/21558
  
It's a little weird to use the attempt id here anyway; the stage ID can 
change for a different execution of the same stage, IIRC, and that would reset 
the attempt id.

@squito probably remembers a lot more about this off the top of his head.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21558
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/3998/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/21558
  
**[Test build #91794 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/91794/testReport)**
 for PR 21558 at commit 
[`e9e776a`](https://github.com/apache/spark/commit/e9e776a097f5dca1dccdd6e50b3790e6a91873d8).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21558
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21558
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/108/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/21558
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-13 Thread rdblue
Github user rdblue commented on the issue:

https://github.com/apache/spark/pull/21558
  
@cloud-fan, this is a work-around for SPARK-24552. I'm not sure the right 
way to fix this besides fixing the scheduler so that it doesn't use task 
attempt numbers twice, but I think this works.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org