Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21606#discussion_r197309248
  
    --- Diff: 
core/src/main/scala/org/apache/spark/internal/io/SparkHadoopWriter.scala ---
    @@ -76,13 +76,29 @@ object SparkHadoopWriter extends Logging {
         // Try to write all RDD partitions as a Hadoop OutputFormat.
         try {
           val ret = sparkContext.runJob(rdd, (context: TaskContext, iter: 
Iterator[(K, V)]) => {
    +        // Generate a positive integer task ID that is unique for the 
current stage. This makes a
    +        // few assumptions:
    +        // - the task ID is always positive
    +        // - stages cannot have more than Int.MaxValue
    +        // - the sum of task counts of all active stages doesn't exceed 
Int.MaxValue
    +        //
    +        // The first two are currently the case in Spark, while the last 
one is very unlikely to
    +        // occur. If it does, two tasks IDs on a single stage could have a 
clashing integer value,
    +        // which could lead to code that generates clashing file names for 
different tasks. Still,
    +        // if the commit coordinator is enabled, only one task would be 
allowed to commit.
    --- End diff --
    
    since it's not a simple `toInt` anymore, how about we combine stage and 
task attempt number?
    ```
    val stageAttemptNumer = ...
    val taskAttempNumber = ...
    assert(stageAttemptNumer <= Short.MaxValue)
    assert(taskAttempNumber <= Short.MaxValue)
    val sparkAttempNumber = (stageAttemptNumer << 16) | taskAttempNumber
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to