[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

cloud-fan Wed, 22 Aug 2018 00:29:08 -0700

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22112
  
    > 1. force a sort on these operations
    
    I think this is the most obvious fix, and that's how we fixed the 
`Dataset#repartition`. Like we discussed before, it's hard to apply it to RDD, 
because the `T` may not be orderable. Even it's orderable, the sorter can't 
guarantee that it can always have the same output order. (e.g. sort by key) I 
think forcing a sort is only applicable in Spark SQL, where the data class has 
a deterministic serialization format and we can use that to do sort, so that 
sorter can always have the same output order.
    
    > 2. do nothing and require users to sort or handle someway (checkpoint) if 
they care.
    
    IIRC @mridulm didn't agree with it. One problem is that, it's hard for 
users to realize that Spark returns wrong result, so they don't know when to 
handle it.
    
    I'm proposing an option 3:
    Retry all the tasks of all the succeeding stages if a stage with 
repartition/zip failed. All RDD actions should tell Spark if it's 
"rollback-able", which becomes a property of the result stage. When we retry a 
result stage that has several tasks finished, if the result stage is 
"rollback-able" (e.g. `collect`), retry it. If the result stage is not 
"rollback-able", fail the job with the error message to ask users to either 
turn on the config that forces sort(e.g. 
`spark.sql.execution.sortBeforeRepartition`), or checkpoint the RDD before 
repartition/zip.
    
    
    This RDD doesn't have the API to allow RDD actions to mark themselves as 
"rollback-able", and assumes all of them are not "rollback-able". I'm ok if you 
insist that it should be done here.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to