[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

mridulm Wed, 15 Aug 2018 23:15:28 -0700

Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/22112
  
    I am not sure what the definition of `isIdempotent` here is.
    
    For example, from MapPartitionsRDD :
    ```
    override private[spark] def isIdempotent = {
        if (inputOrderSensitive) {
          prev.isIdempotent
        } else {
          true
        }
      }
    ```
    
    Consider:
    `val rdd1 = rdd.groupBy().map(...).repartition(...).filter(...)`.
    By definition above, this would make rdd1 idempotent.
    Depending on what the definition of idempotent is (partition level, record 
level, etc) - this can be correct or wrong code.
    
    
    Similarly, I am not sure why idempotency or ordering is depending on 
`Partitioner`.
    IMO we should traverse the dependency graph and rely on how `ShuffledRDD` 
is configured - whether there is a key ordering specified (applies to both 
global sort and per partition sort), whether it is from a checkpoint or marked 
for checkpoint, whether it is from a stable input source, etc.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22112: [WIP][SPARK-23243][Core] Fix RDD.repartition() data corr...

Reply via email to