Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/22112
  
    
    @tgravescs:
    > The shuffle simply transfers the bytes its supposed to. Sparks shuffle of 
those bytes is not consistent in that the order it fetches from can change and 
without the sort happening on that data the order can be different on rerun. I 
guess maybe you mean the ShuffledRDD as a whole or do you mean something else 
here?
    
    
    By shuffle, I am referring to the output of shuffle which is be consumed by 
RDD with `ShuffleDependency` as input.
    More specifically, the output of 
`SparkEnv.get.shuffleManager.getReader(...).read()` which RDD (user and spark 
impl's) uses to fetch output of shuffle machinery.
    This output will not just be shuffle bytes/deserialize, but with 
aggregation applied (if specified) and ordering imposed (if specified).
    
    ShuffledRDD is one such usage within spark core, but others exist within 
spark core and in user code.
    
    > All I'm saying is zip is just another variant of this, you could document 
it as such and do nothing internal to spark to "fix it".
    
    I agree; repartition + shuffle, zip, sample, mllib usages are all variants 
of the same problem - of shuffle output order being inconsistent.
    
    > I guess we can separate out these 2 discussions. I think the point of 
this pr is to temporarily workaround the data loss/corruption issue with 
repartition by failing. So if everyone agrees on that lets move the discussion 
to a jira about what to do with the rest of the operators and fix repartition 
here. thoughts?
    
    Sounds good to me.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to