[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

tgravescs Fri, 17 Aug 2018 10:09:15 -0700

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/22112
  
    @cloud-fan  I think you summarized it nicely, but I think you keep 
forgetting about the ResultTask though.    It's one thing to say this is a 
temporary work around and if you have a failure in a ResultTask where we can't 
rerun all the reduces then just FAIL, but I don't think that should be our long 
term solution to that.  That basically means Spark is not resilient to 
failures....
    
    Basically what I'm saying is that 3 is handled, case 2, we fix (however 
that is we decide and eventually it has a resilient option and not just fail 
the entire job, which to me is the short term workaround). 
    
    case 1, is what I think we need clarification on as to me this is an API 
level decision and I don't want to change our minds and confuse users.  My 
opinion is we either say yes its ok to have random data but with the caveat if 
things retry you might get different results.   Or we force it into case 2 or 
3.   
    If we say we don't support random operations, then do we have put the same 
logic on any operation the user could cause the data to be random? Which seems 
like a lot of operations and thus could be a big performance penalty to handle 
failures.   I guess there is a 3rd option in that we handle it for ones we have 
api's that explicitly do this like zip, but ignore other ones where user code 
could cause this (map, mappartitions, etc). 
    
    @mridulm  thanks for clarifying, I do agree that if we have a solution to 
case 2, it could be used to help with case 1, but is that essentially saying we 
don't support case 1?
    
    I think what you are saying is that for now we document its ok to have 
random data with the caveat retries could return different results but then 
down the road we essentially say its not supported because we are going to 
force a sort.  I just think that could be confusing to the user.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #22112: [SPARK-23243][Core] Fix RDD.repartition() data correctne...

Reply via email to