cache() s...

nchammas Wed, 10 Aug 2016 12:18:03 -0700

Github user nchammas commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    Ah, I see. I don't fully understand how `PipelinedRDD` works or how it is 
used so I'll have to defer to y'all on this. Does the `cached()` utility method 
have this same problem?
    
    > We could possibly work around it with some type checking etc but it then 
starts to feel like adding more complexity than the feature is worth...
    
    Agreed.
    
    At this point, actually, I'm beginning to feel this feature is not worth it.
    
    Context managers seem to work best when the objects they're working on have 
clear open/close-style semantics. File handles, network connections, and the 
like fit this pattern well.
    
    In fact, the [doc for 
`with`](https://docs.python.org/3/reference/compound_stmts.html#the-with-statement)
 says:
    
    > This allows common `try...except...finally` usage patterns to be 
encapsulated for convenient reuse.
    
    RDDs and DataFrames, on the other hand, don't have a simple open/close or 
`try...except...finally` pattern. And when we try to map one onto persist and 
unpersist, we get the various side-effects we've been discussing here.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Reply via email to