Nicholas Chammas created SPARK-16921: ----------------------------------------
Summary: RDD/DataFrame persist() and cache() should return Python context managers Key: SPARK-16921 URL: https://issues.apache.org/jira/browse/SPARK-16921 Project: Spark Issue Type: New Feature Components: PySpark, Spark Core, SQL Reporter: Nicholas Chammas Priority: Minor [Context managers|https://docs.python.org/3/reference/datamodel.html#context-managers] are a natural way to capture closely related setup and teardown code in Python. For example, they are commonly used when doing file I/O: {code} with open('/path/to/file') as f: contents = f.read() ... {code} Once the program exits the with block, {{f}} is automatically closed. I think it makes sense to apply this pattern to persisting and unpersisting DataFrames and RDDs. There are many cases when you want to persist a DataFrame for a specific set of operations and then unpersist it immediately afterwards. For example, take model training. Today, you might do something like this: {code} labeled_data.persist() model = pipeline.fit(labeled_data) labeled_data.unpersist() {code} If {{persist()}} returned a context manager, you could rewrite this as follows: {code} with labeled_data.persist(): model = pipeline.fit(labeled_data) {code} Upon exiting the {{with}} block, {{labeled_data}} would automatically be unpersisted. This can be done in a backwards-compatible way since {{persist()}} would still return the parent DataFrame or RDD as it does today, but add two methods to the object: {{\_\_enter\_\_()}} and {{\_\_exit\_\_()}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org