Nicholas Chammas created SPARK-16921:
----------------------------------------

             Summary: RDD/DataFrame persist() and cache() should return Python 
context managers
                 Key: SPARK-16921
                 URL: https://issues.apache.org/jira/browse/SPARK-16921
             Project: Spark
          Issue Type: New Feature
          Components: PySpark, Spark Core, SQL
            Reporter: Nicholas Chammas
            Priority: Minor


[Context 
managers|https://docs.python.org/3/reference/datamodel.html#context-managers] 
are a natural way to capture closely related setup and teardown code in Python.

For example, they are commonly used when doing file I/O:

{code}
with open('/path/to/file') as f:
    contents = f.read()
    ...
{code}

Once the program exits the with block, {{f}} is automatically closed.

I think it makes sense to apply this pattern to persisting and unpersisting 
DataFrames and RDDs. There are many cases when you want to persist a DataFrame 
for a specific set of operations and then unpersist it immediately afterwards.

For example, take model training. Today, you might do something like this:

{code}
labeled_data.persist()
model = pipeline.fit(labeled_data)
labeled_data.unpersist()
{code}

If {{persist()}} returned a context manager, you could rewrite this as follows:

{code}
with labeled_data.persist():
    model = pipeline.fit(labeled_data)
{code}

Upon exiting the {{with}} block, {{labeled_data}} would automatically be 
unpersisted.

This can be done in a backwards-compatible way since {{persist()}} would still 
return the parent DataFrame or RDD as it does today, but add two methods to the 
object: {{\_\_enter\_\_()}} and {{\_\_exit\_\_()}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to