Re: PySpark: Make persist() return a context manager

Koert Kuipers Fri, 05 Aug 2016 15:51:15 -0700

The tricky part is that the action needs to be inside the with block, not
just the transformation that uses the persisted data.


On Aug 5, 2016 1:44 PM, "Nicholas Chammas" <[email protected]>
wrote:

Okie doke, I've filed a JIRA for this here: https://issues.apache.
org/jira/browse/SPARK-16921

On Fri, Aug 5, 2016 at 2:08 AM Reynold Xin <[email protected]> wrote:

> Sounds like a great idea!
>
> On Friday, August 5, 2016, Nicholas Chammas <[email protected]>
> wrote:
>
>> Context managers
>> <https://docs.python.org/3/reference/datamodel.html#context-managers>
>> are a natural way to capture closely related setup and teardown code in
>> Python.
>>
>> For example, they are commonly used when doing file I/O:
>>
>> with open('/path/to/file') as f:
>>     contents = f.read()
>>     ...
>>
>> Once the program exits the with block, f is automatically closed.
>>
>> Does it make sense to apply this pattern to persisting and unpersisting
>> DataFrames and RDDs? I feel like there are many cases when you want to
>> persist a DataFrame for a specific set of operations and then unpersist it
>> immediately afterwards.
>>
>> For example, take model training. Today, you might do something like this:
>>
>> labeled_data.persist()
>> model = pipeline.fit(labeled_data)
>> labeled_data.unpersist()
>>
>> If persist() returned a context manager, you could rewrite this as
>> follows:
>>
>> with labeled_data.persist():
>>     model = pipeline.fit(labeled_data)
>>
>> Upon exiting the with block, labeled_data would automatically be
>> unpersisted.
>>
>> This can be done in a backwards-compatible way since persist() would
>> still return the parent DataFrame or RDD as it does today, but add two
>> methods to the object: __enter__() and __exit__()
>>
>> Does this make sense? Is it attractive?
>>
>> Nick
>> 
>>
>

Re: PySpark: Make persist() return a context manager

Reply via email to