Re: PySpark: Make persist() return a context manager

Koert Kuipers Fri, 05 Aug 2016 20:29:12 -0700

i think it limits the usability of with statement. and it could be somewhat
confusing because of this, so i would mention it in docs.


i like the idea though.

On Fri, Aug 5, 2016 at 7:04 PM, Nicholas Chammas <[email protected]
> wrote:

> Good point.
>
> Do you think it's sufficient to note this somewhere in the documentation
> (or simply assume that user understanding of transformations vs. actions
> means they know this), or are there other implications that need to be
> considered?
>
> On Fri, Aug 5, 2016 at 6:50 PM Koert Kuipers <[email protected]> wrote:
>
>> The tricky part is that the action needs to be inside the with block, not
>> just the transformation that uses the persisted data.
>>
>> On Aug 5, 2016 1:44 PM, "Nicholas Chammas" <[email protected]>
>> wrote:
>>
>> Okie doke, I've filed a JIRA for this here: https://issues.apache.
>> org/jira/browse/SPARK-16921
>>
>> On Fri, Aug 5, 2016 at 2:08 AM Reynold Xin <[email protected]> wrote:
>>
>>> Sounds like a great idea!
>>>
>>> On Friday, August 5, 2016, Nicholas Chammas <[email protected]>
>>> wrote:
>>>
>>>> Context managers
>>>> <https://docs.python.org/3/reference/datamodel.html#context-managers>
>>>> are a natural way to capture closely related setup and teardown code in
>>>> Python.
>>>>
>>>> For example, they are commonly used when doing file I/O:
>>>>
>>>> with open('/path/to/file') as f:
>>>>     contents = f.read()
>>>>     ...
>>>>
>>>> Once the program exits the with block, f is automatically closed.
>>>>
>>>> Does it make sense to apply this pattern to persisting and unpersisting
>>>> DataFrames and RDDs? I feel like there are many cases when you want to
>>>> persist a DataFrame for a specific set of operations and then unpersist it
>>>> immediately afterwards.
>>>>
>>>> For example, take model training. Today, you might do something like
>>>> this:
>>>>
>>>> labeled_data.persist()
>>>> model = pipeline.fit(labeled_data)
>>>> labeled_data.unpersist()
>>>>
>>>> If persist() returned a context manager, you could rewrite this as
>>>> follows:
>>>>
>>>> with labeled_data.persist():
>>>>     model = pipeline.fit(labeled_data)
>>>>
>>>> Upon exiting the with block, labeled_data would automatically be
>>>> unpersisted.
>>>>
>>>> This can be done in a backwards-compatible way since persist() would
>>>> still return the parent DataFrame or RDD as it does today, but add two
>>>> methods to the object: __enter__() and __exit__()
>>>>
>>>> Does this make sense? Is it attractive?
>>>>
>>>> Nick
>>>> 
>>>>
>>>
>>

Re: PySpark: Make persist() return a context manager

Reply via email to