i think it limits the usability of with statement. and it could be somewhat confusing because of this, so i would mention it in docs.
i like the idea though. On Fri, Aug 5, 2016 at 7:04 PM, Nicholas Chammas <nicholas.cham...@gmail.com > wrote: > Good point. > > Do you think it's sufficient to note this somewhere in the documentation > (or simply assume that user understanding of transformations vs. actions > means they know this), or are there other implications that need to be > considered? > > On Fri, Aug 5, 2016 at 6:50 PM Koert Kuipers <ko...@tresata.com> wrote: > >> The tricky part is that the action needs to be inside the with block, not >> just the transformation that uses the persisted data. >> >> On Aug 5, 2016 1:44 PM, "Nicholas Chammas" <nicholas.cham...@gmail.com> >> wrote: >> >> Okie doke, I've filed a JIRA for this here: https://issues.apache. >> org/jira/browse/SPARK-16921 >> >> On Fri, Aug 5, 2016 at 2:08 AM Reynold Xin <r...@databricks.com> wrote: >> >>> Sounds like a great idea! >>> >>> On Friday, August 5, 2016, Nicholas Chammas <nicholas.cham...@gmail.com> >>> wrote: >>> >>>> Context managers >>>> <https://docs.python.org/3/reference/datamodel.html#context-managers> >>>> are a natural way to capture closely related setup and teardown code in >>>> Python. >>>> >>>> For example, they are commonly used when doing file I/O: >>>> >>>> with open('/path/to/file') as f: >>>> contents = f.read() >>>> ... >>>> >>>> Once the program exits the with block, f is automatically closed. >>>> >>>> Does it make sense to apply this pattern to persisting and unpersisting >>>> DataFrames and RDDs? I feel like there are many cases when you want to >>>> persist a DataFrame for a specific set of operations and then unpersist it >>>> immediately afterwards. >>>> >>>> For example, take model training. Today, you might do something like >>>> this: >>>> >>>> labeled_data.persist() >>>> model = pipeline.fit(labeled_data) >>>> labeled_data.unpersist() >>>> >>>> If persist() returned a context manager, you could rewrite this as >>>> follows: >>>> >>>> with labeled_data.persist(): >>>> model = pipeline.fit(labeled_data) >>>> >>>> Upon exiting the with block, labeled_data would automatically be >>>> unpersisted. >>>> >>>> This can be done in a backwards-compatible way since persist() would >>>> still return the parent DataFrame or RDD as it does today, but add two >>>> methods to the object: __enter__() and __exit__() >>>> >>>> Does this make sense? Is it attractive? >>>> >>>> Nick >>>> >>>> >>> >>