Re: Identifying specific persisted DataFrames via getPersistentRDDs()

2018-05-08 Thread Nicholas Chammas
That’s correct. I probably would have done better to title this thread something like “How to effectively track and release persisted DataFrames”. I jumped the gun in my initial email by referencing getPersistentRDDs() as a potential solution, but in theory the desired API is something like spark.

Re: Identifying specific persisted DataFrames via getPersistentRDDs()

2018-05-08 Thread Mark Hamstra
If I am understanding you correctly, you're just saying that the problem is that you know what you want to keep, not what you want to throw away, and that there is no unpersist DataFrames call based on that what-to-keep information. On Tue, May 8, 2018 at 6:00 AM, Nicholas Chammas wrote: > I cer

Re: Identifying specific persisted DataFrames via getPersistentRDDs()

2018-05-08 Thread Nicholas Chammas
I certainly can, but the problem I’m facing is that of how best to track all the DataFrames I no longer want to persist. I create and persist various DataFrames throughout my pipeline. Spark is already tracking all this for me, and exposing some of that tracking information via getPersistentRDDs()

Re: Identifying specific persisted DataFrames via getPersistentRDDs()

2018-05-03 Thread Reynold Xin
Why do you need the underlying RDDs? Can't you just unpersist the dataframes that you don't need? On Mon, Apr 30, 2018 at 8:17 PM Nicholas Chammas wrote: > This seems to be an underexposed part of the API. My use case is this: I > want to unpersist all DataFrames except a specific few. I want t

Identifying specific persisted DataFrames via getPersistentRDDs()

2018-04-30 Thread Nicholas Chammas
This seems to be an underexposed part of the API. My use case is this: I want to unpersist all DataFrames except a specific few. I want to do this because I know at a specific point in my pipeline that I have a handful of DataFrames that I need, and everything else is no longer needed. The problem