That’s correct. I probably would have done better to title this thread
something like “How to effectively track and release persisted DataFrames”.
I jumped the gun in my initial email by referencing getPersistentRDDs() as
a potential solution, but in theory the desired API is something like
spark.
If I am understanding you correctly, you're just saying that the problem is
that you know what you want to keep, not what you want to throw away, and
that there is no unpersist DataFrames call based on that what-to-keep
information.
On Tue, May 8, 2018 at 6:00 AM, Nicholas Chammas wrote:
> I cer
I certainly can, but the problem I’m facing is that of how best to track
all the DataFrames I no longer want to persist.
I create and persist various DataFrames throughout my pipeline. Spark is
already tracking all this for me, and exposing some of that tracking
information via getPersistentRDDs()
Why do you need the underlying RDDs? Can't you just unpersist the
dataframes that you don't need?
On Mon, Apr 30, 2018 at 8:17 PM Nicholas Chammas
wrote:
> This seems to be an underexposed part of the API. My use case is this: I
> want to unpersist all DataFrames except a specific few. I want t
This seems to be an underexposed part of the API. My use case is this: I
want to unpersist all DataFrames except a specific few. I want to do this
because I know at a specific point in my pipeline that I have a handful of
DataFrames that I need, and everything else is no longer needed.
The problem