If I am understanding you correctly, you're just saying that the problem is
that you know what you want to keep, not what you want to throw away, and
that there is no unpersist DataFrames call based on that what-to-keep
information.

On Tue, May 8, 2018 at 6:00 AM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> I certainly can, but the problem I’m facing is that of how best to track
> all the DataFrames I no longer want to persist.
>
> I create and persist various DataFrames throughout my pipeline. Spark is
> already tracking all this for me, and exposing some of that tracking
> information via getPersistentRDDs(). So when I arrive at a point in my
> program where I know, “I only need this DataFrame going forward”, I want to
> be able to tell Spark “Please unpersist everything except this one
> DataFrame”. If I cannot leverage the information about persisted DataFrames
> that Spark is already tracking, then the alternative is for me to carefully
> track and unpersist DataFrames when I no longer need them.
>
> I suppose the problem is similar at a high level to garbage collection.
> Tracking and freeing DataFrames manually is analogous to malloc and free;
> and full automation would be Spark automatically unpersisting DataFrames
> when they were no longer referenced or needed. I’m looking for an
> in-between solution that lets me leverage some of the persistence tracking
> in Spark so I don’t have to do it all myself.
>
> Does this make more sense now, from a use case perspective as well as from
> a desired API perspective?
> ​
>
> On Thu, May 3, 2018 at 10:26 PM Reynold Xin <r...@databricks.com> wrote:
>
>> Why do you need the underlying RDDs? Can't you just unpersist the
>> dataframes that you don't need?
>>
>>
>> On Mon, Apr 30, 2018 at 8:17 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> This seems to be an underexposed part of the API. My use case is this: I
>>> want to unpersist all DataFrames except a specific few. I want to do this
>>> because I know at a specific point in my pipeline that I have a handful of
>>> DataFrames that I need, and everything else is no longer needed.
>>>
>>> The problem is that there doesn’t appear to be a way to identify
>>> specific DataFrames (or rather, their underlying RDDs) via
>>> getPersistentRDDs(), which is the only way I’m aware of to ask Spark
>>> for all currently persisted RDDs:
>>>
>>> >>> a = spark.range(10).persist()>>> a.rdd.id()8>>> 
>>> >>> list(spark.sparkContext._jsc.getPersistentRDDs().items())
>>> [(3, JavaObject id=o36)]
>>>
>>> As you can see, the id of the persisted RDD, 8, doesn’t match the id
>>> returned by getPersistentRDDs(), 3. So I can’t go through the RDDs
>>> returned by getPersistentRDDs() and know which ones I want to keep.
>>>
>>> id() itself appears to be an undocumented method of the RDD API, and in
>>> PySpark getPersistentRDDs() is buried behind the Java sub-objects
>>> <https://issues.apache.org/jira/browse/SPARK-2141>, so I know I’m
>>> reaching here. But is there a way to do what I want in PySpark without
>>> manually tracking everything I’ve persisted myself?
>>>
>>> And more broadly speaking, do we want to add additional APIs, or
>>> formalize currently undocumented APIs like id(), to make this use case
>>> possible?
>>>
>>> Nick
>>> ​
>>>
>>

Reply via email to