Why do you need the underlying RDDs? Can't you just unpersist the dataframes that you don't need?
On Mon, Apr 30, 2018 at 8:17 PM Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > This seems to be an underexposed part of the API. My use case is this: I > want to unpersist all DataFrames except a specific few. I want to do this > because I know at a specific point in my pipeline that I have a handful of > DataFrames that I need, and everything else is no longer needed. > > The problem is that there doesn’t appear to be a way to identify specific > DataFrames (or rather, their underlying RDDs) via getPersistentRDDs(), > which is the only way I’m aware of to ask Spark for all currently persisted > RDDs: > > >>> a = spark.range(10).persist()>>> a.rdd.id()8>>> > >>> list(spark.sparkContext._jsc.getPersistentRDDs().items()) > [(3, JavaObject id=o36)] > > As you can see, the id of the persisted RDD, 8, doesn’t match the id > returned by getPersistentRDDs(), 3. So I can’t go through the RDDs > returned by getPersistentRDDs() and know which ones I want to keep. > > id() itself appears to be an undocumented method of the RDD API, and in > PySpark getPersistentRDDs() is buried behind the Java sub-objects > <https://issues.apache.org/jira/browse/SPARK-2141>, so I know I’m > reaching here. But is there a way to do what I want in PySpark without > manually tracking everything I’ve persisted myself? > > And more broadly speaking, do we want to add additional APIs, or formalize > currently undocumented APIs like id(), to make this use case possible? > > Nick > >