Sweet! Does this cover DataFrame#rdd also using the cached query from DataFrame#cache? I think the ticket 9141 is mainly concerned with whether a derived DataFrame (B) of a cached DataFrame (A) uses the cached query of A, not whether the rdd from A.rdd or B.rdd uses the cached query of A. On Tue, Jul 28, 2015 at 11:33 PM Joseph Bradley <jos...@databricks.com> wrote:
> Thanks for bringing this up! I talked with Michael Armbrust, and it > sounds like this is a from a bug in DataFrame caching: > https://issues.apache.org/jira/browse/SPARK-9141 > It's marked as a blocker for 1.5. > Joseph > > On Tue, Jul 28, 2015 at 2:36 AM, Justin Uang <justin.u...@gmail.com> > wrote: > >> Hey guys, >> >> I'm running into some pretty bad performance issues when it comes to >> using a CrossValidator, because of caching behavior of DataFrames. >> >> The root of the problem is that while I have cached my DataFrame >> representing the features and labels, it is caching at the DataFrame level, >> while CrossValidator/LogisticRegression both drop down to the dataset.rdd >> level, which ignores the caching that I have previously done. This is >> worsened by the fact that for each combination of a fold and a param set >> from the grid, it recomputes my entire input dataset because the caching >> was lost. >> >> My current solution is to force the input DataFrame to be based off of a >> cached RDD, which I did with this horrible hack (had to drop down to java >> from the pyspark because of something to do with vectors not be inferred >> correctly): >> >> def checkpoint_dataframe_caching(df): >> return >> DataFrame(sqlContext._ssql_ctx.createDataFrame(df._jdf.rdd().cache(), >> train_data._jdf.schema()), sqlContext) >> >> before I pass it into the CrossValidator.fit(). If I do this, I still >> have to cache the underlying rdd once more than necessary (in addition to >> DataFrame#cache()), but at least in cross validation, it doesn't recompute >> the RDD graph anymore. >> >> Note, that input_df.rdd.cache() doesn't work because the python >> CrossValidator implementation applies some more dataframe transformations >> like filter, which then causes filtered_df.rdd to return a completely >> different rdd that recomputes the entire graph. >> >> Is it the intention of Spark SQL that calling DataFrame#rdd removes any >> caching that was done for the query? Is the fix as simple as getting the >> DataFrame#rdd to reference the cached query, or is there something more >> subtle going on. >> >> Best, >> >> Justin >> > >