We will try to address this before Spark 1.5 is released: https://issues.apache.org/jira/browse/SPARK-9141
On Tue, Jul 28, 2015 at 11:50 AM, Kristina Rogale Plazonic <kpl...@gmail.com > wrote: > Hi, > > I'm puzzling over the following problem: when I cache a small sample of a > big dataframe, the small dataframe is recomputed when selecting a column > (but not if show() or count() is invoked). > > Why is that so and how can I avoid recomputation of the small sample > dataframe? > > More details: > > - I have a big dataframe "df" of ~190million rows and ~10 columns, > obtained via 3 different joins; I cache it and invoke count() to make sure > it really is in memory and confirm in web UI > > - val sdf = df.sample(false, 1e-6); sdf.cache(); sdf.count() // 170 > lines; cached is also confirmed in webUI, size in memory is 150kB > > *- sdf.select("colname").show() // this triggers a complete > recomputation of sdf with 3 joins!* > > - show(), count() or take() do not trigger the recomputation of the 3 > joins, but select(), collect() or withColumn() do. > > I have --executor-memory 30G --driver-memory 10g, so memory is not a > problem. I'm using Spark 1.4.0. Could anybody shed some light on this or > where I can find more info? > > Many thanks, > Kristina >