Until after an action is done (i.e. save/count/reduce) or if you explicitly truncate the DAG by checkpointing.
Spark needs to keep all shuffle files because if some task/stage/node fails it'll only need to recompute missing partitions by using already computed parts. On Tue, Dec 19, 2017 at 10:08 AM, Mihai Iacob <mia...@ca.ibm.com> wrote: > When does spark remove them? > > > Regards, > > *Mihai Iacob* > DSX Local <https://datascience.ibm.com/local> - Security, IBM Analytics > > > > ----- Original message ----- > From: Vadim Semenov <vadim.seme...@datadoghq.com> > To: Mihai Iacob <mia...@ca.ibm.com> > Cc: user <user@spark.apache.org> > Subject: Re: /tmp fills up to 100GB when using a window function > Date: Tue, Dec 19, 2017 9:46 AM > > Spark doesn't remove intermediate shuffle files if they're part of the > same job. > > On Mon, Dec 18, 2017 at 3:10 PM, Mihai Iacob <mia...@ca.ibm.com> wrote: > > This code generates files under /tmp...blockmgr... which do not get > cleaned up after the job finishes. > > Anything wrong with the code below? or are there any known issues with > spark not cleaning up /tmp files? > > window = Window.\ > partitionBy('***', 'date_str').\ > orderBy(sqlDf['***']) > > sqlDf = sqlDf.withColumn("***",rank().over(window)) > df_w_least = sqlDf.filter("***=1") > > > > > Regards, > > *Mihai Iacob* > DSX Local <https://datascience.ibm.com/local> - Security, IBM Analytics > > --------------------------------------------------------------------- To > unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > >