[ https://issues.apache.org/jira/browse/SPARK-20598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-20598. ---------------------------------- Resolution: Incomplete > Iterative checkpoints do not get removed from HDFS > -------------------------------------------------- > > Key: SPARK-20598 > URL: https://issues.apache.org/jira/browse/SPARK-20598 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, YARN > Affects Versions: 2.1.0 > Reporter: Guillem Palou > Priority: Major > Labels: bulk-closed > > I am running a pyspark application that makes use of dataframe.checkpoint() > because Spark needs exponential time to compute the plan and eventually I had > to stop it. Using {{checkpoint}} allowed the application to proceed with the > computation, but I noticed that the HDFS cluster was filling up with RDD > files. Spark is running on YARN client mode. > I managed to reproduce the problem in a toy example as below: > {code} > df = spark.createDataFrame([T.Row(a=1, b=2)]).checkpoint() > for i in range(4): > # either line of the following 2 will produce the error > df = df.select('*', F.concat(*df.columns)).cache().checkpoint() > df = df.join(df, on='a').cache().checkpoint() > # the following two lines do not seem to have an effect > gc.collect() > sc._jvm.System.gc() > {code} > After running the code and {{sc.top()}}, I can still see the rdd's > checkpointed in HDFS: > {quote} > guillem@ip-10-9-94-0:~$ hdfs dfs -du -h $CHECKPOINT_PATH > 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-12 > 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-18 > 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-24 > 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-30 > 5.2 K $CHECKPOINT_PATH/53e54099-3f50-4aeb-aee2-d817bfe57d77/rdd-6 > {quote} > The config flag {{spark.cleaner.referenceTracking.cleanCheckpoints}} is set > to {{true}}. I would expect Spark to clean up all RDDs that can't be > accessed. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org