Hi all, We are running a class with Pyspark notebook for data analysis. Some of the books are fairly long and have a lot of operations. Through the course of the notebook, the shuffle storage expands considerably and often exceeds quota (e.g. 1.5GB input expands to 24GB in shuffle files). Closing and reopening the notebook doesn't clean out the shuffle directory.
FWIW, the shuffle memory really explodes when we use ALS. There is a ticket to make sure this is well documented, but there are also suggestions that the problem should have gone away with Spark 1.0: https://issues.apache.org/jira/browse/SPARK-5836 Yours, Ewan On Tue, 2015-09-29 at 01:18 -0700, ramibatal wrote: > Hi all, > > I am applying MLlib LDA for topic modelling. I am setting up the the > lda > parameter as follow: > > lda.setOptimizer(optimizer) > .setK(params.k) > .setMaxIterations(params.maxIterations) > .setDocConcentration(params.docConcentration) > .setTopicConcentration(params.topicConcentration) > .setCheckpointInterval(params.checkpointInterval) > if (params.checkpointDir.nonEmpty) { > sc.setCheckpointDir(params.checkpointDir.get) > } > > > I am running the LDA algorithm on my local MacOS machine, on a corpus > of > 800,000 english text documents (total size 9GB), and my machine has 8 > cores > with 16GB or RAM and 500GB or hard disk. > > Here are my Spark configurations: > > val conf = new > SparkConf().setMaster("local[6]").setAppName("LDAExample") > val sc = new SparkContext(conf) > > > When calling the LDA with a large number of iteration (100) (i.e. by > calling > val ldaModel = lda.run(corpus)), the algorithm start to create > shuffle files > on my disk at at point that it fills it up till there is space left. > > I am using spark-submit to run my program as follow: > > spark-submit --driver-memory 14G --class > com.heystaks.spark.ml.topicmodelling.LDAExample > ./target/scala-2.10/lda-assembly-1.0.jar path/to/copurs/file --k 100 > --maxIterations 100 --checkpointDir /Users/ramialbatal/checkpoints > --checkpointInterval 1 > > > Where 'K' is the number of topics to extract, when the number of > iterations > and topics are small everything is fine, but when there is large > iteration > number like 100, no matter what is the value of --checkpointInterval > the > phenomenon is the same: disk will fill up after about 25 iteration. > > Everything seems to run correctly and the checkpoints files are > created on > my disk but the shuffle files are not removed at all. > > I am using Spark and MLlib 1.5.0, and my machine is Mac Yosemite > 10.10.5. > > Any help is highly appreciated. Thanks > > > > -- > View this message in context: http://apache-spark-user-list.1001560.n > 3.nabble.com/Checkpointing-not-removing-shuffle-files-from-local- > disk-tp24857.html > Sent from the Apache Spark User List mailing list archive at > Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org