Hi all,
We are running a class with Pyspark notebook for data analysis. Some of
the books are fairly long and have a lot of operations. Through the
course of the notebook, the shuffle storage expands considerably and
often exceeds quota (e.g. 1.5GB input expands to 24GB in shuffle
files). Closing and reopening the notebook doesn't clean out the
shuffle directory.

FWIW, the shuffle memory really explodes when we use ALS.

There is a ticket to make sure this is well documented, but there are
also suggestions that the problem should have gone away with Spark 1.0:

https://issues.apache.org/jira/browse/SPARK-5836

Yours,
Ewan

On Tue, 2015-09-29 at 01:18 -0700, ramibatal wrote:
> Hi all,
> 
> I am applying MLlib LDA for topic modelling. I am setting up the the
> lda
> parameter as follow:
> 
> lda.setOptimizer(optimizer)
>   .setK(params.k)
>   .setMaxIterations(params.maxIterations)
>   .setDocConcentration(params.docConcentration)
>   .setTopicConcentration(params.topicConcentration)
>   .setCheckpointInterval(params.checkpointInterval)
>   if (params.checkpointDir.nonEmpty) {
>       sc.setCheckpointDir(params.checkpointDir.get)
>  }
> 
> 
> I am running the LDA algorithm on my local MacOS machine, on a corpus
> of
> 800,000 english text documents (total size 9GB), and my machine has 8
> cores
> with 16GB or RAM and 500GB or hard disk.
> 
> Here are my Spark configurations:
> 
> val conf = new
> SparkConf().setMaster("local[6]").setAppName("LDAExample")
> val sc = new SparkContext(conf)
> 
> 
> When calling the LDA with a large number of iteration (100) (i.e. by
> calling
> val ldaModel = lda.run(corpus)), the algorithm start to create
> shuffle files
> on my disk at at point that it fills it up till there is space left.
> 
> I am using spark-submit to run my program as follow:
> 
> spark-submit --driver-memory 14G --class
> com.heystaks.spark.ml.topicmodelling.LDAExample
> ./target/scala-2.10/lda-assembly-1.0.jar path/to/copurs/file --k 100
> --maxIterations 100 --checkpointDir /Users/ramialbatal/checkpoints
> --checkpointInterval 1
> 
> 
> Where 'K' is the number of topics to extract, when the number of
> iterations
> and topics are small everything is fine, but when there is large
> iteration
> number like 100, no matter what is the value of --checkpointInterval
> the
> phenomenon is the same: disk will fill up after about 25 iteration.
> 
> Everything seems to run correctly and the checkpoints files are
> created on
> my disk but the shuffle files are not removed at all.
> 
> I am using Spark and MLlib 1.5.0, and my machine is Mac Yosemite
> 10.10.5.
> 
> Any help is highly appreciated. Thanks
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n
> 3.nabble.com/Checkpointing-not-removing-shuffle-files-from-local-
> disk-tp24857.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to