Guoqiang Li created SPARK-3623: ---------------------------------- Summary: GraphX does not support the checkpoint operation Key: SPARK-3623 URL: https://issues.apache.org/jira/browse/SPARK-3623 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.1.0, 1.0.2 Reporter: Guoqiang Li Priority: Critical
Consider the following code: {code} for (i <- 0 until totalIter) { val previousCorpus = corpus logInfo("Start Gibbs sampling (Iteration %d/%d)".format(i, totalIter)) val corpusTopicDist = collectTermTopicDist(corpus, globalTopicCounter, sumTerms, numTerms, numTopics, alpha, beta).persist(storageLevel) val corpusSampleTopics = sampleTopics(corpusTopicDist, globalTopicCounter, sumTerms, numTerms, numTopics, alpha, beta).persist(storageLevel) corpus = updateCounter(corpusSampleTopics, numTopics).persist(storageLevel) globalTopicCounter = collectGlobalCounter(corpus, numTopics) assert(bsum(globalTopicCounter) == sumTerms) previousCorpus.unpersistVertices() corpusTopicDist.unpersistVertices() corpusSampleTopics.unpersistVertices() } {code} If there is no checkpoint operation will appear the following problems. 1. The RDD of corpus dependencies are too deep 2. The shuffle files are too large. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org