[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21183 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21183#discussion_r188779768 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala --- @@ -253,6 +253,14 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext with DefaultRead val lda = new LDA() testEstimatorAndModelReadWrite(lda, dataset, LDASuite.allParamSettings, LDASuite.allParamSettings, checkModelData) + +def checkModelDataWithDataset(model: LDAModel, model2: LDAModel, dataset: Dataset[_]): Unit = { --- End diff -- * We don't need this inline method. * Leave a comment to explain why we are doing this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21183#discussion_r188081089 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -473,7 +475,8 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging { None } -val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitions { docs => +val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitionsWithIndex --- End diff -- fix scala style: ``` val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitionsWithIndex { (index, docs) => val nonEmptyDocs = docs.filter(_._2.numNonzeros > 0) val stat = BDM.zeros[Double](k, vocabSize) val logphatPartOption = logphatPartOptionBase() var nonEmptyDocCount: Long = 0L nonEmptyDocs.foreach { case (_, termCounts: Vector) => nonEmptyDocCount += 1 val (gammad, sstats, ids) = OnlineLDAOptimizer.variationalTopicInference( termCounts, expElogbetaBc.value, alpha, gammaShape, k, seed + index) stat(::, ids) := stat(::, ids) + sstats logphatPartOption.foreach(_ += LDAUtils.dirichletExpectation(gammad)) } Iterator((stat, logphatPartOption, nonEmptyDocCount)) } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21183#discussion_r187217859 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala --- @@ -622,11 +623,11 @@ object LocalLDAModel extends MLReadable[LocalLDAModel] { val vectorConverted = MLUtils.convertVectorColumnsToML(data, "docConcentration") val matrixConverted = MLUtils.convertMatrixColumnsToML(vectorConverted, "topicsMatrix") val Row(vocabSize: Int, topicsMatrix: Matrix, docConcentration: Vector, - topicConcentration: Double, gammaShape: Double) = + topicConcentration: Double, gammaShape: Double, seed: Long) = --- End diff -- This will break backwards compatibility of ML persistence (when users try to load LDAModels saved using past versions of Spark). Could you please test this manually by saving a LocalLDAModel using Spark 2.3 and loading it with a build of your PR? You can fix this by checking for the Spark version (in the `metadata`) and loading the seed for Spark >= 2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21183#discussion_r187216371 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala --- @@ -252,6 +252,15 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext with DefaultRead val lda = new LDA() testEstimatorAndModelReadWrite(lda, dataset, LDASuite.allParamSettings, LDASuite.allParamSettings, checkModelData) + +def checkModelDataWithDataset(model: LDAModel, model2: LDAModel, --- End diff -- style: Please fix this to match other multi-line method headers. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21183#discussion_r184753197 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -473,7 +475,8 @@ final class OnlineLDAOptimizer extends LDAOptimizer with Logging { None } -val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitions { docs => +val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = batch.mapPartitionsWithIndexInternal --- End diff -- Let's not use mapPartitionsWithIndexInternal; I don't think closure cleaning is expensive enough for us to worry about here. Use mapPartitionsWithIndex instead. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21183#discussion_r185049467 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -605,14 +609,16 @@ private[clustering] object OnlineLDAOptimizer { expElogbeta: BDM[Double], alpha: breeze.linalg.Vector[Double], gammaShape: Double, - k: Int): (BDV[Double], BDM[Double], List[Int]) = { + k: Int, + seed: Long): (BDV[Double], BDM[Double], List[Int]) = { val (ids: List[Int], cts: Array[Double]) = termCounts match { case v: DenseVector => ((0 until v.size).toList, v.values) case v: SparseVector => (v.indices.toList, v.values) } // Initialize the variational distribution q(theta|gamma) for the mini-batch +val randBasis = new RandBasis(new org.apache.commons.math3.random.MersenneTwister(seed)) val gammad: BDV[Double] = - new Gamma(gammaShape, 1.0 / gammaShape).samplesVector(k) // K + new Gamma(gammaShape, 1.0 / gammaShape)(randBasis).samplesVector(k) // K --- End diff -- nit: Note that the original spacing with the comment was intentional to match lines below. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...
GitHub user ludatabricks opened a pull request: https://github.com/apache/spark/pull/21183 [SPARK-22210][ML] Add seed for LDA variationalTopicInference ## What changes were proposed in this pull request? - Add seed parameter for variationalTopicInference - Add seed for calling variationalTopicInference in submitMiniBatch - Add var seed in LDAModel so that it can take the seed from LDA and use it for the function call of variationalTopicInference in logLikelihoodBound, topicDistributions, getTopicDistributionMethod, and topicDistribution. ## How was this patch tested? Check the test result in mllib.clustering.LDASuite to make sure the result is repeatable with the seed. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ludatabricks/spark-1 SPARK-22210 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21183.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21183 commit e647e85e21c63546611c50e45bc57d232b0cbe83 Author: Lu WANG Date: 2018-04-27T16:46:45Z Add seed for LDA variationalTopicInference --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org