[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user mpjlu closed the pull request at: https://github.com/apache/spark/pull/19337 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/19337#discussion_r144060858 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala --- @@ -224,6 +224,24 @@ private[clustering] trait LDAParams extends Params with HasFeaturesCol with HasM /** * For Online optimizer only: [[optimizer]] = "online". * + * A (positive) learning parameter that controls the convergence of variational inference. + * Smaller value will lead to more accuracy model and longer training time. --- End diff -- Thanks, I will update the doc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19337#discussion_r143818776 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala --- @@ -224,6 +224,24 @@ private[clustering] trait LDAParams extends Params with HasFeaturesCol with HasM /** * For Online optimizer only: [[optimizer]] = "online". * + * A (positive) learning parameter that controls the convergence of variational inference. + * Smaller value will lead to more accuracy model and longer training time. --- End diff -- Sorry to be troublesome here. I'm not sure smaller value for the convergence tolerance will always lead to a more accurate model. Smaller value will make the statistics for each batch more fitting to the training data (even over fitting), I'm not sure about the overall model accuracy will be better, especially for data not in the training dataset. Maybe just "Smaller value will lead to a more converged model and longer training time"? Please also add doc for default value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/19337#discussion_r143723408 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -573,7 +584,8 @@ private[clustering] object OnlineLDAOptimizer { expElogbeta: BDM[Double], alpha: breeze.linalg.Vector[Double], gammaShape: Double, - k: Int): (BDV[Double], BDM[Double], List[Int]) = { + k: Int, + epsilon: Double = 1e-3): (BDV[Double], BDM[Double], List[Int]) = { --- End diff -- Maybe not need to set, we can add epsilon as a parameter in the model to save it. Because most of training parameters are not in the model, so we don't add epsilon to the model here. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19337#discussion_r143720272 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -573,7 +584,8 @@ private[clustering] object OnlineLDAOptimizer { expElogbeta: BDM[Double], alpha: breeze.linalg.Vector[Double], gammaShape: Double, - k: Int): (BDV[Double], BDM[Double], List[Int]) = { + k: Int, + epsilon: Double = 1e-3): (BDV[Double], BDM[Double], List[Int]) = { --- End diff -- shouldn't epsilon be set in LDAModel as well to reflect the value which was used while training the model? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19337#discussion_r143705705 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala --- @@ -119,6 +121,8 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext with DefaultRead assert(lda.getLearningDecay === 0.53) lda.setLearningOffset(1027) assert(lda.getLearningOffset === 1027) +lda.setEpsilon(1e-3) --- End diff -- Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/19337#discussion_r143705102 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala --- @@ -119,6 +121,8 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext with DefaultRead assert(lda.getLearningDecay === 0.53) lda.setLearningOffset(1027) assert(lda.getLearningOffset === 1027) +lda.setEpsilon(1e-3) --- End diff -- Thanks @mgaido91 , I will update the code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19337#discussion_r143704472 --- Diff: mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala --- @@ -119,6 +121,8 @@ class LDASuite extends SparkFunSuite with MLlibTestSparkContext with DefaultRead assert(lda.getLearningDecay === 0.53) lda.setLearningOffset(1027) assert(lda.getLearningOffset === 1027) +lda.setEpsilon(1e-3) --- End diff -- here you should test a value different from the default, otherwise this tests nothing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19337#discussion_r142854372 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -322,6 +326,13 @@ final class OnlineLDAOptimizer extends LDAOptimizer { this } + @Since("2.3.0") + def setEpsilon(epsilon: Double): this.type = { +require(epsilon> 0, s"LDA epsilon must be positive, but was set to $epsilon") --- End diff -- space after epsilon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19337#discussion_r142853109 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala --- @@ -224,6 +224,20 @@ private[clustering] trait LDAParams extends Params with HasFeaturesCol with HasM /** * For Online optimizer only: [[optimizer]] = "online". * + * @group expertParam --- End diff -- parameter comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/19337#discussion_r142853643 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala --- @@ -224,6 +224,20 @@ private[clustering] trait LDAParams extends Params with HasFeaturesCol with HasM /** * For Online optimizer only: [[optimizer]] = "online". * + * @group expertParam + */ + @Since("2.3.0") + final val epsilon = new DoubleParam(this, "epsilon", "(For online optimizer)" + +" A (positive) learning parameter that controls the convergence of variational inference.", --- End diff -- The parameter introduction here cannot really help a user without knowledge of LDA implementation. Please add more description for the effect if user want to tune the parameter, such like "Smaller value will lead to higher accuracy with the cost of more iterations." --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/19337#discussion_r141272887 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala --- @@ -224,6 +224,20 @@ private[clustering] trait LDAParams extends Params with HasFeaturesCol with HasM /** * For Online optimizer only: [[optimizer]] = "online". * + * @group expertParam + */ + @Since("2.3.0") + final val epsilon = new DoubleParam(this, "epsilon", "(For online optimizer)" + +" A (positive) learning parameter that controls the convergence of variational inference.", +ParamValidators.gt(0)) --- End diff -- Yes, use 0.0 is good. Because other Double in this class is using 0, to keep the same code style, I also use 0 here. It is ok to change all 0 to 0.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19337#discussion_r141265768 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala --- @@ -224,6 +224,20 @@ private[clustering] trait LDAParams extends Params with HasFeaturesCol with HasM /** * For Online optimizer only: [[optimizer]] = "online". * + * @group expertParam + */ + @Since("2.3.0") + final val epsilon = new DoubleParam(this, "epsilon", "(For online optimizer)" + +" A (positive) learning parameter that controls the convergence of variational inference.", +ParamValidators.gt(0)) --- End diff -- what about using `0.0` to remark that this is a Double? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/19337#discussion_r140727204 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -573,7 +584,8 @@ private[clustering] object OnlineLDAOptimizer { expElogbeta: BDM[Double], alpha: breeze.linalg.Vector[Double], gammaShape: Double, - k: Int): (BDV[Double], BDM[Double], List[Int]) = { + k: Int, + epsilon: Double = 1e-3): (BDV[Double], BDM[Double], List[Int]) = { --- End diff -- This is for some other code, like in LDAModel, which calls this function and does not need to set epsilon. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19337#discussion_r140725930 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala --- @@ -573,7 +584,8 @@ private[clustering] object OnlineLDAOptimizer { expElogbeta: BDM[Double], alpha: breeze.linalg.Vector[Double], gammaShape: Double, - k: Int): (BDV[Double], BDM[Double], List[Int]) = { + k: Int, + epsilon: Double = 1e-3): (BDV[Double], BDM[Double], List[Int]) = { --- End diff -- This duplicates the default value. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/19337 [SPARK-22114][ML][MLLIB]add epsilon for LDA ## What changes were proposed in this pull request? The current convergence condition of OnlineLDAOptimizer is: while(meanGammaChange > 1e-3) The condition is critical for the performance and accuracy of LDA. We should keep this configurable, like it is in Vowpal Vabbit: https://github.com/JohnLangford/vowpal_wabbit/blob/430f69453bc4876a39351fba1f18771bdbdb7122/vowpalwabbit/lda_core.cc :638 ## How was this patch tested? The existing UT You can merge this pull request into a Git repository by running: $ git pull https://github.com/mpjlu/spark setLDA Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19337.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19337 commit a6c3c79efe4d77ef4beba3f431fd9c2527735875 Author: Peng MengDate: 2017-09-25T09:12:47Z add epsilon for LDA --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org