[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2018-01-15 Thread mpjlu
Github user mpjlu closed the pull request at:

https://github.com/apache/spark/pull/19337


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-11 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r144060858
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -224,6 +224,24 @@ private[clustering] trait LDAParams extends Params 
with HasFeaturesCol with HasM
   /**
* For Online optimizer only: [[optimizer]] = "online".
*
+   * A (positive) learning parameter that controls the convergence of 
variational inference.
+   * Smaller value will lead to more accuracy model and longer training 
time.
--- End diff --

Thanks, I will update the doc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-10 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r143818776
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -224,6 +224,24 @@ private[clustering] trait LDAParams extends Params 
with HasFeaturesCol with HasM
   /**
* For Online optimizer only: [[optimizer]] = "online".
*
+   * A (positive) learning parameter that controls the convergence of 
variational inference.
+   * Smaller value will lead to more accuracy model and longer training 
time.
--- End diff --

Sorry to be troublesome here. I'm not sure smaller value for the 
convergence tolerance will always lead to a more accurate model. Smaller value 
will make the statistics for each batch more fitting to the training data (even 
over fitting), I'm not sure about the overall model accuracy will be better, 
especially for data not in the training dataset. 

Maybe just "Smaller value will lead to a more converged model and longer 
training time"?

Please also add doc for default value.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-10 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r143723408
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -573,7 +584,8 @@ private[clustering] object OnlineLDAOptimizer {
   expElogbeta: BDM[Double],
   alpha: breeze.linalg.Vector[Double],
   gammaShape: Double,
-  k: Int): (BDV[Double], BDM[Double], List[Int]) = {
+  k: Int,
+  epsilon: Double = 1e-3): (BDV[Double], BDM[Double], List[Int]) = {
--- End diff --

Maybe not need to set, we can add epsilon as a parameter in the model to 
save it. Because most of training parameters are not in the model, so we don't 
add epsilon to the model here. Thanks. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-10 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r143720272
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -573,7 +584,8 @@ private[clustering] object OnlineLDAOptimizer {
   expElogbeta: BDM[Double],
   alpha: breeze.linalg.Vector[Double],
   gammaShape: Double,
-  k: Int): (BDV[Double], BDM[Double], List[Int]) = {
+  k: Int,
+  epsilon: Double = 1e-3): (BDV[Double], BDM[Double], List[Int]) = {
--- End diff --

shouldn't epsilon be set in LDAModel as well to reflect the value which was 
used while training the model?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-10 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r143705705
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala ---
@@ -119,6 +121,8 @@ class LDASuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultRead
 assert(lda.getLearningDecay === 0.53)
 lda.setLearningOffset(1027)
 assert(lda.getLearningOffset === 1027)
+lda.setEpsilon(1e-3)
--- End diff --

Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-10 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r143705102
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala ---
@@ -119,6 +121,8 @@ class LDASuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultRead
 assert(lda.getLearningDecay === 0.53)
 lda.setLearningOffset(1027)
 assert(lda.getLearningOffset === 1027)
+lda.setEpsilon(1e-3)
--- End diff --

Thanks @mgaido91 , I will update the code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-10 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r143704472
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala ---
@@ -119,6 +121,8 @@ class LDASuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultRead
 assert(lda.getLearningDecay === 0.53)
 lda.setLearningOffset(1027)
 assert(lda.getLearningOffset === 1027)
+lda.setEpsilon(1e-3)
--- End diff --

here you should test a value different from the default, otherwise this 
tests nothing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-05 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r142854372
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -322,6 +326,13 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
 this
   }
 
+  @Since("2.3.0")
+  def setEpsilon(epsilon: Double): this.type = {
+require(epsilon> 0, s"LDA epsilon must be positive, but was set to 
$epsilon")
--- End diff --

space after epsilon.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-05 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r142853109
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -224,6 +224,20 @@ private[clustering] trait LDAParams extends Params 
with HasFeaturesCol with HasM
   /**
* For Online optimizer only: [[optimizer]] = "online".
*
+   * @group expertParam
--- End diff --

parameter comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-10-05 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r142853643
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -224,6 +224,20 @@ private[clustering] trait LDAParams extends Params 
with HasFeaturesCol with HasM
   /**
* For Online optimizer only: [[optimizer]] = "online".
*
+   * @group expertParam
+   */
+  @Since("2.3.0")
+  final val epsilon = new DoubleParam(this, "epsilon", "(For online 
optimizer)" +
+" A (positive) learning parameter that controls the convergence of 
variational inference.",
--- End diff --

The parameter introduction here cannot really help a user without knowledge 
of LDA implementation. Please add more description for the effect if user want 
to tune the parameter, such like "Smaller value will lead to higher accuracy 
with the cost of more iterations."


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-27 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r141272887
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -224,6 +224,20 @@ private[clustering] trait LDAParams extends Params 
with HasFeaturesCol with HasM
   /**
* For Online optimizer only: [[optimizer]] = "online".
*
+   * @group expertParam
+   */
+  @Since("2.3.0")
+  final val epsilon = new DoubleParam(this, "epsilon", "(For online 
optimizer)" +
+" A (positive) learning parameter that controls the convergence of 
variational inference.",
+ParamValidators.gt(0))
--- End diff --

Yes, use 0.0 is good. Because other Double in this class is using 0, to 
keep the same code style, I also use 0 here.
It is ok to change all 0 to 0.0.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-27 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r141265768
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -224,6 +224,20 @@ private[clustering] trait LDAParams extends Params 
with HasFeaturesCol with HasM
   /**
* For Online optimizer only: [[optimizer]] = "online".
*
+   * @group expertParam
+   */
+  @Since("2.3.0")
+  final val epsilon = new DoubleParam(this, "epsilon", "(For online 
optimizer)" +
+" A (positive) learning parameter that controls the convergence of 
variational inference.",
+ParamValidators.gt(0))
--- End diff --

what about using `0.0` to remark that this is a Double?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-25 Thread mpjlu
Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r140727204
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -573,7 +584,8 @@ private[clustering] object OnlineLDAOptimizer {
   expElogbeta: BDM[Double],
   alpha: breeze.linalg.Vector[Double],
   gammaShape: Double,
-  k: Int): (BDV[Double], BDM[Double], List[Int]) = {
+  k: Int,
+  epsilon: Double = 1e-3): (BDV[Double], BDM[Double], List[Int]) = {
--- End diff --

This is for some other code,  like in LDAModel, which calls this function 
and does not need to set epsilon. 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-25 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19337#discussion_r140725930
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -573,7 +584,8 @@ private[clustering] object OnlineLDAOptimizer {
   expElogbeta: BDM[Double],
   alpha: breeze.linalg.Vector[Double],
   gammaShape: Double,
-  k: Int): (BDV[Double], BDM[Double], List[Int]) = {
+  k: Int,
+  epsilon: Double = 1e-3): (BDV[Double], BDM[Double], List[Int]) = {
--- End diff --

This duplicates the default value.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19337: [SPARK-22114][ML][MLLIB]add epsilon for LDA

2017-09-25 Thread mpjlu
GitHub user mpjlu opened a pull request:

https://github.com/apache/spark/pull/19337

[SPARK-22114][ML][MLLIB]add epsilon for LDA

## What changes were proposed in this pull request?

The current convergence condition of OnlineLDAOptimizer is:
while(meanGammaChange > 1e-3)
The condition is critical for the performance and accuracy of LDA. 
We should keep this configurable, like it is in Vowpal Vabbit: 

https://github.com/JohnLangford/vowpal_wabbit/blob/430f69453bc4876a39351fba1f18771bdbdb7122/vowpalwabbit/lda_core.cc
 :638

## How was this patch tested?

The existing UT


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mpjlu/spark setLDA

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19337.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19337


commit a6c3c79efe4d77ef4beba3f431fd9c2527735875
Author: Peng Meng 
Date:   2017-09-25T09:12:47Z

add epsilon for LDA




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org