[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...

2018-05-16 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21183


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...

2018-05-16 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21183#discussion_r188779768
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala ---
@@ -253,6 +253,14 @@ class LDASuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultRead
 val lda = new LDA()
 testEstimatorAndModelReadWrite(lda, dataset, LDASuite.allParamSettings,
   LDASuite.allParamSettings, checkModelData)
+
+def checkModelDataWithDataset(model: LDAModel, model2: LDAModel, 
dataset: Dataset[_]): Unit = {
--- End diff --

* We don't need this inline method.
* Leave a comment to explain why we are doing this.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...

2018-05-14 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21183#discussion_r188081089
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -473,7 +475,8 @@ final class OnlineLDAOptimizer extends LDAOptimizer 
with Logging {
 None
   }
 
-val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = 
batch.mapPartitions { docs =>
+val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = 
batch.mapPartitionsWithIndex
--- End diff --

fix scala style:
```
val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = 
batch.mapPartitionsWithIndex {
  (index, docs) =>
val nonEmptyDocs = docs.filter(_._2.numNonzeros > 0)

val stat = BDM.zeros[Double](k, vocabSize)
val logphatPartOption = logphatPartOptionBase()
var nonEmptyDocCount: Long = 0L
nonEmptyDocs.foreach { case (_, termCounts: Vector) =>
  nonEmptyDocCount += 1
  val (gammad, sstats, ids) = 
OnlineLDAOptimizer.variationalTopicInference(
termCounts, expElogbetaBc.value, alpha, gammaShape, k, seed + 
index)
  stat(::, ids) := stat(::, ids) + sstats
  logphatPartOption.foreach(_ += 
LDAUtils.dirichletExpectation(gammad))
}
Iterator((stat, logphatPartOption, nonEmptyDocCount))
}
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...

2018-05-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21183#discussion_r187217859
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala ---
@@ -622,11 +623,11 @@ object LocalLDAModel extends 
MLReadable[LocalLDAModel] {
   val vectorConverted = MLUtils.convertVectorColumnsToML(data, 
"docConcentration")
   val matrixConverted = 
MLUtils.convertMatrixColumnsToML(vectorConverted, "topicsMatrix")
   val Row(vocabSize: Int, topicsMatrix: Matrix, docConcentration: 
Vector,
-  topicConcentration: Double, gammaShape: Double) =
+  topicConcentration: Double, gammaShape: Double, seed: Long) =
--- End diff --

This will break backwards compatibility of ML persistence (when users try 
to load LDAModels saved using past versions of Spark).  Could you please test 
this manually by saving a LocalLDAModel using Spark 2.3 and loading it with a 
build of your PR?  You can fix this by checking for the Spark version (in the 
`metadata`) and loading the seed for Spark >= 2.4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...

2018-05-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21183#discussion_r187216371
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala ---
@@ -252,6 +252,15 @@ class LDASuite extends SparkFunSuite with 
MLlibTestSparkContext with DefaultRead
 val lda = new LDA()
 testEstimatorAndModelReadWrite(lda, dataset, LDASuite.allParamSettings,
   LDASuite.allParamSettings, checkModelData)
+
+def checkModelDataWithDataset(model: LDAModel, model2: LDAModel,
--- End diff --

style: Please fix this to match other multi-line method headers.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...

2018-04-30 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21183#discussion_r184753197
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -473,7 +475,8 @@ final class OnlineLDAOptimizer extends LDAOptimizer 
with Logging {
 None
   }
 
-val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = 
batch.mapPartitions { docs =>
+val stats: RDD[(BDM[Double], Option[BDV[Double]], Long)] = 
batch.mapPartitionsWithIndexInternal
--- End diff --

Let's not use mapPartitionsWithIndexInternal; I don't think closure 
cleaning is expensive enough for us to worry about here.  Use 
mapPartitionsWithIndex instead.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...

2018-04-30 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21183#discussion_r185049467
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -605,14 +609,16 @@ private[clustering] object OnlineLDAOptimizer {
   expElogbeta: BDM[Double],
   alpha: breeze.linalg.Vector[Double],
   gammaShape: Double,
-  k: Int): (BDV[Double], BDM[Double], List[Int]) = {
+  k: Int,
+  seed: Long): (BDV[Double], BDM[Double], List[Int]) = {
 val (ids: List[Int], cts: Array[Double]) = termCounts match {
   case v: DenseVector => ((0 until v.size).toList, v.values)
   case v: SparseVector => (v.indices.toList, v.values)
 }
 // Initialize the variational distribution q(theta|gamma) for the 
mini-batch
+val randBasis = new RandBasis(new 
org.apache.commons.math3.random.MersenneTwister(seed))
 val gammad: BDV[Double] =
-  new Gamma(gammaShape, 1.0 / gammaShape).samplesVector(k) 
  // K
+  new Gamma(gammaShape, 1.0 / gammaShape)(randBasis).samplesVector(k) 
// K
--- End diff --

nit: Note that the original spacing with the comment was intentional to 
match lines below.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21183: [SPARK-22210][ML] Add seed for LDA variationalTop...

2018-04-27 Thread ludatabricks
GitHub user ludatabricks opened a pull request:

https://github.com/apache/spark/pull/21183

[SPARK-22210][ML] Add seed for LDA variationalTopicInference

## What changes were proposed in this pull request?

- Add seed parameter for variationalTopicInference

- Add seed for calling variationalTopicInference in submitMiniBatch

- Add var seed in LDAModel so that it can take the seed from LDA and use it 
for the function call of variationalTopicInference in logLikelihoodBound, 
topicDistributions, getTopicDistributionMethod, and topicDistribution.


## How was this patch tested?

Check the test result in mllib.clustering.LDASuite to make sure the result 
is repeatable with the seed.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ludatabricks/spark-1 SPARK-22210

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21183.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21183


commit e647e85e21c63546611c50e45bc57d232b0cbe83
Author: Lu WANG 
Date:   2018-04-27T16:46:45Z

Add seed for LDA variationalTopicInference




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org