[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/12176


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-15 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-210439965
  
Merged to master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-210433724
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-210433732
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55921/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-210433327
  
**[Test build #55921 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55921/consoleFull)**
 for PR 12176 at commit 
[`8107dc2`](https://github.com/apache/spark/commit/8107dc29f17996128cfe2942baa15d371bcef9dd).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-210419520
  
**[Test build #55921 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55921/consoleFull)**
 for PR 12176 at commit 
[`8107dc2`](https://github.com/apache/spark/commit/8107dc29f17996128cfe2942baa15d371bcef9dd).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-15 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-210419431
  
Jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-15 Thread pravingadakh
Github user pravingadakh commented on a diff in the pull request:

https://github.com/apache/spark/pull/12176#discussion_r59852468
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -536,13 +532,16 @@ private[clustering] object OnlineLDAOptimizer {
* An optimization (Lee, Seung: Algorithms for non-negative matrix 
factorization, NIPS 2001)
* avoids explicit computation of variational parameter `phi`.
* @see 
[[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.7566]]
+   *
+   * @return A tuple of topic distribution `gammad`, `sstatsd` and `ids`. 
Latter two are the
--- End diff --

@srowen Updated the return doc accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-15 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/12176#discussion_r59848898
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -536,13 +532,16 @@ private[clustering] object OnlineLDAOptimizer {
* An optimization (Lee, Seung: Algorithms for non-negative matrix 
factorization, NIPS 2001)
* avoids explicit computation of variational parameter `phi`.
* @see 
[[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.7566]]
+   *
+   * @return A tuple of topic distribution `gammad`, `sstatsd` and `ids`. 
Latter two are the
--- End diff --

Sounds fine to me


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-15 Thread pravingadakh
Github user pravingadakh commented on a diff in the pull request:

https://github.com/apache/spark/pull/12176#discussion_r59844091
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -536,13 +532,16 @@ private[clustering] object OnlineLDAOptimizer {
* An optimization (Lee, Seung: Algorithms for non-negative matrix 
factorization, NIPS 2001)
* avoids explicit computation of variational parameter `phi`.
* @see 
[[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.7566]]
+   *
+   * @return A tuple of topic distribution `gammad`, `sstatsd` and `ids`. 
Latter two are the
--- End diff --

@srowen How about this return doc:

> Returns a tuple of `gammad` - estimate of gamma, the topic distribution, 
`sstatsd` - statistics for updating lambda and `ids` - list of termCounts 
vector indices


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-13 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-209677538
  
LGTM too pending the doc update (thanks for that)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-13 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-209493926
  
@pravingadakh if you'll augment that doc a little bit more I'll merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-10 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/12176#discussion_r59135877
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -536,13 +532,16 @@ private[clustering] object OnlineLDAOptimizer {
* An optimization (Lee, Seung: Algorithms for non-negative matrix 
factorization, NIPS 2001)
* avoids explicit computation of variational parameter `phi`.
* @see 
[[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.7566]]
+   *
+   * @return A tuple of topic distribution `gammad`, `sstatsd` and `ids`. 
Latter two are the
--- End diff --

This doesn't really describe what they are, but simply one place they're 
called. It doesn't need to be elaborate; see the explanations of what the firs 
two are above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-10 Thread pravingadakh
Github user pravingadakh commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-207929631
  
@srowen Added `@return` for the method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-09 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-207894385
  
@pravingadakh I'll merge if you'll doc the return value of this method. 
That further justifies the change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-06 Thread hhbyyh
Github user hhbyyh commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-206674436
  
Sean is right about the meanings of the return values. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-206410312
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-206410316
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55113/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-206410132
  
**[Test build #55113 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55113/consoleFull)**
 for PR 12176 at commit 
[`fdc8bb0`](https://github.com/apache/spark/commit/fdc8bb02e5b75e1b9691f072e70a0e478c24ee84).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-06 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-206388932
  
**[Test build #55113 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55113/consoleFull)**
 for PR 12176 at commit 
[`fdc8bb0`](https://github.com/apache/spark/commit/fdc8bb02e5b75e1b9691f072e70a0e478c24ee84).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-06 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-206388117
  
Jenkins test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-05 Thread pravingadakh
Github user pravingadakh commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-206119313
  
@jkbradley Could you please flag this PR for test build?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-05 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-205952381
  
Ha fair enough. From poking around the internet I'm pretty sure the first 
is an estimate of gamma, the topic distribution (as the comments say) and the 
second are sufficient statistics for updating lambda.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-05 Thread pravingadakh
Github user pravingadakh commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-205918384
  
I could document the returned values, but frankly I have no idea what those 
values are. I can see the documentation of first returned value `gammad` in the 
comments but none for others. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-05 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-205882669
  
A-ha, understood about `.values`. This looks pretty reasonable. My only 
question is, does it make sense conceptually that this method also returns a 
list of IDs? it doesn't hurt much in practice, and it seems like there's a 
reasonable argument for it logically. We have one case where the caller needs 
it after all. Maybe finish this by documenting the three things returned from 
this method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-05 Thread pravingadakh
Github user pravingadakh commented on a diff in the pull request:

https://github.com/apache/spark/pull/12176#discussion_r58553918
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -542,10 +539,11 @@ private[clustering] object OnlineLDAOptimizer {
   expElogbeta: BDM[Double],
   alpha: breeze.linalg.Vector[Double],
   gammaShape: Double,
-  k: Int): (BDV[Double], BDM[Double]) = {
-val (ids: List[Int], cts: Array[Double]) = termCounts match {
-  case v: DenseVector => ((0 until v.size).toList, v.values)
-  case v: SparseVector => (v.indices.toList, v.values)
+  k: Int,
+  ids: List[Int]): (BDV[Double], BDM[Double]) = {
+val cts: Array[Double] = termCounts match {
--- End diff --

Yes it looks redundant, but `values` is not a member of parent trait 
`Vector`, it is defined at individual implementation level (i.e.  `DenseVector` 
and `SparseVector`). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-05 Thread pravingadakh
Github user pravingadakh commented on a diff in the pull request:

https://github.com/apache/spark/pull/12176#discussion_r58553056
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -440,12 +440,9 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
   val stat = BDM.zeros[Double](k, vocabSize)
   var gammaPart = List[BDV[Double]]()
   nonEmptyDocs.foreach { case (_, termCounts: Vector) =>
-val ids: List[Int] = termCounts match {
-  case v: DenseVector => (0 until v.size).toList
-  case v: SparseVector => v.indices.toList
-}
+val ids: List[Int] = LDAUtils.vectorAsList(termCounts)
--- End diff --

Yes I agree. It is better to simply return `ids` rather than having callers 
to compute it. I will make that change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/12176#discussion_r58519101
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -440,12 +440,9 @@ final class OnlineLDAOptimizer extends LDAOptimizer {
   val stat = BDM.zeros[Double](k, vocabSize)
   var gammaPart = List[BDV[Double]]()
   nonEmptyDocs.foreach { case (_, termCounts: Vector) =>
-val ids: List[Int] = termCounts match {
-  case v: DenseVector => (0 until v.size).toList
-  case v: SparseVector => v.indices.toList
-}
+val ids: List[Int] = LDAUtils.vectorAsList(termCounts)
--- End diff --

I see what you're optimizing here, but the cost is making all the other 
callers compute this. Is it enough of a performance bottleneck to bother? Is it 
at all sensible to simply return `ids` from the existing method call instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/12176#discussion_r58518958
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala ---
@@ -542,10 +539,11 @@ private[clustering] object OnlineLDAOptimizer {
   expElogbeta: BDM[Double],
   alpha: breeze.linalg.Vector[Double],
   gammaShape: Double,
-  k: Int): (BDV[Double], BDM[Double]) = {
-val (ids: List[Int], cts: Array[Double]) = termCounts match {
-  case v: DenseVector => ((0 until v.size).toList, v.values)
-  case v: SparseVector => (v.indices.toList, v.values)
+  k: Int,
+  ids: List[Int]): (BDV[Double], BDM[Double]) = {
+val cts: Array[Double] = termCounts match {
--- End diff --

The `match` here becomes redundant right? both cases return `v.values`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/12176#issuecomment-205741735
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-14370][MLLIB]removed duplicate generati...

2016-04-05 Thread pravingadakh
GitHub user pravingadakh opened a pull request:

https://github.com/apache/spark/pull/12176

[SPARK-14370][MLLIB]removed duplicate generation of ids in 
OnlineLDAOptimizer

## What changes were proposed in this pull request?

Removed duplicated generation of `ids` in OnlineLDAOptimizer.


## How was this patch tested?

tested with existing unit tests.


(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/pravingadakh/spark SPARK-14370

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/12176.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12176


commit 12acf536bc920eeef7518c183b61e38c06ecacd2
Author: Pravin Gadakh 
Date:   2016-04-05T10:04:03Z

removed duplicate generation of ids in OnlineLDAOptimizer




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org