[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/5748


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124739083
  
Merging with master with the first tests passed, and the second one's 
failure was unrelated.
Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124730905
  
  [Test build #38395 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38395/console)
 for   PR 5748 at commit 
[`e308913`](https://github.com/apache/spark/commit/e308913423c4c6019b21bcb05630268bc381fa1a).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124730979
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124709981
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124709822
  
  [Test build #94 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SlowSparkPullRequestBuilder/94/console)
 for   PR 5748 at commit 
[`e308913`](https://github.com/apache/spark/commit/e308913423c4c6019b21bcb05630268bc381fa1a).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124667713
  
  [Test build #38395 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38395/consoleFull)
 for   PR 5748 at commit 
[`e308913`](https://github.com/apache/spark/commit/e308913423c4c6019b21bcb05630268bc381fa1a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124667127
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124666886
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124665624
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124666314
  
  [Test build #94 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SlowSparkPullRequestBuilder/94/consoleFull)
 for   PR 5748 at commit 
[`e308913`](https://github.com/apache/spark/commit/e308913423c4c6019b21bcb05630268bc381fa1a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124665813
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread MechCoder
Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124665485
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124606115
  
LGTM pending tests.

Test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124600961
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124600685
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124600634
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread MechCoder
Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124600254
  
done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124599284
  
I just checked, and the docs for the private vals won't show up.  (I 
checked the current docs for KMeansModel, which exposes uid but hides 
parentModel.)  Would you mind moving that doc, just to keep things 
well-organized?  That should be it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124597737
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124597508
  
  [Test build #90 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SlowSparkPullRequestBuilder/90/console)
 for   PR 5748 at commit 
[`5703116`](https://github.com/apache/spark/commit/5703116acea0f3e885061e191cb1956b7d4b2ca7).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124595722
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124595446
  
  [Test build #38375 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38375/console)
 for   PR 5748 at commit 
[`5703116`](https://github.com/apache/spark/commit/5703116acea0f3e885061e191cb1956b7d4b2ca7).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r35445290
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -484,8 +480,9 @@ class Word2VecModel private[spark] (
* @return vector representation of word
*/
   def transform(word: String): Vector = {
-model.get(word) match {
-  case Some(vec) =>
+wordIndex.get(word) match {
+  case Some(ind) =>
+val vec = wordVectors.slice(ind * vectorSize, ind * vectorSize + 
vectorSize)
--- End diff --

You're right, for this one, we have to make a copy anyways.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124588668
  
Thanks for the updates!  It LGTM.  I'm just waiting for the docs to compile 
to check the param doc question.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r35445443
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala ---
@@ -37,6 +37,12 @@ class Word2VecSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(syms.length == 2)
 assert(syms(0)._1 == "b")
 assert(syms(1)._1 == "c")
+
+// Test that model built using Word2Vec, i.e wordVectors and wordIndec
+// and a Word2VecMap give the same values.
+val word2VecMap = model.getVectors
+val newModel = new Word2VecModel(word2VecMap)
+assert(newModel.getVectors.mapValues(_.toSeq) == 
word2VecMap.mapValues(_.toSeq))
--- End diff --

Right you are


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r35445108
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -431,36 +422,41 @@ class Word2Vec extends Serializable with Logging {
  * Word2Vec model
  */
 @Experimental
-class Word2VecModel private[spark] (
-model: Map[String, Array[Float]]) extends Serializable with Saveable {
-
-  // wordList: Ordered list of words obtained from model.
-  private val wordList: Array[String] = model.keys.toArray
+class Word2VecModel private[mllib] (
+private val wordIndex: Map[String, Int],
+private val wordVectors: Array[Float]) extends Serializable with 
Saveable {
 
   // wordIndex: Maps each word to an index, which can retrieve the 
corresponding
   //vector from wordVectors (see below).
-  private val wordIndex: Map[String, Int] = wordList.zip(0 until 
model.size).toMap
+  // wordVectors: Array of length numWords * vectorSize, vector 
corresponding
--- End diff --

Good question.  Do you know if it shows up in the API docs, even though 
it's private?  (I'll check, but it may take a little while since I need to 
compile them.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124582475
  
  [Test build #38375 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38375/consoleFull)
 for   PR 5748 at commit 
[`5703116`](https://github.com/apache/spark/commit/5703116acea0f3e885061e191cb1956b7d4b2ca7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124581994
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124581789
  
  [Test build #90 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SlowSparkPullRequestBuilder/90/consoleFull)
 for   PR 5748 at commit 
[`5703116`](https://github.com/apache/spark/commit/5703116acea0f3e885061e191cb1956b7d4b2ca7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124581955
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124581586
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124581521
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread MechCoder
Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124580801
  
jenkins my friend. retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124580045
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124576729
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread MechCoder
Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r35440919
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -484,8 +480,9 @@ class Word2VecModel private[spark] (
* @return vector representation of word
*/
   def transform(word: String): Vector = {
-model.get(word) match {
-  case Some(vec) =>
+wordIndex.get(word) match {
+  case Some(ind) =>
+val vec = wordVectors.slice(ind * vectorSize, ind * vectorSize + 
vectorSize)
--- End diff --

It gives me a compilation error, so that also works in favor of not 
changing it :p 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124576755
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread MechCoder
Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r35439718
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala ---
@@ -37,6 +37,12 @@ class Word2VecSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(syms.length == 2)
 assert(syms(0)._1 == "b")
 assert(syms(1)._1 == "c")
+
+// Test that model built using Word2Vec, i.e wordVectors and wordIndec
+// and a Word2VecMap give the same values.
+val word2VecMap = model.getVectors
+val newModel = new Word2VecModel(word2VecMap)
+assert(newModel.getVectors.mapValues(_.toSeq) == 
word2VecMap.mapValues(_.toSeq))
--- End diff --

The (word, vector) pairs are compared actually, sorry if the name 
`getVectors` sounds misleading, but I did not write that either :p 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread MechCoder
Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r35439383
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -484,8 +480,9 @@ class Word2VecModel private[spark] (
* @return vector representation of word
*/
   def transform(word: String): Vector = {
-model.get(word) match {
-  case Some(vec) =>
+wordIndex.get(word) match {
+  case Some(ind) =>
+val vec = wordVectors.slice(ind * vectorSize, ind * vectorSize + 
vectorSize)
--- End diff --

Are you sure? I think a copy will be produced anyway. It seems if it is a 
collection.view then it does not produce a copy of collection.

Ref: (http://stackoverflow.com/a/6799739/1170730)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-24 Thread MechCoder
Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r35437614
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -431,36 +422,41 @@ class Word2Vec extends Serializable with Logging {
  * Word2Vec model
  */
 @Experimental
-class Word2VecModel private[spark] (
-model: Map[String, Array[Float]]) extends Serializable with Saveable {
-
-  // wordList: Ordered list of words obtained from model.
-  private val wordList: Array[String] = model.keys.toArray
+class Word2VecModel private[mllib] (
+private val wordIndex: Map[String, Int],
+private val wordVectors: Array[Float]) extends Serializable with 
Saveable {
 
   // wordIndex: Maps each word to an index, which can retrieve the 
corresponding
   //vector from wordVectors (see below).
-  private val wordIndex: Map[String, Int] = wordList.zip(0 until 
model.size).toMap
+  // wordVectors: Array of length numWords * vectorSize, vector 
corresponding
--- End diff --

But this is not meant to be public at any point of time. Is that okay?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-23 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124298189
  
It looks good, just tiny comments.  We can make sure this gets into 1.5.

> However, if the user provides a Word2Vec map by himself to construct the 
Word2Vec model (in the future, since Word2Vec model is marked as 
private[mllib]), it creates a huge array of size numWords * numDims. Are we 
okay with that?

I think that's OK, though we could make that constructor public in the 
future.  I think it would only be useful if someone wanted to load a model 
(created by another library) into MLlib.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r35392030
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -431,36 +422,41 @@ class Word2Vec extends Serializable with Logging {
  * Word2Vec model
  */
 @Experimental
-class Word2VecModel private[spark] (
-model: Map[String, Array[Float]]) extends Serializable with Saveable {
-
-  // wordList: Ordered list of words obtained from model.
-  private val wordList: Array[String] = model.keys.toArray
+class Word2VecModel private[mllib] (
+private val wordIndex: Map[String, Int],
+private val wordVectors: Array[Float]) extends Serializable with 
Saveable {
 
   // wordIndex: Maps each word to an index, which can retrieve the 
corresponding
   //vector from wordVectors (see below).
-  private val wordIndex: Map[String, Int] = wordList.zip(0 until 
model.size).toMap
+  // wordVectors: Array of length numWords * vectorSize, vector 
corresponding
--- End diff --

This doc for wordIndex and wordVectors can go in the class Scala doc and 
use ```@param```.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r35392033
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -508,7 +507,7 @@ class Word2VecModel private[mllib] (
*/
   def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
 require(num > 0, "Number of similar words should > 0")
-
+// TODO: optimize top-k
--- End diff --

I see.  Can you please make a JIRA and add its number to the comment here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r35392037
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -548,6 +545,24 @@ class Word2VecModel private[spark] (
 @Experimental
 object Word2VecModel extends Loader[Word2VecModel] {
 
+  private def buildWordIndex(model: Map[String, Array[Float]]): 
Map[String, Int] = {
+model.keys.zipWithIndex.toMap
+  }
+
+  private def buildWordVectors(model: Map[String, Array[Float]]): 
Array[Float] = {
+require(!model.isEmpty, "Word2VecMap should be non-empty")
+val (vectorSize, numWords) = (model.head._2.size, model.size)
+val wordList = model.keys.toArray
+val wordVectors = new Array[Float](vectorSize * numWords)
+var i = 0
+while (i < numWords) {
+  val vec = model.get(wordList(i)).get
--- End diff --

style: Use ```model(wordList(i))``` rather than "get"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r35392038
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala ---
@@ -37,6 +37,12 @@ class Word2VecSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(syms.length == 2)
 assert(syms(0)._1 == "b")
 assert(syms(1)._1 == "c")
+
+// Test that model built using Word2Vec, i.e wordVectors and wordIndec
+// and a Word2VecMap give the same values.
+val word2VecMap = model.getVectors
+val newModel = new Word2VecModel(word2VecMap)
+assert(newModel.getVectors.mapValues(_.toSeq) == 
word2VecMap.mapValues(_.toSeq))
--- End diff --

Could you change this to compare (word, vector) pairs, rather than just the 
vectors?
(Also use triple equals ```===```)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r35392035
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -548,6 +545,24 @@ class Word2VecModel private[spark] (
 @Experimental
 object Word2VecModel extends Loader[Word2VecModel] {
 
+  private def buildWordIndex(model: Map[String, Array[Float]]): 
Map[String, Int] = {
+model.keys.zipWithIndex.toMap
+  }
+
+  private def buildWordVectors(model: Map[String, Array[Float]]): 
Array[Float] = {
+require(!model.isEmpty, "Word2VecMap should be non-empty")
--- End diff --

nit: Use ```model.nonEmpty```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r35392031
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -484,8 +480,9 @@ class Word2VecModel private[spark] (
* @return vector representation of word
*/
   def transform(word: String): Vector = {
-model.get(word) match {
-  case Some(vec) =>
+wordIndex.get(word) match {
+  case Some(ind) =>
+val vec = wordVectors.slice(ind * vectorSize, ind * vectorSize + 
vectorSize)
--- End diff --

Does this work if you call ```wordVectors.view.slice(...)``` instead?  I 
think "view" will tell Scala not to physically create a copy of the slice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-07-23 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-124294661
  
I'm sorry about the long delay!  I'll take a look now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-25 Thread sujkh85
Github user sujkh85 commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-115255982
  

NAVER - http://www.naver.com/


su...@naver.com 님께 보내신 메일  이 다음과 
같은 이유로 전송 실패했습니다.



받는 사람이 회원님의 메일을 수신차단 하였습니다. 






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-25 Thread MechCoder
Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-115255530
  
ping @jkbradley Can you have a look? I think it is one pass away from a 
merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-113629287
  
  [Test build #35309 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35309/console)
 for   PR 5748 at commit 
[`fa04313`](https://github.com/apache/spark/commit/fa043131902fd5633a2ecaf5651b3414bd728669).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  // class ParentClass(parentField: Int)`
  * `  // class ChildClass(childField: Int) extends ParentClass(1)`
  * `  // If the class type corresponding to current slot has 
writeObject() defined,`
  * `  // then its not obvious which fields of the class will be 
serialized as the writeObject()`
  * `case class Md5(child: Expression)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-113629316
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-19 Thread MechCoder
Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-113608869
  
@jkbradley I just had a proper look at this after a long time.

I think this PR succeeds in preventing the huge Word2Vec map while 
constructing the Word2Vec model.

However, if the user provides a Word2Vec map by himself to construct the 
Word2Vec model (in the future,  since Word2Vec model is marked as 
private[mllib]), it creates a huge array of size numWords * numDims. Are we 
okay with that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-113608349
  
  [Test build #35309 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35309/consoleFull)
 for   PR 5748 at commit 
[`fa04313`](https://github.com/apache/spark/commit/fa043131902fd5633a2ecaf5651b3414bd728669).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-113607327
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-113607350
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-113596922
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-113596920
  
  [Test build #35302 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35302/console)
 for   PR 5748 at commit 
[`b1d61c4`](https://github.com/apache/spark/commit/b1d61c4e441d423782805dcadb017d723d812b79).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-113596462
  
  [Test build #35302 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/35302/consoleFull)
 for   PR 5748 at commit 
[`b1d61c4`](https://github.com/apache/spark/commit/b1d61c4e441d423782805dcadb017d723d812b79).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-113596253
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-113596278
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-110255663
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-110255651
  
  [Test build #34486 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/34486/console)
 for   PR 5748 at commit 
[`14ee596`](https://github.com/apache/spark/commit/14ee5960ced3079231543dfe103075ae12e40e05).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-08 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-110228546
  
  [Test build #34486 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/34486/consoleFull)
 for   PR 5748 at commit 
[`14ee596`](https://github.com/apache/spark/commit/14ee5960ced3079231543dfe103075ae12e40e05).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-110228363
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-110228369
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-08 Thread MechCoder
Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-110228137
  
@jkbradley ping?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-08 Thread MechCoder
Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-110228122
  
jenkins retest this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-108029957
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-108029940
  
  [Test build #33999 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33999/consoleFull)
 for   PR 5748 at commit 
[`14ee596`](https://github.com/apache/spark/commit/14ee5960ced3079231543dfe103075ae12e40e05).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-108007641
  
  [Test build #33999 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33999/consoleFull)
 for   PR 5748 at commit 
[`14ee596`](https://github.com/apache/spark/commit/14ee5960ced3079231543dfe103075ae12e40e05).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-108007522
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-108007496
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-02 Thread MechCoder
Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-108007247
  
@jkbradley fixed!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-02 Thread MechCoder
Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r31537727
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -508,7 +507,7 @@ class Word2VecModel private[mllib] (
*/
   def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
 require(num > 0, "Number of similar words should > 0")
-
+// TODO: optimize top-k
--- End diff --

https://github.com/apache/spark/pull/5467#discussion_r29032366


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-02 Thread MechCoder
Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r31537671
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -426,38 +422,40 @@ class Word2Vec extends Serializable with Logging {
 /**
  * :: Experimental ::
  * Word2Vec model
+ *
+ * @param wordIndex: Maps each word to an index, which can retrieve the 
corresponding
+ *   vector from wordVectors (see below).
+ * @param wordVectors: Array of length numWords * vectorSize, vector 
corresponding
+ * to the word mapped with index i can be retrieved by 
the slice
+ * (i * vectorSize, i * vectorSize + vectorSize)   
  */
 @Experimental
 class Word2VecModel private[mllib] (
-model: Map[String, Array[Float]]) extends Serializable with Saveable {
-
-  // wordList: Ordered list of words obtained from model.
-  private val wordList: Array[String] = model.keys.toArray
+wordIndex: Map[String, Int],
+wordVectors: Array[Float]) extends Serializable with Saveable {
 
-  // wordIndex: Maps each word to an index, which can retrieve the 
corresponding
-  //vector from wordVectors (see below).
-  private val wordIndex: Map[String, Int] = wordList.zip(0 until 
model.size).toMap
-
-  // vectorSize: Dimension of each word's vector.
-  private val vectorSize = model.head._2.size
   private val numWords = wordIndex.size
+  // vectorSize: Dimension of each word's vector.
+  private val vectorSize = wordVectors.length / numWords
+
+  // wordList: Ordered list of words obtained from wordIndex.
+  private val wordList: Array[String] = wordIndex.keys.toArray
--- End diff --

I hope all this sorting does not cause regressions :P


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-02 Thread MechCoder
Github user MechCoder commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r31533274
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -400,17 +400,13 @@ class Word2Vec extends Serializable with Logging {
 }
 newSentences.unpersist()
 
-val word2VecMap = mutable.HashMap.empty[String, Array[Float]]
+val wordArray = new Array[String](vocabSize)
 var i = 0
 while (i < vocabSize) {
-  val word = bcVocab.value(i).word
-  val vector = new Array[Float](vectorSize)
-  Array.copy(syn0Global, i * vectorSize, vector, 0, vectorSize)
-  word2VecMap += word -> vector
+  wordArray(i) = bcVocab.value(i).word
--- End diff --

Hmm. I just followed the convention used before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r31455502
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -400,17 +400,13 @@ class Word2Vec extends Serializable with Logging {
 }
 newSentences.unpersist()
 
-val word2VecMap = mutable.HashMap.empty[String, Array[Float]]
+val wordArray = new Array[String](vocabSize)
 var i = 0
 while (i < vocabSize) {
-  val word = bcVocab.value(i).word
-  val vector = new Array[Float](vectorSize)
-  Array.copy(syn0Global, i * vectorSize, vector, 0, vectorSize)
-  word2VecMap += word -> vector
+  wordArray(i) = bcVocab.value(i).word
--- End diff --

This is executing on the driver, so it should not use broadcast variables.  
Use ```vocab```  Could be shorter to do:
```
val wordArray = vocab.map(_.word)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r31455507
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -426,38 +422,40 @@ class Word2Vec extends Serializable with Logging {
 /**
  * :: Experimental ::
  * Word2Vec model
+ *
+ * @param wordIndex: Maps each word to an index, which can retrieve the 
corresponding
+ *   vector from wordVectors (see below).
+ * @param wordVectors: Array of length numWords * vectorSize, vector 
corresponding
+ * to the word mapped with index i can be retrieved by 
the slice
+ * (i * vectorSize, i * vectorSize + vectorSize)   
  */
 @Experimental
 class Word2VecModel private[mllib] (
-model: Map[String, Array[Float]]) extends Serializable with Saveable {
-
-  // wordList: Ordered list of words obtained from model.
-  private val wordList: Array[String] = model.keys.toArray
+wordIndex: Map[String, Int],
+wordVectors: Array[Float]) extends Serializable with Saveable {
 
-  // wordIndex: Maps each word to an index, which can retrieve the 
corresponding
-  //vector from wordVectors (see below).
-  private val wordIndex: Map[String, Int] = wordList.zip(0 until 
model.size).toMap
-
-  // vectorSize: Dimension of each word's vector.
-  private val vectorSize = model.head._2.size
   private val numWords = wordIndex.size
+  // vectorSize: Dimension of each word's vector.
+  private val vectorSize = wordVectors.length / numWords
+
+  // wordList: Ordered list of words obtained from wordIndex.
+  private val wordList: Array[String] = wordIndex.keys.toArray
--- End diff --

This should sort by ```wordIndex._2``` to make sure the order matches 
wordVectors


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r31455519
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -508,7 +507,7 @@ class Word2VecModel private[mllib] (
*/
   def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
 require(num > 0, "Number of similar words should > 0")
-
+// TODO: optimize top-k
--- End diff --

Is there a JIRA for this?  If so, can you please note the JIRA number here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r31455505
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -426,38 +422,40 @@ class Word2Vec extends Serializable with Logging {
 /**
  * :: Experimental ::
  * Word2Vec model
+ *
+ * @param wordIndex: Maps each word to an index, which can retrieve the 
corresponding
+ *   vector from wordVectors (see below).
+ * @param wordVectors: Array of length numWords * vectorSize, vector 
corresponding
+ * to the word mapped with index i can be retrieved by 
the slice
+ * (i * vectorSize, i * vectorSize + vectorSize)   
  */
 @Experimental
 class Word2VecModel private[mllib] (
-model: Map[String, Array[Float]]) extends Serializable with Saveable {
-
-  // wordList: Ordered list of words obtained from model.
-  private val wordList: Array[String] = model.keys.toArray
+wordIndex: Map[String, Int],
--- End diff --

Make this and wordVectors private vals


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-06-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/5748#discussion_r31455524
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala ---
@@ -38,6 +38,13 @@ class Word2VecSuite extends FunSuite with 
MLlibTestSparkContext {
 assert(syms.length == 2)
 assert(syms(0)._1 == "b")
 assert(syms(1)._1 == "c")
+
+val word2VecMap = model.getVectors
+val newModel = new Word2VecModel(word2VecMap)
+val newSyms = newModel.findSynonyms("a", 2)
--- End diff --

Instead of testing newModel like this, can you just compare the model data 
with the original model?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-05-20 Thread MechCoder
Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-104109512
  
@jkbradley ping?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-05-08 Thread MechCoder
Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-100158192
  
@jkbradley can you have a look at this too? even if it won't be in this 
release?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-04-29 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-97523885
  
The code cutoff is this Friday


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-04-29 Thread MechCoder
Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-97517846
  
when is the release scheduled?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-04-29 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-97516897
  
I'll try to review this before the code cutoff, but it might slip to 1.5.  
I think that's OK since it's an internal improvement.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-04-28 Thread MechCoder
Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-97303637
  
Btw, I addressed the minor comments in this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-04-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-97193688
  
  [Test build #31150 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31150/consoleFull)
 for   PR 5748 at commit 
[`a17d9c9`](https://github.com/apache/spark/commit/a17d9c9ec568bca12f884720d7685176ce07d7d6).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.
 * This patch does not change any dependencies.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-04-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-97193722
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-04-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-97193731
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31150/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-04-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-97165628
  
  [Test build #31150 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/31150/consoleFull)
 for   PR 5748 at commit 
[`a17d9c9`](https://github.com/apache/spark/commit/a17d9c9ec568bca12f884720d7685176ce07d7d6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-04-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-97165262
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-04-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-97165242
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-04-28 Thread MechCoder
Github user MechCoder commented on the pull request:

https://github.com/apache/spark/pull/5748#issuecomment-97164779
  
cc @mengxr @jkbradley


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7045] [MLlib] Avoid intermediate repres...

2015-04-28 Thread MechCoder
GitHub user MechCoder opened a pull request:

https://github.com/apache/spark/pull/5748

[SPARK-7045] [MLlib] Avoid intermediate representation when creating model

Word2Vec used to convert from an Array[Float] representation to a 
Map[String, Array[Float]] and then back to an Array[Float] through 
Word2VecModel.

This prevents this conversion while still supporting the older method of 
supplying a Map.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/MechCoder/spark spark-7045

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5748.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5748


commit a17d9c9ec568bca12f884720d7685176ce07d7d6
Author: MechCoder 
Date:   2015-04-28T18:23:15Z

[SPARK-7045] Avoid intermediate representation when creating model




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org