subject:"\[GitHub\] spark pull request #17451\: \[SPARK\-19866\]\[ML\]\[PySpark\] Add local version of W..."

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-09-08 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/17451


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-09-06 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r137368179
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -274,6 +274,29 @@ class Word2VecModel private[ml] (
 wordVectors.findSynonyms(word, num)
   }
 
+  /**
--- End diff --

So instead of using _call_java you can use 
`self._java_obj.findSynonymsArray` and then call `list()` on the result. which 
will give you something like `[JavaObject id=o86, JavaObject id=o87]`. So you 
can do something like `map(lambda st: (st._1(), st._2()), list(tuples))` which 
gives you `[(u'b', 0.25053444504737854), (u'c', -0.6980510950088501)]`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-07-03 Thread keypointt

Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r125339579
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -274,6 +274,29 @@ class Word2VecModel private[ml] (
 wordVectors.findSynonyms(word, num)
   }
 
+  /**
--- End diff --

Hi Holden, I tried to call original `findSynonymsArray()` in scala from 
python side 
```
>>> from pyspark.ml.feature import Word2Vec
>>> sent = ("a b " * 100 + "a c " * 10).split(" ")
>>> df = spark.createDataFrame([(sent,), (sent,)], ["sentence"])
>>> word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", 
outputCol="model")
>>> model = word2Vec.fit(df)
>>> a = model.findSynonymsArray("a", 2)
```
and python getting a returned list of dict as below, and `_1()` and `_2()` 
cannot get actual data, just getting a string `u'scala.Tuple2'`, as shown 
below. 

Maybe I'm missing something here? could you please help on how to get data 
here? thanks a lot 
```
>>> a
[{u'__class__': u'scala.Tuple2'}, {u'__class__': u'scala.Tuple2'}]
>>> len(a)
2
>>> a[0]
{u'__class__': u'scala.Tuple2'}
>>> for e in a[0]:
... print ''.join(a[0][e])
...
scala.Tuple2
>>> for e in a[0]:
... print a[0][e]._1()
...
Traceback (most recent call last):
  File "", line 2, in 
AttributeError: 'unicode' object has no attribute '_1'
>>> for e in a[0]:
... print a[0][e]._2()
...
Traceback (most recent call last):
  File "", line 2, in 
AttributeError: 'unicode' object has no attribute '_2'
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-07-01 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r125170914
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -274,6 +274,29 @@ class Word2VecModel private[ml] (
 wordVectors.findSynonyms(word, num)
   }
 
+  /**
--- End diff --

Yes, so as I mentioned you could do the map function with the `_1()` and 
`_2()` to convert it entirely in the Python side.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-07-01 Thread keypointt

Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r125158303
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -274,6 +274,29 @@ class Word2VecModel private[ml] (
 wordVectors.findSynonyms(word, num)
   }
 
+  /**
--- End diff --

actually I tried call the `findSynonymsArray` from python, but for 
`findSynonymsArray()` I got below in python, which has no data, 
```
>>> model.findSynonymsArray("a", 2)
[{u'__class__': u'scala.Tuple2'}, {u'__class__': u'scala.Tuple2'}]
```
which I posted a bit long ago 
https://github.com/apache/spark/pull/17451#issuecomment-290951029
that's why I switched to create a new method in scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-07-01 Thread keypointt

Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r125158304
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2869,6 +2871,20 @@ def findSynonyms(self, word, num):
 word = _convert_to_vector(word)
 return self._call_java("findSynonyms", word, num)
 
+@since("2.2.0")
+def findSynonymsArray(self, word, num):
+"""
+Find "num" number of words closest in similarity to "word".
+word can be a string or vector representation.
--- End diff --

sure, will do


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-06-30 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r125154011
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2869,6 +2871,20 @@ def findSynonyms(self, word, num):
 word = _convert_to_vector(word)
 return self._call_java("findSynonyms", word, num)
 
+@since("2.2.0")
+def findSynonymsArray(self, word, num):
+"""
+Find "num" number of words closest in similarity to "word".
+word can be a string or vector representation.
--- End diff --

can you add a test for the vector representation as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-06-30 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r125154035
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -274,6 +274,29 @@ class Word2VecModel private[ml] (
 wordVectors.findSynonyms(word, num)
   }
 
+  /**
--- End diff --

this seems a little weird, it feels like it would be more natural to call 
the `findSynonymsArray` from python then do the map in Python, but I guess this 
might be a little faster


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-06-30 Thread holdenk

Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r125154018
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2869,6 +2871,20 @@ def findSynonyms(self, word, num):
 word = _convert_to_vector(word)
 return self._call_java("findSynonyms", word, num)
 
+@since("2.2.0")
+def findSynonymsArray(self, word, num):
+"""
+Find "num" number of words closest in similarity to "word".
+word can be a string or vector representation.
+Returns an array with two fields word and similarity (which
+gives the cosine similarity).
+"""
+if not isinstance(word, basestring):
+word = _convert_to_vector(word)
+tupleOfArray = self._call_java("findSynonymsTuple", word, num)
+arrayOfTuple = list(zip(tupleOfArray._1(), tupleOfArray._2()))
--- End diff --

I'm glad this approach worked.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-06-26 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r123951868
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -538,6 +538,19 @@ def test_rformula_force_index_label(self):
 transformedDF2 = model2.transform(df)
 self.assertEqual(transformedDF2.head().label, 0.0)
 
+def test_findSynonyms(self):
--- End diff --

I think it will suffice to just add this to the doc test?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-06-26 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r123951212
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -274,6 +274,31 @@ class Word2VecModel private[ml] (
 wordVectors.findSynonyms(word, num)
   }
 
+  /**
+   * Find "num" number of words whose vector representation is most 
similar to the supplied vector.
+   * If the supplied vector is the vector representation of a word in the 
model's vocabulary,
+   * that word will be in the results.
+   * @return a tuple of the words list and the cosine similarities list 
between the synonyms given
+   * word vector.
+   */
+  @Since("2.2.0")
--- End diff --

No since tag


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-06-26 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r123951200
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -274,6 +274,31 @@ class Word2VecModel private[ml] (
 wordVectors.findSynonyms(word, num)
   }
 
+  /**
+   * Find "num" number of words whose vector representation is most 
similar to the supplied vector.
+   * If the supplied vector is the vector representation of a word in the 
model's vocabulary,
+   * that word will be in the results.
+   * @return a tuple of the words list and the cosine similarities list 
between the synonyms given
+   * word vector.
+   */
+  @Since("2.2.0")
+  def findSynonymsTuple(vec: Vector, num: Int): (Array[String], 
Array[Double]) = {
+val result = findSynonymsArray(vec, num)
+(result.map(e => e._1), result.map(e => e._2))
+  }
+
+  /**
+   * Find "num" number of words closest in similarity to the given word, 
not
+   * including the word itself.
+   * @return a tuple of the words list and the cosine similarities list 
between the synonyms given
+   * word vector.
+   */
+  @Since("2.2.0")
--- End diff --

No since tag is required as this won't be a public method on the Scala side


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-06-26 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r123950944
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -274,6 +274,31 @@ class Word2VecModel private[ml] (
 wordVectors.findSynonyms(word, num)
   }
 
+  /**
+   * Find "num" number of words whose vector representation is most 
similar to the supplied vector.
+   * If the supplied vector is the vector representation of a word in the 
model's vocabulary,
+   * that word will be in the results.
+   * @return a tuple of the words list and the cosine similarities list 
between the synonyms given
+   * word vector.
+   */
+  @Since("2.2.0")
+  def findSynonymsTuple(vec: Vector, num: Int): (Array[String], 
Array[Double]) = {
--- End diff --

package private, so `private[feature]` - 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-06-26 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r123950871
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -230,7 +230,7 @@ class Word2VecModel private[ml] (
* Find "num" number of words closest in similarity to the given word, 
not
* including the word itself.
* @return a dataframe with columns "word" and "similarity" of the word 
and the cosine
-   * similarities between the synonyms and the given word vector.
+   * similarities between the synonyms and the given word string.
--- End diff --

You can just say "given word" if you're going to change the comment - drop 
the "string"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-06-26 Thread keypointt

Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r123950209
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -274,6 +274,31 @@ class Word2VecModel private[ml] (
 wordVectors.findSynonyms(word, num)
   }
 
+  /**
+   * Find "num" number of words whose vector representation is most 
similar to the supplied vector.
+   * If the supplied vector is the vector representation of a word in the 
model's vocabulary,
+   * that word will be in the results.
+   * @return a tuple of the words list and the cosine similarities list 
between the synonyms given
+   * word vector.
+   */
+  @Since("2.2.0")
+  def findSynonymsTuple(vec: Vector, num: Int): (Array[String], 
Array[Double]) = {
--- End diff --

Hi Yanbo, I also tried to put the method here `private` but not working 
out, since this method `findSynonymsTuple()` is exposed to python client 
`self._call_java("findSynonymsTuple", word, num)`

or I'm missing something here, I should use `private[feature]` or 
`private[ml]`? I'm not quite sure here, thanks a lot


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-06-26 Thread keypointt

Github user keypointt commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r123940294
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -274,6 +274,31 @@ class Word2VecModel private[ml] (
 wordVectors.findSynonyms(word, num)
   }
 
+  /**
+   * Find "num" number of words whose vector representation is most 
similar to the supplied vector.
+   * If the supplied vector is the vector representation of a word in the 
model's vocabulary,
+   * that word will be in the results.
+   * @return a tuple of the words list and the cosine similarities list 
between the synonyms given
+   * word vector.
+   */
+  @Since("2.2.0")
+  def findSynonymsTuple(vec: Vector, num: Int): (Array[String], 
Array[Double]) = {
--- End diff --

what annotation should I use? `@note`?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-06-25 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r123925184
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2869,6 +2869,18 @@ def findSynonyms(self, word, num):
 word = _convert_to_vector(word)
 return self._call_java("findSynonyms", word, num)
 
+@since("2.2.0")
+def findSynonymsTuple(self, word, num):
--- End diff --

```findSynonymsTuple``` -> ```findSynonymsArray```, we should keep the same 
function name and return type with Scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-06-25 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r123925233
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2869,6 +2869,18 @@ def findSynonyms(self, word, num):
 word = _convert_to_vector(word)
 return self._call_java("findSynonyms", word, num)
 
+@since("2.2.0")
+def findSynonymsTuple(self, word, num):
+"""
+Find "num" number of words closest in similarity to "word".
+word can be a string or vector representation.
+Returns an array with two fields word and similarity (which
+gives the cosine similarity).
+"""
+if not isinstance(word, basestring):
+word = _convert_to_vector(word)
+return self._call_java("findSynonymsTuple", word, num)
+
--- End diff --

We need to convert result back to array of tuple, which would be consistent 
with Scala output.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-06-25 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r123925086
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -274,6 +274,31 @@ class Word2VecModel private[ml] (
 wordVectors.findSynonyms(word, num)
   }
 
+  /**
+   * Find "num" number of words whose vector representation is most 
similar to the supplied vector.
+   * If the supplied vector is the vector representation of a word in the 
model's vocabulary,
+   * that word will be in the results.
+   * @return a tuple of the words list and the cosine similarities list 
between the synonyms given
+   * word vector.
+   */
+  @Since("2.2.0")
+  def findSynonymsTuple(vec: Vector, num: Int): (Array[String], 
Array[Double]) = {
+val result = findSynonymsArray(vec, num)
+(result.map(e => e._1), result.map(e => e._2))
+  }
+
+  /**
+   * Find "num" number of words closest in similarity to the given word, 
not
+   * including the word itself.
+   * @return a tuple of the words list and the cosine similarities list 
between the synonyms given
+   * word vector.
+   */
+  @Since("2.2.0")
+  def findSynonymsTuple(word: String, num: Int): (Array[String], 
Array[Double]) = {
--- End diff --

Ditto, should be private.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-06-25 Thread yanboliang

Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r123925064
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -274,6 +274,31 @@ class Word2VecModel private[ml] (
 wordVectors.findSynonyms(word, num)
   }
 
+  /**
+   * Find "num" number of words whose vector representation is most 
similar to the supplied vector.
+   * If the supplied vector is the vector representation of a word in the 
model's vocabulary,
+   * that word will be in the results.
+   * @return a tuple of the words list and the cosine similarities list 
between the synonyms given
+   * word vector.
+   */
+  @Since("2.2.0")
+  def findSynonymsTuple(vec: Vector, num: Int): (Array[String], 
Array[Double]) = {
--- End diff --

This should be private. Meanwhile, add annotation to clarify this is only 
the Java stubs for the Python bindings.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-03-28 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r108359620
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -389,6 +389,21 @@ def test_word2vec_param(self):
 # Check windowSize is set properly
 self.assertEqual(model.getWindowSize(), 6)
 
+def test_findSynonyms(self):
+sent = ("a b " * 100 + "a c " * 10).split(" ")
+doc = spark.createDataFrame([(sent,), (sent,)], ["sentence"])
+word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", 
outputCol="model")
+model = word2Vec.fit(doc)
+model.getVectors().show()
+
+from pyspark.sql.functions import format_number as fmt
+from pyspark.ml.linalg import Vector
+from pyspark.ml.linalg import Vectors
+model.findSynonyms("a", 2).select("word", fmt("similarity", 
5).alias("similarity")).show()
+
+# model.findSynonymsArray(["a"], 2).select("word", 
fmt("similarity", 5).alias("similarity")).show()
+model.findSynonymsArray(Vectors.dense("a"), 2).select("word", 
fmt("similarity", 5).alias("similarity")).show()
--- End diff --

As per style tests this line is too long. Also why are you creating a 
vector of "a"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-03-28 Thread MLnick

Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/17451#discussion_r108359474
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2674,6 +2674,18 @@ def findSynonyms(self, word, num):
 word = _convert_to_vector(word)
 return self._call_java("findSynonyms", word, num)
 
+@since("2.2.0")
+def findSynonymsArray(self, wordVector, num):
+"""
+Find "num" number of words closest in similarity to "word".
+word can be a string or vector representation.
+Returns a dataframe with two fields word and similarity (which
+gives the cosine similarity).
+"""
+# if not isinstance(wordVector, basestring):
--- End diff --

Why is this commented out? There are two versions of `findSynonymsArray` on 
the Scala side just like there are for `findSynonyms` - we need it to work for 
both a string and vector input.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

2017-03-28 Thread keypointt

GitHub user keypointt opened a pull request:

https://github.com/apache/spark/pull/17451

[SPARK-19866][ML][PySpark] Add local version of Word2Vec findSynonyms for 
spark.ml: Python API

https://issues.apache.org/jira/browse/SPARK-19866

## What changes were proposed in this pull request?

Add Python API for findSynonymsArray matching Scala API.

## How was this patch tested?

Manual test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/keypointt/spark SPARK-19866

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17451.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17451


commit f84be09782e74326758a4cba420b150c5d86449a
Author: Xin Ren 
Date:   2017-03-23T04:04:00Z

[SPARK-19866] expose findSynonymsArray()




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...

23 matches

Site Navigation

Mail list logo

Footer information