[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17451 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r137368179 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -274,6 +274,29 @@ class Word2VecModel private[ml] ( wordVectors.findSynonyms(word, num) } + /** --- End diff -- So instead of using _call_java you can use `self._java_obj.findSynonymsArray` and then call `list()` on the result. which will give you something like `[JavaObject id=o86, JavaObject id=o87]`. So you can do something like `map(lambda st: (st._1(), st._2()), list(tuples))` which gives you `[(u'b', 0.25053444504737854), (u'c', -0.6980510950088501)]` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user keypointt commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r125339579 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -274,6 +274,29 @@ class Word2VecModel private[ml] ( wordVectors.findSynonyms(word, num) } + /** --- End diff -- Hi Holden, I tried to call original `findSynonymsArray()` in scala from python side ``` >>> from pyspark.ml.feature import Word2Vec >>> sent = ("a b " * 100 + "a c " * 10).split(" ") >>> df = spark.createDataFrame([(sent,), (sent,)], ["sentence"]) >>> word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model") >>> model = word2Vec.fit(df) >>> a = model.findSynonymsArray("a", 2) ``` and python getting a returned list of dict as below, and `_1()` and `_2()` cannot get actual data, just getting a string `u'scala.Tuple2'`, as shown below. Maybe I'm missing something here? could you please help on how to get data here? thanks a lot ``` >>> a [{u'__class__': u'scala.Tuple2'}, {u'__class__': u'scala.Tuple2'}] >>> len(a) 2 >>> a[0] {u'__class__': u'scala.Tuple2'} >>> for e in a[0]: ... print ''.join(a[0][e]) ... scala.Tuple2 >>> for e in a[0]: ... print a[0][e]._1() ... Traceback (most recent call last): File "", line 2, in AttributeError: 'unicode' object has no attribute '_1' >>> for e in a[0]: ... print a[0][e]._2() ... Traceback (most recent call last): File "", line 2, in AttributeError: 'unicode' object has no attribute '_2' ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r125170914 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -274,6 +274,29 @@ class Word2VecModel private[ml] ( wordVectors.findSynonyms(word, num) } + /** --- End diff -- Yes, so as I mentioned you could do the map function with the `_1()` and `_2()` to convert it entirely in the Python side. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user keypointt commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r125158303 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -274,6 +274,29 @@ class Word2VecModel private[ml] ( wordVectors.findSynonyms(word, num) } + /** --- End diff -- actually I tried call the `findSynonymsArray` from python, but for `findSynonymsArray()` I got below in python, which has no data, ``` >>> model.findSynonymsArray("a", 2) [{u'__class__': u'scala.Tuple2'}, {u'__class__': u'scala.Tuple2'}] ``` which I posted a bit long ago https://github.com/apache/spark/pull/17451#issuecomment-290951029 that's why I switched to create a new method in scala --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user keypointt commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r125158304 --- Diff: python/pyspark/ml/feature.py --- @@ -2869,6 +2871,20 @@ def findSynonyms(self, word, num): word = _convert_to_vector(word) return self._call_java("findSynonyms", word, num) +@since("2.2.0") +def findSynonymsArray(self, word, num): +""" +Find "num" number of words closest in similarity to "word". +word can be a string or vector representation. --- End diff -- sure, will do --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r125154011 --- Diff: python/pyspark/ml/feature.py --- @@ -2869,6 +2871,20 @@ def findSynonyms(self, word, num): word = _convert_to_vector(word) return self._call_java("findSynonyms", word, num) +@since("2.2.0") +def findSynonymsArray(self, word, num): +""" +Find "num" number of words closest in similarity to "word". +word can be a string or vector representation. --- End diff -- can you add a test for the vector representation as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r125154035 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -274,6 +274,29 @@ class Word2VecModel private[ml] ( wordVectors.findSynonyms(word, num) } + /** --- End diff -- this seems a little weird, it feels like it would be more natural to call the `findSynonymsArray` from python then do the map in Python, but I guess this might be a little faster --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r125154018 --- Diff: python/pyspark/ml/feature.py --- @@ -2869,6 +2871,20 @@ def findSynonyms(self, word, num): word = _convert_to_vector(word) return self._call_java("findSynonyms", word, num) +@since("2.2.0") +def findSynonymsArray(self, word, num): +""" +Find "num" number of words closest in similarity to "word". +word can be a string or vector representation. +Returns an array with two fields word and similarity (which +gives the cosine similarity). +""" +if not isinstance(word, basestring): +word = _convert_to_vector(word) +tupleOfArray = self._call_java("findSynonymsTuple", word, num) +arrayOfTuple = list(zip(tupleOfArray._1(), tupleOfArray._2())) --- End diff -- I'm glad this approach worked. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r123951868 --- Diff: python/pyspark/ml/tests.py --- @@ -538,6 +538,19 @@ def test_rformula_force_index_label(self): transformedDF2 = model2.transform(df) self.assertEqual(transformedDF2.head().label, 0.0) +def test_findSynonyms(self): --- End diff -- I think it will suffice to just add this to the doc test? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r123951212 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -274,6 +274,31 @@ class Word2VecModel private[ml] ( wordVectors.findSynonyms(word, num) } + /** + * Find "num" number of words whose vector representation is most similar to the supplied vector. + * If the supplied vector is the vector representation of a word in the model's vocabulary, + * that word will be in the results. + * @return a tuple of the words list and the cosine similarities list between the synonyms given + * word vector. + */ + @Since("2.2.0") --- End diff -- No since tag --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r123951200 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -274,6 +274,31 @@ class Word2VecModel private[ml] ( wordVectors.findSynonyms(word, num) } + /** + * Find "num" number of words whose vector representation is most similar to the supplied vector. + * If the supplied vector is the vector representation of a word in the model's vocabulary, + * that word will be in the results. + * @return a tuple of the words list and the cosine similarities list between the synonyms given + * word vector. + */ + @Since("2.2.0") + def findSynonymsTuple(vec: Vector, num: Int): (Array[String], Array[Double]) = { +val result = findSynonymsArray(vec, num) +(result.map(e => e._1), result.map(e => e._2)) + } + + /** + * Find "num" number of words closest in similarity to the given word, not + * including the word itself. + * @return a tuple of the words list and the cosine similarities list between the synonyms given + * word vector. + */ + @Since("2.2.0") --- End diff -- No since tag is required as this won't be a public method on the Scala side --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r123950944 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -274,6 +274,31 @@ class Word2VecModel private[ml] ( wordVectors.findSynonyms(word, num) } + /** + * Find "num" number of words whose vector representation is most similar to the supplied vector. + * If the supplied vector is the vector representation of a word in the model's vocabulary, + * that word will be in the results. + * @return a tuple of the words list and the cosine similarities list between the synonyms given + * word vector. + */ + @Since("2.2.0") + def findSynonymsTuple(vec: Vector, num: Int): (Array[String], Array[Double]) = { --- End diff -- package private, so `private[feature]` - --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r123950871 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -230,7 +230,7 @@ class Word2VecModel private[ml] ( * Find "num" number of words closest in similarity to the given word, not * including the word itself. * @return a dataframe with columns "word" and "similarity" of the word and the cosine - * similarities between the synonyms and the given word vector. + * similarities between the synonyms and the given word string. --- End diff -- You can just say "given word" if you're going to change the comment - drop the "string" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user keypointt commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r123950209 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -274,6 +274,31 @@ class Word2VecModel private[ml] ( wordVectors.findSynonyms(word, num) } + /** + * Find "num" number of words whose vector representation is most similar to the supplied vector. + * If the supplied vector is the vector representation of a word in the model's vocabulary, + * that word will be in the results. + * @return a tuple of the words list and the cosine similarities list between the synonyms given + * word vector. + */ + @Since("2.2.0") + def findSynonymsTuple(vec: Vector, num: Int): (Array[String], Array[Double]) = { --- End diff -- Hi Yanbo, I also tried to put the method here `private` but not working out, since this method `findSynonymsTuple()` is exposed to python client `self._call_java("findSynonymsTuple", word, num)` or I'm missing something here, I should use `private[feature]` or `private[ml]`? I'm not quite sure here, thanks a lot --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user keypointt commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r123940294 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -274,6 +274,31 @@ class Word2VecModel private[ml] ( wordVectors.findSynonyms(word, num) } + /** + * Find "num" number of words whose vector representation is most similar to the supplied vector. + * If the supplied vector is the vector representation of a word in the model's vocabulary, + * that word will be in the results. + * @return a tuple of the words list and the cosine similarities list between the synonyms given + * word vector. + */ + @Since("2.2.0") + def findSynonymsTuple(vec: Vector, num: Int): (Array[String], Array[Double]) = { --- End diff -- what annotation should I use? `@note`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r123925184 --- Diff: python/pyspark/ml/feature.py --- @@ -2869,6 +2869,18 @@ def findSynonyms(self, word, num): word = _convert_to_vector(word) return self._call_java("findSynonyms", word, num) +@since("2.2.0") +def findSynonymsTuple(self, word, num): --- End diff -- ```findSynonymsTuple``` -> ```findSynonymsArray```, we should keep the same function name and return type with Scala. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r123925233 --- Diff: python/pyspark/ml/feature.py --- @@ -2869,6 +2869,18 @@ def findSynonyms(self, word, num): word = _convert_to_vector(word) return self._call_java("findSynonyms", word, num) +@since("2.2.0") +def findSynonymsTuple(self, word, num): +""" +Find "num" number of words closest in similarity to "word". +word can be a string or vector representation. +Returns an array with two fields word and similarity (which +gives the cosine similarity). +""" +if not isinstance(word, basestring): +word = _convert_to_vector(word) +return self._call_java("findSynonymsTuple", word, num) + --- End diff -- We need to convert result back to array of tuple, which would be consistent with Scala output. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r123925086 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -274,6 +274,31 @@ class Word2VecModel private[ml] ( wordVectors.findSynonyms(word, num) } + /** + * Find "num" number of words whose vector representation is most similar to the supplied vector. + * If the supplied vector is the vector representation of a word in the model's vocabulary, + * that word will be in the results. + * @return a tuple of the words list and the cosine similarities list between the synonyms given + * word vector. + */ + @Since("2.2.0") + def findSynonymsTuple(vec: Vector, num: Int): (Array[String], Array[Double]) = { +val result = findSynonymsArray(vec, num) +(result.map(e => e._1), result.map(e => e._2)) + } + + /** + * Find "num" number of words closest in similarity to the given word, not + * including the word itself. + * @return a tuple of the words list and the cosine similarities list between the synonyms given + * word vector. + */ + @Since("2.2.0") + def findSynonymsTuple(word: String, num: Int): (Array[String], Array[Double]) = { --- End diff -- Ditto, should be private. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r123925064 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -274,6 +274,31 @@ class Word2VecModel private[ml] ( wordVectors.findSynonyms(word, num) } + /** + * Find "num" number of words whose vector representation is most similar to the supplied vector. + * If the supplied vector is the vector representation of a word in the model's vocabulary, + * that word will be in the results. + * @return a tuple of the words list and the cosine similarities list between the synonyms given + * word vector. + */ + @Since("2.2.0") + def findSynonymsTuple(vec: Vector, num: Int): (Array[String], Array[Double]) = { --- End diff -- This should be private. Meanwhile, add annotation to clarify this is only the Java stubs for the Python bindings. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r108359620 --- Diff: python/pyspark/ml/tests.py --- @@ -389,6 +389,21 @@ def test_word2vec_param(self): # Check windowSize is set properly self.assertEqual(model.getWindowSize(), 6) +def test_findSynonyms(self): +sent = ("a b " * 100 + "a c " * 10).split(" ") +doc = spark.createDataFrame([(sent,), (sent,)], ["sentence"]) +word2Vec = Word2Vec(vectorSize=5, seed=42, inputCol="sentence", outputCol="model") +model = word2Vec.fit(doc) +model.getVectors().show() + +from pyspark.sql.functions import format_number as fmt +from pyspark.ml.linalg import Vector +from pyspark.ml.linalg import Vectors +model.findSynonyms("a", 2).select("word", fmt("similarity", 5).alias("similarity")).show() + +# model.findSynonymsArray(["a"], 2).select("word", fmt("similarity", 5).alias("similarity")).show() +model.findSynonymsArray(Vectors.dense("a"), 2).select("word", fmt("similarity", 5).alias("similarity")).show() --- End diff -- As per style tests this line is too long. Also why are you creating a vector of "a"? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/17451#discussion_r108359474 --- Diff: python/pyspark/ml/feature.py --- @@ -2674,6 +2674,18 @@ def findSynonyms(self, word, num): word = _convert_to_vector(word) return self._call_java("findSynonyms", word, num) +@since("2.2.0") +def findSynonymsArray(self, wordVector, num): +""" +Find "num" number of words closest in similarity to "word". +word can be a string or vector representation. +Returns a dataframe with two fields word and similarity (which +gives the cosine similarity). +""" +# if not isinstance(wordVector, basestring): --- End diff -- Why is this commented out? There are two versions of `findSynonymsArray` on the Scala side just like there are for `findSynonyms` - we need it to work for both a string and vector input. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17451: [SPARK-19866][ML][PySpark] Add local version of W...
GitHub user keypointt opened a pull request: https://github.com/apache/spark/pull/17451 [SPARK-19866][ML][PySpark] Add local version of Word2Vec findSynonyms for spark.ml: Python API https://issues.apache.org/jira/browse/SPARK-19866 ## What changes were proposed in this pull request? Add Python API for findSynonymsArray matching Scala API. ## How was this patch tested? Manual test You can merge this pull request into a Git repository by running: $ git pull https://github.com/keypointt/spark SPARK-19866 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17451.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17451 commit f84be09782e74326758a4cba420b150c5d86449a Author: Xin Ren Date: 2017-03-23T04:04:00Z [SPARK-19866] expose findSynonymsArray() --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org