[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

ygcao Tue, 15 Dec 2015 23:09:07 -0800

Github user ygcao commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10152#discussion_r47742736
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -469,13 +495,13 @@ class Word2VecModel private[spark] (
         this(Word2VecModel.buildWordIndex(model), 
Word2VecModel.buildWordVectors(model))
       }
     
    -  private def cosineSimilarity(v1: Array[Float], v2: Array[Float]): Double 
= {
    -    require(v1.length == v2.length, "Vectors should have the same length")
    -    val n = v1.length
    -    val norm1 = blas.snrm2(n, v1, 1)
    -    val norm2 = blas.snrm2(n, v2, 1)
    -    if (norm1 == 0 || norm2 == 0) return 0.0
    -    blas.sdot(n, v1, 1, v2, 1) / norm1 / norm2
    +  /**
    +   * get the built vocabulary from the input
    +   * this is useful for getting the whole vocabulary to join with other 
data or filtering other data
    +   * @return a map of word to its index
    +   */
    +  def getVocabulary: Map[String, Int] = {
    --- End diff --
    
    never mind, by looking carefully, I found I was confused by scala syntax 
sugar in the before.
    When I use getVectors("word"), it will return a vector for me after a 
couple of seconds, actually, it was doing two things implicitly, outputting the 
entire vocabulary first and then lookup the map. The performance issue I found 
was also a illusion then, since it is actually doing a heavy job.
    removed those unnecessary getters designed for working around a fake 
problem~~



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-12153][MLlib]add support of arbitrary l...

Reply via email to