Github user willb commented on a diff in the pull request:

    https://github.com/apache/spark/pull/15150#discussion_r79637489
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
    @@ -580,10 +581,14 @@ class Word2VecModel private[spark] (
           ind += 1
         }
     
    -    val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2)
    +    val pq = new BoundedPriorityQueue[(String, Double)](num + 
1)(Ordering.by(_._2))
    +
    +    pq ++= wordList.zip(cosVec)
    --- End diff --
    
    Yeah, I guess I figured that since we were allocating the tuples anyway a 
single copy of the array wasn't a lot of extra overhead vs. having slightly 
cleaner code.  But I'm happy to make the change if you think it's a good idea.  
I agree that allocating an array just to iterate through it isn't ideal.
    
    (I'm ambivalent, partially because I don't have a great sense for the 
vocabulary sizes people typically use this code for in the wild.  For my 
example corpus, my patch as-is, zipping collection iterators, and explicit 
iteration over indices are all more or less equivalent in time performance.  My 
intuition is that allocating even the single array from `zip` is a bad deal if 
we're dealing with a very large vocabulary but probably not if the typical case 
is on the order of 10^5 words or less.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to