Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15150#discussion_r79637489 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -580,10 +581,14 @@ class Word2VecModel private[spark] ( ind += 1 } - val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2) + val pq = new BoundedPriorityQueue[(String, Double)](num + 1)(Ordering.by(_._2)) + + pq ++= wordList.zip(cosVec) --- End diff -- Yeah, I guess I figured that since we were allocating the tuples anyway a single copy of the array wasn't a lot of extra overhead vs. having slightly cleaner code. But I'm happy to make the change if you think it's a good idea. I agree that allocating an array just to iterate through it isn't ideal. (I'm ambivalent, partially because I don't have a great sense for the vocabulary sizes people typically use this code for in the wild. For my example corpus, my patch as-is, zipping collection iterators, and explicit iteration over indices are all more or less equivalent in time performance. My intuition is that allocating even the single array from `zip` is a bad deal if we're dealing with a very large vocabulary but probably not if the typical case is on the order of 10^5 words or less.)
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org