[ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088516#comment-16088516 ]
Sean Owen commented on SPARK-21401: ----------------------------------- [~peng.m...@intel.com] on re-reading this, where is something sorted only to get the top element? word2vec seems to sort the list when it needs to produces a fully sorted list. The PR you link to isn't replacing a sort, and it's not in general faster to sort something this way. What is the benefit here? > add poll function for BoundedPriorityQueue > ------------------------------------------ > > Key: SPARK-21401 > URL: https://issues.apache.org/jira/browse/SPARK-21401 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib > Affects Versions: 2.3.0 > Reporter: Peng Meng > Priority: Minor > > The most of BoundedPriorityQueue usages in ML/MLLIB are: > Get the value of BoundedPriorityQueue, then sort it. > For example, in Word2Vec: pq.toSeq.sortBy(-_._2) > in ALS, pq.toArray.sorted() > The test results show using pq.poll() is much faster than sort the value. > For example, in PR https://github.com/apache/spark/pull/18624 > We get the sorted value of pq by the following code: > {quote} > var size = pq.size > while(size > 0) { > size -= 1 > val factor = pq.poll > }{quote} > If using the generally used methods: pq.toArray.sorted() to get the sorted > value of pq. There is about 10% performance reduction. > It is good to add the poll function for BoundedPriorityQueue, since many > usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org