[ https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089586#comment-16089586 ]
Peng Meng edited comment on SPARK-21401 at 7/17/17 10:10 AM: ------------------------------------------------------------- I have tested much about poll and toArray.sorted. If the queue is much ordered (suppose offer 2000 times for queue size 20). Use pq.toArray.sorted is faster. If the queue is much disordered (suppose offer 100 times for queue size 20), Use pq.poll is much faster. So why not keep the poll function for BoundedPriorityQueue? was (Author: peng.m...@intel.com): I have tested much about poll and toArray.sorted. If the queue is ordered. Use pq.toArray.sorted is faster. If the queue is much disordered, Use pq.poll is much faster. So why not keep the poll function for BoundedPriorityQueue? > add poll function for BoundedPriorityQueue > ------------------------------------------ > > Key: SPARK-21401 > URL: https://issues.apache.org/jira/browse/SPARK-21401 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib > Affects Versions: 2.3.0 > Reporter: Peng Meng > Priority: Minor > > The most of BoundedPriorityQueue usages in ML/MLLIB are: > Get the value of BoundedPriorityQueue, then sort it. > For example, in Word2Vec: pq.toSeq.sortBy(-_._2) > in ALS, pq.toArray.sorted() > The test results show using pq.poll() is much faster than sort the value. > For example, in PR https://github.com/apache/spark/pull/18624 > We get the sorted value of pq by the following code: > {quote} > var size = pq.size > while(size > 0) { > size -= 1 > val factor = pq.poll > }{quote} > If using the generally used methods: pq.toArray.sorted() to get the sorted > value of pq. There is about 10% performance reduction. > It is good to add the poll function for BoundedPriorityQueue, since many > usages of PQ need the sorted value. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org