[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085651#comment-16085651
 ] 

Apache Spark commented on SPARK-21401:
--

User 'mpjlu' has created a pull request for this issue:
https://github.com/apache/spark/pull/18620

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
>  size -= 1
>  val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085727#comment-16085727
 ] 

Sean Owen commented on SPARK-21401:
---

This sounds like a detail of https://issues.apache.org/jira/browse/SPARK-21389

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-13 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085738#comment-16085738
 ] 

Peng Meng commented on SPARK-21401:
---

Yes,  SPARK-21389 used pq.poll.  pq.poll is just a small part of SPARK-21389, 
and pq.poll can improve the performance of SPARK-21389 by 10% comparing with 
pq.toAarry.sorted. 
Maybe other places where use pq.toAarry.sorted also can use pq.poll to improve 
performance.  So I create this JIRA.

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085869#comment-16085869
 ] 

Sean Owen commented on SPARK-21401:
---

I don't think this is worth breaking out unless you are going to add isEmpty 
and tests for these new methods. 

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-13 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16085874#comment-16085874
 ] 

Peng Meng commented on SPARK-21401:
---

Sure, I will add isEmpty and maybe some other functions, and tests cases. 
Thanks.

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088516#comment-16088516
 ] 

Sean Owen commented on SPARK-21401:
---

[~peng.m...@intel.com] on re-reading this, where is something sorted only to 
get the top element? word2vec seems to sort the list when it needs to produces 
a fully sorted list.  The PR you link to isn't replacing a sort, and it's not 
in general faster to sort something this way. What is the benefit here?

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-16 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089209#comment-16089209
 ] 

Peng Meng commented on SPARK-21401:
---

Hi [~srowen], here we also want to get a fully sorted list by get the the top 
element N (pq.size) times.
In my test, call pq.poll  N times is much faster than pq.toArray.sorted. 


> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089415#comment-16089415
 ] 

Sean Owen commented on SPARK-21401:
---

I can't see how polling would be faster. When you poll you have to reheap and 
reassign elements of the queue. Whereas sorting can be done in-place, with 
native code and even in parallel, using quicksort. You do not show benchmarks 
of these two directly, and you're not replacing a sort anywhere. You may be 
measuring the cost of .toArray. I don't understand this and think it should be 
closed.

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089424#comment-16089424
 ] 

Peng Meng commented on SPARK-21401:
---

Hi [~srowen], for ALS optimization, the difference of using poll and sorted is 
large:
When num = 20, if use sorted here, the prediction time is about 31s, if use 
poll, the prediction time is about 26s. I think this difference is large. I 
have tested many times. The result is about the same.
https://github.com/apache/spark/pull/18624.

I am testing LDA of changing sorted to poll. 

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089431#comment-16089431
 ] 

Sean Owen commented on SPARK-21401:
---

You're benchmarking a lot of other changes all at once. I don't see an argument 
that sorting a list is slower than polling a queue. I don't think it can be. 
What might be faster is avoiding copying elements out of the queue to sort 
them. But why use a queue here to begin with, if that's your argument?

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089437#comment-16089437
 ] 

Peng Meng commented on SPARK-21401:
---

I benchmarking just change pq.toArray.sorted. and pq.poll.
pq.poll time complexity is always log(N).
How about the time complexity of quicksort when the data is already partially 
ordered. 

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089448#comment-16089448
 ] 

Sean Owen commented on SPARK-21401:
---

What about the time needed to build the queue? insertions are O(log n) instead 
of O(1). quicksort should be nearly optimal when the array is already sorted. 
It just can't be faster to pay the overhead of reheaping to build the queue, 
then to reheap on every poll, vs nearly no copies in quicksort.

As I say, I think you have a bigger optimization here then, which is to avoid 
the queue entirely, if it's just there to collect elements and then traverse 
them in sorted order.

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089578#comment-16089578
 ] 

Peng Meng commented on SPARK-21401:
---

Hi [~srowen], I got why my original test pq.toArray.sorted is very slow.
My original test for sorted is using: 
{quote}pq.toArray.sorted(Ordering.By[(Int, Double), Double](-_._2)) {quote},
because  {quote} pq.toArray.sorted(-_._2)  {quote} build error. Maybe there is 
boxing/unboxing, the  performance is very bad.
Now, I use  {quote}pq.toArray.sortBy(-_._2) {quote}, the performance is good 
than poll. this 25s vs poll 26s.
Thanks.

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089581#comment-16089581
 ] 

Sean Owen commented on SPARK-21401:
---

Scratch that, I get the issue now. We can't remove the queue because that's the 
efficient way to keep the top k. 

While it would be most efficient to get the queue's array and sort it directly, 
there seems to be no good way to do that. Any change I try involves giving the 
type A a ClassTag in order to allocate Arrays, and that causes other changes in 
the code, like in RDD.scala. The current way this works involves inefficiently 
cloning the queue into an array, and that's why it's slow. So poll() is 
actually faster, even with the reheaps.

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089586#comment-16089586
 ] 

Peng Meng commented on SPARK-21401:
---

I have tested much about poll and toArray.sorted. 
If the queue is ordered.  Use pq.toArray.sorted is faster.
If the queue is much disordered,  Use pq.poll is much faster.
So why not keep the poll function for BoundedPriorityQueue?  

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089589#comment-16089589
 ] 

Sean Owen commented on SPARK-21401:
---

Yes, I agree with you now. I find they're roughly the same, but, the overhead 
of polling is less than the overhead of trying to rebuild the array. It's 
possible to rewrite this to be even faster by copying the queue's array 
directly and sorting it, but, it requires a change to the type signature of 
BoundedPriorityQueue and it's not worth it. 

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089606#comment-16089606
 ] 

Peng Meng commented on SPARK-21401:
---

I think the BoundedPriorityQueue should be rewritten. there are two key 
problems:
1) If insert an element: you should poll first, then offer. reheap two times. 
Actually, we just need one if rewritten the code.
2) Now, BoundedPriorityQueue use Java PriorityQueue, there are many 
boxing/unboxing, which is very time consuming. 
How do you think about it? 

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089613#comment-16089613
 ] 

Sean Owen commented on SPARK-21401:
---

I also am not clear why the implementation didn't use Scala's PriorityQueue 
internally. I think there's a quiet problem here, in that this will fail for 
primitive types. The declaration doesn't restrict A to AnyRef, but, the Java 
implementation will only work with reference types.

I don't think there's an operation in either implementation to both poll and 
offer in one step, unfortunately.

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089629#comment-16089629
 ] 

Peng Meng commented on SPARK-21401:
---

Thanks @srowen.
I mean for BoundedPriorityQueue, you also can offer it directly.
No need to work like this
{quote}  
private def maybeReplaceLowest(a: A): Boolean =  {
val head = underlying.peek()
if (head != null && ord.gt(a, head))  {
  underlying.poll()
  underlying.offer(a)
} 
else  {
  false
}
}
{quote}
poll then offer. poll and offer both need to reheap. 
Actually, this doesn't need reheap two times. just replace the first element 
and reheap is ok. (We should rewrite the code of BoundedPriorityQueue to do 
that)

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089633#comment-16089633
 ] 

Sean Owen commented on SPARK-21401:
---

I'm not sure how you would remove and add an element to the _underlying_ queue. 
Not sure what you mean.
(By the way, this private method doesn't need to return a Boolean anymore)

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089638#comment-16089638
 ] 

Peng Meng commented on SPARK-21401:
---

I mean we totally rewrite the BoundedPriorityQueue, not use Java PriorityQueue 
or Scala PriorityQueue. To create a very performance one.

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089641#comment-16089641
 ] 

Sean Owen commented on SPARK-21401:
---

I think it's not worth writing and maintaining a custom implementation unless 
it's very important for performance. The overhead here should be pretty small. 
I've implemented something similar in the past and not found that the two 
updates are a hotspot, mostly because you never need to do it for the vast 
majority of elements.

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089648#comment-16089648
 ] 

Peng Meng commented on SPARK-21401:
---

Yes,  you don't need to do it for the vast majority of elements.
The key problem here is boxing and unboxing. My testing show boxing/unboxing 
consuming most of time. My micro test  even shows 80% time is spending on 
boxing/unboxing.
If redesign BoundedPriorityQueue  can avoid boxing/unboxing, it will be very 
useful. 

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089651#comment-16089651
 ] 

Sean Owen commented on SPARK-21401:
---

That's why using the Scala implementation may be better. It should be able to 
handle primitive types directly.  I don't think that needs a rewrite. However, 
in the use case you're working on, you're already working with objects.

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21401) add poll function for BoundedPriorityQueue

2017-07-17 Thread Peng Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16089656#comment-16089656
 ] 

Peng Meng commented on SPARK-21401:
---

Got it, thanks [~srowen]

> add poll function for BoundedPriorityQueue
> --
>
> Key: SPARK-21401
> URL: https://issues.apache.org/jira/browse/SPARK-21401
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Peng Meng
>Priority: Minor
>
> The most of BoundedPriorityQueue usages in ML/MLLIB are:
> Get the value of BoundedPriorityQueue, then sort it.
> For example, in Word2Vec: pq.toSeq.sortBy(-_._2)
> in ALS, pq.toArray.sorted()
> The test results show using pq.poll() is much faster than sort the value.
> For example, in PR https://github.com/apache/spark/pull/18624
> We get the sorted value of pq by the following code:
> {quote}
> var size = pq.size
> while(size > 0) {
> size -= 1
> val factor = pq.poll
> }{quote} 
> If using the generally used methods: pq.toArray.sorted() to get the sorted 
> value of pq. There is about 10% performance reduction. 
> It is good to add the poll function for BoundedPriorityQueue, since many 
> usages of PQ need  the sorted value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org