[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-06 Thread ScrapCodes
GitHub user ScrapCodes opened a pull request:

https://github.com/apache/spark/pull/93

SPARK-1162 Added top in python.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ScrapCodes/spark-1 
SPARK-1162/pyspark-top-takeOrdered

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/93.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #93


commit 4603399c4e7a8c6ed19d916d3a55225b4bb31af8
Author: Prashant Sharma 
Date:   2014-03-06T12:12:16Z

Added top in python.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-36887773
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-36887770
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-06 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-36887864
  
@mateiz I am learning python while doing this, so not sure if it is going 
to make sense. 

+ I have not figured how to implement takeOrdered. Will it be fine if I 
write our own maxHeap implementation or there is a better way(I am not aware 
of).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-36892161
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-36892162
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13023/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-06 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/93#discussion_r10362394
  
--- Diff: python/pyspark/rdd.py ---
@@ -628,6 +669,26 @@ def mergeMaps(m1, m2):
 m1[k] += v
 return m1
 return self.mapPartitions(countPartition).reduce(mergeMaps)
+
+def top(self, num):
+"""
+Get the top N elements from a RDD.
+
+Note: It returns the list sorted in ascending order.
+"""
+def f(iterator):
+q = BoundedPriorityQueue(num)
+for k in iterator:
+q.put(k)
+return q
+
+def f2(a, b):
+a.put(b)
+return a
+q = BoundedPriorityQueue(num)
+# I can not come up with a way to avoid this step. 
+t = self.mapPartitions(f).collect()
+return [k for k in iter(reduce(f2, t, q))]
--- End diff --

This could just be ```list(reduce(f2, t, q))```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-06 Thread ScrapCodes
Github user ScrapCodes commented on a diff in the pull request:

https://github.com/apache/spark/pull/93#discussion_r10370555
  
--- Diff: python/pyspark/rdd.py ---
@@ -628,6 +669,26 @@ def mergeMaps(m1, m2):
 m1[k] += v
 return m1
 return self.mapPartitions(countPartition).reduce(mergeMaps)
+
+def top(self, num):
+"""
+Get the top N elements from a RDD.
+
+Note: It returns the list sorted in ascending order.
+"""
+def f(iterator):
+q = BoundedPriorityQueue(num)
+for k in iterator:
+q.put(k)
+return q
+
+def f2(a, b):
+a.put(b)
+return a
+q = BoundedPriorityQueue(num)
+# I can not come up with a way to avoid this step. 
+t = self.mapPartitions(f).collect()
+return [k for k in iter(reduce(f2, t, q))]
--- End diff --

Thanks that is definitely nicer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-36962210
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-36962209
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-36964796
  
One or more automated tests failed
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13034/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-36964795
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-06 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-36971911
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-36973824
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-06 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-36973825
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-36976663
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-36976664
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13039/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-10 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-37231853
  
Hey Prashant, I looked at this but it seems that the Queue module in Python 
is used for thread-safe queues, meaning it will have a lot of unnecessary 
overhead for what we want. Can you use "heapq" instead 
(http://docs.python.org/2.6/library/heapq.html)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-10 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-37232223
  
In particular note that you can use `heapq.heappushpop` to add each item 
and remove the smallest one when the heap reaches the required size. Before 
that, just use `heappush`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-11 Thread ScrapCodes
Github user ScrapCodes commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-37272574
  
Hey Matei, Thanks ! 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-37273032
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-37273033
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-37277159
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-37277160
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13112/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-11 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/93#discussion_r10486216
  
--- Diff: python/pyspark/rdd.py ---
@@ -628,6 +656,31 @@ def mergeMaps(m1, m2):
 m1[k] += v
 return m1
 return self.mapPartitions(countPartition).reduce(mergeMaps)
+
+def top(self, num):
+"""
+Get the top N elements from a RDD.
+
+Note: It returns the list sorted in ascending order.
+>>> sc.parallelize([10, 4, 2, 12, 3]).top(1)
+[12]
+>>> sc.parallelize([2, 3, 4, 5, 6]).cache().top(2)
+[5, 6]
+"""
+def f(iterator):
+q = BoundedPriorityQueue(num)
+for k in iterator:
+q.put(k)
+return q
+
+def f2(a, b):
+a.put(b)
+return a
--- End diff --

Are you sure this is correct? It seems like f2 will put the entire queue 
into `b` as an element in queue `a`. Try it with an RDD with more than one 
partition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-11 Thread mateiz
Github user mateiz commented on a diff in the pull request:

https://github.com/apache/spark/pull/93#discussion_r10486433
  
--- Diff: python/pyspark/rdd.py ---
@@ -628,6 +656,31 @@ def mergeMaps(m1, m2):
 m1[k] += v
 return m1
 return self.mapPartitions(countPartition).reduce(mergeMaps)
+
+def top(self, num):
+"""
+Get the top N elements from a RDD.
+
+Note: It returns the list sorted in ascending order.
+>>> sc.parallelize([10, 4, 2, 12, 3]).top(1)
+[12]
+>>> sc.parallelize([2, 3, 4, 5, 6]).cache().top(2)
+[5, 6]
+"""
+def f(iterator):
+q = BoundedPriorityQueue(num)
+for k in iterator:
+q.put(k)
+return q
+
+def f2(a, b):
+a.put(b)
+return a
--- End diff --

Ah, never mind, I see that you're merging in one element at a time, where 
`a` is a queue and `b` is an element. You should give these functions and 
parameters better names. Maybe call this `merge(queue, element)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-11 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/93#discussion_r10490986
  
--- Diff: python/pyspark/rdd.py ---
@@ -628,6 +656,31 @@ def mergeMaps(m1, m2):
 m1[k] += v
 return m1
 return self.mapPartitions(countPartition).reduce(mergeMaps)
+
+def top(self, num):
+"""
+Get the top N elements from a RDD.
+
+Note: It returns the list sorted in ascending order.
+>>> sc.parallelize([10, 4, 2, 12, 3]).top(1)
+[12]
+>>> sc.parallelize([2, 3, 4, 5, 6]).cache().top(2)
+[5, 6]
+"""
+def f(iterator):
+q = BoundedPriorityQueue(num)
+for k in iterator:
+q.put(k)
+return q
--- End diff --

Now that BoundedPriorityQueue is quite simple, I don't think you even need 
it any more. In fact, you can do everything in ```f``` (and I would rename 
this) as follows:

```
def topIterator(iterator):
q = []
for k in iterator:
if len(q) < num:
heappush(q, k)
else:
heappushpop(q, k)
yield q


Then your ```f2``` (merge function) can look something like

```
def merge(a, b):
return next(topIterator(a + b))
```

(The ```next``` is there only because topIterator returns a 1 element 
generator)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-37377371
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-37377373
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-37379571
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13126/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-37379570
  
Merged build finished.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-12 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/93#issuecomment-37479585
  
Looks good, thanks! I'll merge this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-1162 Added top in python.

2014-03-12 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/93


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---