[GitHub] spark pull request: SPARK-1162 Added top in python.
GitHub user ScrapCodes opened a pull request: https://github.com/apache/spark/pull/93 SPARK-1162 Added top in python. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ScrapCodes/spark-1 SPARK-1162/pyspark-top-takeOrdered Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/93.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #93 commit 4603399c4e7a8c6ed19d916d3a55225b4bb31af8 Author: Prashant Sharma Date: 2014-03-06T12:12:16Z Added top in python. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-36887773 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-36887770 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user ScrapCodes commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-36887864 @mateiz I am learning python while doing this, so not sure if it is going to make sense. + I have not figured how to implement takeOrdered. Will it be fine if I write our own maxHeap implementation or there is a better way(I am not aware of). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-36892161 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-36892162 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13023/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/93#discussion_r10362394 --- Diff: python/pyspark/rdd.py --- @@ -628,6 +669,26 @@ def mergeMaps(m1, m2): m1[k] += v return m1 return self.mapPartitions(countPartition).reduce(mergeMaps) + +def top(self, num): +""" +Get the top N elements from a RDD. + +Note: It returns the list sorted in ascending order. +""" +def f(iterator): +q = BoundedPriorityQueue(num) +for k in iterator: +q.put(k) +return q + +def f2(a, b): +a.put(b) +return a +q = BoundedPriorityQueue(num) +# I can not come up with a way to avoid this step. +t = self.mapPartitions(f).collect() +return [k for k in iter(reduce(f2, t, q))] --- End diff -- This could just be ```list(reduce(f2, t, q))``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user ScrapCodes commented on a diff in the pull request: https://github.com/apache/spark/pull/93#discussion_r10370555 --- Diff: python/pyspark/rdd.py --- @@ -628,6 +669,26 @@ def mergeMaps(m1, m2): m1[k] += v return m1 return self.mapPartitions(countPartition).reduce(mergeMaps) + +def top(self, num): +""" +Get the top N elements from a RDD. + +Note: It returns the list sorted in ascending order. +""" +def f(iterator): +q = BoundedPriorityQueue(num) +for k in iterator: +q.put(k) +return q + +def f2(a, b): +a.put(b) +return a +q = BoundedPriorityQueue(num) +# I can not come up with a way to avoid this step. +t = self.mapPartitions(f).collect() +return [k for k in iter(reduce(f2, t, q))] --- End diff -- Thanks that is definitely nicer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-36962210 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-36962209 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-36964796 One or more automated tests failed Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13034/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-36964795 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user ScrapCodes commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-36971911 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-36973824 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-36973825 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-36976663 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-36976664 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13039/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-37231853 Hey Prashant, I looked at this but it seems that the Queue module in Python is used for thread-safe queues, meaning it will have a lot of unnecessary overhead for what we want. Can you use "heapq" instead (http://docs.python.org/2.6/library/heapq.html)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-37232223 In particular note that you can use `heapq.heappushpop` to add each item and remove the smallest one when the heap reaches the required size. Before that, just use `heappush`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user ScrapCodes commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-37272574 Hey Matei, Thanks ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-37273032 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-37273033 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-37277159 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-37277160 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13112/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/93#discussion_r10486216 --- Diff: python/pyspark/rdd.py --- @@ -628,6 +656,31 @@ def mergeMaps(m1, m2): m1[k] += v return m1 return self.mapPartitions(countPartition).reduce(mergeMaps) + +def top(self, num): +""" +Get the top N elements from a RDD. + +Note: It returns the list sorted in ascending order. +>>> sc.parallelize([10, 4, 2, 12, 3]).top(1) +[12] +>>> sc.parallelize([2, 3, 4, 5, 6]).cache().top(2) +[5, 6] +""" +def f(iterator): +q = BoundedPriorityQueue(num) +for k in iterator: +q.put(k) +return q + +def f2(a, b): +a.put(b) +return a --- End diff -- Are you sure this is correct? It seems like f2 will put the entire queue into `b` as an element in queue `a`. Try it with an RDD with more than one partition. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/93#discussion_r10486433 --- Diff: python/pyspark/rdd.py --- @@ -628,6 +656,31 @@ def mergeMaps(m1, m2): m1[k] += v return m1 return self.mapPartitions(countPartition).reduce(mergeMaps) + +def top(self, num): +""" +Get the top N elements from a RDD. + +Note: It returns the list sorted in ascending order. +>>> sc.parallelize([10, 4, 2, 12, 3]).top(1) +[12] +>>> sc.parallelize([2, 3, 4, 5, 6]).cache().top(2) +[5, 6] +""" +def f(iterator): +q = BoundedPriorityQueue(num) +for k in iterator: +q.put(k) +return q + +def f2(a, b): +a.put(b) +return a --- End diff -- Ah, never mind, I see that you're merging in one element at a time, where `a` is a queue and `b` is an element. You should give these functions and parameters better names. Maybe call this `merge(queue, element)`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/93#discussion_r10490986 --- Diff: python/pyspark/rdd.py --- @@ -628,6 +656,31 @@ def mergeMaps(m1, m2): m1[k] += v return m1 return self.mapPartitions(countPartition).reduce(mergeMaps) + +def top(self, num): +""" +Get the top N elements from a RDD. + +Note: It returns the list sorted in ascending order. +>>> sc.parallelize([10, 4, 2, 12, 3]).top(1) +[12] +>>> sc.parallelize([2, 3, 4, 5, 6]).cache().top(2) +[5, 6] +""" +def f(iterator): +q = BoundedPriorityQueue(num) +for k in iterator: +q.put(k) +return q --- End diff -- Now that BoundedPriorityQueue is quite simple, I don't think you even need it any more. In fact, you can do everything in ```f``` (and I would rename this) as follows: ``` def topIterator(iterator): q = [] for k in iterator: if len(q) < num: heappush(q, k) else: heappushpop(q, k) yield q Then your ```f2``` (merge function) can look something like ``` def merge(a, b): return next(topIterator(a + b)) ``` (The ```next``` is there only because topIterator returns a 1 element generator) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-37377371 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-37377373 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-37379571 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13126/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-37379570 Merged build finished. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/93#issuecomment-37479585 Looks good, thanks! I'll merge this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-1162 Added top in python.
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/93 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---