[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/2508 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/2508#issuecomment-57357513 Cool, thanks. Going to merge this as is then. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2508#issuecomment-56841680 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20809/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2508#issuecomment-56841671 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20809/consoleFull) for PR 2508 at commit [`b7c96fd`](https://github.com/apache/spark/commit/b7c96fd68ba5816e6bcb6334bef9b5b4c1a4b15b). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2508#issuecomment-56830592 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20809/consoleFull) for PR 2508 at commit [`b7c96fd`](https://github.com/apache/spark/commit/b7c96fd68ba5816e6bcb6334bef9b5b4c1a4b15b). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/2508#issuecomment-56829748 @mateiz Yeah, there's no mention of zip methods in the programming guide, so if the groupBy method note isn't so valuable, I think there's probably no useful note to be made in the docs that I can see. I reverted that (will see if I can get git to not think there is a single whitespace change as a result). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...
Github user mateiz commented on a diff in the pull request: https://github.com/apache/spark/pull/2508#discussion_r18012607 --- Diff: docs/programming-guide.md --- @@ -882,7 +882,11 @@ for details. groupByKey([numTasks]) - When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs. + When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable ) pairs. + +Note: The ordering of elements within each group is not guaranteed, and may even differ + each time the resulting RDD is evaluated. --- End diff -- I don't think this is a good place to put this in the programming guide. If you can't find another place, maybe just leave it out. The other note here is a much more important and more common pitfall. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2508#issuecomment-56669653 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20750/consoleFull) for PR 2508 at commit [`ad4aeec`](https://github.com/apache/spark/commit/ad4aeec504ad07269511a2aad843a5b815dfcf5d). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2508#issuecomment-56669661 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20750/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2508#issuecomment-56661626 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20750/consoleFull) for PR 2508 at commit [`ad4aeec`](https://github.com/apache/spark/commit/ad4aeec504ad07269511a2aad843a5b815dfcf5d). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/2508#issuecomment-56651085 @mateiz Got it. On the zip methods, I want to capture the key point from https://issues.apache.org/jira/browse/SPARK-3098 , that the ordering is not only not guaranteed but also may change on reevaluation. I hope that wording is OK to retain and merge into yours. I'll find some place in the programming guide to note this, and remove wording about persist and/or replace with suggestion to sort the RDD. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/2508#issuecomment-56572221 Hey Sean, I don't think it makes sense to add "the ordering of elements within each partition is not guaranteed" to all the mapPartitions and zip methods. For some RDDs, ordering is guaranteed, and these methods might use that. It's better to leave it on the group-by methods instead, and adding a note on just the zip methods to say "note that some RDDs, such as those returned by groupBy, do not guarantee order of elements in a partition; in those cases you should sort the RDD with sortByKey or save it to a file". You might also consider adding a section on this in the programming guide, if there's a good spot for it. Finally, don't recommend persist as a way to preserve order because even persist is not guaranteed to prevent recomputation if there are faults. It's better to tell them to use something with a guaranteed order. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2508#issuecomment-56525341 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20704/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2508#issuecomment-56525327 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20704/consoleFull) for PR 2508 at commit [`fce943b`](https://github.com/apache/spark/commit/fce943b3401135074ec943c56653fbb36657804c). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2508#issuecomment-56516002 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20704/consoleFull) for PR 2508 at commit [`fce943b`](https://github.com/apache/spark/commit/fce943b3401135074ec943c56653fbb36657804c). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3356] [DOCS] Document when RDD elements...
GitHub user srowen opened a pull request: https://github.com/apache/spark/pull/2508 [SPARK-3356] [DOCS] Document when RDD elements' ordering within partitions is nondeterministic As suggested by @mateiz , and because it came up on the mailing list again last week, this attempts to document that ordering of elements is not guaranteed across RDD evaluations in groupBy, zip, and partition-wise RDD methods. Suggestions welcome about the wording, or other methods that need a note. You can merge this pull request into a Git repository by running: $ git pull https://github.com/srowen/spark SPARK-3356 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2508.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2508 commit fce943b3401135074ec943c56653fbb36657804c Author: Sean Owen Date: 2014-09-23T12:57:47Z Note that ordering of elements is not guaranteed across RDD evaluations in groupBy, zip, and partition-wise RDD methods --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org