[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/16739 Agree with @jkbradley on this one. We should avoid adding functions that are completely new in a patch release given that the timing between minor versions and patch releases aren't that high. As we discussed in the other thread, lets start tagging JIRAs with `backport` and also add a line in the JIRA saying why its safe/required for backport. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/16739 I've commented elsewhere, but wanted to here just to make more people aware: Let's refrain from backporting new APIs into patch versions unless they are really critical. We do not do this elsewhere in Spark, and we should not in SparkR. New APIs and API changes should only happen in minor versions (and ideally changes will only happen in major ones). It's been discussed elsewhere that SparkR is more experimental than other parts of Spark, but the sooner we start treating it like a stable library, the sooner it will be a stable library. For most people, there isn't a huge difference between getting a new API in a patch version (every 1-2 months) vs. getting it in a minor version (every 4 months). Thanks all! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16739 Thank YOU, always! :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16739 @dongjoon-hyun my apologies, thanks for bringing this to my attention. I had to hang merge and didn't realize the mismatch. Opened a new PR to fix that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user dongjoon-hyun commented on the issue: https://github.com/apache/spark/pull/16739 Hi, @felixcheung . While backporting, https://github.com/apache/spark/commit/6c35399068f1035fec6d5f909a83a5b1683702e0#diff-3d2a6b9d2b7d84ae179d7ea0f9eca696R1232 seems to break the build of `branch-2.1`. The PR about `to_timestamp` is not backported to branch-2.1 yet. Could you backport that issue, too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16739 merged to master and branch-2.1 @gatorsmile thanks - please feel free to update or remove unneeded test cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16739 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72929/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16739 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16739 **[Test build #72929 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72929/testReport)** for PR 16739 at commit [`bf2373f`](https://github.com/apache/spark/commit/bf2373f260a2af4a8841c0b440e86979de9c98e0). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16739 **[Test build #72929 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72929/testReport)** for PR 16739 at commit [`bf2373f`](https://github.com/apache/spark/commit/bf2373f260a2af4a8841c0b440e86979de9c98e0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16739 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16739 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72925/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16739 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16739 **[Test build #72925 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72925/testReport)** for PR 16739 at commit [`bf2373f`](https://github.com/apache/spark/commit/bf2373f260a2af4a8841c0b440e86979de9c98e0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16739 The issue is fixed in https://github.com/apache/spark/pull/16933. If this is merged at first, I will fix the test case in this PR Thanks! : ) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16739 great, looking forward to that. I'm going to merge this unless anyone has a concern? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16739 Let me rewrite the test cases in Scala. ```Scala val df = spark.range(0, 1, 1, 5) assert(df.rdd.getNumPartitions == 5) assert(df.coalesce(3).rdd.getNumPartitions == 3) assert(df.coalesce(6).rdd.getNumPartitions == 5) val df1 = df.coalesce(3) assert(df1.rdd.getNumPartitions == 3) assert(df1.coalesce(6).rdd.getNumPartitions == 5) assert(df1.coalesce(4).rdd.getNumPartitions == 4) assert(df1.coalesce(2).rdd.getNumPartitions == 2) val df2 = df.repartition(10) assert(df2.rdd.getNumPartitions == 10) assert(df2.coalesce(13).rdd.getNumPartitions == 5) assert(df2.coalesce(7).rdd.getNumPartitions == 5) assert(df2.coalesce(3).rdd.getNumPartitions == 3) ``` The question is why the second one is `5` instead of `10`. If we do the explain, we got the following plan ``` == Parsed Logical Plan == Repartition 13, false +- Repartition 10, true +- Range (0, 1, step=1, splits=Some(5)) == Analyzed Logical Plan == id: bigint Repartition 13, false +- Repartition 10, true +- Range (0, 1, step=1, splits=Some(5)) == Optimized Logical Plan == Repartition 13, false +- Range (0, 1, step=1, splits=Some(5)) == Physical Plan == Coalesce 13 +- *Range (0, 1, step=1, splits=Some(5)) ``` Ok... `Repartition 10, true` is removed by our Optimizer rule `CollapseRepartition`. It is a bug, I think. Your question is valid. Let me fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16739 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72790/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16739 Build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16739 **[Test build #72790 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72790/testReport)** for PR 16739 at commit [`55b99df`](https://github.com/apache/spark/commit/55b99dfefacbe549e3d48278fa391c963ac36ab7). * This patch passes all tests. * This patch **does not merge cleanly**. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16739 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16739 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72791/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16739 **[Test build #72791 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72791/testReport)** for PR 16739 at commit [`a0fe134`](https://github.com/apache/spark/commit/a0fe1344ae1030be98a37ca133ee24a40e8bc65d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16739 **[Test build #72791 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72791/testReport)** for PR 16739 at commit [`a0fe134`](https://github.com/apache/spark/commit/a0fe1344ae1030be98a37ca133ee24a40e8bc65d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16739 **[Test build #72790 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72790/testReport)** for PR 16739 at commit [`55b99df`](https://github.com/apache/spark/commit/55b99dfefacbe549e3d48278fa391c963ac36ab7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16739 hmm, not as far as I can see: ``` > df2 <- repartition(df1, 10) > getNumPartitions(df2) # right after repartition the number of partition is greater than the original numSlices [1] 10 > foo <- coalesce(df2, 13) > explain(foo, extended = T) == Parsed Logical Plan == Repartition 13, false +- Repartition 10, true +- Repartition 3, false +- LogicalRDD [speed#2, dist#3] == Analyzed Logical Plan == speed: double, dist: double Repartition 13, false +- Repartition 10, true +- Repartition 3, false +- LogicalRDD [speed#2, dist#3] == Optimized Logical Plan == Repartition 13, false +- LogicalRDD [speed#2, dist#3] == Physical Plan == Coalesce 13 +- Scan ExistingRDD[speed#2,dist#3] ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16739 : ) This might be caused by the optimizer rule `CollapseRepartition`. Can you output the plan by `explain(true)`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16739 @gatorsmile thanks for commenting. `coalesce` currently accept a number even if it is larger than the current number of partitions - I guess we didn't want to throw exeception in that case? but, since you are here, do you know why we see this behavior ``` df2 <- repartition(df1, 10) expect_equal(getNumPartitions(df2), 10) <-- right after repartition the number of partition is greater than the original numSlices expect_equal(getNumPartitions(coalesce(df2, 13)), 5) <-- but coalesce after repartition it can't go beyond 5 ``` Shouldn't I allow to set partition to 5 < n < 10, since I just `repartition(10)`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/16739 `coalesce` is used to decrease the number of partitions in the RDD, but when you are setting it to a number that is larger than the number of the current RDD partitions, the result is not predicable. It depends on your RDD physical distribution. Thus, I am wondering whether we should allow users to set it to a larger number? Or some advanced users are using it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16739 yap, https://github.com/apache/spark/pull/16739#issuecomment-276739220 - only RDD has `coalesce(.. shuffle)`, in Dataset, it's `coalesce` and `repartition` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user holdenk commented on the issue: https://github.com/apache/spark/pull/16739 @felixcheung I was refering to the ` * However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, * this may result in your computation taking place on fewer nodes than * you like (e.g. one node in the case of numPartitions = 1). To avoid this, * you can pass shuffle = true. This will add a shuffle step, but means the * current upstream partitions will be executed in parallel (per whatever * the current partitioning is). ` warning but documentating the coalesce capping out based on numSlices also sounds important to document (and potentially confusing). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16739 and actually I find the current behavior a bit hard to explain, could someone perhaps enlighten me if this is intentional and how best, if we are to, document this behavior? ``` df <- as.DataFrame(cars, numPartitions = 5) <-- this set numSlices on RDD to 5 + expect_equal(getNumPartitions(df), 5) + expect_equal(getNumPartitions(coalesce(df, 3)), 3) + expect_equal(getNumPartitions(coalesce(df, 6)), 5) + + df1 <- coalesce(df, 3) + expect_equal(getNumPartitions(df1), 3) + expect_equal(getNumPartitions(coalesce(df1, 6)), 5) < even after a coalesce it can't go beyond 5 + expect_equal(getNumPartitions(coalesce(df1, 4)), 4) + expect_equal(getNumPartitions(coalesce(df1, 2)), 2) + + df2 <- repartition(df1, 10) + expect_equal(getNumPartitions(df2), 10) <-- right after repartition the number of partition is greater than the original numSlices + expect_equal(getNumPartitions(coalesce(df2, 13)), 5) <-- but coalesce after repartition it can't go beyond 5 + expect_equal(getNumPartitions(coalesce(df2, 7)), 5) + expect_equal(getNumPartitions(coalesce(df2, 3)), 3) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16739 surely, i think you mean https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L428 we will need to update this to say `use repartition() if you want shuffling` though, since the shuffle option is only on RDD. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/16739 Thanks @felixcheung - I think these changes look good. cc @gatorsmile / @holdenk for doc changes in SQL, Python --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16739 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16739 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72240/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16739 **[Test build #72240 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72240/testReport)** for PR 16739 at commit [`3ed835a`](https://github.com/apache/spark/commit/3ed835ad340ea0793f8fbb93a697e09f7eb249d9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16739 **[Test build #72240 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72240/testReport)** for PR 16739 at commit [`3ed835a`](https://github.com/apache/spark/commit/3ed835ad340ea0793f8fbb93a697e09f7eb249d9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16739 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72232/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16739 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16739 **[Test build #72232 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72232/testReport)** for PR 16739 at commit [`1bd7163`](https://github.com/apache/spark/commit/1bd7163723641bfaa107c9a20974e163eaead0a4). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16739 **[Test build #72232 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72232/testReport)** for PR 16739 at commit [`1bd7163`](https://github.com/apache/spark/commit/1bd7163723641bfaa107c9a20974e163eaead0a4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16739 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16739 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72166/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16739 **[Test build #72166 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72166/testReport)** for PR 16739 at commit [`938c2ce`](https://github.com/apache/spark/commit/938c2ce27e4e1029a646e25c053baeb304d6f217). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16739 **[Test build #72166 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72166/testReport)** for PR 16739 at commit [`938c2ce`](https://github.com/apache/spark/commit/938c2ce27e4e1029a646e25c053baeb304d6f217). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16739 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16739 **[Test build #72149 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72149/testReport)** for PR 16739 at commit [`50ab563`](https://github.com/apache/spark/commit/50ab5635c54074a24a03d08ed42fd94fa19e68d3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16739 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72149/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16739 **[Test build #72149 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72149/testReport)** for PR 16739 at commit [`50ab563`](https://github.com/apache/spark/commit/50ab5635c54074a24a03d08ed42fd94fa19e68d3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/16739 Jenkins, retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16739 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16739 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72147/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16739: [SPARK-19399][SPARKR] Add R coalesce API for DataFrame a...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16739 **[Test build #72147 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72147/testReport)** for PR 16739 at commit [`50ab563`](https://github.com/apache/spark/commit/50ab5635c54074a24a03d08ed42fd94fa19e68d3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org