[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...
Github user maryannxue commented on the issue: https://github.com/apache/spark/pull/21083 cc @mgaido91 @maryannxue @KaiXinXiaoLei @gatorsmile @jiangxb1987 @gengliangwang: I do not think this is the right way to do things, @cloud-fan. Looks like you have been aware of my and others' work like https://github.com/apache/spark/pull/20816, you could have, or I'd say, should have, given your input/suggestions on related PRs. People who have worked on this should deserve more credit than being mentioned as "inspired" here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21080: [SPARK-23986][SQL] freshName can generate non-uni...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21080#discussion_r181966326 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -575,7 +575,7 @@ class CodegenContext { if (freshNameIds.contains(fullName)) { val id = freshNameIds(fullName) freshNameIds(fullName) = id + 1 - s"$fullName$id" + s"${fullName}_$id" } else { freshNameIds += fullName -> 1 fullName --- End diff -- @viirya my comment was for the current code without your change. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21083 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2373/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21083 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21083 **[Test build #89432 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89432/testReport)** for PR 21083 at commit [`371c1df`](https://github.com/apache/spark/commit/371c1df8aed72ed42f74c8c4cdbc6c4c1791fcdf). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21083 **[Test build #89431 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89431/testReport)** for PR 21083 at commit [`b967955`](https://github.com/apache/spark/commit/b967955ec2c7d33f28845dd55a1a9b70c5c2ba03). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21083: [SPARK-21479][SPARK-23564][SQL] infer additional filters...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/21083 cc @mgaido91 @maryannxue @KaiXinXiaoLei @gatorsmile @jiangxb1987 @gengliangwang --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for ArrayData
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20984 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89425/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21083: [SPARK-21479][SPARK-23564][SQL] infer additional ...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/21083 [SPARK-21479][SPARK-23564][SQL] infer additional filters from constraints for join's children ## What changes were proposed in this pull request? The existing query constraints framework has 2 steps: 1. propagate constraints bottom up. 2. use constraints to infer additional filters for better data pruning. For step 2, it mostly helps with Join, because we can connect the constraints from children to the join condition and infer powerful filters to prune the data of the join sides. e.g., the left side has constraints `a = 1`, the join condition is `left.a = right.a`, then we can infer `right.a = 1` to the right side and prune the right side a lot. However, the current logic of inferring filters from constraints for Join is pretty weak. It infers the filters from Join's constraints. Some joins like left semi/anti exclude output from right side and the right side constraints will be lost here. This PR propose to check the left and right constraints individually, expand the constraints with join condition and add filters to children of join directly, instead of adding to the join condition. This reverts https://github.com/apache/spark/pull/20670 , covers https://github.com/apache/spark/pull/20717 and https://github.com/apache/spark/pull/20816 This is inspired by the original PRs and the tests are all from these PRs. Thanks to the authors @mgaido91 @maryannxue @KaiXinXiaoLei ! ## How was this patch tested? new tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark join Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21083.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21083 commit 2977a5e037eb862a530a777e349f328ffbda39bb Author: Wenchen Fan Date: 2018-04-16T16:15:04Z Revert "[SPARK-23405] Generate additional constraints for Join's children" This reverts commit cdcccd7b41c43d79edff2fec7a84cd00e9524f75. commit b967955ec2c7d33f28845dd55a1a9b70c5c2ba03 Author: Wenchen Fan Date: 2018-04-16T19:39:50Z fix join filter inference from constraints --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for ArrayData
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20984 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for ArrayData
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20984 **[Test build #89425 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89425/testReport)** for PR 20984 at commit [`77640c4`](https://github.com/apache/spark/commit/77640c449af4f3ba0d7e5a231c24f34f090314db). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21036: [SPARK-23958][CORE] HadoopRdd filters empty files...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/21036#discussion_r181963024 --- Diff: core/src/main/scala/org/apache/spark/internal/config/package.scala --- @@ -323,7 +323,7 @@ package object config { .internal() .doc("When true, HadoopRDD/NewHadoopRDD will not create partitions for empty input splits.") .booleanConf - .createWithDefault(false) + .createWithDefault(true) --- End diff -- This seems silently changed the behavior. I would suggest not to do it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21036: [SPARK-23958][CORE] HadoopRdd filters empty files to avo...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21036 nope it's a radical change that affects many of integrations. I wouldn't enable it by default for now. here is non-critical path. It's fine to loop twice if it's more readable. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21029: [SPARK-23952] remove type parameter in DataReaderFactory
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21029 **[Test build #89430 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89430/testReport)** for PR 21029 at commit [`36031f2`](https://github.com/apache/spark/commit/36031f2bc803535ba49e058951aef3faea30380d). * This patch **fails to generate documentation**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21029: [SPARK-23952] remove type parameter in DataReaderFactory
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21029 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89430/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21029: [SPARK-23952] remove type parameter in DataReaderFactory
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21029 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21029: [SPARK-23952] remove type parameter in DataReaderFactory
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21029 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21029: [SPARK-23952] remove type parameter in DataReaderFactory
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21029 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2372/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21029: [SPARK-23952] remove type parameter in DataReaderFactory
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21029 **[Test build #89430 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89430/testReport)** for PR 21029 at commit [`36031f2`](https://github.com/apache/spark/commit/36031f2bc803535ba49e058951aef3faea30380d). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21036: [SPARK-23958][CORE] HadoopRdd filters empty files to avo...
Github user guoxiaolongzte commented on the issue: https://github.com/apache/spark/pull/21036 1.No need to loop twice to filter to determine if the length is greater than 0 2.This feature is to improve performance, the default switch needs to open --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21080: [SPARK-23986][SQL] freshName can generate non-uni...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21080#discussion_r181957835 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -575,7 +575,7 @@ class CodegenContext { if (freshNameIds.contains(fullName)) { val id = freshNameIds(fullName) freshNameIds(fullName) = id + 1 - s"$fullName$id" + s"${fullName}_$id" } else { freshNameIds += fullName -> 1 fullName --- End diff -- Ah, sorry for my misunderstanding. To change line 581 would work well. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for ArrayData
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20984 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for ArrayData
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20984 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2371/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for ArrayData
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20984 **[Test build #89429 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89429/testReport)** for PR 20984 at commit [`8b5de0f`](https://github.com/apache/spark/commit/8b5de0f7d18af1d194305d7cd8c43ca007a5b980). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for ArrayData
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20984 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for ArrayData
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20984 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2370/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for ArrayData
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20984 **[Test build #89428 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89428/testReport)** for PR 20984 at commit [`6e8a697`](https://github.com/apache/spark/commit/6e8a697098e146ab0582df049831ec5f1237aa40). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20816: [SPARK-21479][SQL] Outer join filter pushdown in null su...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/20816 cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21080: [SPARK-23986][SQL] freshName can generate non-uni...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21080#discussion_r181953816 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -575,7 +575,7 @@ class CodegenContext { if (freshNameIds.contains(fullName)) { val id = freshNameIds(fullName) freshNameIds(fullName) = id + 1 - s"$fullName$id" + s"${fullName}_$id" } else { freshNameIds += fullName -> 1 fullName --- End diff -- Hmm, doesn't it be `a_1` -> `a_1_0` `a` -> `a_0` `a` -> `a_1` ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21078: [SPARK-23990][ML] Instruments logging improvements - ML ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21078 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21078: [SPARK-23990][ML] Instruments logging improvements - ML ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21078 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89427/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21078: [SPARK-23990][ML] Instruments logging improvements - ML ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21078 **[Test build #89427 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89427/testReport)** for PR 21078 at commit [`2d7a8ca`](https://github.com/apache/spark/commit/2d7a8cadb81eb1adf896ef566c1d27f42f8b26ba). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20633: [SPARK-23455][ML] Default Params in ML should be saved s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20633 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20633: [SPARK-23455][ML] Default Params in ML should be saved s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20633 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89421/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20633: [SPARK-23455][ML] Default Params in ML should be saved s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20633 **[Test build #89421 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89421/testReport)** for PR 20633 at commit [`80b668a`](https://github.com/apache/spark/commit/80b668afb0303b67ead8aed8d4d1f1996fa02658). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21080: [SPARK-23986][SQL] freshName can generate non-uni...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21080#discussion_r181952028 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -575,7 +575,7 @@ class CodegenContext { if (freshNameIds.contains(fullName)) { val id = freshNameIds(fullName) freshNameIds(fullName) = id + 1 - s"$fullName$id" + s"${fullName}_$id" } else { freshNameIds += fullName -> 1 fullName --- End diff -- I think it still has a problem. for sequence `a_1`, `a`, `a`, we have duplicated name `a_1`. We can solve this problem by always adding the postfix. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21080: [SPARK-23986][SQL] freshName can generate non-uni...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21080#discussion_r181950606 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -575,7 +575,7 @@ class CodegenContext { if (freshNameIds.contains(fullName)) { val id = freshNameIds(fullName) freshNameIds(fullName) = id + 1 - s"$fullName$id" + s"${fullName}_$id" } else { freshNameIds += fullName -> 1 fullName --- End diff -- Isn't changed to `s"${fullName}_$id"`? So you will get `a_0_1`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...
Github user WeichenXu123 commented on the issue: https://github.com/apache/spark/pull/19381 @dbtsai Good idea! Is there a related JIRA or could you open one for it ? cc @jkbradley --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21078: [SPARK-23990][ML] Instruments logging improvements - ML ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21078 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2369/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21078: [SPARK-23990][ML] Instruments logging improvements - ML ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21078 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21078: [SPARK-23990][ML] Instruments logging improvements - ML ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21078 **[Test build #89427 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89427/testReport)** for PR 21078 at commit [`2d7a8ca`](https://github.com/apache/spark/commit/2d7a8cadb81eb1adf896ef566c1d27f42f8b26ba). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21080: [SPARK-23986][SQL] freshName can generate non-uni...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/21080#discussion_r181949672 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -575,7 +575,7 @@ class CodegenContext { if (freshNameIds.contains(fullName)) { val id = freshNameIds(fullName) freshNameIds(fullName) = id + 1 - s"$fullName$id" + s"${fullName}_$id" } else { freshNameIds += fullName -> 1 fullName --- End diff -- If I am correct, does this still have a conflict between `a_01` wth `a_0$id` where `$id = 1`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21060 I am fine to accept different opinions for this specific PR. Reverting this PR is not my goal here. This is a public community. It sounds like the commit message clearly delivers what this PR does to you: `This PR proposes to add collect to a query executor as an action.`, although I still have different opinions. We need to collect and accept different opinions always. I am also glad you agree on the backport policy I proposed above. Hopefully, everyone is on the same page for avoiding unnecessary overhead. > The minor bug fixes/improvements that have external behavior changes I personally thought this PR fits this category. No matter whether the behavior changes are correct or not, we should still not backport it if the issue is neither critical nor a regression. That is what I emphasized in the above argument multiple times. The API inconsistency is not rare between our APIs. We did not backport these PRs. Now, I am fine to backport it because it is an experimental API. Thus, we can say we do not guarantee the backport compatibility. If it were a public API, I would insist my original opinion. I am also glad many community members have a lot of experience with mission critical software development. This can help improve documentation, code quality and test coverage. Development of application/mobile software is completely different from development of system software. We are in the right direction. We need to enforce it with stricter discipline. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20816: [SPARK-21479][SQL] Outer join filter pushdown in null su...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20816 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89420/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20816: [SPARK-21479][SQL] Outer join filter pushdown in null su...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20816 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20816: [SPARK-21479][SQL] Outer join filter pushdown in null su...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20816 **[Test build #89420 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89420/testReport)** for PR 20816 at commit [`194e6e7`](https://github.com/apache/spark/commit/194e6e76f3f4605c0c3026ab693124be44608715). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20787: [MINOR][DOCS] Documenting months_between direction
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20787 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20787: [MINOR][DOCS] Documenting months_between direction
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20787 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89419/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21060 I do not see a problem with the commit message here. Is that really the issue? it accurately describes _what_ changes. The _why_ has always been documented in discussion, and it is here already. Sometimes the _why_ is documented in comments too; I don't see a particular need for that here, but, if that's the issue, why isn't that what we're talking about? You continue to portray this as a behavior change, and I think you mean "a change in what is considered correct behavior". However all the other comments suggest otherwise; the argument from consistency seems much stronger. Your proposed criteria for backports sort of align with accepted practice, which is to follow semver.org semantics. I think semver is reasonably clear, in general and in this case. I see broad agreement for this backport, and people simply disagree with your interpretation. It is not a failure to understand criteria. Believe me, people here have plenty experience with software, versioning, and the impact of changes. I'd put more faith in the judgment of your peers. Your anecdotes are of a type that's familiar to many people, but, I also fail to see how they're relevant here. You are adopting a 'conservative' position and I think in this case it's out of line with normal practice. I think you should accept that people disagree and move on. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20787: [MINOR][DOCS] Documenting months_between direction
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20787 **[Test build #89419 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89419/testReport)** for PR 20787 at commit [`3ca9948`](https://github.com/apache/spark/commit/3ca99489a1a88d16f371091e86e4a58962e00759). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20940: [SPARK-23429][CORE] Add executor memory metrics t...
Github user edwinalu commented on a diff in the pull request: https://github.com/apache/spark/pull/20940#discussion_r181943630 --- Diff: core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala --- @@ -234,8 +244,22 @@ private[spark] class EventLoggingListener( } } - // No-op because logging every update would be overkill - override def onExecutorMetricsUpdate(event: SparkListenerExecutorMetricsUpdate): Unit = { } + /** + * Log if there is a new peak value for one of the memory metrics for the given executor. + * Metrics are cleared out when a new stage is started in onStageSubmitted, so this will + * log new peak memory metric values per executor per stage. + */ + override def onExecutorMetricsUpdate(event: SparkListenerExecutorMetricsUpdate): Unit = { --- End diff -- Thanks for your thoughts on this. Size of message, and also logging, but it is only an extra few longs per heartbeat, and and similarly for logging. Task end would help with minimizing communication for longer running tasks. The heartbeats are only every 10 seconds, so perhaps not so bad. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21073: [SPARK-23936][SQL][WIP] Implement map_concat
Github user bersprockets commented on a diff in the pull request: https://github.com/apache/spark/pull/21073#discussion_r181943238 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -115,6 +116,62 @@ case class MapValues(child: Expression) override def prettyName: String = "map_values" } +/** + * Returns the union of all the given maps. + */ +@ExpressionDescription( +usage = "_FUNC_(map, ...) - Returns the union of all the given maps", +examples = """ +Examples: + > SELECT _FUNC_(map(1, 'a', 2, 'b'), map(2, 'c', 3, 'd')); + [[1 -> "a"], [2 -> "c"], [3 -> "d"] + """) +case class MapConcat(children: Seq[Expression]) extends Expression + with CodegenFallback { + + override def checkInputDataTypes(): TypeCheckResult = { +// this check currently does not allow valueContainsNull to vary, +// and unfortunately none of the MapType toString methods include +// valueContainsNull for the error message +if (children.exists(!_.dataType.isInstanceOf[MapType])) { + TypeCheckResult.TypeCheckFailure( +s"The given input of function $prettyName should all be of type map, " + + "but they are " + children.map(_.dataType.simpleString).mkString("[", ", ", "]")) +} else if (children.map(_.dataType).distinct.length > 1) { + TypeCheckResult.TypeCheckFailure( +s"The given input maps of function $prettyName should all be the same type, " + + "but they are " + children.map(_.dataType.simpleString).mkString("[", ", ", "]")) +} else { + TypeCheckResult.TypeCheckSuccess +} + } + override def dataType: MapType = { +children.headOption.map(_.dataType.asInstanceOf[MapType]) + .getOrElse(MapType(keyType = StringType, valueType = StringType)) + } + + override def nullable: Boolean = false --- End diff -- @henryr empty map: scala> df.select(map_concat('map1, 'map2).as('newMap)).show df.select(map_concat('map1, 'map2).as('newMap)).show +--+ |newMap| +--+ |[]| |[]| +--+ Presto docs (from which the proposed spec comes) are quiet on the matter. Even after looking at the Presto code, I am still hard-pressed to say. I did divine from the Presto code that there should be at least two inputs (and I don't currently verify that). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user gatorsmile commented on the issue: https://github.com/apache/spark/pull/21060 I might not explain it well. Sorry for the misunderstanding. Thank you @rxin for helping me clarify my points. It sounds like many of you think this backport is fine. I am not against this specific PR. We do not need to revert the PR but just improve the documentation. That should be fine, although I still personally prefer to adding the configuration. As what I said in the original PR https://github.com/apache/spark/pull/21007 that was merged to master, let me point out two points here too. - PR descriptions will be part of the commit log. We need to be very careful before merging the PR. In the past, I also missed a few when I did the merge. To be honest, I am not sure how the native English speakers think. The first paragraph scared me when I reading the PR commit log. @srowen WDYT? ``` This PR proposes to add collect to a query executor as an action. ``` - Document the behavior changes that are visible to the external users/developers. In Spark 2.3, we started to enforce it in every merged PR. I believe many of you got multiple similar comments in the previous PRs. This PR should also upgrade the migration guides. @HyukjinKwon Do you agree? Before we finalize the backport policy, below is my inputs about the whitelist which we can backport: - The critical/important bug fixes and security fixes. - The regression fixes. - The PRs that do not touch the production code, like test-only patches, documentation fixes, and the log message fixes. Avoid backporting the PRs if it contains - The new features - The minor bug fixes/improvements that have external behavior changes - The code refactoring - The code changes with the high/mid risk In the OSS community, I believe no committer will be fired just because we merged/introduced a bug, right? If the users application failed due to an upgrade, normally we blame our users or the bug are just accidentally introduced. However, this is not acceptable in my first team. Let me share what I experienced. Just various customer accidents in my related product teams. - One director got demoted (almost fired) due to a bad release. She is a very nice lady. We really like her. That release had many cool features but the quality is not controlled well. Many customers are not willing to upgrade. - There is a famous system upgrade failure a few years ago. The whole system became very slow after the upgrade. It took 10s hours to recover the system. After a few days, the GM went to the customer site and got blamed in the whole day. Multiple architects and VPs were forced to write apology letters. Customers planned to sue us. In the customer side, the CTO got fired later and the upgrade accident was also on the national TV news because it affects many people. - A few directors were on call with me 10+ nights to resolve one Japanese customer data corruption issue. The client teams ran multiple systems at the same time to reproduce the issue. After a few weeks, it was finally resolved after reading the memory dump. The root cause is the code merge from one branch to another branch many years ago. If all the above people believes Spark is the best product in Big Data, we need to be more conservative. Our decisions could affect many people. This is not the first time I argued with the other committers/contributors about the PR quality. In one previous PR, I left almost 100 comments just because the documents are not accurate. If my above comments offend anyone, I apologize. Everyone has different understanding about the software development because we have different work experience. The whole community already did a wonderful job compared with the other open source projects. I still believe we can do a better job, right? Let us formalize the backport policy and enforce them in each release. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21079: [SPARK-23992][CORE] ShuffleDependency does not ne...
Github user 10110346 commented on a diff in the pull request: https://github.com/apache/spark/pull/21079#discussion_r181942230 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala --- @@ -113,3 +118,24 @@ private[spark] class ShuffleMapTask( override def toString: String = "ShuffleMapTask(%d, %d)".format(stageId, partitionId) } + +object ShuffleMapTask extends Logging { + private val cache = CacheBuilder.newBuilder() --- End diff -- we don't need to clear, this cache can recovery itself --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21019: [SPARK-23948] Trigger mapstage's job listener in submitM...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21019 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2368/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21019: [SPARK-23948] Trigger mapstage's job listener in submitM...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21019 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21019: [SPARK-23948] Trigger mapstage's job listener in submitM...
Github user jinxing64 commented on the issue: https://github.com/apache/spark/pull/21019 Thanks comments from Imran and Xingbo. I made some change and please take another look when you have time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21019: [SPARK-23948] Trigger mapstage's job listener in submitM...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21019 **[Test build #89426 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89426/testReport)** for PR 21019 at commit [`42a9b2e`](https://github.com/apache/spark/commit/42a9b2e2a4f23401affbf53f207a7c98d4da0bce). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20695: [SPARK-21741][ML][PySpark] Python API for DataFrame-base...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20695 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20695: [SPARK-21741][ML][PySpark] Python API for DataFrame-base...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20695 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89422/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for ArrayData
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20984 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for ArrayData
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20984 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2367/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20695: [SPARK-21741][ML][PySpark] Python API for DataFrame-base...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20695 **[Test build #89422 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89422/testReport)** for PR 20695 at commit [`9a4a0ca`](https://github.com/apache/spark/commit/9a4a0ca43185d46800a9e29c9c3b0a139a1e29e9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20984: [SPARK-23875][SQL] Add IndexedSeq wrapper for ArrayData
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20984 **[Test build #89425 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89425/testReport)** for PR 20984 at commit [`77640c4`](https://github.com/apache/spark/commit/77640c449af4f3ba0d7e5a231c24f34f090314db). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21078: [SPARK-23990][ML] Instruments logging improvements - ML ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21078 **[Test build #89423 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89423/testReport)** for PR 21078 at commit [`7d286c4`](https://github.com/apache/spark/commit/7d286c4322bca3c95f678018dc85e476d77e48da). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19868 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21078: [SPARK-23990][ML] Instruments logging improvements - ML ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21078 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21078: [SPARK-23990][ML] Instruments logging improvements - ML ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21078 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89423/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19868 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2366/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21078: [SPARK-23990][ML] Instruments logging improvements - ML ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21078 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2365/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21078: [SPARK-23990][ML] Instruments logging improvements - ML ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21078 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21079: [SPARK-23992][CORE] ShuffleDependency does not ne...
Github user Ngone51 commented on a diff in the pull request: https://github.com/apache/spark/pull/21079#discussion_r181938225 --- Diff: core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala --- @@ -113,3 +118,24 @@ private[spark] class ShuffleMapTask( override def toString: String = "ShuffleMapTask(%d, %d)".format(stageId, partitionId) } + +object ShuffleMapTask extends Logging { + private val cache = CacheBuilder.newBuilder() --- End diff -- Do we need to clear this `cache` at the end of a app ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19868: [SPARK-22676] Avoid iterating all partition paths when s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/19868 **[Test build #89424 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89424/testReport)** for PR 19868 at commit [`bf4277b`](https://github.com/apache/spark/commit/bf4277bf862000bb000d0cbf11092d8565f42afb). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181939704 --- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala --- @@ -195,6 +205,10 @@ class NewHadoopRDD[K, V]( e) finished = true null + case e: FileNotFoundException if ignoreMissingFiles => --- End diff -- Yes, I should change this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19868: [SPARK-22676] Avoid iterating all partition paths...
Github user jinxing64 commented on a diff in the pull request: https://github.com/apache/spark/pull/19868#discussion_r181939661 --- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala --- @@ -276,6 +292,12 @@ class HadoopRDD[K, V]( try { finished = !reader.next(key, value) } catch { + case e: FileNotFoundException if ignoreMissingFiles => +logWarning(s"Skipped missing file: ${split.inputSplit}", e) +finished = true +null --- End diff -- I removed this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21078: [SPARK-23990][ML] Instruments logging improvements - ML ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21078 **[Test build #89423 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89423/testReport)** for PR 21078 at commit [`7d286c4`](https://github.com/apache/spark/commit/7d286c4322bca3c95f678018dc85e476d77e48da). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21034: [SPARK-23926][SQL] Extending reverse function to support...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/21034 LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20695: [SPARK-21741][ML][PySpark] Python API for DataFrame-base...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20695 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2364/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20695: [SPARK-21741][ML][PySpark] Python API for DataFrame-base...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20695 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20695: [SPARK-21741][ML][PySpark] Python API for DataFrame-base...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20695 **[Test build #89422 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89422/testReport)** for PR 20695 at commit [`9a4a0ca`](https://github.com/apache/spark/commit/9a4a0ca43185d46800a9e29c9c3b0a139a1e29e9). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20888: [SPARK-23775][TEST] Make DataFrameRangeSuite not flaky
Github user jiangxb1987 commented on the issue: https://github.com/apache/spark/pull/20888 Overall this LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21080: [SPARK-23986][SQL] freshName can generate non-uni...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21080#discussion_r181931856 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -575,7 +575,7 @@ class CodegenContext { if (freshNameIds.contains(fullName)) { val id = freshNameIds(fullName) freshNameIds(fullName) = id + 1 - s"$fullName$id" + s"${fullName}_$id" } else { freshNameIds += fullName -> 1 fullName --- End diff -- I'd suggest something like `s"${fullName}_0"` at L581. It also solves the failed tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21080: [SPARK-23986][SQL] freshName can generate non-uni...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21080#discussion_r181931283 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala --- @@ -575,7 +575,7 @@ class CodegenContext { if (freshNameIds.contains(fullName)) { val id = freshNameIds(fullName) freshNameIds(fullName) = id + 1 - s"$fullName$id" + s"${fullName}_$id" } else { freshNameIds += fullName -> 1 fullName --- End diff -- If the given name is something like `name1_1`, I think you can still produce non-unique variable name. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21070 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/89415/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21070 **[Test build #89415 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89415/testReport)** for PR 21070 at commit [`27a66d8`](https://github.com/apache/spark/commit/27a66d8114552e199389c944517d42719861b9de). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user srowen commented on the issue: https://github.com/apache/spark/pull/21060 Adding a flag just in 2.3 is, at least, an unusual thing to do. By this logic lots of backports should be flag protected but we don't. Why is this special? I still don't see much argument against this backport. I count about 3-4 committers in favor and 1 against. Let's leave it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21060 I am okay if there's a specific reason. I think this is the point - if there's a specific reason, that should be mentioned and explained ahead. Actually, I (and @srowen did as well IIUC) asked this many times, see above. I would have investigated or would have just said that I am okay with reverting. I don't usually get in the way if there's a specific reason. It would be great if we can have more open talks next time. > for the 2.3.x backport, add a config that so it is possible to turn this off in production, if somebody actually has their job failed because of this? It's a small delta from what this PR already does, and that should alleviate the concerns @gatorsmile has. I am personally fine with reverting or adding a configuration if that's what you guys feel strongly; however, I should say it sounds unusual to have a config to control this behaviour in branch-2.3 alone and it sounds less worth. The case you mention sounds really unlikely and I wonder if that makes sense tho. It's also experimental as you all said. Also, I should note that I have been confused about the backporting policy and the bunch of configurations to control each behaviour. If that's just concerns to be addressed, that's fine but sounds what people must follow so far. If this is true, I feel sure this should be documented. I feel sure we shouldn't have such overhead next time. I am pretty sure this isn't the first time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20633: [SPARK-23455][ML] Default Params in ML should be saved s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20633 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2363/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20633: [SPARK-23455][ML] Default Params in ML should be saved s...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20633 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20633: [SPARK-23455][ML] Default Params in ML should be saved s...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20633 **[Test build #89421 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89421/testReport)** for PR 20633 at commit [`80b668a`](https://github.com/apache/spark/commit/80b668afb0303b67ead8aed8d4d1f1996fa02658). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20633: [SPARK-23455][ML] Default Params in ML should be saved s...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/20633 @dbtsai Thanks! I've solved the conflicts. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20816: [SPARK-21479][SQL] Outer join filter pushdown in null su...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20816 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20816: [SPARK-21479][SQL] Outer join filter pushdown in null su...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/20816 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution/2362/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20816: [SPARK-21479][SQL] Outer join filter pushdown in null su...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20816 **[Test build #89420 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89420/testReport)** for PR 20816 at commit [`194e6e7`](https://github.com/apache/spark/commit/194e6e76f3f4605c0c3026ab693124be44608715). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20633: [SPARK-23455][ML] Default Params in ML should be saved s...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/20633 Can you address the conflicts? Gonna start to review it soon. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19381: [SPARK-10884][ML] Support prediction on single instance ...
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/19381 Too late to the party! Great PR. We should also think about refactor the predictors to `mllib-local` eventually so people can use them in prod without depending on Spark. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20787: [MINOR][DOCS] Documenting months_between direction
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/20787 **[Test build #89419 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89419/testReport)** for PR 20787 at commit [`3ca9948`](https://github.com/apache/spark/commit/3ca99489a1a88d16f371091e86e4a58962e00759). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21073: [SPARK-23936][SQL][WIP] Implement map_concat
Github user henryr commented on a diff in the pull request: https://github.com/apache/spark/pull/21073#discussion_r181915827 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -115,6 +116,62 @@ case class MapValues(child: Expression) override def prettyName: String = "map_values" } +/** + * Returns the union of all the given maps. + */ +@ExpressionDescription( +usage = "_FUNC_(map, ...) - Returns the union of all the given maps", +examples = """ +Examples: + > SELECT _FUNC_(map(1, 'a', 2, 'b'), map(2, 'c', 3, 'd')); + [[1 -> "a"], [2 -> "c"], [3 -> "d"] + """) +case class MapConcat(children: Seq[Expression]) extends Expression + with CodegenFallback { + + override def checkInputDataTypes(): TypeCheckResult = { +// this check currently does not allow valueContainsNull to vary, +// and unfortunately none of the MapType toString methods include +// valueContainsNull for the error message +if (children.exists(!_.dataType.isInstanceOf[MapType])) { + TypeCheckResult.TypeCheckFailure( +s"The given input of function $prettyName should all be of type map, " + + "but they are " + children.map(_.dataType.simpleString).mkString("[", ", ", "]")) +} else if (children.map(_.dataType).distinct.length > 1) { + TypeCheckResult.TypeCheckFailure( +s"The given input maps of function $prettyName should all be the same type, " + + "but they are " + children.map(_.dataType.simpleString).mkString("[", ", ", "]")) +} else { + TypeCheckResult.TypeCheckSuccess +} + } + override def dataType: MapType = { +children.headOption.map(_.dataType.asInstanceOf[MapType]) + .getOrElse(MapType(keyType = StringType, valueType = StringType)) + } + + override def nullable: Boolean = false --- End diff -- What's the result of `map_concat(NULL, NULL)`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21068: [SPARK-16630][YARN] Blacklist a node if executors...
Github user henryr commented on a diff in the pull request: https://github.com/apache/spark/pull/21068#discussion_r181908529 --- Diff: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala --- @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.deploy.yarn + +import scala.collection.JavaConverters._ +import scala.collection.mutable.HashMap + +import org.apache.hadoop.yarn.client.api.AMRMClient +import org.apache.hadoop.yarn.client.api.AMRMClient.ContainerRequest + +import org.apache.spark.SparkConf +import org.apache.spark.deploy.yarn.config._ +import org.apache.spark.internal.Logging +import org.apache.spark.internal.config._ +import org.apache.spark.util.{Clock, SystemClock, Utils} + +private[spark] class YarnAllocatorBlacklistTracker( +sparkConf: SparkConf, +amClient: AMRMClient[ContainerRequest], +failureWithinTimeIntervalTracker: FailureWithinTimeIntervalTracker) + extends Logging { + + private val DEFAULT_TIMEOUT = "1h" + + private val BLACKLIST_TIMEOUT_MILLIS = + sparkConf.get(BLACKLIST_TIMEOUT_CONF).getOrElse(Utils.timeStringAsMs(DEFAULT_TIMEOUT)) + + private val IS_YARN_ALLOCATION_BLACKLIST_ENABLED = +sparkConf.get(YARN_ALLOCATION_BLACKLIST_ENABLED).getOrElse(false) + + private val BLACKLIST_MAX_FAILED_EXEC_PER_NODE = sparkConf.get(MAX_FAILED_EXEC_PER_NODE) + + private val BLACKLIST_SIZE_LIMIT = sparkConf.get(YARN_BLACKLIST_SIZE_LIMIT) + + private val BLACKLIST_SIZE_DEFAULT_WEIGHT = sparkConf.get(YARN_BLACKLIST_SIZE_DEFAULT_WEIGHT) + + private var clock: Clock = new SystemClock + + private val allocationBlacklistedNodesWithExpiry = new HashMap[String, Long]() + + private var currentBlacklistedYarnNodes = Set.empty[String] + + private var schedulerBlacklistedNodesWithExpiry = Map.empty[String, Long] --- End diff -- Do you need to keep a separate data structure for the scheduler and allocator blacklisted nodes? Instead, could you add the scheduler ones into a shared map when `setSchedulerBlacklistedNodes` is called? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org