[GitHub] spark issue #23040: [SPARK-26068][Core]ChunkedByteBufferInputStream should h...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23040 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23040: [SPARK-26068][Core]ChunkedByteBufferInputStream should h...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23040 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98989/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23040: [SPARK-26068][Core]ChunkedByteBufferInputStream should h...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23040 **[Test build #98989 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98989/testReport)** for PR 23040 at commit [`3c6d349`](https://github.com/apache/spark/commit/3c6d349b26e54ead7c345e11ffacf14edcd072c1). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23025 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5132/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23025 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23025 **[Test build #98992 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98992/testReport)** for PR 23025 at commit [`7ca4821`](https://github.com/apache/spark/commit/7ca48214cda312d78c22ad4305d2e490c46535f5). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...
Github user JulienPeloton commented on a diff in the pull request: https://github.com/apache/spark/pull/23025#discussion_r234512932 --- Diff: python/pyspark/sql/dataframe.py --- @@ -732,6 +732,11 @@ def repartitionByRange(self, numPartitions, *cols): At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. +Note that due to performance reasons this method uses sampling to estimate the ranges. --- End diff -- Oh right, I missed it! Pushed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23025#discussion_r234511357 --- Diff: python/pyspark/sql/dataframe.py --- @@ -732,6 +732,11 @@ def repartitionByRange(self, numPartitions, *cols): At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. +Note that due to performance reasons this method uses sampling to estimate the ranges. --- End diff -- Besides Python, we also have `repartitionByRange` API in R. Can you also update it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23025 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5131/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23025 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...
Github user JulienPeloton commented on a diff in the pull request: https://github.com/apache/spark/pull/23025#discussion_r234509708 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -2789,6 +2789,12 @@ class Dataset[T] private[sql]( * When no explicit sort order is specified, "ascending nulls first" is assumed. * Note, the rows are not sorted in each partition of the resulting Dataset. * + * + * Note that due to performance reasons this method uses sampling to estimate the ranges. + * Hence, the output may not be consistent, since sampling can return different values. + * The sample size can be controlled by setting the value of the parameter + * `spark.sql.execution.rangeExchange.sampleSizePerPartition`. --- End diff -- @cloud-fan the sentence has been changed according to your suggestion (in both Spark & PySpark). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23025 **[Test build #98991 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98991/testReport)** for PR 23025 at commit [`f829dfe`](https://github.com/apache/spark/commit/f829dfe0ce5c4d6be68c1247102d58a99b21ad56). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23079: [SPARK-26107][SQL] Extend ReplaceNullWithFalseInP...
Github user rednaxelafx commented on a diff in the pull request: https://github.com/apache/spark/pull/23079#discussion_r234508866 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceNullWithFalseInPredicateSuite.scala --- @@ -298,6 +299,45 @@ class ReplaceNullWithFalseSuite extends PlanTest { testProjection(originalExpr = column, expectedExpr = column) } + test("replace nulls in lambda function of ArrayFilter") { +val cond = GreaterThan(UnresolvedAttribute("e"), Literal(0)) --- End diff -- Actually I intentionally made all three lambda the same (the `MapFilter` one only differs in the lambda parameter). I can encapsulate this lambda function into a test utility function. Let me update the PR and see what you think. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23079: [SPARK-26107][SQL] Extend ReplaceNullWithFalseInP...
Github user rednaxelafx commented on a diff in the pull request: https://github.com/apache/spark/pull/23079#discussion_r234508561 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -767,6 +767,15 @@ object ReplaceNullWithFalse extends Rule[LogicalPlan] { replaceNullWithFalse(cond) -> value } cw.copy(branches = newBranches) + case af @ ArrayFilter(_, lf @ LambdaFunction(func, _, _)) => --- End diff -- I'm not sure if that's useful or not. First of all, the `replaceNullWithFalse` handling doesn't apply to all higher-order functions. In fact it only applies to a very narrow set, ones where a lambda function returns `BooleanType` and is immediately used as a predicate. So having a generic utility can certainly help make this PR slightly simpler, but I don't know how useful it is for other cases. I'd prefer waiting for more such transformation cases to introduce a new utility for the pattern. WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23082: [SPARK-26112][SQL] Update since versions of new built-in...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23082 **[Test build #98990 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98990/testReport)** for PR 23082 at commit [`f26db66`](https://github.com/apache/spark/commit/f26db66986a12049e14d1b234840b66f0b96767f). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23082: [SPARK-26112][SQL] Update since versions of new built-in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23082 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23082: [SPARK-26112][SQL] Update since versions of new built-in...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23082 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5130/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23082: [SPARK-26112][SQL] Update since versions of new built-in...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/23082 cc @cloud-fan @gatorsmile @dongjoon-hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23082: [SPARK-26112][SQL] Update since versions of new b...
GitHub user ueshin opened a pull request: https://github.com/apache/spark/pull/23082 [SPARK-26112][SQL] Update since versions of new built-in functions. ## What changes were proposed in this pull request? The following 5 functions were removed from branch-2.4: - map_entries - map_filter - transform_values - transform_keys - map_zip_with We should update the since version to 3.0.0. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ueshin/apache-spark issues/SPARK-26112/since Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23082.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23082 commit f26db66986a12049e14d1b234840b66f0b96767f Author: Takuya UESHIN Date: 2018-11-19T06:36:38Z Update since version to 3.0.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23045: [SPARK-26071][SQL] disallow map as map key
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/23045 LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23045: [SPARK-26071][SQL] disallow map as map key
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/23045#discussion_r234502854 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -521,13 +521,18 @@ case class MapEntries(child: Expression) extends UnaryExpression with ExpectsInp case class MapConcat(children: Seq[Expression]) extends ComplexTypeMergingExpression { override def checkInputDataTypes(): TypeCheckResult = { -var funcName = s"function $prettyName" +val funcName = s"function $prettyName" if (children.exists(!_.dataType.isInstanceOf[MapType])) { TypeCheckResult.TypeCheckFailure( s"input to $funcName should all be of type map, but it's " + children.map(_.dataType.catalogString).mkString("[", ", ", "]")) } else { - TypeUtils.checkForSameTypeInputExpr(children.map(_.dataType), funcName) + val sameTypeCheck = TypeUtils.checkForSameTypeInputExpr(children.map(_.dataType), funcName) + if (sameTypeCheck.isFailure) { +sameTypeCheck + } else { +TypeUtils.checkForMapKeyType(dataType.keyType) --- End diff -- oh, I see. thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of non-struct type unde...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23054 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of non-struct type unde...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23054 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98988/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of non-struct type unde...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23054 **[Test build #98988 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98988/testReport)** for PR 23054 at commit [`b5cfda4`](https://github.com/apache/spark/commit/b5cfda40cf0939e03900e571b1642285fea9a528). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23043: [SPARK-26021][SQL] replace minus zero with zero in Unsaf...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/23043 Do we need to consider `GenerateSafeProjection`, too? In other words, if the generated code or runtime does not use data in `Unsafe`, this `+0.0/-0.0` problem may still exist. Am I correct? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23025 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98987/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23025 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23025 **[Test build #98987 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98987/testReport)** for PR 23025 at commit [`654fed9`](https://github.com/apache/spark/commit/654fed90997140715d2d52578ca6e4f0661d4e69). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23076: [SPARK-26103][SQL] Added maxDepth to limit the length of...
Github user HeartSaVioR commented on the issue: https://github.com/apache/spark/pull/23076 I'm seeing both sides of needs: while I think dumping full plan into file is a good feature for debugging specific issue, retaining full plans for representing them to UI page have been a headache and three regarding issues ([SPARK-23904](https://issues.apache.org/jira/browse/SPARK-23904), [SPARK-25380](https://issues.apache.org/jira/browse/SPARK-25380), [SPARK-26103](https://issues.apache.org/jira/browse/SPARK-26103)) are filed in 3 months which doesn't look like a thing we can say end users should take a workaround. One thing we may be aware is that huge plan is not generated not only from nested join, but also from lots of columns, like SPARK-23904. For SPARK-25380 we are not aware of which parts generate huge plan. So we might feel easier and flexible to just truncate to specific size rather than applying conditions. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23045: [SPARK-26071][SQL] disallow map as map key
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23045#discussion_r234494542 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -521,13 +521,18 @@ case class MapEntries(child: Expression) extends UnaryExpression with ExpectsInp case class MapConcat(children: Seq[Expression]) extends ComplexTypeMergingExpression { override def checkInputDataTypes(): TypeCheckResult = { -var funcName = s"function $prettyName" +val funcName = s"function $prettyName" if (children.exists(!_.dataType.isInstanceOf[MapType])) { TypeCheckResult.TypeCheckFailure( s"input to $funcName should all be of type map, but it's " + children.map(_.dataType.catalogString).mkString("[", ", ", "]")) } else { - TypeUtils.checkForSameTypeInputExpr(children.map(_.dataType), funcName) + val sameTypeCheck = TypeUtils.checkForSameTypeInputExpr(children.map(_.dataType), funcName) + if (sameTypeCheck.isFailure) { +sameTypeCheck + } else { +TypeUtils.checkForMapKeyType(dataType.keyType) --- End diff -- see https://github.com/apache/spark/pull/23045/files#diff-3f19ec3d15dcd8cd42bb25dde1c5c1a9R20 . The child may be read from parquet files, so map of map is still possible. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23040: [SPARK-26068][Core]ChunkedByteBufferInputStream should h...
Github user LinhongLiu commented on the issue: https://github.com/apache/spark/pull/23040 cc @cloud-fan @srowen review is fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23040: [SPARK-26068][Core]ChunkedByteBufferInputStream should h...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23040 **[Test build #98989 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98989/testReport)** for PR 23040 at commit [`3c6d349`](https://github.com/apache/spark/commit/3c6d349b26e54ead7c345e11ffacf14edcd072c1). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23058: [SPARK-25905][CORE] When getting a remote block, avoid f...
Github user squito commented on the issue: https://github.com/apache/spark/pull/23058 @attilapiros can you review this please? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23027: [SPARK-26049][SQL][TEST] FilterPushdownBenchmark ...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/23027#discussion_r234482766 --- Diff: sql/core/benchmarks/FilterPushdownBenchmark-results.txt --- @@ -2,669 +2,809 @@ Pushdown for many distinct value case -OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz Select 0 string row (value IS NULL): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -Parquet Vectorized 11405 / 11485 1.4 725.1 1.0X -Parquet Vectorized (Pushdown) 675 / 690 23.3 42.9 16.9X -Native ORC Vectorized 7127 / 7170 2.2 453.1 1.6X -Native ORC Vectorized (Pushdown) 519 / 541 30.3 33.0 22.0X +Parquet Vectorized7823 / 7996 2.0 497.4 1.0X +Parquet Vectorized (Pushdown) 460 / 468 34.2 29.2 17.0X +Native ORC Vectorized 5412 / 5550 2.9 344.1 1.4X +Native ORC Vectorized (Pushdown) 551 / 563 28.6 35.0 14.2X +InMemoryTable Vectorized 6 /6 2859.1 0.31422.0X +InMemoryTable Vectorized (Pushdown) 5 /6 3023.0 0.31503.6X -OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz Select 0 string row ('7864320' < value < '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative -Parquet Vectorized 11457 / 11473 1.4 728.4 1.0X -Parquet Vectorized (Pushdown) 656 / 686 24.0 41.7 17.5X -Native ORC Vectorized 7328 / 7342 2.1 465.9 1.6X -Native ORC Vectorized (Pushdown) 539 / 565 29.2 34.2 21.3X +Parquet Vectorized 8322 / 11160 1.9 529.1 1.0X +Parquet Vectorized (Pushdown) 463 / 472 34.0 29.4 18.0X +Native ORC Vectorized 5622 / 5635 2.8 357.4 1.5X +Native ORC Vectorized (Pushdown) 563 / 595 27.9 35.8 14.8X +InMemoryTable Vectorized 4831 / 4881 3.3 307.2 1.7X +InMemoryTable Vectorized (Pushdown) 1980 / 2027 7.9 125.9 4.2X -OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz Select 1 string row (value = '7864320'): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -Parquet Vectorized 11878 / 11888 1.3 755.2 1.0X -Parquet Vectorized (Pushdown) 630 / 654 25.0 40.1 18.9X -Native ORC Vectorized 7342 / 7362 2.1 466.8 1.6X -Native ORC Vectorized (Pushdown) 519 / 537 30.3 33.0 22.9X +Parquet Vectorized8322 / 8386 1.9 529.1 1.0X +Parquet Vectorized (Pushdown) 434 / 441 36.2 27.6 19.2X +Native ORC Vectorized 5659 / 5944 2.8 359.8 1.5X +Native ORC Vectorized (Pushdown) 535 / 567 29.4 34.0 15.6X +InMemoryTable Vectorized 4784 / 4879 3.3 304.1 1.7X +InMemoryTable Vectorized (Pushdown) 1950 / 1985 8.1 124.0 4.3X -OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 -Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz +Java HotSpot(TM) 64-Bit Server VM 1.8.
[GitHub] spark pull request #23045: [SPARK-26071][SQL] disallow map as map key
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/23045#discussion_r234481249 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala --- @@ -521,13 +521,18 @@ case class MapEntries(child: Expression) extends UnaryExpression with ExpectsInp case class MapConcat(children: Seq[Expression]) extends ComplexTypeMergingExpression { override def checkInputDataTypes(): TypeCheckResult = { -var funcName = s"function $prettyName" +val funcName = s"function $prettyName" if (children.exists(!_.dataType.isInstanceOf[MapType])) { TypeCheckResult.TypeCheckFailure( s"input to $funcName should all be of type map, but it's " + children.map(_.dataType.catalogString).mkString("[", ", ", "]")) } else { - TypeUtils.checkForSameTypeInputExpr(children.map(_.dataType), funcName) + val sameTypeCheck = TypeUtils.checkForSameTypeInputExpr(children.map(_.dataType), funcName) + if (sameTypeCheck.isFailure) { +sameTypeCheck + } else { +TypeUtils.checkForMapKeyType(dataType.keyType) --- End diff -- I don't think we need this. The children already should not have map type keys? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23081: [SPARK-26109][WebUI]Duration in the task summary metrics...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23081 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98984/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23081: [SPARK-26109][WebUI]Duration in the task summary metrics...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23081 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23081: [SPARK-26109][WebUI]Duration in the task summary metrics...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23081 **[Test build #98984 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98984/testReport)** for PR 23081 at commit [`131164c`](https://github.com/apache/spark/commit/131164c2104a119468e782fb1d484f2d15274e33). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21363: [SPARK-19228][SQL] Migrate on Java 8 time from FastDateF...
Github user xuanyuanking commented on the issue: https://github.com/apache/spark/pull/21363 @MaxGekk Sorry for the late, something inserted in the my scheduler, I plan to start this PR in this weekend, if its too late please just take it, sorry for the late again. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23043: [SPARK-26021][SQL] replace minus zero with zero in Unsaf...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/23043 Is it better to update this PR title now? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23054: [SPARK-26085][SQL] Key attribute of non-struct ty...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23054#discussion_r234477289 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala --- @@ -459,7 +460,11 @@ class KeyValueGroupedDataset[K, V] private[sql]( columns.map(_.withInputType(vExprEnc, dataAttributes).named) val keyColumn = if (!kExprEnc.isSerializedAsStruct) { assert(groupingAttributes.length == 1) - groupingAttributes.head + if (SQLConf.get.aliasNonStructGroupingKey) { --- End diff -- hmm, don't we want to have "key" attribute and only have old "value" attribute when we turn on legacy config? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23043: [SPARK-26021][SQL] replace minus zero with zero in Unsaf...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/23043 @srowen #21794 is what I thought. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23080 Ah, also, `CsvParser.beginParsing` takes an additional argument `Charset`. It should rather be easily able to support encoding in `multiLine`. @MaxGekk, would you be able to find some time to work on it? If that change can make the current PR easier. we can merge that one first. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23054: [SPARK-26085][SQL] Key attribute of non-struct ty...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23054#discussion_r234476607 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala --- @@ -459,7 +460,11 @@ class KeyValueGroupedDataset[K, V] private[sql]( columns.map(_.withInputType(vExprEnc, dataAttributes).named) val keyColumn = if (!kExprEnc.isSerializedAsStruct) { assert(groupingAttributes.length == 1) - groupingAttributes.head + if (SQLConf.get.aliasNonStructGroupingKey) { --- End diff -- we should do the lias when config is true... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23080#discussion_r234476318 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala --- @@ -192,6 +192,20 @@ class CSVOptions( */ val emptyValueInWrite = emptyValue.getOrElse("\"\"") + /** + * A string between two consecutive JSON records. + */ + val lineSeparator: Option[String] = parameters.get("lineSep").map { sep => +require(sep.nonEmpty, "'lineSep' cannot be an empty string.") +require(sep.length <= 2, "'lineSep' can contain 1 or 2 characters.") +sep + } + + val lineSeparatorInRead: Option[Array[Byte]] = lineSeparator.map { lineSep => +lineSep.getBytes("UTF-8") --- End diff -- @MaxGekk, CSV's multiline does not support encoding but I think normal mode supports `encoding`. It should be okay to get bytes from it. We can just throw an exception when multiline is enabled. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23043: [SPARK-26021][SQL] replace minus zero with zero i...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23043#discussion_r234476361 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala --- @@ -723,4 +723,32 @@ class DataFrameAggregateSuite extends QueryTest with SharedSQLContext { "grouping expressions: [current_date(None)], value: [key: int, value: string], " + "type: GroupBy]")) } + + test("SPARK-26021: Double and Float 0.0/-0.0 should be equal when grouping") { +val colName = "i" +def groupByCollect(df: DataFrame): Array[Row] = { + df.groupBy(colName).count().collect() +} +def assertResult[T](result: Array[Row], zero: T)(implicit ordering: Ordering[T]): Unit = { + assert(result.length == 1) + // using compare since 0.0 == -0.0 is true + assert(ordering.compare(result(0).getAs[T](0), zero) == 0) --- End diff -- Instead of checking the result, I prefer the code snippet in the JIRA ticket, which is more obvious about where is the problem. Let's run a group-by query, with both 0.0 and -0.0 in the input. Then we check the number of result rows, as ideally 0.0 and -0.0 is same, so we should only have one group(one result row). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of non-struct type unde...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23054 **[Test build #98988 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98988/testReport)** for PR 23054 at commit [`b5cfda4`](https://github.com/apache/spark/commit/b5cfda40cf0939e03900e571b1642285fea9a528). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of non-struct type unde...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23054 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of non-struct type unde...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23054 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5129/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21888: [SPARK-24253][SQL][WIP] Implement DeleteFrom for v2 tabl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21888 **[Test build #98986 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98986/testReport)** for PR 21888 at commit [`f8b178d`](https://github.com/apache/spark/commit/f8b178d34b870e779ec061175f01ba63a5adc076). * This patch **fails to build**. * This patch **does not merge cleanly**. * This patch adds the following public classes _(experimental)_: * `case class UnresolvedRelation(table: CatalogTableIdentifier) extends LeafNode with NamedRelation ` * `sealed trait IdentifierWithOptionalDatabaseAndCatalog ` * `case class CatalogTableIdentifier(table: String, database: Option[String], catalog: Option[String])` * `class TableIdentifier(name: String, db: Option[String])` * ` implicit class CatalogHelper(catalog: CatalogProvider) ` * `case class ResolveCatalogV2Relations(sparkSession: SparkSession) extends Rule[LogicalPlan] ` * `case class DeleteFromV2Exec(rel: TableV2Relation, expr: Expression)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21888: [SPARK-24253][SQL][WIP] Implement DeleteFrom for v2 tabl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21888 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98986/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21888: [SPARK-24253][SQL][WIP] Implement DeleteFrom for v2 tabl...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/21888 Build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23043: [SPARK-26021][SQL] replace minus zero with zero i...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23043#discussion_r234475978 --- Diff: common/unsafe/src/test/java/org/apache/spark/unsafe/PlatformUtilSuite.java --- @@ -157,4 +159,15 @@ public void heapMemoryReuse() { Assert.assertEquals(onheap4.size(), 1024 * 1024 + 7); Assert.assertEquals(obj3, onheap4.getBaseObject()); } + + @Test + // SPARK-26021 + public void writeMinusZeroIsReplacedWithZero() { +byte[] doubleBytes = new byte[Double.BYTES]; +byte[] floatBytes = new byte[Float.BYTES]; +Platform.putDouble(doubleBytes, Platform.BYTE_ARRAY_OFFSET, -0.0d); +Platform.putFloat(floatBytes, Platform.BYTE_ARRAY_OFFSET, -0.0f); +Assert.assertEquals(0, Double.compare(0.0d, ByteBuffer.wrap(doubleBytes).getDouble())); --- End diff -- are you sure this test fails before the fix? IIUC `0.0 == -0.0` is ture, but they have different binary format --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23043: [SPARK-26021][SQL] replace minus zero with zero i...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23043#discussion_r234476055 --- Diff: common/unsafe/src/test/java/org/apache/spark/unsafe/PlatformUtilSuite.java --- @@ -157,4 +159,15 @@ public void heapMemoryReuse() { Assert.assertEquals(onheap4.size(), 1024 * 1024 + 7); Assert.assertEquals(obj3, onheap4.getBaseObject()); } + + @Test + // SPARK-26021 + public void writeMinusZeroIsReplacedWithZero() { +byte[] doubleBytes = new byte[Double.BYTES]; +byte[] floatBytes = new byte[Float.BYTES]; +Platform.putDouble(doubleBytes, Platform.BYTE_ARRAY_OFFSET, -0.0d); +Platform.putFloat(floatBytes, Platform.BYTE_ARRAY_OFFSET, -0.0f); +Assert.assertEquals(0, Double.compare(0.0d, ByteBuffer.wrap(doubleBytes).getDouble())); --- End diff -- BTW thanks for adding the unit test! It's a good complementary to the end-to-end test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23043: [SPARK-26021][SQL] replace minus zero with zero i...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23043#discussion_r234475858 --- Diff: common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java --- @@ -120,6 +120,9 @@ public static float getFloat(Object object, long offset) { } public static void putFloat(Object object, long offset, float value) { +if(value == -0.0f) { --- End diff -- I'm fine to put this trick here, shall we also move the IsNaN logic to here as well? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23025 **[Test build #98987 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98987/testReport)** for PR 23025 at commit [`654fed9`](https://github.com/apache/spark/commit/654fed90997140715d2d52578ca6e4f0661d4e69). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23025 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5128/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23025 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23080#discussion_r234475595 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala --- @@ -192,6 +192,20 @@ class CSVOptions( */ val emptyValueInWrite = emptyValue.getOrElse("\"\"") + /** + * A string between two consecutive JSON records. + */ + val lineSeparator: Option[String] = parameters.get("lineSep").map { sep => +require(sep.nonEmpty, "'lineSep' cannot be an empty string.") +require(sep.length <= 2, "'lineSep' can contain 1 or 2 characters.") --- End diff -- We could say the line separator should be 1 or 2 bytes (UTF-8) in read path specifically. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23025: [SPARK-26024][SQL]: Update documentation for repa...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23025#discussion_r234475550 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -2789,6 +2789,12 @@ class Dataset[T] private[sql]( * When no explicit sort order is specified, "ascending nulls first" is assumed. * Note, the rows are not sorted in each partition of the resulting Dataset. * + * + * Note that due to performance reasons this method uses sampling to estimate the ranges. + * Hence, the output may not be consistent, since sampling can return different values. + * The sample size can be controlled by setting the value of the parameter + * `spark.sql.execution.rangeExchange.sampleSizePerPartition`. --- End diff -- It's not a parameter but a config. So I'd like to propose ``` The sample size can be controlled by the config `xxx` ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23054: [SPARK-26085][SQL] Key attribute of primitive typ...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/23054#discussion_r234475488 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -17,6 +17,9 @@ displayTitle: Spark SQL Upgrading Guide - The `ADD JAR` command previously returned a result set with the single value 0. It now returns an empty result set. + - In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a grouped dataset with key attribute wrongly named as "value", if the key is atomic type, e.g. int, string, etc. This is counterintuitive and makes the schema of aggregation queries weird. For example, the schema of `ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the grouping attribute to "key". The old behaviour is preserved under a newly added configuration `spark.sql.legacy.atomicKeyAttributeGroupByKey` with a default value of `false`. --- End diff -- Ok. More accurate. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23025: [SPARK-26024][SQL]: Update documentation for repartition...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/23025 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23054: [SPARK-26085][SQL] Key attribute of primitive typ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23054#discussion_r234475321 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1594,6 +1594,15 @@ object SQLConf { "WHERE, which does not follow SQL standard.") .booleanConf .createWithDefault(false) + + val LEGACY_ATOMIC_KEY_ATTRIBUTE_GROUP_BY_KEY = +buildConf("spark.sql.legacy.atomicKeyAttributeGroupByKey") --- End diff -- `spark.sql.legacy.dataset.aliasNonStructGroupingKey`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23080: [SPARK-26108][SQL] Support custom lineSep in CSV ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/23080#discussion_r234475228 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala --- @@ -192,6 +192,20 @@ class CSVOptions( */ val emptyValueInWrite = emptyValue.getOrElse("\"\"") + /** + * A string between two consecutive JSON records. + */ + val lineSeparator: Option[String] = parameters.get("lineSep").map { sep => +require(sep.nonEmpty, "'lineSep' cannot be an empty string.") +require(sep.length <= 2, "'lineSep' can contain 1 or 2 characters.") --- End diff -- @MaxGekk, might not be a super big deal but I believe this should be counted after converting it into `UTF-8`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23054: [SPARK-26085][SQL] Key attribute of primitive typ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23054#discussion_r234475156 --- Diff: docs/sql-migration-guide-upgrade.md --- @@ -17,6 +17,9 @@ displayTitle: Spark SQL Upgrading Guide - The `ADD JAR` command previously returned a result set with the single value 0. It now returns an empty result set. + - In Spark version 2.4 and earlier, `Dataset.groupByKey` results to a grouped dataset with key attribute wrongly named as "value", if the key is atomic type, e.g. int, string, etc. This is counterintuitive and makes the schema of aggregation queries weird. For example, the schema of `ds.groupByKey(...).count()` is `(value, count)`. Since Spark 3.0, we name the grouping attribute to "key". The old behaviour is preserved under a newly added configuration `spark.sql.legacy.atomicKeyAttributeGroupByKey` with a default value of `false`. --- End diff -- I realized that, only struct type key has the `key` alias. So here we should say: `if the key is non-struct type, e.g. int, string, array, etc.` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21888: [SPARK-24253][SQL][WIP] Implement DeleteFrom for v2 tabl...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/21888 **[Test build #98986 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98986/testReport)** for PR 21888 at commit [`f8b178d`](https://github.com/apache/spark/commit/f8b178d34b870e779ec061175f01ba63a5adc076). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23079: [SPARK-26107][SQL] Extend ReplaceNullWithFalseInP...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/23079#discussion_r234474562 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala --- @@ -767,6 +767,15 @@ object ReplaceNullWithFalse extends Rule[LogicalPlan] { replaceNullWithFalse(cond) -> value } cw.copy(branches = newBranches) + case af @ ArrayFilter(_, lf @ LambdaFunction(func, _, _)) => --- End diff -- shall we add a `withNewFunctions` method in `HigherOrderFunction`? Then we can simplify this rule to ``` case f: HigherOrderFunction => f.withNewFunctions(f.functions.map(replaceNullWithFalse)) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23077: [SPARK-26105][PYTHON] Clean unittest2 imports up ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/23077 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23077: [SPARK-26105][PYTHON] Clean unittest2 imports up that we...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/23077 Merged to master. Thanks for reviewing this, @BryanCutler and @srowen. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23077: [SPARK-25344][PYTHON] Clean unittest2 imports up that we...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23077 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98985/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23077: [SPARK-25344][PYTHON] Clean unittest2 imports up that we...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23077 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23077: [SPARK-25344][PYTHON] Clean unittest2 imports up that we...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23077 **[Test build #98985 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98985/testReport)** for PR 23077 at commit [`a188076`](https://github.com/apache/spark/commit/a1880767041b325e4343bd6a1737cdccfe614792). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of primitive type under...
Github user viirya commented on the issue: https://github.com/apache/spark/pull/23054 For non-primitive types there is a struct named "key". --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23077: [SPARK-25344][PYTHON] Clean unittest2 imports up that we...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23077 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23077: [SPARK-25344][PYTHON] Clean unittest2 imports up that we...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23077 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5127/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23077: [SPARK-25344][PYTHON] Clean unittest2 imports up that we...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23077 **[Test build #98985 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98985/testReport)** for PR 23077 at commit [`a188076`](https://github.com/apache/spark/commit/a1880767041b325e4343bd6a1737cdccfe614792). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23077: [SPARK-25344][PYTHON] Clean unittest2 imports up that we...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/23077 Oh, I think the PR title should be SPARK-26105 too --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23081: [SPARK-26109][WebUI]Duration in the task summary metrics...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23081 **[Test build #98984 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98984/testReport)** for PR 23081 at commit [`131164c`](https://github.com/apache/spark/commit/131164c2104a119468e782fb1d484f2d15274e33). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23081: [SPARK-26109][WebUI]Duration in the task summary metrics...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23081 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23081: [SPARK-26109][WebUI]Duration in the task summary metrics...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23081 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23077: [SPARK-25344][PYTHON] Clean unittest2 imports up that we...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/23077 >BTW, Bryan, do you have some time to work on the has_numpy stuff Yup, I can do that --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23077: [SPARK-25344][PYTHON] Clean unittest2 imports up that we...
Github user BryanCutler commented on the issue: https://github.com/apache/spark/pull/23077 Oops, actually I think there is one more here https://github.com/apache/spark/blob/master/python/pyspark/testing/mllibutils.py#L20 Other than that, looks good --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23081: [SPARK-26109][WebUI]Duration in the task summary ...
GitHub user shahidki31 opened a pull request: https://github.com/apache/spark/pull/23081 [SPARK-26109][WebUI]Duration in the task summary metrics table and the task table are different ## What changes were proposed in this pull request? Task summary displays the summary of the task table in the stage page. However, the duration metrics of task summary and task table are not matching. The reason is because, in the task summary we display executorRunTime as the duration and in task table, the actual duration. Except duration metrics, all other metrics are properly displaying in the task summary. In Spark2.2, we used to show executorRunTime as duration in the taskTable. That is why, in summary metrics also the exeuctorRunTime shows as the duration. In Spark2.3, it changed to the actual duration of task. So, summary metrics also should change according to that. ## How was this patch tested? Before patch: ![screenshot from 2018-11-19 04-32-06](https://user-images.githubusercontent.com/23054875/48679263-1e4fff80-ebb4-11e8-9ed5-16d892039e01.png) After patch: ![screenshot from 2018-11-19 04-37-39](https://user-images.githubusercontent.com/23054875/48679343-e39a9700-ebb4-11e8-8df9-9dc3a28d4bce.png) You can merge this pull request into a Git repository by running: $ git pull https://github.com/shahidki31/spark duratinSummary Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/23081.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #23081 commit 131164c2104a119468e782fb1d484f2d15274e33 Author: Shahid Date: 2018-11-18T22:38:21Z taskMetrics duration --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23079: [SPARK-26107][SQL] Extend ReplaceNullWithFalseInP...
Github user aokolnychyi commented on a diff in the pull request: https://github.com/apache/spark/pull/23079#discussion_r234467085 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/ReplaceNullWithFalseInPredicateSuite.scala --- @@ -298,6 +299,45 @@ class ReplaceNullWithFalseSuite extends PlanTest { testProjection(originalExpr = column, expectedExpr = column) } + test("replace nulls in lambda function of ArrayFilter") { +val cond = GreaterThan(UnresolvedAttribute("e"), Literal(0)) --- End diff -- Test cases for `ArrayFilter` and `ArrayExists` seem to be identical. As we have those tests anyway, would it make sense to cover different lambda functions? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23079: [SPARK-26107][SQL] Extend ReplaceNullWithFalseInPredicat...
Github user aokolnychyi commented on the issue: https://github.com/apache/spark/pull/23079 @rednaxelafx I am glad the rule gets more adoption. Renaming also makes sense to me. Shall we extend `ReplaceNullWithFalseEndToEndSuite` as well? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23065: [SPARK-26090][CORE][SQL][ML] Resolve most miscellaneous ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23065 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98983/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23065: [SPARK-26090][CORE][SQL][ML] Resolve most miscellaneous ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23065 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23065: [SPARK-26090][CORE][SQL][ML] Resolve most miscellaneous ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23065 **[Test build #98983 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98983/testReport)** for PR 23065 at commit [`0cfcd90`](https://github.com/apache/spark/commit/0cfcd9056f4d93dfdeb447110e5e26030ad4ad3a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of primitive type under...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/23054 BTW what does the non-primitive types look like? Do they get flattened, or is there a strict? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of primitive type under...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23054 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98981/ Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of primitive type under...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23054 Merged build finished. Test FAILed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23054: [SPARK-26085][SQL] Key attribute of primitive type under...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23054 **[Test build #98981 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98981/testReport)** for PR 23054 at commit [`6e3c37a`](https://github.com/apache/spark/commit/6e3c37ae454b83075707040d85813587cc92cccb). * This patch **fails from timeout after a configured wait of `400m`**. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23080 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/98982/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23080 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23080: [SPARK-26108][SQL] Support custom lineSep in CSV datasou...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23080 **[Test build #98982 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98982/testReport)** for PR 23080 at commit [`12022ad`](https://github.com/apache/spark/commit/12022ad1a0194a4bab9007d66145071562e066a4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23075: [SPARK-26084][SQL] Fixes unresolved AggregateExpression....
Github user MaxGekk commented on the issue: https://github.com/apache/spark/pull/23075 ok to test --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23075: [SPARK-26084][SQL] Fixes unresolved AggregateExpression....
Github user ssimeonov commented on the issue: https://github.com/apache/spark/pull/23075 @MaxGekk done --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #23073: [SPARK-26104] [Hydrogen] expose pci info to task ...
Github user chenqin commented on a diff in the pull request: https://github.com/apache/spark/pull/23073#discussion_r234453776 --- Diff: core/src/main/scala/org/apache/spark/scheduler/cluster/ExecutorData.scala --- @@ -27,12 +27,14 @@ import org.apache.spark.rpc.{RpcAddress, RpcEndpointRef} * @param executorHost The hostname that this executor is running on * @param freeCores The current number of cores available for work on the executor * @param totalCores The total number of cores available to the executor + * @param pcis The external devices avaliable to the executor --- End diff -- fixed --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23065: [SPARK-26090][CORE][SQL][ML] Resolve most miscellaneous ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/23065 **[Test build #98983 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/98983/testReport)** for PR 23065 at commit [`0cfcd90`](https://github.com/apache/spark/commit/0cfcd9056f4d93dfdeb447110e5e26030ad4ad3a). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #23065: [SPARK-26090][CORE][SQL][ML] Resolve most miscellaneous ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/23065 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/5126/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org