[GitHub] spark pull request #22500: [SPARK-25488][TEST] Refactor MiscBenchmark to use...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22500#discussion_r219403342 --- Diff: sql/core/benchmarks/MiscBenchmark-results.txt --- @@ -0,0 +1,132 @@ + +filter & aggregate without group + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz + +range/filter/sum:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +range/filter/sum wholestage off 36618 / 41080 57.3 17.5 1.0X +range/filter/sum wholestage on2495 / 2609840.4 1.2 14.7X + + + +range/limit/sum + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz + +range/limit/sum: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +range/limit/sum wholestage off 117 / 121 4477.9 0.2 1.0X +range/limit/sum wholestage on 178 / 187 2938.1 0.3 0.7X + + + +sample + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz + +sample with replacement: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +sample with replacement wholestage off9142 / 9182 14.3 69.8 1.0X +sample with replacement wholestage on 5926 / 6107 22.1 45.2 1.5X + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz + +sample without replacement: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +sample without replacement wholestage off 1834 / 1837 71.5 14.0 1.0X +sample without replacement wholestage on 784 / 803167.2 6.0 2.3X + + + +collect + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz + +collect: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +collect 1 million 186 / 215 5.6 177.5 1.0X +collect 2 millions 361 / 393 2.9 344.2 0.5X +collect 4 millions 884 / 1053 1.2 843.4 0.2X + + + +collect limit + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz + +collect limit: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +collect limit 1 million206 / 225 5.1 196.6 1.0X +collect limit 2 millions 407 / 419 2.6 387.8 0.5X + + + +generate exp
[GitHub] spark issue #22497: [SPARK-25487][SQL][TEST] Refactor PrimitiveArrayBenchmar...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22497 Congratulation, @kiszk --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22513: [SPARK-25499][TEST]Refactor BenchmarkBase and Benchmark
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22513 `KryoBenchmark` is in core, and `UnsafeProjectionBenchmark`, `HashByteArrayBenchmark` and `HashBenchmark` are in `catalyst`. If we move the benchmark base class to sql, benchmarks mentioned above would not be able to inherit from the benchmark base class. What do you think? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22513: [SPARK-25499][TEST]Refactor BenchmarkBase and Ben...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22513#discussion_r219388085 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -27,7 +27,7 @@ import org.apache.spark.sql.functions.monotonically_increasing_id import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.internal.SQLConf.ParquetOutputTimestampType import org.apache.spark.sql.types.{ByteType, Decimal, DecimalType, TimestampType} -import org.apache.spark.util.{Benchmark, BenchmarkBase => FileBenchmarkBase, Utils} +import org.apache.spark.util.Utils /** * Benchmark to measure read performance with Filter pushdown. --- End diff -- How about change scala doc to below to fix **fails to generate documentation**? ```scala * To run this benchmark: * {{{ * 1. without sbt: bin/spark-submit --class * 2. build/sbt "sql/test:runMain " * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " * Results will be written to "benchmarks/FilterPushdownBenchmark-results.txt". * }}} ``` fails to generate documentation error message: ```java /home/jenkins/workspace/SparkPullRequestBuilder@2/target/javaunidoc/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.html... [error] /home/jenkins/workspace/SparkPullRequestBuilder@2/mllib/target/java/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.java:5: error: unknown tag: this [error] * 1. without sbt: bin/spark-submit --class [error] ^ [error] /home/jenkins/workspace/SparkPullRequestBuilder@2/mllib/target/java/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.java:5: error: unknown tag: spark [error] * 1. without sbt: bin/spark-submit --class [error] ^ [error] /home/jenkins/workspace/SparkPullRequestBuilder@2/mllib/target/java/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.java:6: error: unknown tag: this [error] * 2. build/sbt "mllib/test:runMain " [error] ^ [error] /home/jenkins/workspace/SparkPullRequestBuilder@2/mllib/target/java/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.java:7: error: unknown tag: this [error] * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "mllib/test:runMain " [error] ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22499: [SPARK-25489][ML][TEST] Refactor UDTSerialization...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22499#discussion_r219366799 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/linalg/UDTSerializationBenchmark.scala --- @@ -18,52 +18,52 @@ package org.apache.spark.mllib.linalg import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder -import org.apache.spark.util.Benchmark +import org.apache.spark.util.{Benchmark, BenchmarkBase => FileBenchmarkBase} /** * Serialization benchmark for VectorUDT. + * To run this benchmark: + * 1. without sbt: bin/spark-submit --class --- End diff -- I think `<` should replaced to `[`: ```scala [error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/target/java/org/apache/spark/sql/DatasetBenchmark.java:5: error: unknown tag: this [error] * 1. without sbt: bin/spark-submit --class [error] ^ [error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/target/java/org/apache/spark/sql/DatasetBenchmark.java:5: error: unknown tag: spark [error] * 1. without sbt: bin/spark-submit --class [error] ^ [error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/target/java/org/apache/spark/sql/DatasetBenchmark.java:6: error: unknown tag: this [error] * 2. build/sbt "sql/test:runMain " [error] ^ [error] /home/jenkins/workspace/SparkPullRequestBuilder@2/sql/core/target/java/org/apache/spark/sql/DatasetBenchmark.java:7: error: unknown tag: this [error] * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " [error] ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22501 [SPARK-25492][TEST] Refactor WideSchemaBenchmark to use main method ## What changes were proposed in this pull request? Refactor `WideSchemaBenchmark` to use main method. Generate benchmark result: ```sh SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.WideSchemaBenchmark" ``` ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25492 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22501.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22501 commit f56b73223fbf765e408d9aef6565a2318f4836e3 Author: Yuming Wang Date: 2018-09-20T16:04:30Z Refactor WideSchemaBenchmark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22500: [SPARK-25488][TEST] Refactor MiscBenchmark to use...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22500#discussion_r219219972 --- Diff: sql/core/benchmarks/MiscBenchmark-results.txt --- @@ -0,0 +1,132 @@ + +filter & aggregate without group + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz + +range/filter/sum:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +range/filter/sum wholestage off 36618 / 41080 57.3 17.5 1.0X +range/filter/sum wholestage on2495 / 2609840.4 1.2 14.7X + + + +range/limit/sum + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz + +range/limit/sum: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +range/limit/sum wholestage off 117 / 121 4477.9 0.2 1.0X +range/limit/sum wholestage on 178 / 187 2938.1 0.3 0.7X + + + +sample + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz + +sample with replacement: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +sample with replacement wholestage off9142 / 9182 14.3 69.8 1.0X +sample with replacement wholestage on 5926 / 6107 22.1 45.2 1.5X + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz + +sample without replacement: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +sample without replacement wholestage off 1834 / 1837 71.5 14.0 1.0X +sample without replacement wholestage on 784 / 803167.2 6.0 2.3X + + + +collect + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz + +collect: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +collect 1 million 186 / 215 5.6 177.5 1.0X +collect 2 millions 361 / 393 2.9 344.2 0.5X +collect 4 millions 884 / 1053 1.2 843.4 0.2X + + + +collect limit + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz + +collect limit: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +collect limit 1 million206 / 225 5.1 196.6 1.0X +collect limit 2 millions 407 / 419 2.6 387.8 0.5X + + + +generate exp
[GitHub] spark pull request #22500: [SPARK-25488][TEST] Refactor MiscBenchmark to use...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22500#discussion_r219218036 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/MiscBenchmark.scala --- @@ -17,251 +17,154 @@ package org.apache.spark.sql.execution.benchmark -import org.apache.spark.util.Benchmark +import org.apache.spark.sql.SparkSession +import org.apache.spark.util.{Benchmark, BenchmarkBase => FileBenchmarkBase} /** * Benchmark to measure whole stage codegen performance. - * To run this: - * build/sbt "sql/test-only *benchmark.MiscBenchmark" - * - * Benchmarks in this file are skipped in normal builds. + * To run this benchmark: + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + *Results will be written to "benchmarks/MiscBenchmark-results.txt". */ -class MiscBenchmark extends BenchmarkBase { - - ignore("filter & aggregate without group") { -val N = 500L << 22 -runBenchmark("range/filter/sum", N) { - sparkSession.range(N).filter("(id & 1) = 1").groupBy().sum().collect() +object MiscBenchmark extends FileBenchmarkBase { + + lazy val sparkSession = SparkSession.builder +.master("local[1]") +.appName("microbenchmark") +.config("spark.sql.shuffle.partitions", 1) +.config("spark.sql.autoBroadcastJoinThreshold", 1) +.getOrCreate() + + /** Runs function `f` with whole stage codegen on and off. */ + def runMiscBenchmark(name: String, cardinality: Long)(f: => Unit): Unit = { +val benchmark = new Benchmark(name, cardinality, output = output) + +benchmark.addCase(s"$name wholestage off", numIters = 2) { iter => + sparkSession.conf.set("spark.sql.codegen.wholeStage", value = false) + f } -/* -Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.11 -Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz - -range/filter/sum:Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative - -range/filter/sum codegen=false 30663 / 31216 68.4 14.6 1.0X -range/filter/sum codegen=true 2399 / 2409874.1 1.1 12.8X -*/ - } - ignore("range/limit/sum") { -val N = 500L << 20 -runBenchmark("range/limit/sum", N) { - sparkSession.range(N).limit(100).groupBy().sum().collect() +benchmark.addCase(s"$name wholestage on", numIters = 5) { iter => + sparkSession.conf.set("spark.sql.codegen.wholeStage", value = true) + f } -/* -Westmere E56xx/L56xx/X56xx (Nehalem-C) -range/limit/sum:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative - --- -range/limit/sum codegen=false 609 / 672861.6 1.2 1.0X -range/limit/sum codegen=true 561 / 621935.3 1.1 1.1X -*/ - } - ignore("sample") { -val N = 500 << 18 -runBenchmark("sample with replacement", N) { - sparkSession.range(N).sample(withReplacement = true, 0.01).groupBy().sum().collect() -} -/* -Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.11 -Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz - -sample with replacement: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative - -sample with replacement codegen=false 7073 / 7227 18.5 54.0 1.0X -sample with replacement codegen=true 5199 / 5203 25.2 39.7 1.4X -*/ - -runBenchmark("sample without replacement", N) { - sparkSession.range(N).sample(withReplacement = false, 0.01).groupBy().sum().collect() -} -/* -Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.11 -Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz - -sample without replacement: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative -
[GitHub] spark pull request #22500: [SPARK-25488][TEST] Refactor MiscBenchmark to use...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22500 [SPARK-25488][TEST] Refactor MiscBenchmark to use main method ## What changes were proposed in this pull request? Refactor `MiscBenchmark ` to use main method. Generate benchmark result: ```sh SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.MiscBenchmark" ``` ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25488 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22500.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22500 commit 6252440c1a079bbb12d41e2ae513f988fcdf5651 Author: Yuming Wang Date: 2018-09-20T15:41:03Z Refactor MiscBenchmark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22497: [SPARK-25487][SQL][TEST] Refactor PrimitiveArrayBenchmar...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22497 LGTM --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22491: [SPARK-25483][TEST] Refactor UnsafeArrayDataBench...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22491 [SPARK-25483][TEST] Refactor UnsafeArrayDataBenchmark to use main method ## What changes were proposed in this pull request? Refactor `UnsafeArrayDataBenchmark` to use main method. Generate benchmark result: ```sh SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.UnsafeArrayDataBenchmark" ``` ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25483 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22491.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22491 commit 5bb5f0806bce127f07eefd337bc457912e9f5075 Author: Yuming Wang Date: 2018-09-20T12:13:52Z Refactor UnsafeArrayDataBenchmark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22488: [SPARK-25479][TEST] Refactor DatasetBenchmark to use mai...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22488 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22484: [SPARK-25476][TEST] Refactor AggregateBenchmark t...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22484#discussion_r219104743 --- Diff: sql/core/benchmarks/AggregateBenchmark-results.txt --- @@ -0,0 +1,154 @@ + +aggregate without grouping + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz + +agg w/o group: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative + +agg w/o group wholestage off39650 / 46049 52.9 18.9 1.0X +agg w/o group wholestage on 1224 / 1413 1713.5 0.6 32.4X + + + +stat functions + + --- End diff -- @davies Do you know how to generate there benchmark: https://github.com/apache/spark/blob/3c3eebc8734e36e61f4627e2c517fbbe342b3b42/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/AggregateBenchmark.scala#L70-L78 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22488: [SPARK-25479][TEST] Refactor DatasetBenchmark to ...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22488 [SPARK-25479][TEST] Refactor DatasetBenchmark to use main method ## What changes were proposed in this pull request? Refactor `DatasetBenchmark` to use main method. Generate benchmark result: ```sh SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.DatasetBenchmark" ``` ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25479 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22488.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22488 commit 21b623aad6a84cca2ab5f89f1c29d3b3b1b82d80 Author: Yuming Wang Date: 2018-09-20T09:46:19Z Refactor DatasetBenchmark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22486: [SPARK-25478][TEST] Refactor CompressionSchemeBen...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22486 [SPARK-25478][TEST] Refactor CompressionSchemeBenchmark to use main method ## What changes were proposed in this pull request? Refactor `CompressionSchemeBenchmark` to use main method. To gererate benchmark result: ```sh SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.columnar.compression.CompressionSchemeBenchmark" ``` ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25478 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22486.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22486 commit 4dc46ad21e32784d42ab4b052ba73e31a050efb8 Author: Yuming Wang Date: 2018-09-20T08:38:04Z Refactor CompressionSchemeBenchmark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22484: [SPARK-25476][TEST] Refactor AggregateBenchmark t...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22484 [SPARK-25476][TEST] Refactor AggregateBenchmark to use main method ## What changes were proposed in this pull request? Refactor `AggregateBenchmark` to use main method. To gererate benchmark result: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.AggregateBenchmark" ``` ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25476 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22484.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22484 commit 649f2965188efcfa0b1d2b5acb4c0f057ecd3788 Author: Yuming Wang Date: 2018-09-20T07:23:46Z Refactor AggregateBenchmark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22419: [SPARK-23906][SQL] Add built-in UDF TRUNCATE(numb...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22419#discussion_r218749189 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala --- @@ -1245,3 +1245,80 @@ case class BRound(child: Expression, scale: Expression) with Serializable with ImplicitCastInputTypes { def this(child: Expression) = this(child, Literal(0)) } + +/** + * The number truncated to scale decimal places. + */ +// scalastyle:off line.size.limit +@ExpressionDescription( + usage = "_FUNC_(number, scale) - Returns number truncated to scale decimal places. " + +"If scale is omitted, then number is truncated to 0 places. " + +"scale can be negative to truncate (make zero) scale digits left of the decimal point.", + examples = """ +Examples: + > SELECT _FUNC_(1234567891.1234567891, 4); + 1234567891.1234 + > SELECT _FUNC_(1234567891.1234567891, -4); + 123456 + > SELECT _FUNC_(1234567891.1234567891); + 1234567891 + """) +// scalastyle:on line.size.limit +case class Truncate(number: Expression, scale: Expression) + extends BinaryExpression with ImplicitCastInputTypes { + + def this(number: Expression) = this(number, Literal(0)) + + override def left: Expression = number + override def right: Expression = scale + + override def inputTypes: Seq[AbstractDataType] = +Seq(TypeCollection(DoubleType, FloatType, DecimalType), IntegerType) + + override def checkInputDataTypes(): TypeCheckResult = { +super.checkInputDataTypes() match { + case TypeCheckSuccess => +if (scale.foldable) { --- End diff -- Same to `RoundBase`. only support foldable: https://github.com/apache/spark/blob/c7156943a2a32ba57e67aa6d8fa7035a09847e07/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L1076 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22443: [SPARK-25339][TEST] Refactor FilterPushdownBenchmark
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22443 Jenkins, retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22443: [SPARK-25339][TEST] Refactor FilterPushdownBenchmark
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22443 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22461: [SPARK-25453] OracleIntegrationSuite IllegalArgumentExce...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22461 cc @maropu --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22461: [SPARK-25453] OracleIntegrationSuite IllegalArgumentExce...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22461 Could you add `[TEST]` to title, otherwise LGTM. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22443: [SPARK-25339][TEST] Refactor FilterPushdownBenchm...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22443#discussion_r218642335 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -17,29 +17,28 @@ package org.apache.spark.sql.execution.benchmark -import java.io.{File, FileOutputStream, OutputStream} +import java.io.File import scala.util.{Random, Try} -import org.scalatest.{BeforeAndAfterEachTestData, Suite, TestData} - import org.apache.spark.SparkConf -import org.apache.spark.SparkFunSuite import org.apache.spark.sql.{DataFrame, SparkSession} import org.apache.spark.sql.functions.monotonically_increasing_id import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.internal.SQLConf.ParquetOutputTimestampType import org.apache.spark.sql.types.{ByteType, Decimal, DecimalType, TimestampType} -import org.apache.spark.util.{Benchmark, Utils} +import org.apache.spark.util.{Benchmark, BenchmarkBase => FileBenchmarkBase, Utils} /** * Benchmark to measure read performance with Filter pushdown. - * To run this: - * build/sbt "sql/test-only *FilterPushdownBenchmark" - * - * Results will be written to "benchmarks/FilterPushdownBenchmark-results.txt". + * To run this benchmark: + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " --- End diff -- Thanks --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22443: [SPARK-25339][TEST] Refactor FilterPushdownBenchm...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22443#discussion_r218308258 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/FilterPushdownBenchmark.scala --- @@ -17,29 +17,28 @@ package org.apache.spark.sql.execution.benchmark -import java.io.{File, FileOutputStream, OutputStream} +import java.io.File import scala.util.{Random, Try} -import org.scalatest.{BeforeAndAfterEachTestData, Suite, TestData} - import org.apache.spark.SparkConf -import org.apache.spark.SparkFunSuite import org.apache.spark.sql.{DataFrame, SparkSession} import org.apache.spark.sql.functions.monotonically_increasing_id import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.internal.SQLConf.ParquetOutputTimestampType import org.apache.spark.sql.types.{ByteType, Decimal, DecimalType, TimestampType} -import org.apache.spark.util.{Benchmark, Utils} +import org.apache.spark.util.{Benchmark, BenchmarkBase => FileBenchmarkBase, Utils} /** * Benchmark to measure read performance with Filter pushdown. - * To run this: - * build/sbt "sql/test-only *FilterPushdownBenchmark" - * - * Results will be written to "benchmarks/FilterPushdownBenchmark-results.txt". + * To run this benchmark: + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " --- End diff -- Yes, It can print the output to console if `SPARK_GENERATE_BENCHMARK_FILES` not set. https://github.com/apache/spark/blob/4e8ac6edd5808ca8245b39d804c6d4f5ea9d0d36/core/src/main/scala/org/apache/spark/util/Benchmark.scala#L59-L63 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22446: [SPARK-19550][DOC][FOLLOW-UP] Update tuning.md to use JD...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22446 Yes. I can't find more references to the old JDK docs also. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22446: [SPARK-19550][DOC][FOLLOW-UP] Update tuning.md to use JD...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22446 Some references to Java 7, Some references to Java 6. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22446: [SPARK-19550][DOC][FOLLOW-UP] Update tuning.md to use JD...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22446 cc @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22446: [SPARK-19550][DOC][FOLLOW-UP] Update tuning.md to...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22446 [SPARK-19550][DOC][FOLLOW-UP] Update tuning.md to use JDK8 ## What changes were proposed in this pull request? Update `tuning.md` and `building-spark.md` to use JDK8. ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark java8 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22446.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22446 commit f8924bb0ce876beb35309ea51f1c1c42497d26e0 Author: Yuming Wang Date: 2018-09-18T01:14:04Z To java 8 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22443: [SPARK-25339][TESTS] Refactor FilterPushdownBenchmark
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22443 cc @dongjoon-hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22443: [SPARK-25339][TESTS] Refactor FilterPushdownBench...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22443 [SPARK-25339][TESTS] Refactor FilterPushdownBenchmark ## What changes were proposed in this pull request? Refactor `FilterPushdownBenchmark` use `main` method. we can use 3 ways to run this test now: 1. bin/spark-submit --class org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark spark-sql_2.11-2.5.0-SNAPSHOT-tests.jar 2. build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark" 3. SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.FilterPushdownBenchmark" The method 2 and the method 3 do not need to compile the `spark-sql_*-tests.jar` package. So these two methods are mainly for developers to quickly do benchmark. ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25339 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22443.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22443 commit 7b50eb5225b664c43cd5dd66a49024741d2ca19c Author: Yuming Wang Date: 2018-09-17T16:51:03Z Refactor FilterPushdownBenchmark commit 6e7cfc85d4d7719ee31254317b0ca81173be7128 Author: Yuming Wang Date: 2018-09-17T16:53:55Z Revert numRows --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22435: [SPARK-25423][SQL] Output "dataFilters" in DataSo...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22435#discussion_r217934875 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/DataSourceScanExecRedactionSuite.scala --- @@ -83,4 +83,20 @@ class DataSourceScanExecRedactionSuite extends QueryTest with SharedSQLContext { } } + test("FileSourceScanExec metadata") { +withTempDir { dir => + val basePath = dir.getCanonicalPath + spark.range(0, 10).toDF("a").write.parquet(new Path(basePath, "foo=1").toString) + val df = spark.read.parquet(basePath).filter("a = 1") --- End diff -- Thanks @dongjoon-hyun I fixed it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22435: [SPARK-25423][SQL] Output "dataFilters" in DataSo...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22435 [SPARK-25423][SQL] Output "dataFilters" in DataSourceScanExec.metadata ## What changes were proposed in this pull request? Output `dataFilters` in `DataSourceScanExec.metadata`. ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25423 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22435.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22435 commit 830e1881b4ef4d9bb661d8b6635470e2596d4eaa Author: Yuming Wang Date: 2018-09-16T16:31:32Z Output "dataFilters" in DataSourceScanExec.metadata --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22427: [SPARK-25438][SQL][TEST] Fix FilterPushdownBenchm...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22427#discussion_r217918733 --- Diff: sql/core/benchmarks/FilterPushdownBenchmark-results.txt --- @@ -2,737 +2,669 @@ Pushdown for many distinct value case -Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 -Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz - +OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Select 0 string row (value IS NULL): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -Parquet Vectorized8970 / 9122 1.8 570.3 1.0X -Parquet Vectorized (Pushdown) 471 / 491 33.4 30.0 19.0X -Native ORC Vectorized 7661 / 7853 2.1 487.0 1.2X -Native ORC Vectorized (Pushdown) 1134 / 1161 13.9 72.1 7.9X - -Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 -Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz +Parquet Vectorized 11405 / 11485 1.4 725.1 1.0X +Parquet Vectorized (Pushdown) 675 / 690 23.3 42.9 16.9X +Native ORC Vectorized 7127 / 7170 2.2 453.1 1.6X +Native ORC Vectorized (Pushdown) 519 / 541 30.3 33.0 22.0X +OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Select 0 string row ('7864320' < value < '7864320'): Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative -Parquet Vectorized9246 / 9297 1.7 587.8 1.0X -Parquet Vectorized (Pushdown) 480 / 488 32.8 30.5 19.3X -Native ORC Vectorized 7838 / 7850 2.0 498.3 1.2X -Native ORC Vectorized (Pushdown) 1054 / 1118 14.9 67.0 8.8X - -Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 -Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz +Parquet Vectorized 11457 / 11473 1.4 728.4 1.0X +Parquet Vectorized (Pushdown) 656 / 686 24.0 41.7 17.5X +Native ORC Vectorized 7328 / 7342 2.1 465.9 1.6X +Native ORC Vectorized (Pushdown) 539 / 565 29.2 34.2 21.3X +OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Select 1 string row (value = '7864320'): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -Parquet Vectorized8989 / 9100 1.7 571.5 1.0X -Parquet Vectorized (Pushdown) 448 / 467 35.1 28.5 20.1X -Native ORC Vectorized 7680 / 7768 2.0 488.3 1.2X -Native ORC Vectorized (Pushdown) 1067 / 1118 14.7 67.8 8.4X - -Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 -Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz +Parquet Vectorized 11878 / 11888 1.3 755.2 1.0X +Parquet Vectorized (Pushdown) 630 / 654 25.0 40.1 18.9X +Native ORC Vectorized 7342 / 7362 2.1 466.8 1.6X +Native ORC Vectorized (Pushdown) 519 / 537 30.3 33.0 22.9X +OpenJDK 64-Bit Server VM 1.8.0_181-b13 on Linux 3.10.0-862.3.2.el7.x86_64 +Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz Select 1 string row (value <=> '7864320'): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -Parquet Vectorized9115 / 9266 1.7 579.5 1.0X -Parquet Vectorized (Pushdown) 466 / 492 33.7 29.7 19.5X -Native ORC Vectorized 7800 / 7914 2.0
[GitHub] spark pull request #22426: [SPARK-25436] Bump master branch version to 2.5.0...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22426#discussion_r217876251 --- Diff: docs/_config.yml --- @@ -14,8 +14,8 @@ include: # These allow the documentation to be updated with newer releases # of Spark, Scala, and Mesos. -SPARK_VERSION: 2.4.0-SNAPSHOT -SPARK_VERSION_SHORT: 2.4.0 +SPARK_VERSION: 2.5.0-SNAPSHOT +SPARK_VERSION_SHORT: 2.5.0-SNAPSHOT --- End diff -- 2.5.0-SNAPSHOT -> 2.5.0? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22420: SPARK-25429
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22420 Could you update the PR title. The PR title should be of the form [SPARK-][COMPONENT] Title. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22419: [SPARK-23906][SQL] Add UDF TRUNCATE(number)
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22419 [SPARK-23906][SQL] Add UDF TRUNCATE(number) ## What changes were proposed in this pull request? Add UDF `TRUNCATE(number)`: ```sql > SELECT TRUNCATE(1234567891.1234567891, 4); 1234567891.1234 > SELECT TRUNCATE(1234567891.1234567891, -4); 123456 > SELECT TRUNCATE(1234567891.1234567891, 0); 1234567891 > SELECT TRUNCATE(1234567891.1234567891); 1234567891 ``` It's similar to MySQL [TRUNCATE(X, D)](https://dev.mysql.com/doc/refman/8.0/en/mathematical-functions.html#function_truncate) ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-23906 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22419.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22419 commit b5365e28bf40448bb3cbd59668f316c7e5a3809a Author: Yuming Wang Date: 2018-09-14T08:23:50Z Support truncate number --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #18106: [SPARK-20754][SQL] Support TRUNC (number)
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/18106 @dongjoon-hyun Actually `TRUNC (number)` not resolved. I will fix it soon. https://issues.apache.org/jira/browse/SPARK-23906 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18106: [SPARK-20754][SQL] Support TRUNC (number)
Github user wangyum closed the pull request at: https://github.com/apache/spark/pull/18106 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22038: [SPARK-25056][SQL] Unify the InConversion and BinaryComp...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22038 @mgaido91 I updated Postgres and Hive to https://github.com/apache/spark/pull/22038#issuecomment-412737994 @gatorsmile Is this change make sense? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22358: [SPARK-25366][SQL]Zstd and brotli CompressionCode...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22358#discussion_r216873177 --- Diff: docs/sql-programming-guide.md --- @@ -965,6 +965,8 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession `parquet.compression` is specified in the table-specific options/properties, the precedence would be `compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include: none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd. +Note that `zstd` needs to install `ZStandardCodec` before Hadoop 2.9.0, `brotli` needs to install +`brotliCodec`. --- End diff -- @HyukjinKwon How about adding a link? Users may not know where to download it. ``` `brotliCodec` -> [`brotli-codec`](https://github.com/rdblue/brotli-codec) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22387: [SPARK-25313][SQL][FOLLOW-UP][BACKPORT-2.3] Fix I...
Github user wangyum closed the pull request at: https://github.com/apache/spark/pull/22387 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22387: [SPARK-25313][SQL][FOLLOW-UP][BACKPORT-2.3] Fix InsertIn...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22387 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20504: [SPARK-23332][SQL] Update SQLQueryTestSuite to su...
Github user wangyum closed the pull request at: https://github.com/apache/spark/pull/20504 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22387: [SPARK-25313][SQL][FOLLOW-UP][BACKPORT-2.3] Fix InsertIn...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22387 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #20504: [SPARK-23332][SQL] Update SQLQueryTestSuite to support a...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/20504 I will close it now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22387: [SPARK-25313][SQL][FOLLOW-UP][BACKPORT-2.3] Fix I...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22387 [SPARK-25313][SQL][FOLLOW-UP][BACKPORT-2.3] Fix InsertIntoHiveDirCommand output schema in Parquet issue ## What changes were proposed in this pull request? Backport https://github.com/apache/spark/pull/22359 to branch-2.3. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25313-FOLLOW-UP-branch-2.3 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22387.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22387 commit a7b857c69fa20615108413d6f17a87978ca44ae2 Author: Yuming Wang Date: 2018-09-11T02:02:55Z [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirCommand output schema in Parquet issue --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22372: [SPARK-25385][BUILD] Upgrade Hadoop 3.1 jackson v...
Github user wangyum closed the pull request at: https://github.com/apache/spark/pull/22372 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22372: [SPARK-25385][BUILD] Upgrade Hadoop 3.1 jackson version ...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22372 I did a simple test for 2.9.6. It works well. But that pr for 3.0. It means that a simple test on branch 2.4 will fail: ```scala scala> spark.range(10).write.parquet("/tmp/spark/parquet") com.fasterxml.jackson.databind.JsonMappingException: Incompatible Jackson version: 2.7.8 at com.fasterxml.jackson.module.scala.JacksonModule$class.setupModule(JacksonModule.scala:64) at com.fasterxml.jackson.module.scala.DefaultScalaModule.setupModule(DefaultScalaModule.scala:19) at com.fasterxml.jackson.databind.ObjectMapper.registerModule(ObjectMapper.java:730) at org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala:82) at org.apache.spark.rdd.RDDOperationScope$.(RDDOperationScope.scala) ``` How about merge this pr to branch-2.4 only? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22372: [SPARK-25385][BUILD] Upgrade Hadoop 3.1 jackson v...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22372#discussion_r216208966 --- Diff: pom.xml --- @@ -2694,6 +2694,8 @@ 3.1.0 2.12.0 3.4.9 +2.7.8 + 2.7.8 --- End diff -- We should `clean` first for `package`: ```sh build/sbt clean package -Phadoop-3.1 ``` and then check: `assembly/target/scala-2.11/jars/`: ![image](https://user-images.githubusercontent.com/5399861/45279062-5e369e80-b502-11e8-9f18-04e41cc060ac.png) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22372: [SPARK-25385][BUILD] Upgrade Hadoop 3.1 jackson v...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22372#discussion_r216208793 --- Diff: pom.xml --- @@ -2694,6 +2694,8 @@ 3.1.0 2.12.0 3.4.9 +2.7.8 + 2.7.8 --- End diff -- ```sh build/sbt dependency-tree -Phadoop-3.1 ``` ![image](https://user-images.githubusercontent.com/5399861/45279695-87582e80-b504-11e8-9d24-2b21e569222d.png) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveDirComma...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22359 cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22372: [SPARK-25385][BUILD] Upgrade Hadoop 3.1 jackson version ...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22372 We do not have jenkins tests for 3.1 profile: https://github.com/apache/spark/blob/395860a986987886df6d60fd9b26afd818b2cb39/dev/run-tests.py#L307-L310 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22372: [SPARK-25385][BUILD] Upgrade Hadoop 3.1 jackson v...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22372 [SPARK-25385][BUILD] Upgrade Hadoop 3.1 jackson version to 2.7.8 ## What changes were proposed in this pull request? Upgrade Hadoop 3.1 jackson version to 2.7.8 to fix `JsonMappingException: Incompatible Jackson version: 2.7.8`. https://github.com/apache/hadoop/blob/release-3.1.0-RC1/hadoop-project/pom.xml#L72 ## How was this patch tested? manual tests: ```sh export SPARK_PREPEND_CLASSES=true build/sbt clean package -Phadoop-3.1 spark-shell scala> spark.range(10).write.mode("overwrite").parquet("/tmp/spark/parquet") ``` You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25385 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22372.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22372 commit f68083ab07df6fedf8a30c94e706a74e0c620694 Author: Yuming Wang Date: 2018-09-09T11:51:39Z Upgrade jackson version to 2.7.8 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22368: [SPARK-25368][SQL] Incorrect predicate pushdown returns ...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22368 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22368: [SPARK-25368][SQL] Incorrect predicate pushdown r...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22368 [SPARK-25368][SQL] Incorrect predicate pushdown returns wrong result ## What changes were proposed in this pull request? How to reproduce: ```scala val df1 = spark.createDataFrame(Seq( (1, 1) )).toDF("a", "b").withColumn("c", lit(null).cast("int")) val df2 = df1.union(df1).withColumn("d", spark_partition_id).filter($"c".isNotNull) df2.show +---+---++---+ | a| b| c| d| +---+---++---+ | 1| 1|null| 0| | 1| 1|null| 1| +---+---++---+ ``` `filter($"c".isNotNull)`changed to `(null <=> c#10)` before https://github.com/apache/spark/pull/19201, but it changed to `(c#10 = null)` since https://github.com/apache/spark/pull/20155. This pr revert it to `(null <=> c#10)` to fix this issue. ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25368 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22368.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22368 commit 86b9b7892c94be68145453f9519e35a3574fe568 Author: Yuming Wang Date: 2018-09-09T03:46:18Z Fix SPARK-25368 commit 865e0af572edad7fd775c25e317055ffa0df2a08 Author: Yuming Wang Date: 2018-09-09T04:22:29Z Fix InferFiltersFromConstraintsSuite test error --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22359: [SPARK-25313][SQL][FOLLOW-UP] Fix InsertIntoHiveD...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22359#discussion_r216117397 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -803,6 +803,25 @@ class HiveDDLSuite } } + test("Insert overwrite directory should output correct schema") { --- End diff -- Also added here? https://github.com/apache/spark/blob/8e60b98239be63555644e013417cda7175baf984/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala#L758 https://github.com/apache/spark/blob/8e60b98239be63555644e013417cda7175baf984/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala#L782 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCode...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22358#discussion_r215897048 --- Diff: docs/sql-programming-guide.md --- @@ -964,7 +964,7 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession Sets the compression codec used when writing Parquet files. If either `compression` or `parquet.compression` is specified in the table-specific options/properties, the precedence would be `compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include: -none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd. --- End diff -- `none, uncompressed, snappy, gzip, lzo, brotli(need install brotli-codec), lz4, zstd(since Hadoop 2.9.0)` https://jira.apache.org/jira/browse/HADOOP-13578 https://github.com/rdblue/brotli-codec https://jira.apache.org/jira/browse/HADOOP-13126 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22358: [SPARK-25366][SQL]Zstd and brotil CompressionCode...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22358#discussion_r215874603 --- Diff: docs/sql-programming-guide.md --- @@ -964,7 +964,7 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession Sets the compression codec used when writing Parquet files. If either `compression` or `parquet.compression` is specified in the table-specific options/properties, the precedence would be `compression`, `parquet.compression`, `spark.sql.parquet.compression.codec`. Acceptable values include: -none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd. --- End diff -- I prefer `none, uncompressed, snappy, gzip, lzo, brotli(need install ...), lz4, zstd(need install ...)`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22359: [SPARK-25313][SQL]FOLLOW-UP] Fix InsertIntoHiveDirComman...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22359 cc @gengliangwang --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22359: [SPARK-25313][SQL]FOLLOW-UP] Fix InsertIntoHiveDi...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22359 [SPARK-25313][SQL]FOLLOW-UP] Fix InsertIntoHiveDirCommand output schema issue ## What changes were proposed in this pull request? Fix `InsertIntoHiveDirCommand` output schema issue. ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25313-FOLLOW-UP Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22359.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22359 commit ff78fdb017d87a8320e8be33c4beceffbdaa3ab4 Author: Yuming Wang Date: 2018-09-07T06:08:47Z Fix InsertIntoHiveDirCommand output schema --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22287: [SPARK-25135][SQL] FileFormatWriter should respec...
Github user wangyum closed the pull request at: https://github.com/apache/spark/pull/22287 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22320: [SPARK-25313][SQL]Fix regression in FileFormatWriter out...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22320 @gengliangwang We need backport this pr to branch-2.3. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22327: [SPARK-25330][BUILD] Revert Hadoop 2.7 to 2.7.3
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22327 How about revert it to branch-2.3 as we are going to release 2.3.2? We have time to fix it before releasing 2.4.0. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22320: [SPARK-25313][SQL]Fix regression in FileFormatWriter out...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22320 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21404: [SPARK-24360][SQL] Support Hive 3.0 metastore
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21404#discussion_r215232389 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala --- @@ -99,6 +99,7 @@ private[hive] object IsolatedClientLoader extends Logging { case "2.1" | "2.1.0" | "2.1.1" => hive.v2_1 case "2.2" | "2.2.0" => hive.v2_2 case "2.3" | "2.3.0" | "2.3.1" | "2.3.2" | "2.3.3" => hive.v2_3 +case "3.0" | "3.0.0" => hive.v3_0 --- End diff -- @dongjoon-hyun Please update sql-programming-guide.md: https://github.com/apache/spark/blob/05974f9431e9718a5f331a9892b7d81aca8387a6/docs/sql-programming-guide.md#L1217 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17174: [SPARK-19145][SQL] Timestamp to String casting is slowin...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/17174 @tanejagagan Are you still working on? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22320: [SPARK-25313][SQL]Fix regression in FileFormatWri...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22320#discussion_r215106921 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala --- @@ -56,7 +56,7 @@ case class InsertIntoHadoopFsRelationCommand( mode: SaveMode, catalogTable: Option[CatalogTable], fileIndex: Option[FileIndex], -outputColumns: Seq[Attribute]) +outputColumnNames: Seq[String]) extends DataWritingCommand { import org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils.escapePathName --- End diff -- Line 66: `query.schema` should be `DataWritingCommand.logicalPlanSchemaWithNames(query, outputColumnNames)`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22228: [SPARK-25124][ML]VectorSizeHint setSize and getSize don'...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/8 This is already merged, @huaxingao Could you please close this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22179 Thanks, @dongjoon-hyun --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22327: [SPARK-25330][BUILD] Revert Hadoop 2.7 to 2.7.3
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22327 Yes. This is a Hadoop thing. I try to build Hadoop 2.7.7 with [`Configuration.getRestrictParserDefault(Object resource)`](https://github.com/apache/hadoop/blob/release-2.7.7-RC0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java#L236) = true and false. It succeeded when `Configuration.getRestrictParserDefault(Object resource)=false`, but failed when `Configuration.getRestrictParserDefault(Object resource)=true`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22327: [SPARK-25330][BUILD] Revert Hadoop 2.7 to 2.7.3
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22327 cc @srowen @steveloughran --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22327: [SPARK-25330][BUILD] Revert Hadoop 2.7 to 2.7.3
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22327 [SPARK-25330][BUILD] Revert Hadoop 2.7 to 2.7.3 ## What changes were proposed in this pull request? Revert Hadoop 2.7 to 2.7.3 to fix permission issue. The issue occurred in this commit: https://github.com/apache/hadoop/commit/feb886f2093ea5da0cd09c69bd1360a335335c86 ## How was this patch tested? unit tests and manual tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25330 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22327.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22327 commit f89448b7b0598a59f750a324e869e7768cfedbc1 Author: Yuming Wang Date: 2018-09-04T08:31:13Z Revert Hadoop 2.7 to 2.7.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22320: [SPARK-25313][SQL]Fix regression in FileFormatWri...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22320#discussion_r214786494 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -754,6 +754,47 @@ class HiveDDLSuite } } + test("Insert overwrite Hive table should output correct schema") { +withTable("tbl", "tbl2") { + withView("view1") { +spark.sql("CREATE TABLE tbl(id long)") +spark.sql("INSERT OVERWRITE TABLE tbl SELECT 4") +spark.sql("CREATE VIEW view1 AS SELECT id FROM tbl") +spark.sql("CREATE TABLE tbl2(ID long)") +spark.sql("INSERT OVERWRITE TABLE tbl2 SELECT ID FROM view1") +checkAnswer(spark.table("tbl2"), Seq(Row(4))) --- End diff -- Add schema assert please. We can read data since [SPARK-25132](https://issues.apache.org/jira/browse/SPARK-25132). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22179 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22179 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22179 Sorry @dongjoon-hyun I only reproduce one test. kryo-parametrized-type-inheritance related to language. It seems scala can't reproduce it: ```scala val ser = new KryoSerializer(new SparkConf).newInstance().asInstanceOf[KryoSerializerInstance] class BaseType[R] {} class CollectionType(val child: BaseType[_]*) extends BaseType[Boolean] { val children: List[BaseType[_]] = child.toList } class ValueType[R](val v: R) extends BaseType[R] {} val value = new CollectionType(new ValueType("hello")) ser.serialize(value) ``` SPARK-23131 may be related to data. I can't reproduce it: ```scala def modelToString(model: GeneralizedLinearRegressionModel): (String, String) = { val os: ByteArrayOutputStream = new ByteArrayOutputStream() val zos = new GZIPOutputStream(os) val oo: ObjectOutputStream = new ObjectOutputStream(zos) oo.writeObject(model) oo.close() zos.close() os.close() (model.uid, DatatypeConverter.printBase64Binary(os.toByteArray)) } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22179 @dongjoon-hyun I'm trying to add test cases. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21987: [SPARK-25015][BUILD] Update Hadoop 2.7 to 2.7.7
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/21987 It seems that this change caused permission issue: ``` export HADOOP_PROXY_USER=user_a spark-sql ``` It will create dir `/tmp/hive-$%7Buser.name%7D/user_a/`. then change to other user: ``` export HADOOP_PROXY_USER=user_b spark-sql ``` exception: ```scala Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=user_b, access=EXECUTE, inode="/tmp/hive-$%7Buser.name%7D/user_b/6b446017-a880-4f23-a8d0-b62f37d3c413":user_a:hadoop:drwx-- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:1780) at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getFileInfo(FSDirStatAndListingOp.java:108) ``` I'll do verification later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22124: [SPARK-25135][SQL] Insert datasource table may al...
Github user wangyum closed the pull request at: https://github.com/apache/spark/pull/22124 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22124 close it. I have create a new PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22287: [SPARK-25135][SQL] FileFormatWriter should respec...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22287 [SPARK-25135][SQL] FileFormatWriter should respect the schema of Hive ## What changes were proposed in this pull request? This pr fix `FileFormatWriter's dataSchema` should respect the schema of Hive. Otherwise there will be two issues. 1. Throwing an exception(This can be reproduce by added test case): ```scala java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:347) at scala.None$.get(Option.scala:345) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$3$$anonfun$4.apply(FileFormatWriter.scala:87) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$3$$anonfun$4.apply(FileFormatWriter.scala:87) ``` 2. The schema of the Hive table is not the same as the schema of the parquet file. ## How was this patch tested? - Unit tests for FileFormatWriter should respect the schema of Hive. - Manual tests for didn't break UI issues fixed by [SPARK-22834](https://issues.apache.org/jira/browse/SPARK-22834): ![image](https://user-images.githubusercontent.com/5399861/44870021-94ce1700-acc1-11e8-8ef7-d7a8ba3c435d.png) You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25135-view Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22287.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22287 commit b54953a8224aa0a7759289a83e876e3bfc166cb6 Author: Yuming Wang Date: 2018-08-30T17:46:02Z FileFormatWriter should respect the input query schema in HIVE --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22267: [SPARK-24716][TESTS][FOLLOW-UP] Test Hive metastore sche...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22267 cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22267: [SPARK-24716][TESTS][FOLLOW-UP] Test Hive metasto...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22267 [SPARK-24716][TESTS][FOLLOW-UP] Test Hive metastore schema and parquet schema are in different letter cases ## What changes were proposed in this pull request? Since https://github.com/apache/spark/pull/21696. Spark uses Parquet schema instead of Hive metastore schema to do pushdown. This change can avoid wrong records returned when Hive metastore schema and parquet schema are in different letter cases. This pr add a test case for it. More details: https://issues.apache.org/jira/browse/SPARK-25206 ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-24716-TESTS Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22267.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22267 commit f5559f40dc7d3bfd80ced7090f617998094811bf Author: Yuming Wang Date: 2018-08-29T10:03:15Z Improvement test. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22250: [SPARK-25259][SQL] left/right join support push down dur...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22250 Fixed by [SPARK-21479](https://issues.apache.org/jira/browse/SPARK-21479). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22250: [SPARK-25259][SQL] left/right join support push d...
Github user wangyum closed the pull request at: https://github.com/apache/spark/pull/22250 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22263: [SPARK-25269][SQL] SQL interface support specify Storage...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22263 retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22263: [SPARK-25269][SQL] SQL interface support specify ...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22263 [SPARK-25269][SQL] SQL interface support specify StorageLevel when cache table ## What changes were proposed in this pull request? SQL interface support specify `StorageLevel` when cache table. The semantic is like this: ```sql CACHE DISK_ONLY TABLE tableName; ``` All supported `StorageLevel` is: https://github.com/apache/spark/blob/eefdf9f9dd8afde49ad7d4e230e2735eb817ab0a/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L172-L183 ## How was this patch tested? unit tests and manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25269 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22263.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22263 commit 9e058d1b402dec85982a880bf086268a1dcec99e Author: Yuming Wang Date: 2018-08-29T06:31:44Z SQL interface support specify StorageLevel when cache table --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22250: [SPARK-25259][SQL] left/right join support push down dur...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22250 cc @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22177: [SPARK-25119][Web UI] stages in wrong order within job p...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22177 How about backport https://github.com/apache/spark/pull/21680? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22250: [SPARK-25259][SQL] left/right join support push down dur...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22250 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22250: [SPARK-25259][SQL] left/right join support push d...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22250 [SPARK-25259][SQL] left/right join support push down during-join predicates ## What changes were proposed in this pull request? Prepare data: ```sql create temporary view EMPLOYEE as select * from values ("10", "HAAS", "A00"), ("10", "THOMPSON", "B01"), ("30", "KWAN", "C01"), ("000110", "LUCCHESSI", "A00"), ("000120", "O'CONNELL", "A))"), ("000130", "QUINTANA", "C01") as EMPLOYEE(EMPNO, LASTNAME, WORKDEPT); create temporary view DEPARTMENT as select * from values ("A00", "SPIFFY COMPUTER SERVICE DIV.", "10"), ("B01", "PLANNING", "20"), ("C01", "INFORMATION CENTER", "30"), ("D01", "DEVELOPMENT CENTER", null) as EMPLOYEE(DEPTNO, DEPTNAME, MGRNO); create temporary view PROJECT as select * from values ("AD3100", "ADMIN SERVICES", "D01"), ("IF1000", "QUERY SERVICES", "C01"), ("IF2000", "USER EDUCATION", "E01"), ("MA2100", "WELD LINE AUDOMATION", "D01"), ("PL2100", "WELD LINE PLANNING", "01") as EMPLOYEE(PROJNO, PROJNAME, DEPTNO); ``` For the below SQL, we can push `DEPTNO='E01'` to right side to reduce data reading: ```sql SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME FROM PROJECT P LEFT OUTER JOIN DEPARTMENT D ON P.DEPTNO = D.DEPTNO AND P.DEPTNO='E01'; ``` Optimized SQL is equivalent to: ```sql SELECT PROJNO, PROJNAME, P.DEPTNO, DEPTNAME FROM PROJECT P LEFT OUTER JOIN (SELECT * FROM DEPARTMENT WHERE DEPTNO='E01') D ON P.DEPTNO = D.DEPTNO AND P.DEPTNO='E01'; ``` This pr enhancement `PushPredicateThroughJoin` to support this feature. ## How was this patch tested? unit tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25259 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22250.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22250 commit f9b32d5d044a899529959ad5042f8cf95c789ea8 Author: Yuming Wang Date: 2018-08-28T06:18:05Z left/right join support push down during-join predicates --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20020: [SPARK-22834][SQL] Make insertion commands have r...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/20020#discussion_r212850054 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala --- @@ -20,30 +20,32 @@ package org.apache.spark.sql.execution.command import org.apache.hadoop.conf.Configuration import org.apache.spark.SparkContext -import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.{Row, SparkSession} +import org.apache.spark.sql.catalyst.expressions.Attribute +import org.apache.spark.sql.catalyst.plans.logical.{Command, LogicalPlan} +import org.apache.spark.sql.execution.SparkPlan import org.apache.spark.sql.execution.datasources.BasicWriteJobStatsTracker +import org.apache.spark.sql.execution.datasources.FileFormatWriter import org.apache.spark.sql.execution.metric.{SQLMetric, SQLMetrics} import org.apache.spark.util.SerializableConfiguration - /** - * A special `RunnableCommand` which writes data out and updates metrics. + * A special `Command` which writes data out and updates metrics. */ -trait DataWritingCommand extends RunnableCommand { - +trait DataWritingCommand extends Command { /** * The input query plan that produces the data to be written. + * IMPORTANT: the input query plan MUST be analyzed, so that we can carry its output columns + *to [[FileFormatWriter]]. */ def query: LogicalPlan - // We make the input `query` an inner child instead of a child in order to hide it from the - // optimizer. This is because optimizer may not preserve the output schema names' case, and we - // have to keep the original analyzed plan here so that we can pass the corrected schema to the - // writer. The schema of analyzed plan is what user expects(or specifies), so we should respect - // it when writing. - override protected def innerChildren: Seq[LogicalPlan] = query :: Nil + override final def children: Seq[LogicalPlan] = query :: Nil - override lazy val metrics: Map[String, SQLMetric] = { + // Output columns of the analyzed input query plan + def outputColumns: Seq[Attribute] --- End diff -- `outputColumns` changed from analyzed to optimized. For example: ```scala withTempDir { dir => val path = dir.getCanonicalPath val cnt = 30 val table1Path = s"$path/table1" val table3Path = s"$path/table3" spark.range(cnt).selectExpr("cast(id as bigint) as col1", "cast(id % 3 as bigint) as col2") .write.mode(SaveMode.Overwrite).parquet(table1Path) withTable("table1", "table3") { spark.sql( s"CREATE TABLE table1(col1 bigint, col2 bigint) using parquet location '$table1Path/'") spark.sql("CREATE TABLE table3(COL1 bigint, COL2 bigint) using parquet " + "PARTITIONED BY (COL2) " + s"CLUSTERED BY (COL1) INTO 2 BUCKETS location '$table3Path/'") withView("view1") { spark.sql("CREATE VIEW view1 as select col1, col2 from table1 where col1 > -20") spark.sql("INSERT OVERWRITE TABLE table3 select COL1, COL2 from view1 CLUSTER BY COL1") spark.table("table3").show } } } ``` ``` outputColumns: List(COL1#19L, COL2#20L) outputColumns: List(COL1#19L, COL2#20L) outputColumns: List(COL1#19L, COL2#20L) outputColumns: List(COL1#19L, COL2#20L) outputColumns: List(COL1#19L, COL2#20L) outputColumns: List(COL1#19L, COL2#20L) outputColumns: List(COL1#19L, COL2#20L) outputColumns: List(COL1#19L, COL2#20L) outputColumns: List(col1#16L, col2#17L) outputColumns: List(col1#16L, col2#17L) outputColumns: List(col1#16L, col2#17L) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22179 Thanks @srowen [SPARK-25176](https://issues.apache.org/jira/browse/SPARK-25176) has a detail description: > I'm using the latest spark version spark-core_2.11:2.3.1 which transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo serializer contains an issue [1,2] which results in throwing ClassCastExceptions when serialising parameterised type hierarchy. This issue has been fixed in kryo version 4.0.0 [3]. It would be great to have this update in Spark as well. Could you please upgrade the version of com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2? You can find a simple test to reproduce the issue [4]. [1] https://github.com/EsotericSoftware/kryo/issues/384 [2] https://github.com/EsotericSoftware/kryo/issues/377 [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0 [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22179 cc @srowen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22179: [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22179 [SPARK-23131][BUILD] Upgrade Kryo to 4.0.2 ## What changes were proposed in this pull request? Upgrade chill to 0.9.3, Kryo to 4.0.2, to get bug fixes and improvements. More details: https://github.com/twitter/chill/releases/tag/v0.9.3 https://github.com/twitter/chill/commit/cc3910d501a844f3c882249fef8fc2560b95b6dd ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-23131 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22179.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22179 commit 8c28e078526445a31099ca5f1ae71ce76d782004 Author: Yuming Wang Date: 2018-08-22T01:44:09Z Upgrade chill from 0.8.4 to 0.9.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22159: [BUILD] Close stale PRs
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22159 Please add https://github.com/apache/spark/pull/18424, It has been fixed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22124: [SPARK-25135][SQL] Insert datasource table may al...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22124#discussion_r211447728 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -384,7 +384,12 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] { } } - def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty) + def apply(plan: LogicalPlan): LogicalPlan = { +plan match { + case c: Command => c + case _ => removeRedundantAliases(plan, AttributeSet.empty) --- End diff -- For example: ```scala val path = "/tmp/spark/parquet" val cnt = 30 spark.range(cnt).selectExpr("id as col1").write.mode("overwrite").parquet(path) spark.sql(s"CREATE TABLE table1(col1 bigint) using parquet location '$path'") spark.sql("create view view1 as select col1 from table1 where col1 > -20") // The column name of table2 is inconsistent with the column name of view1. spark.sql("create table table2 (COL1 BIGINT) using parquet") // When querying the view, ensure that the column name of the query matches the column name of the target table. spark.sql("insert overwrite table table2 select COL1 from view1") ``` The execution plan change track: ```scala === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences === !'Project ['id AS col1#2] Project [id#0L AS col1#2L] +- Range (0, 30, step=1, splits=Some(1)) +- Range (0, 30, step=1, splits=Some(1)) 17:02:55.061 WARN org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.CleanupAliases === Project [id#0L AS col1#2L] Project [id#0L AS col1#2L] +- Range (0, 30, step=1, splits=Some(1)) +- Range (0, 30, step=1, splits=Some(1)) 17:02:59.174 WARN org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.execution.datasources.DataSourceAnalysis === !'CreateTable `table1`, ErrorIfExists CreateDataSourceTableCommand `table1`, false 17:02:59.909 WARN org.apache.hadoop.hive.metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException 17:03:00.094 WARN org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === 'Project ['col1] 'Project ['col1] +- 'Filter ('col1 > -20) +- 'Filter ('col1 > -20) ! +- 'UnresolvedRelation `table1` +- 'SubqueryAlias `default`.`table1` ! +- 'UnresolvedCatalogRelation `default`.`table1`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe 17:03:00.254 WARN org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.execution.datasources.FindDataSourceTable === 'Project ['col1] 'Project ['col1] +- 'Filter ('col1 > -20) +- 'Filter ('col1 > -20) ! +- 'SubqueryAlias `default`.`table1` +- SubqueryAlias `default`.`table1` ! +- 'UnresolvedCatalogRelation `default`.`table1`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe +- Relation[col1#5L] parquet 17:03:00.267 WARN org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences === 'Project ['col1] 'Project ['col1] !+- 'Filter ('col1 > -20) +- 'Filter (col1#5L > -20) +- SubqueryAlias `default`.`table1` +- SubqueryAlias `default`.`table1` +- Relation[col1#5L] parquet +- Relation[col1#5L] parquet 17:03:00.306 WARN org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1: === Applying Rule org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts === 'Project ['col1] 'Project ['col1] !+- 'Filter (col1#5L > -20) +- Filter (col1#5L > cast(-20 as bigint)) +- SubqueryAlias `default`.`table1` +- SubqueryAlias `default`.`table1` +- Relation[col1#5L] parquet +- Relation[col1#5L] parquet 17:03:00.309 WARN org.apache.spark.sql.hive.HiveSessionStateBui
[GitHub] spark pull request #22124: [SPARK-25135][SQL] Insert datasource table may al...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22124#discussion_r211208353 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala --- @@ -384,7 +384,12 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] { } } - def apply(plan: LogicalPlan): LogicalPlan = removeRedundantAliases(plan, AttributeSet.empty) + def apply(plan: LogicalPlan): LogicalPlan = { +plan match { + case c: Command => c + case _ => removeRedundantAliases(plan, AttributeSet.empty) --- End diff -- Yes, this is correct. Without this PR, `RemoveRedundantAliases` works like this: ```scala === Applying Rule org.apache.spark.sql.catalyst.optimizer.RemoveRedundantAliases === InsertIntoHadoopFsRelationCommand file:/private/var/folders/tg/f5mz46090wg7swzgdc69f8q03965_0/T/warehouse-ae504f50-9543-49fb-a InsertIntoHadoopFsRelationCommand file:/private/var/folders/tg/f5mz46090wg7swzgdc69f8q03965_0/T/warehouse-ae504f50-9543-49fb-acf Database: default Database: default Table: table2 Table: table2 Owner: yumwang Owner: yumwang Created Time: Mon Aug 20 03:03:52 PDT 2018 Created Time: Mon Aug 20 03:03:52 PDT 2018 Last Access: Wed Dec 31 16:00:00 PST 1969 Last Access: Wed Dec 31 16:00:00 PST 1969 Created By: Spark 2.4.0-SNAPSHOT Created By: Spark 2.4.0-SNAPSHOT Type: MANAGED Type: MANAGED Provider: hive Provider: hive Table Properties: [transient_lastDdlTime=1534759432] Table Properties: [transient_lastDdlTime=1534759432] Location: file:/private/var/folders/tg/f5mz46090wg7swzgdc69f8q03965_0/T/warehouse-ae504f50-9543-49fb-acf0-8b2736665d26/table2 Location: file:/private/var/folders/tg/f5mz46090wg7swzgdc69f8q03965_0/T/warehouse-ae504f50-9543-49fb-acf0-8b2736665d26/table2 Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat Storage Properties: [serialization.format=1] Storage Properties: [serialization.format=1] Partition Provider: Catalog Partition Provider: Catalog Schema: root Schema: root -- COL1: long (nullable = true) |-- COL1: long (nullable = true) -- COL2: long (nullable = true) |-- COL2: long (nullable = true) !), org.apache.spark.sql.execution.datasources.InMemoryFileIndex@60582d55, [COL1#10L, COL2#11L] ), org.apache.spark.sql.execution.datasources.InMemoryFileIndex@60582d55, [col1#8L, col2#9L] !+- Project [col1#8L AS col1#10L, col2#9L AS col2#11L] +- Project [col1#8L, col2#9L] +- Filter (col1#8L > -20) +- Filter (col1#8L > -20) +- Relation[col1#8L,col2#9L] parquet +- Rela
[GitHub] spark issue #22148: [SPARK-25132][SQL] Case-insensitive field resolution whe...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22148 retest this please --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22124 Comes from: [[SPARK-22834][SQL] Make insertion commands have real children to fix UI issues](https://github.com/apache/spark/pull/20020). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22124: [SPARK-25135][SQL] Insert datasource table may all null ...
Github user wangyum commented on the issue: https://github.com/apache/spark/pull/22124 The root project should be consistent with the schema of the target table. But now it is inconsistent. **Before this PR**: [dataColumns](https://github.com/apache/spark/blob/e6c6f90a55241905c420afbc803dd3bd6961d66b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L84): `col1#8L,col2#9L` [plan](https://github.com/apache/spark/blob/e6c6f90a55241905c420afbc803dd3bd6961d66b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L67): ``` *(1) Project [col1#8L, col2#9L] +- *(1) Filter (isnotnull(col1#8L) && (col1#8L > -20)) +- *(1) FileScan parquet default.table1[col1#8L,col2#9L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/yumwang/spark/parquet], PartitionFilters: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,-20)], ReadSchema: struct ``` **After this PR**: [dataColumns](https://github.com/apache/spark/blob/e6c6f90a55241905c420afbc803dd3bd6961d66b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L84): `COL1#14L,COL2#15L` [plan](https://github.com/apache/spark/blob/e6c6f90a55241905c420afbc803dd3bd6961d66b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L67): ``` *(1) Project [col1#8L AS COL1#14L, col2#9L AS COL2#15L] +- *(1) Filter (isnotnull(col1#8L) && (col1#8L > -20)) +- *(1) FileScan parquet default.table1[col1#8L,col2#9L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/yumwang/spark/parquet], PartitionFilters: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,-20)], ReadSchema: struct ``` Before [SPARK-22834](https://issues.apache.org/jira/browse/SPARK-22834) [dataColumns](https://github.com/apache/spark/blob/ec122209fb35a65637df42eded64b0203e105aae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L124): `COL1#19L,COL2#20L` [queryExecution](https://github.com/apache/spark/blob/ec122209fb35a65637df42eded64b0203e105aae/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L104): ``` == Parsed Logical Plan == Project [COL1#19L, COL2#20L] +- SubqueryAlias view1 +- View (`default`.`view1`, [col1#19L,col2#20L]) +- Project [col1#15L, col2#16L] +- Filter (col1#15L > cast(-20 as bigint)) +- SubqueryAlias table1 +- Relation[col1#15L,col2#16L] parquet == Analyzed Logical Plan == COL1: bigint, COL2: bigint Project [COL1#19L, COL2#20L] +- SubqueryAlias view1 +- View (`default`.`view1`, [col1#19L,col2#20L]) +- Project [cast(col1#15L as bigint) AS col1#19L, cast(col2#16L as bigint) AS col2#20L] +- Project [col1#15L, col2#16L] +- Filter (col1#15L > cast(-20 as bigint)) +- SubqueryAlias table1 +- Relation[col1#15L,col2#16L] parquet == Optimized Logical Plan == Filter (isnotnull(col1#15L) && (col1#15L > -20)) +- Relation[col1#15L,col2#16L] parquet == Physical Plan == *Project [col1#15L, col2#16L] +- *Filter (isnotnull(col1#15L) && (col1#15L > -20)) +- *FileScan parquet default.table1[col1#15L,col2#16L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/yumwang/spark/parquet], PartitionFilters: [], PushedFilters: [IsNotNull(col1), GreaterThan(col1,-20)], ReadSchema: struct ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org