[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21741 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21741#discussion_r202451540 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -378,6 +378,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED = +buildConf("spark.sql.parquet.filterPushdown.timestamp") + .doc("If true, enables Parquet filter push-down optimization for Timestamp. " + +"This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is " + +"enabled and Timestamp stored as TIMESTAMP_MICROS or TIMESTAMP_MILLIS type.") --- End diff -- I would just note that push-down doesn't work for INT96 timestamps in the file. It should work for the others. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21741#discussion_r202451374 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -378,6 +378,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED = +buildConf("spark.sql.parquet.filterPushdown.timestamp") + .doc("If true, enables Parquet filter push-down optimization for Timestamp. " + +"This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is " + +"enabled and Timestamp stored as TIMESTAMP_MICROS or TIMESTAMP_MILLIS type.") +.internal() +.booleanConf +.createWithDefault(true) --- End diff -- Because we're using the file schema, it doesn't mater what the write configuration is. It only matters what it was when the file was written. If the file has an INT96 timestamp, this should just not push anything down. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/21741#discussion_r202423258 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -378,6 +378,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED = +buildConf("spark.sql.parquet.filterPushdown.timestamp") + .doc("If true, enables Parquet filter push-down optimization for Timestamp. " + +"This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is " + +"enabled and Timestamp stored as TIMESTAMP_MICROS or TIMESTAMP_MILLIS type.") --- End diff -- You need to explain how to use `spark.sql.parquet.outputTimestampType` to control the Parquet timestamp type Spark uses to writes parquet files. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21741#discussion_r202287558 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -378,6 +378,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED = +buildConf("spark.sql.parquet.filterPushdown.timestamp") + .doc("If true, enables Parquet filter push-down optimization for Timestamp. " + +"This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is " + +"enabled and Timestamp stored as TIMESTAMP_MICROS or TIMESTAMP_MILLIS type.") --- End diff -- ... I don't think users will understand any of them .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21741#discussion_r202277812 --- Diff: sql/core/benchmarks/FilterPushdownBenchmark-results.txt --- @@ -578,3 +578,127 @@ Native ORC Vectorized 11622 / 12196 1.4 7 Native ORC Vectorized (Pushdown)11377 / 11654 1.4 723.3 1.0X + +Pushdown benchmark for Timestamp + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz + +Select 1 timestamp stored as INT96 row (value = CAST(7864320 AS timestamp)): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative --- End diff -- OK. I'll send a follow-up PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21741#discussion_r202277658 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -517,7 +585,6 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } - --- End diff -- OK --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21741#discussion_r202277483 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -378,6 +378,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED = +buildConf("spark.sql.parquet.filterPushdown.timestamp") + .doc("If true, enables Parquet filter push-down optimization for Timestamp. " + +"This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is " + +"enabled and Timestamp stored as TIMESTAMP_MICROS or TIMESTAMP_MILLIS type.") --- End diff -- I think end users have a better understanding of `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21741#discussion_r202261810 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -517,7 +585,6 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } - --- End diff -- nit: I would revert this change if you are going to push more changes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21741#discussion_r202261386 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -378,6 +378,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED = +buildConf("spark.sql.parquet.filterPushdown.timestamp") + .doc("If true, enables Parquet filter push-down optimization for Timestamp. " + +"This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is " + +"enabled and Timestamp stored as TIMESTAMP_MICROS or TIMESTAMP_MILLIS type.") --- End diff -- Shell we note `INT64` here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/21741#discussion_r202260518 --- Diff: sql/core/benchmarks/FilterPushdownBenchmark-results.txt --- @@ -578,3 +578,127 @@ Native ORC Vectorized 11622 / 12196 1.4 7 Native ORC Vectorized (Pushdown)11377 / 11654 1.4 723.3 1.0X + +Pushdown benchmark for Timestamp + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz + +Select 1 timestamp stored as INT96 row (value = CAST(7864320 AS timestamp)): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative --- End diff -- shall we add a new line after the benchmark name? e.g. ``` Select 1 timestamp stored as INT96 row (value = CAST(7864320 AS timestamp)): Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative ... ``` We can send a follow-up PR to fix this entire file. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21741#discussion_r201889164 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -387,6 +389,82 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - timestamp(TIMESTAMP_MILLIS)") { +val ts1 = Timestamp.valueOf("2018-06-14 08:28:53.123") +val ts2 = Timestamp.valueOf("2018-06-15 08:28:53.123") +val ts3 = Timestamp.valueOf("2018-06-16 08:28:53.123") +val ts4 = Timestamp.valueOf("2018-06-17 08:28:53.123") + +val data = Seq(ts1, ts2, ts3, ts4) + +withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> --- End diff -- I changed to: ```scala // spark.sql.parquet.outputTimestampType = TIMESTAMP_MILLIS val millisData = Seq(Timestamp.valueOf("2018-06-14 08:28:53.123"), Timestamp.valueOf("2018-06-15 08:28:53.123"), Timestamp.valueOf("2018-06-16 08:28:53.123"), Timestamp.valueOf("2018-06-17 08:28:53.123")) withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> ParquetOutputTimestampType.TIMESTAMP_MILLIS.toString) { testTimestampPushdown(millisData) } // spark.sql.parquet.outputTimestampType = TIMESTAMP_MICROS val microsData = Seq(Timestamp.valueOf("2018-06-14 08:28:53.123456"), Timestamp.valueOf("2018-06-15 08:28:53.123456"), Timestamp.valueOf("2018-06-16 08:28:53.123456"), Timestamp.valueOf("2018-06-17 08:28:53.123456")) withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> ParquetOutputTimestampType.TIMESTAMP_MICROS.toString) { testTimestampPushdown(microsData) } ``` We shouldn't use same data to test `TIMESTAMP_MILLIS` type and `TIMESTAMP_MICROS` type: 1. `TIMESTAMP_MILLIS` type will truncate `456` if use `microsData` to test. 2. It can't test `DateTimeUtils.fromJavaTimestamp(t.asInstanceOf[Timestamp]` if use `millisData`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21741#discussion_r201697107 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -387,6 +389,82 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - timestamp(TIMESTAMP_MILLIS)") { --- End diff -- I think we should also test INT96 timestamp type. Also maybe when `PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED` is disabled. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/21741#discussion_r201696558 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -387,6 +389,82 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - timestamp(TIMESTAMP_MILLIS)") { +val ts1 = Timestamp.valueOf("2018-06-14 08:28:53.123") +val ts2 = Timestamp.valueOf("2018-06-15 08:28:53.123") +val ts3 = Timestamp.valueOf("2018-06-16 08:28:53.123") +val ts4 = Timestamp.valueOf("2018-06-17 08:28:53.123") + +val data = Seq(ts1, ts2, ts3, ts4) + +withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key -> --- End diff -- This case is quite similar to the one below. Should we use a loop for setting the key `SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key` to avoid duplicated code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/21741#discussion_r201233974 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -378,6 +378,15 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED = +buildConf("spark.sql.parquet.filterPushdown.timestamp") + .doc("If true, enables Parquet filter push-down optimization for Timestamp. " + +"This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is " + +"enabled and Timestamp stored as TIMESTAMP_MICROS or TIMESTAMP_MILLIS type.") +.internal() +.booleanConf +.createWithDefault(true) --- End diff -- May be default should be `false`. because `PARQUET_OUTPUT_TIMESTAMP_TYPE` default is `INT96`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/21741 [SPARK-24718][SQL] Timestamp support pushdown to parquet data source ## What changes were proposed in this pull request? `Timestamp` support pushdown to parquet data source. Only `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` support push down. ## How was this patch tested? unit tests and benchmark tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-24718 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21741.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21741 commit b2a90005b95b4fc6e51d43d74784a6ff4397005a Author: Yuming Wang Date: 2018-07-10T06:45:17Z Timestamp support pushdown to parquet data source --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org