[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...

2018-07-14 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21741


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...

2018-07-13 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21741#discussion_r202451540
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -378,6 +378,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED =
+buildConf("spark.sql.parquet.filterPushdown.timestamp")
+  .doc("If true, enables Parquet filter push-down optimization for 
Timestamp. " +
+"This configuration only has an effect when 
'spark.sql.parquet.filterPushdown' is " +
+"enabled and Timestamp stored as TIMESTAMP_MICROS or 
TIMESTAMP_MILLIS type.")
--- End diff --

I would just note that push-down doesn't work for INT96 timestamps in the 
file. It should work for the others.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...

2018-07-13 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21741#discussion_r202451374
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -378,6 +378,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED =
+buildConf("spark.sql.parquet.filterPushdown.timestamp")
+  .doc("If true, enables Parquet filter push-down optimization for 
Timestamp. " +
+"This configuration only has an effect when 
'spark.sql.parquet.filterPushdown' is " +
+"enabled and Timestamp stored as TIMESTAMP_MICROS or 
TIMESTAMP_MILLIS type.")
+.internal()
+.booleanConf
+.createWithDefault(true)
--- End diff --

Because we're using the file schema, it doesn't mater what the write 
configuration is. It only matters what it was when the file was written. If the 
file has an INT96 timestamp, this should just not push anything down.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...

2018-07-13 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/21741#discussion_r202423258
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -378,6 +378,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED =
+buildConf("spark.sql.parquet.filterPushdown.timestamp")
+  .doc("If true, enables Parquet filter push-down optimization for 
Timestamp. " +
+"This configuration only has an effect when 
'spark.sql.parquet.filterPushdown' is " +
+"enabled and Timestamp stored as TIMESTAMP_MICROS or 
TIMESTAMP_MILLIS type.")
--- End diff --

You need to explain how to use `spark.sql.parquet.outputTimestampType` to 
control the Parquet timestamp type  Spark uses to writes parquet files.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...

2018-07-13 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21741#discussion_r202287558
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -378,6 +378,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED =
+buildConf("spark.sql.parquet.filterPushdown.timestamp")
+  .doc("If true, enables Parquet filter push-down optimization for 
Timestamp. " +
+"This configuration only has an effect when 
'spark.sql.parquet.filterPushdown' is " +
+"enabled and Timestamp stored as TIMESTAMP_MICROS or 
TIMESTAMP_MILLIS type.")
--- End diff --

... I don't think users will understand any of them .. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...

2018-07-13 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21741#discussion_r202277812
  
--- Diff: sql/core/benchmarks/FilterPushdownBenchmark-results.txt ---
@@ -578,3 +578,127 @@ Native ORC Vectorized   11622 / 
12196  1.4 7
 Native ORC Vectorized (Pushdown)11377 / 11654  1.4 
723.3   1.0X
 
 

+
+Pushdown benchmark for Timestamp

+
+
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
+Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
+
+Select 1 timestamp stored as INT96 row (value = CAST(7864320 AS 
timestamp)): Best/Avg Time(ms)Rate(M/s)   Per Row(ns)   Relative
--- End diff --

OK. I'll send a follow-up PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...

2018-07-13 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21741#discussion_r202277658
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -517,7 +585,6 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
 }
   }
 
-
--- End diff --

OK


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...

2018-07-13 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21741#discussion_r202277483
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -378,6 +378,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED =
+buildConf("spark.sql.parquet.filterPushdown.timestamp")
+  .doc("If true, enables Parquet filter push-down optimization for 
Timestamp. " +
+"This configuration only has an effect when 
'spark.sql.parquet.filterPushdown' is " +
+"enabled and Timestamp stored as TIMESTAMP_MICROS or 
TIMESTAMP_MILLIS type.")
--- End diff --

I think end users have a better understanding of `TIMESTAMP_MICROS` and 
`TIMESTAMP_MILLIS`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...

2018-07-13 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21741#discussion_r202261810
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -517,7 +585,6 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
 }
   }
 
-
--- End diff --

nit: I would revert this change if you are going to push more changes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...

2018-07-13 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21741#discussion_r202261386
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -378,6 +378,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED =
+buildConf("spark.sql.parquet.filterPushdown.timestamp")
+  .doc("If true, enables Parquet filter push-down optimization for 
Timestamp. " +
+"This configuration only has an effect when 
'spark.sql.parquet.filterPushdown' is " +
+"enabled and Timestamp stored as TIMESTAMP_MICROS or 
TIMESTAMP_MILLIS type.")
--- End diff --

Shell we note `INT64` here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...

2018-07-13 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21741#discussion_r202260518
  
--- Diff: sql/core/benchmarks/FilterPushdownBenchmark-results.txt ---
@@ -578,3 +578,127 @@ Native ORC Vectorized   11622 / 
12196  1.4 7
 Native ORC Vectorized (Pushdown)11377 / 11654  1.4 
723.3   1.0X
 
 

+
+Pushdown benchmark for Timestamp

+
+
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
+Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
+
+Select 1 timestamp stored as INT96 row (value = CAST(7864320 AS 
timestamp)): Best/Avg Time(ms)Rate(M/s)   Per Row(ns)   Relative
--- End diff --

shall we add a new line after the benchmark name? e.g.
```
Select 1 timestamp stored as INT96 row (value = CAST(7864320 AS timestamp)):
Best/Avg Time(ms)Rate(M/s)   Per Row(ns)   Relative
...
```

We can send a follow-up PR to fix this entire file.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...

2018-07-11 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21741#discussion_r201889164
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -387,6 +389,82 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
 }
   }
 
+  test("filter pushdown - timestamp(TIMESTAMP_MILLIS)") {
+val ts1 = Timestamp.valueOf("2018-06-14 08:28:53.123")
+val ts2 = Timestamp.valueOf("2018-06-15 08:28:53.123")
+val ts3 = Timestamp.valueOf("2018-06-16 08:28:53.123")
+val ts4 = Timestamp.valueOf("2018-06-17 08:28:53.123")
+
+val data = Seq(ts1, ts2, ts3, ts4)
+
+withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key ->
--- End diff --

I changed to:
```scala
// spark.sql.parquet.outputTimestampType = TIMESTAMP_MILLIS
val millisData = Seq(Timestamp.valueOf("2018-06-14 08:28:53.123"),
  Timestamp.valueOf("2018-06-15 08:28:53.123"),
  Timestamp.valueOf("2018-06-16 08:28:53.123"),
  Timestamp.valueOf("2018-06-17 08:28:53.123"))
withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key ->
  ParquetOutputTimestampType.TIMESTAMP_MILLIS.toString) {
  testTimestampPushdown(millisData)
}

// spark.sql.parquet.outputTimestampType = TIMESTAMP_MICROS
val microsData = Seq(Timestamp.valueOf("2018-06-14 08:28:53.123456"),
  Timestamp.valueOf("2018-06-15 08:28:53.123456"),
  Timestamp.valueOf("2018-06-16 08:28:53.123456"),
  Timestamp.valueOf("2018-06-17 08:28:53.123456"))
withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key ->
  ParquetOutputTimestampType.TIMESTAMP_MICROS.toString) {
  testTimestampPushdown(microsData)
}
```
We shouldn't use same data to test `TIMESTAMP_MILLIS` type and 
`TIMESTAMP_MICROS` type:
1.  `TIMESTAMP_MILLIS` type will truncate `456` if use  `microsData` to 
test.
2. It can't test 
`DateTimeUtils.fromJavaTimestamp(t.asInstanceOf[Timestamp]` if use `millisData`.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...

2018-07-11 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21741#discussion_r201697107
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -387,6 +389,82 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
 }
   }
 
+  test("filter pushdown - timestamp(TIMESTAMP_MILLIS)") {
--- End diff --

I think we should also test INT96 timestamp type. Also maybe when 
`PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED` is disabled.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...

2018-07-11 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/21741#discussion_r201696558
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -387,6 +389,82 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
 }
   }
 
+  test("filter pushdown - timestamp(TIMESTAMP_MILLIS)") {
+val ts1 = Timestamp.valueOf("2018-06-14 08:28:53.123")
+val ts2 = Timestamp.valueOf("2018-06-15 08:28:53.123")
+val ts3 = Timestamp.valueOf("2018-06-16 08:28:53.123")
+val ts4 = Timestamp.valueOf("2018-06-17 08:28:53.123")
+
+val data = Seq(ts1, ts2, ts3, ts4)
+
+withSQLConf(SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key ->
--- End diff --

This case is quite similar to the one below. Should we use a loop for 
setting the key `SQLConf.PARQUET_OUTPUT_TIMESTAMP_TYPE.key` to avoid duplicated 
code.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...

2018-07-10 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21741#discussion_r201233974
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -378,6 +378,15 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val PARQUET_FILTER_PUSHDOWN_TIMESTAMP_ENABLED =
+buildConf("spark.sql.parquet.filterPushdown.timestamp")
+  .doc("If true, enables Parquet filter push-down optimization for 
Timestamp. " +
+"This configuration only has an effect when 
'spark.sql.parquet.filterPushdown' is " +
+"enabled and Timestamp stored as TIMESTAMP_MICROS or 
TIMESTAMP_MILLIS type.")
+.internal()
+.booleanConf
+.createWithDefault(true)
--- End diff --

May be default should be `false`. because `PARQUET_OUTPUT_TIMESTAMP_TYPE` 
default is `INT96`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21741: [SPARK-24718][SQL] Timestamp support pushdown to ...

2018-07-10 Thread wangyum
GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21741

[SPARK-24718][SQL] Timestamp support pushdown to parquet data source

## What changes were proposed in this pull request?

`Timestamp` support pushdown to parquet data source.
Only `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` support push down.

## How was this patch tested?

unit tests and benchmark tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-24718

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21741.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21741


commit b2a90005b95b4fc6e51d43d74784a6ff4397005a
Author: Yuming Wang 
Date:   2018-07-10T06:45:17Z

Timestamp support pushdown to parquet data source




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org