[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20851 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176971232 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -129,6 +154,10 @@ private[parquet] object ParquetFilters { case BinaryType => (n: String, v: Any) => FilterApi.gt(binaryColumn(n), Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])) +case DateType if SQLConf.get.parquetFilterPushDownDate => --- End diff -- we should refactor it, so that adding a new data type doesn't need to touch so many places. This can be done later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176971146 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -50,6 +59,10 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.eq( binaryColumn(n), Option(v).map(b => Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull) +case DateType if SQLConf.get.parquetFilterPushDownDate => + (n: String, v: Any) => FilterApi.eq( +intColumn(n), +Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]).orNull) --- End diff -- sorry I was wrong. I took a look at how these dates get created, in `DataSourceStrategy.translateFilter`. Actually they are created via `DateTimeUtils.toJavaDate` without timezone, which means here we should not use timezone either. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176966288 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -353,6 +353,13 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = buildConf("spark.sql.parquet.filterPushdown.date") +.doc("If true, enables Parquet filter push-down optimization for Date. " + + "This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is enabled.") +.internal() +.booleanConf +.createWithDefault(false) --- End diff -- @dongjoon-hyun, we are investigating to use this feature in some kind of eBay's queries, if its performance is good, will benefit a lot. As per our discussion, I will turn it on by default, thanks very much! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176948516 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -353,6 +353,13 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = buildConf("spark.sql.parquet.filterPushdown.date") +.doc("If true, enables Parquet filter push-down optimization for Date. " + + "This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is enabled.") +.internal() +.booleanConf +.createWithDefault(false) --- End diff -- Great! Thank you for confirmation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176946273 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -353,6 +353,13 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = buildConf("spark.sql.parquet.filterPushdown.date") +.doc("If true, enables Parquet filter push-down optimization for Date. " + + "This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is enabled.") +.internal() +.booleanConf +.createWithDefault(false) --- End diff -- This is not a bug fix. We will not backport it to branch-2.3 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176922300 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -353,6 +353,13 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = buildConf("spark.sql.parquet.filterPushdown.date") +.doc("If true, enables Parquet filter push-down optimization for Date. " + + "This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is enabled.") +.internal() +.booleanConf +.createWithDefault(false) --- End diff -- If you think so, +1. :) BTW, based on Apache Spark way, I assume that this will not land on `branch-2.3` with `spark.sql.parquet.filterPushdown.date=true`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176897485 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -353,6 +353,13 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = buildConf("spark.sql.parquet.filterPushdown.date") +.doc("If true, enables Parquet filter push-down optimization for Date. " + + "This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is enabled.") +.internal() +.booleanConf +.createWithDefault(false) --- End diff -- I think it's common that we turn on new feature by default, if there is no known regression. And turn it off if we find regression later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176762766 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -353,6 +353,13 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = buildConf("spark.sql.parquet.filterPushdown.date") +.doc("If true, enables Parquet filter push-down optimization for Date. " + + "This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is enabled.") +.internal() +.booleanConf +.createWithDefault(false) --- End diff -- @yucai . The reason is that `spark.sql.orc.filterPushdown` is still `false` in Spark 2.3 while `spark.sql.parquet.filterPushdown` is `true`. We don't know this is safe or not. Anyway, we have 6 or more months for Apache Spark 2.4. We may enable this in `master` branch temporarily for testing purpose only, and are able to disable this at the last moment of 2.4 release like we did about ORC conf if there is some issue. BTW, did you use this in your company a lot in production? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176676405 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -353,6 +353,13 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = buildConf("spark.sql.parquet.filterPushdown.date") +.doc("If true, enables Parquet filter push-down optimization for Date. " + + "This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is enabled.") +.internal() +.booleanConf +.createWithDefault(false) --- End diff -- I am fine either way. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176660685 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -353,6 +353,13 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = buildConf("spark.sql.parquet.filterPushdown.date") +.doc("If true, enables Parquet filter push-down optimization for Date. " + + "This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is enabled.") +.internal() +.booleanConf +.createWithDefault(false) --- End diff -- @dongjoon-hyun @HyukjinKwon suggested such configuration could be better to start with "false", but actually ORC does support Date push down by default, any idea? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176636849 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -353,6 +353,13 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = buildConf("spark.sql.parquet.filterPushdown.date") +.doc("If true, enables Parquet filter push-down optimization for Date. " + + "This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is enabled.") +.internal() +.booleanConf +.createWithDefault(false) --- End diff -- an internal by-default-false conf usually means it's not available for users... --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176633578 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -353,6 +353,12 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = buildConf("spark.sql.parquet.filterPushdown.date") +.doc("If true, enables Parquet filter push-down optimization for Date. " + --- End diff -- Yes, it should be an internal conf. In Spark 3.0 release, we will revisit all the internal confs and remove all the unnecessary confs. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176317120 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +315,48 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class StringToDate(s: String) { + def date: Date = Date.valueOf(s) +} + +val data = Seq("2018-03-18", "2018-03-19", "2018-03-20", "2018-03-21") + +withParquetDataFrame(data.map(i => Tuple1(i.date))) { implicit df => + checkFilterPredicate('_1.isNull, classOf[Eq[_]], Seq.empty[Row]) + checkFilterPredicate('_1.isNotNull, classOf[NotEq[_]], data.map(i => Row.apply(i.date))) + + checkFilterPredicate('_1 === "2018-03-18".date, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate('_1 <=> "2018-03-18".date, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate('_1 =!= "2018-03-18".date, classOf[NotEq[_]], +Seq("2018-03-19", "2018-03-20", "2018-03-21").map(i => Row.apply(i.date))) + + checkFilterPredicate('_1 < "2018-03-19".date, classOf[Lt[_]], "2018-03-18".date) + checkFilterPredicate('_1 > "2018-03-20".date, classOf[Gt[_]], "2018-03-21".date) + checkFilterPredicate('_1 <= "2018-03-18".date, classOf[LtEq[_]], "2018-03-18".date) + checkFilterPredicate('_1 >= "2018-03-21".date, classOf[GtEq[_]], "2018-03-21".date) + + checkFilterPredicate( +Literal("2018-03-18".date) === '_1, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-18".date) <=> '_1, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-19".date) > '_1, classOf[Lt[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-20".date) < '_1, classOf[Gt[_]], "2018-03-21".date) + checkFilterPredicate( +Literal("2018-03-18".date) >= '_1, classOf[LtEq[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-21".date) <= '_1, classOf[GtEq[_]], "2018-03-21".date) + + checkFilterPredicate(!('_1 < "2018-03-21".date), classOf[GtEq[_]], "2018-03-21".date) + checkFilterPredicate( +'_1 < "2018-03-19".date || '_1 > "2018-03-20".date, +classOf[Operators.Or], +Seq(Row("2018-03-18".date), Row("2018-03-21".date))) +} --- End diff -- This is different from what @HyukjinKwon gave you. :) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176310447 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +315,49 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class StringToDate(s: String) { + def date: Date = Date.valueOf(s) +} + +withParquetDataFrame( + Seq("2018-03-18", "2018-03-19", "2018-03-20", "2018-03-21").map(i => Tuple1(i.date))) { +implicit df => + checkFilterPredicate('_1.isNull, classOf[Eq[_]], Seq.empty[Row]) + checkFilterPredicate('_1.isNotNull, classOf[NotEq[_]], +Seq("2018-03-18", "2018-03-19", "2018-03-20", "2018-03-21").map(i => Row.apply(i.date))) + + checkFilterPredicate('_1 === "2018-03-18".date, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate('_1 <=> "2018-03-18".date, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate('_1 =!= "2018-03-18".date, classOf[NotEq[_]], +Seq("2018-03-19", "2018-03-20", "2018-03-21").map(i => Row.apply(i.date))) + + checkFilterPredicate('_1 < "2018-03-19".date, classOf[Lt[_]], "2018-03-18".date) + checkFilterPredicate('_1 > "2018-03-20".date, classOf[Gt[_]], "2018-03-21".date) + checkFilterPredicate('_1 <= "2018-03-18".date, classOf[LtEq[_]], "2018-03-18".date) + checkFilterPredicate('_1 >= "2018-03-21".date, classOf[GtEq[_]], "2018-03-21".date) + + checkFilterPredicate( +Literal("2018-03-18".date) === '_1, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-18".date) <=> '_1, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-19".date) > '_1, classOf[Lt[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-20".date) < '_1, classOf[Gt[_]], "2018-03-21".date) + checkFilterPredicate( +Literal("2018-03-18".date) >= '_1, classOf[LtEq[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-21".date) <= '_1, classOf[GtEq[_]], "2018-03-21".date) + + checkFilterPredicate(!('_1 < "2018-03-21".date), classOf[GtEq[_]], "2018-03-21".date) + checkFilterPredicate( +'_1 < "2018-03-19".date || '_1 > "2018-03-20".date, +classOf[Operators.Or], +Seq(Row("2018-03-18".date), Row("2018-03-21".date))) +} --- End diff -- Thanks, @HyukjinKwon ! That's better what I thought. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176308130 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -353,6 +353,13 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = buildConf("spark.sql.parquet.filterPushdown.date") +.doc("If true, enables Parquet filter push-down optimization for Date. " + + "This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is enabled.") +.internal() +.booleanConf +.createWithDefault(true) --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176308097 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +315,49 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class StringToDate(s: String) { + def date: Date = Date.valueOf(s) +} + +withParquetDataFrame( + Seq("2018-03-18", "2018-03-19", "2018-03-20", "2018-03-21").map(i => Tuple1(i.date))) { +implicit df => + checkFilterPredicate('_1.isNull, classOf[Eq[_]], Seq.empty[Row]) + checkFilterPredicate('_1.isNotNull, classOf[NotEq[_]], +Seq("2018-03-18", "2018-03-19", "2018-03-20", "2018-03-21").map(i => Row.apply(i.date))) + + checkFilterPredicate('_1 === "2018-03-18".date, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate('_1 <=> "2018-03-18".date, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate('_1 =!= "2018-03-18".date, classOf[NotEq[_]], +Seq("2018-03-19", "2018-03-20", "2018-03-21").map(i => Row.apply(i.date))) + + checkFilterPredicate('_1 < "2018-03-19".date, classOf[Lt[_]], "2018-03-18".date) + checkFilterPredicate('_1 > "2018-03-20".date, classOf[Gt[_]], "2018-03-21".date) + checkFilterPredicate('_1 <= "2018-03-18".date, classOf[LtEq[_]], "2018-03-18".date) + checkFilterPredicate('_1 >= "2018-03-21".date, classOf[GtEq[_]], "2018-03-21".date) + + checkFilterPredicate( +Literal("2018-03-18".date) === '_1, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-18".date) <=> '_1, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-19".date) > '_1, classOf[Lt[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-20".date) < '_1, classOf[Gt[_]], "2018-03-21".date) + checkFilterPredicate( +Literal("2018-03-18".date) >= '_1, classOf[LtEq[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-21".date) <= '_1, classOf[GtEq[_]], "2018-03-21".date) + + checkFilterPredicate(!('_1 < "2018-03-21".date), classOf[GtEq[_]], "2018-03-21".date) + checkFilterPredicate( +'_1 < "2018-03-19".date || '_1 > "2018-03-20".date, +classOf[Operators.Or], +Seq(Row("2018-03-18".date), Row("2018-03-21".date))) +} --- End diff -- ```scala test("filter pushdown - date") { implicit class StringToDate(s: String) { def date: Date = Date.valueOf(s) } val data = Seq("2018-03-18", "2018-03-19", "2018-03-20", "2018-03-21").map(i => Tuple1(i.date)) withParquetDataFrame(data) { implicit df => checkFilterPredicate('_1.isNull, classOf[Eq[_]], Seq.empty[Row]) checkFilterPredicate('_1.isNotNull, classOf[NotEq[_]], data) checkFilterPredicate('_1 === "2018-03-18".date, classOf[Eq[_]], "2018-03-18".date) checkFilterPredicate('_1 <=> "2018-03-18".date, classOf[Eq[_]], "2018-03-18".date) checkFilterPredicate('_1 =!= "2018-03-18".date, classOf[NotEq[_]], Seq("2018-03-19", "2018-03-20", "2018-03-21").map(i => Row.apply(i.date))) checkFilterPredicate('_1 < "2018-03-19".date, classOf[Lt[_]], "2018-03-18".date) checkFilterPredicate('_1 > "2018-03-20".date, classOf[Gt[_]], "2018-03-21".date) checkFilterPredicate('_1 <= "2018-03-18".date, classOf[LtEq[_]], "2018-03-18".date) checkFilterPredicate('_1 >= "2018-03-21".date, classOf[GtEq[_]], "2018-03-21".date) checkFilterPredicate( Literal("2018-03-18".date) === '_1, classOf[Eq[_]], "2018-03-18".date) checkFilterPredicate( Literal("2018-03-18".date) <=> '_1, classOf[Eq[_]], "2018-03-18".date) checkFilterPredicate( Literal("2018-03-19".date) > '_1, classOf[Lt[_]], "2018-03-18".date) checkFilterPredicate( Literal("2018-03-20".date) < '_1, classOf[Gt[_]], "2018-03-21".date) checkFilterPredicate( Literal("2018-03-18".date) >= '_1, classOf[LtEq[_]], "2018-03-18".date) checkFilterPredicate( Literal("2018-03-21".date) <= '_1, classOf[GtEq[_]], "2018-03-21".date) checkFilterPredicate(!('_1 < "2018-03-21".date), classOf[GtEq[_]], "2018-03-21".date) checkFilterPredicate( '_1 < "2018-03-19".date || '_1 > "2018-03-20".date, classOf[Operators.Or],
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176306730 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +315,49 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class StringToDate(s: String) { + def date: Date = Date.valueOf(s) +} + +withParquetDataFrame( + Seq("2018-03-18", "2018-03-19", "2018-03-20", "2018-03-21").map(i => Tuple1(i.date))) { +implicit df => + checkFilterPredicate('_1.isNull, classOf[Eq[_]], Seq.empty[Row]) + checkFilterPredicate('_1.isNotNull, classOf[NotEq[_]], +Seq("2018-03-18", "2018-03-19", "2018-03-20", "2018-03-21").map(i => Row.apply(i.date))) + + checkFilterPredicate('_1 === "2018-03-18".date, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate('_1 <=> "2018-03-18".date, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate('_1 =!= "2018-03-18".date, classOf[NotEq[_]], +Seq("2018-03-19", "2018-03-20", "2018-03-21").map(i => Row.apply(i.date))) + + checkFilterPredicate('_1 < "2018-03-19".date, classOf[Lt[_]], "2018-03-18".date) + checkFilterPredicate('_1 > "2018-03-20".date, classOf[Gt[_]], "2018-03-21".date) + checkFilterPredicate('_1 <= "2018-03-18".date, classOf[LtEq[_]], "2018-03-18".date) + checkFilterPredicate('_1 >= "2018-03-21".date, classOf[GtEq[_]], "2018-03-21".date) + + checkFilterPredicate( +Literal("2018-03-18".date) === '_1, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-18".date) <=> '_1, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-19".date) > '_1, classOf[Lt[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-20".date) < '_1, classOf[Gt[_]], "2018-03-21".date) + checkFilterPredicate( +Literal("2018-03-18".date) >= '_1, classOf[LtEq[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-21".date) <= '_1, classOf[GtEq[_]], "2018-03-21".date) + + checkFilterPredicate(!('_1 < "2018-03-21".date), classOf[GtEq[_]], "2018-03-21".date) + checkFilterPredicate( +'_1 < "2018-03-19".date || '_1 > "2018-03-20".date, +classOf[Operators.Or], +Seq(Row("2018-03-18".date), Row("2018-03-21".date))) +} --- End diff -- Could you kindly show me how to improve? Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176304057 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +315,49 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class StringToDate(s: String) { + def date: Date = Date.valueOf(s) +} + +withParquetDataFrame( + Seq("2018-03-18", "2018-03-19", "2018-03-20", "2018-03-21").map(i => Tuple1(i.date))) { +implicit df => + checkFilterPredicate('_1.isNull, classOf[Eq[_]], Seq.empty[Row]) + checkFilterPredicate('_1.isNotNull, classOf[NotEq[_]], +Seq("2018-03-18", "2018-03-19", "2018-03-20", "2018-03-21").map(i => Row.apply(i.date))) + + checkFilterPredicate('_1 === "2018-03-18".date, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate('_1 <=> "2018-03-18".date, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate('_1 =!= "2018-03-18".date, classOf[NotEq[_]], +Seq("2018-03-19", "2018-03-20", "2018-03-21").map(i => Row.apply(i.date))) + + checkFilterPredicate('_1 < "2018-03-19".date, classOf[Lt[_]], "2018-03-18".date) + checkFilterPredicate('_1 > "2018-03-20".date, classOf[Gt[_]], "2018-03-21".date) + checkFilterPredicate('_1 <= "2018-03-18".date, classOf[LtEq[_]], "2018-03-18".date) + checkFilterPredicate('_1 >= "2018-03-21".date, classOf[GtEq[_]], "2018-03-21".date) + + checkFilterPredicate( +Literal("2018-03-18".date) === '_1, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-18".date) <=> '_1, classOf[Eq[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-19".date) > '_1, classOf[Lt[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-20".date) < '_1, classOf[Gt[_]], "2018-03-21".date) + checkFilterPredicate( +Literal("2018-03-18".date) >= '_1, classOf[LtEq[_]], "2018-03-18".date) + checkFilterPredicate( +Literal("2018-03-21".date) <= '_1, classOf[GtEq[_]], "2018-03-21".date) + + checkFilterPredicate(!('_1 < "2018-03-21".date), classOf[GtEq[_]], "2018-03-21".date) + checkFilterPredicate( +'_1 < "2018-03-19".date || '_1 > "2018-03-20".date, +classOf[Operators.Or], +Seq(Row("2018-03-18".date), Row("2018-03-21".date))) +} --- End diff -- nit. Indentation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176302671 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -50,6 +59,11 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.eq( binaryColumn(n), Option(v).map(b => Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull) +case DateType if SQLConf.get.parquetFilterPushDownDate => + (n: String, v: Any) => FilterApi.eq( +intColumn(n), +Option(v).map(date => dateToDays(date.asInstanceOf[java.sql.Date]).asInstanceOf[Integer]) --- End diff -- Since we imported `Date` at line 20, we can remove `java.sql`. ```scala -Option(v).map(date => dateToDays(date.asInstanceOf[java.sql.Date]).asInstanceOf[Integer]) +Option(v).map(date => dateToDays(date.asInstanceOf[Date]).asInstanceOf[Integer]) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r176302394 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -353,6 +353,13 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = buildConf("spark.sql.parquet.filterPushdown.date") +.doc("If true, enables Parquet filter push-down optimization for Date. " + + "This configuration only has an effect when 'spark.sql.parquet.filterPushdown' is enabled.") +.internal() +.booleanConf +.createWithDefault(true) --- End diff -- For this kind of thing, we had better start with `false`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175990680 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +316,36 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class IntToDate(int: Int) { --- End diff -- the string one is more readable to me --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175989972 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +316,36 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class IntToDate(int: Int) { --- End diff -- As per @cloud-fan 's suggestion, I have changed 1.d to 1.date, which one is perferred? 1. "2017-08-18".d 2. 1.date 3. "2017-08-18".date --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175987083 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +316,36 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class IntToDate(int: Int) { --- End diff -- Yup, I think this one is better than the current one. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175986514 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +316,36 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class IntToDate(int: Int) { --- End diff -- I think `"2017-08-19".d` is at least better than `1.d`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175975025 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +315,36 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class IntToDate(int: Int) { + def d: Date = new Date(Date.valueOf("2018-03-01").getTime + 24 * 60 * 60 * 1000 * (int - 1)) +} + +withParquetDataFrame((1 to 4).map(i => Tuple1(i.d))) { implicit df => + checkFilterPredicate('_1.isNull, classOf[Eq[_]], Seq.empty[Row]) + checkFilterPredicate('_1.isNotNull, classOf[NotEq[_]], (1 to 4).map(i => Row.apply(i.d))) + + checkFilterPredicate('_1 === 1.d, classOf[Eq[_]], 1.d) --- End diff -- Got it, thanks very much for explanation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175970994 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +315,36 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class IntToDate(int: Int) { + def d: Date = new Date(Date.valueOf("2018-03-01").getTime + 24 * 60 * 60 * 1000 * (int - 1)) +} + +withParquetDataFrame((1 to 4).map(i => Tuple1(i.d))) { implicit df => + checkFilterPredicate('_1.isNull, classOf[Eq[_]], Seq.empty[Row]) + checkFilterPredicate('_1.isNotNull, classOf[NotEq[_]], (1 to 4).map(i => Row.apply(i.d))) + + checkFilterPredicate('_1 === 1.d, classOf[Eq[_]], 1.d) --- End diff -- `b` might be OK, but `d` is confusing as it's also a postfix for double. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175962888 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +315,36 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class IntToDate(int: Int) { + def d: Date = new Date(Date.valueOf("2018-03-01").getTime + 24 * 60 * 60 * 1000 * (int - 1)) +} + +withParquetDataFrame((1 to 4).map(i => Tuple1(i.d))) { implicit df => + checkFilterPredicate('_1.isNull, classOf[Eq[_]], Seq.empty[Row]) + checkFilterPredicate('_1.isNotNull, classOf[NotEq[_]], (1 to 4).map(i => Row.apply(i.d))) + + checkFilterPredicate('_1 === 1.d, classOf[Eq[_]], 1.d) --- End diff -- I agree 1.date is better, but binary is using "1.b", should we keep the same pattern with it? ``` test("filter pushdown - binary") { implicit class IntToBinary(int: Int) { def b: Array[Byte] = int.toString.getBytes(StandardCharsets.UTF_8) } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175854222 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -50,6 +52,12 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.eq( binaryColumn(n), Option(v).map(b => Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull) +case DateType if SQLConf.get.parquetFilterPushDownDate => + (n: String, v: Any) => FilterApi.eq( + intColumn(n), + Option(v).map { d => + DateTimeUtils.fromJavaDate(d.asInstanceOf[java.sql.Date]).asInstanceOf[Integer] --- End diff -- I think we should do `DateTimeUtils.fromJavaDate(d.asInstanceOf[java.sql.Date], SQLConf.get.sessionLocalTimeZone).asInstanceOf[Integer]` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175853868 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -50,6 +52,12 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.eq( binaryColumn(n), Option(v).map(b => Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull) +case DateType if SQLConf.get.parquetFilterPushDownDate => + (n: String, v: Any) => FilterApi.eq( + intColumn(n), + Option(v).map { d => + DateTimeUtils.fromJavaDate(d.asInstanceOf[java.sql.Date]).asInstanceOf[Integer] --- End diff -- shall we respect session local timezone here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175851942 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +315,36 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class IntToDate(int: Int) { + def d: Date = new Date(Date.valueOf("2018-03-01").getTime + 24 * 60 * 60 * 1000 * (int - 1)) +} + +withParquetDataFrame((1 to 4).map(i => Tuple1(i.d))) { implicit df => + checkFilterPredicate('_1.isNull, classOf[Eq[_]], Seq.empty[Row]) + checkFilterPredicate('_1.isNotNull, classOf[NotEq[_]], (1 to 4).map(i => Row.apply(i.d))) + + checkFilterPredicate('_1 === 1.d, classOf[Eq[_]], 1.d) --- End diff -- `1.d` is weird, can we name it `1.date`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175839248 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +316,36 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class IntToDate(int: Int) { --- End diff -- Ok, do you mean this way? Looks like we need more words :). ``` implicit class StringToDate(s: String) { def d: Date = Date.valueOf(s) } withParquetDataFrame( Seq("2017-08-18", "2017-08-19", "2017-08-20", "2017-08-21").map(i => Tuple1(i.d))) { implicit df => checkFilterPredicate('_1.isNull, classOf[Eq[_]], Seq.empty[Row]) checkFilterPredicate('_1.isNotNull, classOf[NotEq[_]], Seq("2017-08-18", "2017-08-19", "2017-08-20", "2017-08-21").map(i => Row.apply(i.d))) checkFilterPredicate('_1 === "2017-08-18".d, classOf[Eq[_]], "2017-08-18".d) checkFilterPredicate('_1 =!= "2017-08-18".d, classOf[NotEq[_]], Seq("2017-08-19", "2017-08-20", "2017-08-21").map(i => Row.apply(i.d))) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175687977 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +316,36 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class IntToDate(int: Int) { --- End diff -- Either way looks fine but I usually stick to what other codes do around the fix though .. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175684322 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +316,36 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class IntToDate(int: Int) { --- End diff -- How about this way? It is from ORC's date push down. ``` test("filter pushdown - date") { val dates = Seq("2017-08-18", "2017-08-19", "2017-08-20", "2017-08-21").map { day => Date.valueOf(day) } withOrcDataFrame(dates.map(Tuple1(_))) { implicit df => checkFilterPredicate('_1.isNull, PredicateLeaf.Operator.IS_NULL) checkFilterPredicate('_1 === dates(0), PredicateLeaf.Operator.EQUALS) checkFilterPredicate('_1 <=> dates(0), PredicateLeaf.Operator.NULL_SAFE_EQUALS) checkFilterPredicate('_1 < dates(1), PredicateLeaf.Operator.LESS_THAN) checkFilterPredicate('_1 > dates(2), PredicateLeaf.Operator.LESS_THAN_EQUALS) checkFilterPredicate('_1 <= dates(0), PredicateLeaf.Operator.LESS_THAN_EQUALS) checkFilterPredicate('_1 >= dates(3), PredicateLeaf.Operator.LESS_THAN) checkFilterPredicate(Literal(dates(0)) === '_1, PredicateLeaf.Operator.EQUALS) checkFilterPredicate(Literal(dates(0)) <=> '_1, PredicateLeaf.Operator.NULL_SAFE_EQUALS) checkFilterPredicate(Literal(dates(1)) > '_1, PredicateLeaf.Operator.LESS_THAN) checkFilterPredicate(Literal(dates(2)) < '_1, PredicateLeaf.Operator.LESS_THAN_EQUALS) checkFilterPredicate(Literal(dates(0)) >= '_1, PredicateLeaf.Operator.LESS_THAN_EQUALS) checkFilterPredicate(Literal(dates(3)) <= '_1, PredicateLeaf.Operator.LESS_THAN) } } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175651639 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -72,6 +82,14 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.notEq( binaryColumn(n), Option(v).map(b => Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull) +case DateType if SQLConf.get.parquetFilterPushDownDate => + (n: String, v: Any) => { --- End diff -- nit: ``` (n: String, v: Any) => FilterApi.notEq( intColumn(n), Option(v).map { d => DateTimeUtils.fromJavaDate(d.asInstanceOf[java.sql.Date]).asInstanceOf[Integer] }.orNull) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175649724 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -76,7 +77,9 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex expected: Seq[Row]): Unit = { val output = predicate.collect { case a: Attribute => a }.distinct -withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") { +withSQLConf( + SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true", + SQLConf.PARQUET_FILTER_PUSHDOWN_DATE_ENABLED.key -> "true") { --- End diff -- nit: ```scala withSQLConf( SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true", SQLConf.PARQUET_FILTER_PUSHDOWN_DATE_ENABLED.key -> "true", SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false") { ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175650228 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +316,36 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class IntToDate(int: Int) { --- End diff -- Shall we pass a string here and convert it into a date? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175342992 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -353,6 +353,12 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = buildConf("spark.sql.parquet.filterPushdown.date") +.doc("If true, enables Parquet filter push-down optimization for Date. " + --- End diff -- @gatorsmile any idea? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user manku-timma commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175327686 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -72,6 +82,15 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.notEq( binaryColumn(n), Option(v).map(b => Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull) +case DateType if SQLConf.get.parquetFilterPushDownDate => + (n: String, v: Any) => { +FilterApi.notEq( + intColumn(n), + Option(v).map { date => --- End diff -- These 3 lines are repeated several times in this change. Please refactor to a new method. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175317152 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -50,6 +51,15 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.eq( binaryColumn(n), Option(v).map(b => Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull) +case DateType if SQLConf.get.parquetFilterPushDownDate => + (n: String, v: Any) => { +FilterApi.eq( + intColumn(n), + Option(v).map { date => +val days = date.asInstanceOf[java.sql.Date].getTime / (24 * 60 * 60 * 1000) +days.toInt.asInstanceOf[Integer] --- End diff -- Use `DateTimeUtils.fromJavaDate` here and below? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175320790 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -353,6 +353,12 @@ object SQLConf { .booleanConf .createWithDefault(true) + val PARQUET_FILTER_PUSHDOWN_DATE_ENABLED = buildConf("spark.sql.parquet.filterPushdown.date") +.doc("If true, enables Parquet filter push-down optimization for Date. " + --- End diff -- Should be an internal conf, i.e., `.internal()`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175293032 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala --- @@ -148,6 +193,15 @@ private[parquet] object ParquetFilters { case BinaryType => (n: String, v: Any) => FilterApi.gtEq(binaryColumn(n), Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])) +case DateType => --- End diff -- Have added, kindly help review. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20851: [SPARK-23727][SQL] Support for pushing down filte...
Github user yucai commented on a diff in the pull request: https://github.com/apache/spark/pull/20851#discussion_r175293026 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala --- @@ -313,6 +314,36 @@ class ParquetFilterSuite extends QueryTest with ParquetTest with SharedSQLContex } } + test("filter pushdown - date") { +implicit class IntToDate(int: Int) { + def d: Date = new Date(Date.valueOf("2018-03-01").getTime + 24 * 60 * 60 * 1000 * (int - 1)) +} + +withParquetDataFrame((1 to 4).map(i => Tuple1(i.d))) { implicit df => --- End diff -- Could you kindly give me some examples about what kind of boundary tests? I checked parquet integer push down and ORC date type push down, seems like have covered all their tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org