[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

2017-05-18 Thread HyukjinKwon
Github user HyukjinKwon closed the pull request at:

https://github.com/apache/spark/pull/17680


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

2017-05-15 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17680#discussion_r116579008
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -235,3 +238,39 @@ private[parquet] object ParquetFilters {
 }
   }
 }
+
+/**
+ * Note that, this is a hacky workaround to allow dots in column names. 
Currently, column APIs
+ * in Parquet's `FilterApi` only allows dot-separated names so here we 
resemble those columns
+ * but only allow single column path that allows dots in the names as we 
don't currently push
+ * down filters with nested fields.
--- End diff --

We need to explain the functions in this object `ParquetColumns` are based 
on the codes in `org.apache.parquet.filter2.predicate`. Thus, when upgrading 
the Parquet versions, we need to check whether they are still the same. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

2017-05-15 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/17680#discussion_r116572835
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -536,4 +537,53 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   // scalastyle:on nonascii
 }
   }
+
+  test("SPARK-20364: Predicate pushdown for columns with a '.' in them") {
+import testImplicits._
+
+Seq(true, false).foreach { vectorized =>
+  withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> 
vectorized.toString) {
+withTempPath { path =>
+  Seq(Some(1), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` > 0").count() 
== 1)
+}
--- End diff --

Instead of duplicating the codes, please write a helper function. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

2017-04-20 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17680#discussion_r112531312
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   // scalastyle:on nonascii
 }
   }
+
+  test("SPARK-20364: Predicate pushdown for columns with a '.' in them") {
+import testImplicits._
+
+Seq(true, false).foreach { vectorized =>
+  withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> 
vectorized.toString) {
+withTempPath { path =>
+  Seq(Some(1), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` > 0").count() 
== 1)
+}
+
+withTempPath { path =>
+  Seq(Some(1L), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` >= 
1L").count() == 1)
+}
+
+withTempPath { path =>
+  Seq(Some(1.0F), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` < 
2.0").count() == 1)
+}
+
+withTempPath { path =>
+  Seq(Some(1.0D), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` <= 
1.0D").count() == 1)
+}
+
+withTempPath { path =>
+  Seq(true, 
false).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` == 
true").count() == 1)
+}
+
+withTempPath { path =>
+  Seq("apple", 
null).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  assert(
+spark.read.parquet(path.getAbsolutePath).where("`col.dots` IS 
NOT NULL").count() == 1)
--- End diff --

thanks for adding the additional test below


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

2017-04-20 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17680#discussion_r112529697
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala
 ---
@@ -487,6 +487,20 @@ class FileSourceStrategySuite extends QueryTest with 
SharedSQLContext with Predi
 }
   }
 
+  test("no filter puwhdown for nested field access") {
--- End diff --

nit: pushdown


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

2017-04-20 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17680#discussion_r112530989
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   // scalastyle:on nonascii
 }
   }
+
+  test("SPARK-20364: Predicate pushdown for columns with a '.' in them") {
--- End diff --

thanks for the change -- I wasn't sure if predicate pushdown worked on 
nested columns and it looks like that change confirms it does not after this 
change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

2017-04-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17680#discussion_r112347873
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   // scalastyle:on nonascii
 }
   }
+
+  test("SPARK-20364: Predicate pushdown for columns with a '.' in them") {
+import testImplicits._
+
+Seq(true, false).foreach { vectorized =>
+  withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> 
vectorized.toString) {
+withTempPath { path =>
+  Seq(Some(1), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` > 0").count() 
== 1)
+}
+
+withTempPath { path =>
+  Seq(Some(1L), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` >= 
1L").count() == 1)
+}
+
+withTempPath { path =>
+  Seq(Some(1.0F), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` < 
2.0").count() == 1)
+}
+
+withTempPath { path =>
+  Seq(Some(1.0D), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` <= 
1.0D").count() == 1)
+}
+
+withTempPath { path =>
+  Seq(true, 
false).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` == 
true").count() == 1)
+}
+
+withTempPath { path =>
+  Seq("apple", 
null).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  assert(
+spark.read.parquet(path.getAbsolutePath).where("`col.dots` IS 
NOT NULL").count() == 1)
--- End diff --

Actually, `IS NULL` is not the problem here.

```scala
val path = "/tmp/abcde"
spark.read.parquet(path).where("`col.dots` IS NULL").show()
```

```
++
|col.dots|
++
|null|
++
```

The reason is Parquet produces `null` permissively if the column does not 
exist after we upgrade it to 1.8.2 AFAIK. If this reason should be verified, I 
will look further. But in terms of the output, the issue is not reproduced.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

2017-04-19 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/17680#discussion_r112288596
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   // scalastyle:on nonascii
 }
   }
+
+  test("SPARK-20364: Predicate pushdown for columns with a '.' in them") {
--- End diff --

Up to my knolwedge, we don't push down filters with nested columns. Let me 
check if we already have the negative case explicitly and then add it if 
missing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

2017-04-19 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17680#discussion_r112285677
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   // scalastyle:on nonascii
 }
   }
+
+  test("SPARK-20364: Predicate pushdown for columns with a '.' in them") {
--- End diff --

is there another existing test that checks that pushdown for 
`struct.field1` syntax works correctly?  I'm not sure how to reference those 
inner fields in a struct field as I don't use it much personally, but want to 
make sure that's not broken as a result of this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

2017-04-19 Thread ash211
Github user ash211 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17680#discussion_r112285883
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -536,4 +537,43 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   // scalastyle:on nonascii
 }
   }
+
+  test("SPARK-20364: Predicate pushdown for columns with a '.' in them") {
+import testImplicits._
+
+Seq(true, false).foreach { vectorized =>
+  withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> 
vectorized.toString) {
+withTempPath { path =>
+  Seq(Some(1), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` > 0").count() 
== 1)
+}
+
+withTempPath { path =>
+  Seq(Some(1L), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` >= 
1L").count() == 1)
+}
+
+withTempPath { path =>
+  Seq(Some(1.0F), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` < 
2.0").count() == 1)
+}
+
+withTempPath { path =>
+  Seq(Some(1.0D), 
None).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` <= 
1.0D").count() == 1)
+}
+
+withTempPath { path =>
+  Seq(true, 
false).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  
assert(spark.read.parquet(path.getAbsolutePath).where("`col.dots` == 
true").count() == 1)
+}
+
+withTempPath { path =>
+  Seq("apple", 
null).toDF("col.dots").write.parquet(path.getAbsolutePath)
+  assert(
+spark.read.parquet(path.getAbsolutePath).where("`col.dots` IS 
NOT NULL").count() == 1)
--- End diff --

please also do the check for `IS NULL` having 1 row too, so this is 
symmetric


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17680: [SPARK-20364][SQL] Support Parquet predicate push...

2017-04-19 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/17680

[SPARK-20364][SQL] Support Parquet predicate pushdown on columns with dots

## What changes were proposed in this pull request?

Currently, if there are dots in the column name, predicate pushdown seems 
being failed in Parquet.

**With dots**

```
val path = "/tmp/abcde"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
```

```
++
|col.dots|
++
++
```

**Without dots**

```
val path = "/tmp/abcde"
Seq(Some(1), None).toDF("coldots").write.parquet(path)
spark.read.parquet(path).where("`coldots` IS NOT NULL").show()
```

```
+---+
|coldots|
+---+
|  1|
+---+
```

It seems dot in the column names via `FilterApi` tries to separate the 
field name with dot (`ColumnPath` with multiple column paths) whereas the 
actual column name is `col.dots`. (See 
[FilterApi.java#L71](https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71)
 and it calls 
[ColumnPath.java#L44](https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44).

I just tried to come up with ways to resolve it and I came up with two as 
below:

 - One is simply to don't push down filters when there are dots in column 
names so that it reads all and filters in Spark-side.

 - The other way creates Spark's `FilterApi` for those columns (it seems 
final) to get always use single column path it in Spark-side (this seems hacky) 
as we are not pushing down nested columns currently. So, it looks we can get a 
field name via `ColumnPath.get` not `ColumnPath.fromDotString` in this way.

This PR proposes the latter way because I think we need to be sure on that 
it passes the tests.

**After**

```scala
val path = "/tmp/abcde"
Seq(Some(1), None).toDF("col.dots").write.parquet(path)
spark.read.parquet(path).where("`col.dots` IS NOT NULL").show()
```

```
++
|col.dots|
++
|   1|
++
```


## How was this patch tested?

Existing tests should cover this. Some tests additionally were added in 
`ParquetFilterSuite.scala`. Manually, I ran related tests and Jenkins tests 
will cover this.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-20364

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17680.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17680


commit 297c70bd47cb859d343345849776bd7c21396078
Author: hyukjinkwon 
Date:   2017-04-19T06:24:15Z

Parquet predicate pushdown on columns with dots return empty results




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org