[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22357 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216776055 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { val projectionRootFields = projects.flatMap(getRootFields) val filterRootFields = filters.flatMap(getRootFields) -(projectionRootFields ++ filterRootFields).distinct +// Kind of expressions don't need to access any fields of a root fields, e.g., `IsNotNull`. +// For them, if there are any nested fields accessed in the query, we don't need to add root +// field access of above expressions. +// For example, for a query `SELECT name.first FROM contacts WHERE name IS NOT NULL`, +// we don't need to read nested fields of `name` struct other than `first` field. --- End diff -- Instead of reading any arbitrary field of simple type (which may not exist if it's a deeply nested struct), I think we should implement the pushdown with complex type in parquet with similar logic, and let parquet reader handle it. @viirya Can you create a followup JIRA for this? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216714387 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { val projectionRootFields = projects.flatMap(getRootFields) val filterRootFields = filters.flatMap(getRootFields) -(projectionRootFields ++ filterRootFields).distinct +// Kind of expressions don't need to access any fields of a root fields, e.g., `IsNotNull`. +// For them, if there are any nested fields accessed in the query, we don't need to add root +// field access of above expressions. +// For example, for a query `SELECT name.first FROM contacts WHERE name IS NOT NULL`, +// we don't need to read nested fields of `name` struct other than `first` field. --- End diff -- Yeah, I'm okay with leaving it as-is. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216708046 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { val projectionRootFields = projects.flatMap(getRootFields) val filterRootFields = filters.flatMap(getRootFields) -(projectionRootFields ++ filterRootFields).distinct +// Kind of expressions don't need to access any fields of a root fields, e.g., `IsNotNull`. +// For them, if there are any nested fields accessed in the query, we don't need to add root +// field access of above expressions. +// For example, for a query `SELECT name.first FROM contacts WHERE name IS NOT NULL`, +// we don't need to read nested fields of `name` struct other than `first` field. --- End diff -- Btw, I think this is not seen as schema pruning case in the sense of original PR, so maybe we can leave it as it for now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216702003 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { val projectionRootFields = projects.flatMap(getRootFields) val filterRootFields = filters.flatMap(getRootFields) -(projectionRootFields ++ filterRootFields).distinct +// Kind of expressions don't need to access any fields of a root fields, e.g., `IsNotNull`. +// For them, if there are any nested fields accessed in the query, we don't need to add root +// field access of above expressions. +// For example, for a query `SELECT name.first FROM contacts WHERE name IS NOT NULL`, +// we don't need to read nested fields of `name` struct other than `first` field. --- End diff -- Currently under this PR, `name` will be fully read. This is not perfect. However, to pick one arbitrary field from `name` sounds a little bit hacky to me. WDYT? cc @dbtsai @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216694201 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala --- @@ -155,6 +163,60 @@ class ParquetSchemaPruningSuite Row(null) :: Row(null) :: Nil) } + testSchemaPruning("select a single complex field and in where clause") { +val query1 = sql("select name.first from contacts where name.first = 'Jane'") +checkScan(query1, "struct>") +checkAnswer(query1, Row("Jane") :: Nil) + +val query2 = sql("select name.first, name.last from contacts where name.first = 'Jane'") +checkScan(query2, "struct>") +checkAnswer(query2, Row("Jane", "Doe") :: Nil) + +val query3 = sql("select name.first from contacts " + + "where employer.company.name = 'abc' and p = 1") +checkScan(query3, "struct," + + "employer:struct>>") +checkAnswer(query3, Row("Jane") :: Nil) + +val query4 = sql("select name.first, employer.company.name from contacts " + + "where employer.company is not null and p = 1") +checkScan(query4, "struct," + + "employer:struct>>") +checkAnswer(query4, Row("Jane", "abc") :: Nil) + } + + testSchemaPruning("select nullable complex field and having is null predicate") { --- End diff -- Oops, yes, thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216686762 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala --- @@ -155,6 +163,60 @@ class ParquetSchemaPruningSuite Row(null) :: Row(null) :: Nil) } + testSchemaPruning("select a single complex field and in where clause") { +val query1 = sql("select name.first from contacts where name.first = 'Jane'") +checkScan(query1, "struct>") +checkAnswer(query1, Row("Jane") :: Nil) + +val query2 = sql("select name.first, name.last from contacts where name.first = 'Jane'") +checkScan(query2, "struct>") +checkAnswer(query2, Row("Jane", "Doe") :: Nil) + +val query3 = sql("select name.first from contacts " + + "where employer.company.name = 'abc' and p = 1") +checkScan(query3, "struct," + + "employer:struct>>") +checkAnswer(query3, Row("Jane") :: Nil) + +val query4 = sql("select name.first, employer.company.name from contacts " + + "where employer.company is not null and p = 1") +checkScan(query4, "struct," + + "employer:struct>>") +checkAnswer(query4, Row("Jane", "abc") :: Nil) + } + + testSchemaPruning("select nullable complex field and having is null predicate") { --- End diff -- Do you mean `having is not null predicate`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216683076 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { val projectionRootFields = projects.flatMap(getRootFields) val filterRootFields = filters.flatMap(getRootFields) -(projectionRootFields ++ filterRootFields).distinct +// Kind of expressions don't need to access any fields of a root fields, e.g., `IsNotNull`. +// For them, if there are any nested fields accessed in the query, we don't need to add root +// field access of above expressions. +// For example, for a query `SELECT name.first FROM contacts WHERE name IS NOT NULL`, +// we don't need to read nested fields of `name` struct other than `first` field. --- End diff -- @viirya, I see your point about the difference between a complex type being null and a subfield being null. So to answer the following query select address from contacts where name is not null do we need to read any of the fields in `name`? Or perhaps just read one arbitrary field of simple type, like `name.first`? That's surprising, but I'm starting to believe it's true. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216601125 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { val projectionRootFields = projects.flatMap(getRootFields) val filterRootFields = filters.flatMap(getRootFields) -(projectionRootFields ++ filterRootFields).distinct +// Kind of expressions don't need to access any fields of a root fields, e.g., `IsNotNull`. +// For them, if there are any nested fields accessed in the query, we don't need to add root +// field access of above expressions. +// For example, for a query `SELECT name.first FROM contacts WHERE name IS NOT NULL`, +// we don't need to read nested fields of `name` struct other than `first` field. --- End diff -- A complex column is null and its fields are null are different. I think we don't need to read all the fields to check if the complex column is null. In other words, in above case, when we only read `employer.id` and it is null, the predicate `employer is not null` will still be true because it is a complex column containing a null field. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216565635 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { val projectionRootFields = projects.flatMap(getRootFields) val filterRootFields = filters.flatMap(getRootFields) -(projectionRootFields ++ filterRootFields).distinct +// Kind of expressions don't need to access any fields of a root fields, e.g., `IsNotNull`. +// For them, if there are any nested fields accessed in the query, we don't need to add root +// field access of above expressions. +// For example, for a query `SELECT name.first FROM contacts WHERE name IS NOT NULL`, +// we don't need to read nested fields of `name` struct other than `first` field. --- End diff -- > I checked in the ParquetFilter, IsNotNull(employer) will be ignored since it's not a valid parquet filter as parquet doesn't support pushdown on the struct; thus, with this PR, this query will return wrong answer. We may not worry about wrong answer from datasource like Parquet in predicate pushdown. As not all predicates are supported by pushdown, we always have a SparkSQL Filter on top of scan node to make sure to receive correct answer. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216559045 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { val projectionRootFields = projects.flatMap(getRootFields) val filterRootFields = filters.flatMap(getRootFields) -(projectionRootFields ++ filterRootFields).distinct +// Kind of expressions don't need to access any fields of a root fields, e.g., `IsNotNull`. +// For them, if there are any nested fields accessed in the query, we don't need to add root +// field access of above expressions. +// For example, for a query `SELECT name.first FROM contacts WHERE name IS NOT NULL`, +// we don't need to read nested fields of `name` struct other than `first` field. --- End diff -- For the first query, the constrain is `employer is not null`. When `employer.id` is not `null`, `employer` will always not be `null`; as a result, this PR will work. However, when `employer.id` is `null`, `employer` can be `null` or `something`, so we need to check if `employer` is `something` to return a null of `employer.id`. I checked in the `ParquetFilter`, `IsNotNull(employer)` will be ignored since it's not a valid parquet filter as parquet doesn't support pushdown on the struct; thus, with this PR, this query will return wrong answer. I think in this scenario, as @mallman suggested, we might need to read the full data. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216545091 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -110,7 +110,17 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { val projectionRootFields = projects.flatMap(getRootFields) val filterRootFields = filters.flatMap(getRootFields) -(projectionRootFields ++ filterRootFields).distinct +// Kind of expressions don't need to access any fields of a root fields, e.g., `IsNotNull`. +// For them, if there are any nested fields accessed in the query, we don't need to add root +// field access of above expressions. +// For example, for a query `SELECT name.first FROM contacts WHERE name IS NOT NULL`, +// we don't need to read nested fields of `name` struct other than `first` field. --- End diff -- I'm having trouble accepting this, but perhaps I'm reading too much into it (or not enough). Let me illustrate with a couple of queries and their physical plans. Assuming the data model in `ParquetSchemaPruningSuite.scala`, the physical plan for the query select employer.id from contacts where employer is not null is ``` == Physical Plan == *(1) Project [employer#36.id AS id#46] +- *(1) Filter isnotnull(employer#36) +- *(1) FileScan parquet [employer#36,p#37] Batched: false, Format: Parquet, PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(employer)], ReadSchema: struct> ``` The physical plan for the query select employer.id from contacts where employer.id is not null is ``` == Physical Plan == *(1) Project [employer#36.id AS id#47] +- *(1) Filter (isnotnull(employer#36) && isnotnull(employer#36.id)) +- *(1) FileScan parquet [employer#36,p#37] Batched: false, Format: Parquet, PartitionCount: 2, PartitionFilters: [], PushedFilters: [IsNotNull(employer)], ReadSchema: struct> ``` The read schemata are the same, but the query filters are not. The file scan for the second query looks as I would expect, but the scan for the first query appears to only read `employer.id` even though it needs to check `employer is not null`. If it only reads `employer.id`, how does it check that `employer.company` is not null? Perhaps `employer.id` is null but `employer.company` is not null for some row... I have run some tests to validate that this PR is returning the correct results for both queries, and it is. But I don't understand why. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216256434 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -199,6 +209,15 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { case att: Attribute => RootField(StructField(att.name, att.dataType, att.nullable), derivedFromAtt = true) :: Nil case SelectedField(field) => RootField(field, derivedFromAtt = false) :: Nil + // Root field accesses by `IsNotNull` and `IsNull` are special cases as the expressions + // don't actually use any nested fields. These root field accesses might be excluded later + // if there are any nested fields accesses in the query plan. + case IsNotNull(SelectedField(field)) => +RootField(field, derivedFromAtt = false, contentAccessed = false) :: Nil + case IsNull(SelectedField(field)) => +RootField(field, derivedFromAtt = false, contentAccessed = false) :: Nil --- End diff -- @dbtsai The question you mentioned at https://github.com/apache/spark/pull/22357/files#r216204022 was addressed by this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216255549 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala --- @@ -155,6 +161,47 @@ class ParquetSchemaPruningSuite Row(null) :: Row(null) :: Nil) } + testSchemaPruning("select a single complex field and in where clause") { +val query1 = sql("select name.first from contacts where name.first = 'Jane'") +checkScan(query1, "struct>") +checkAnswer(query1, Row("Jane") :: Nil) + +val query2 = sql("select name.first, name.last from contacts where name.first = 'Jane'") +checkScan(query2, "struct>") +checkAnswer(query2, Row("Jane", "Doe") :: Nil) + +val query3 = sql("select name.first from contacts " + + "where employer.company.name = 'abc' and p = 1") --- End diff -- When there is a nested field access in the query like `employer.company.name`, then we don't need other fields inside `employ.company` other than `name`. But if there is no such access but just `employer.company is not null` in where clause, it will read full schema of `employ.company`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216244620 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala --- @@ -155,6 +161,47 @@ class ParquetSchemaPruningSuite Row(null) :: Row(null) :: Nil) } + testSchemaPruning("select a single complex field and in where clause") { +val query1 = sql("select name.first from contacts where name.first = 'Jane'") +checkScan(query1, "struct>") +checkAnswer(query1, Row("Jane") :: Nil) + +val query2 = sql("select name.first, name.last from contacts where name.first = 'Jane'") +checkScan(query2, "struct>") +checkAnswer(query2, Row("Jane", "Doe") :: Nil) + +val query3 = sql("select name.first from contacts " + + "where employer.company.name = 'abc' and p = 1") --- End diff -- Added one query test for this case. Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216218951 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -156,7 +161,7 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { // in the resulting schema may differ from their ordering in the logical relation's // original schema val mergedSchema = requestedRootFields - .map { case RootField(field, _) => StructType(Array(field)) } + .map { case RootField(field, _, _) => StructType(Array(field)) } --- End diff -- Not a big deal but `.map { root: RootField => StructType(Array(root.field)) }` per https://github.com/databricks/scala-style-guide#pattern-matching --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216218409 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -196,6 +201,9 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { */ private def getRootFields(expr: Expression): Seq[RootField] = { expr match { + // Those expressions don't really use the nested fields of a root field. + case i@(IsNotNull(_: Attribute) | IsNull(_: Attribute)) => --- End diff -- nit: -> `i @ (IsNotNull(_: ...` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216204022 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala --- @@ -155,6 +161,47 @@ class ParquetSchemaPruningSuite Row(null) :: Row(null) :: Nil) } + testSchemaPruning("select a single complex field and in where clause") { +val query1 = sql("select name.first from contacts where name.first = 'Jane'") +checkScan(query1, "struct>") +checkAnswer(query1, Row("Jane") :: Nil) + +val query2 = sql("select name.first, name.last from contacts where name.first = 'Jane'") +checkScan(query2, "struct>") +checkAnswer(query2, Row("Jane", "Doe") :: Nil) + +val query3 = sql("select name.first from contacts " + + "where employer.company.name = 'abc' and p = 1") --- End diff -- Let's say a user adds `where employer.company is not null`, can we still read schema with `employer:struct>>` as we only mark `contentAccessed = false` when `IsNotNull` is on an attribute? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216202879 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -110,7 +110,12 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { val projectionRootFields = projects.flatMap(getRootFields) val filterRootFields = filters.flatMap(getRootFields) -(projectionRootFields ++ filterRootFields).distinct +val (rootFields, optRootFields) = (projectionRootFields ++ filterRootFields) + .distinct.partition(_.contentAccessed) --- End diff -- Some comments here please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216200358 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -17,7 +17,7 @@ package org.apache.spark.sql.execution.datasources.parquet -import org.apache.spark.sql.catalyst.expressions.{And, Attribute, AttributeReference, Expression, NamedExpression} +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, AttributeReference, Expression, IsNotNull, IsNull, NamedExpression} --- End diff -- which can be wildcard when there are more than 6 entities --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216199370 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -17,7 +17,7 @@ package org.apache.spark.sql.execution.datasources.parquet -import org.apache.spark.sql.catalyst.expressions.{And, Attribute, AttributeReference, Expression, NamedExpression} +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, AttributeReference, Expression, IsNotNull, IsNull, NamedExpression} --- End diff -- line too long. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216199294 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -250,8 +258,9 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { } /** - * A "root" schema field (aka top-level, no-parent) and whether it was derived from - * an attribute or had a proper child. + * A "root" schema field (aka top-level, no-parent), whether it was derived from + * an attribute or had a proper child, and whether it was accessed with its content. */ - private case class RootField(field: StructField, derivedFromAtt: Boolean) + private case class RootField(field: StructField, derivedFromAtt: Boolean, + contentAccessed: Boolean = true) --- End diff -- Formatting and please elaborate the comment --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r216198335 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -196,6 +201,9 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { */ private def getRootFields(expr: Expression): Seq[RootField] = { expr match { + // Those expressions don't really use the nested fields of a root field. + case i@(IsNotNull(_: Attribute) | IsNull(_: Attribute)) => +getRootFields(i.children(0)).map(_.copy(contentAccessed = false)) case att: Attribute => RootField(StructField(att.name, att.dataType, att.nullable), derivedFromAtt = true) :: Nil case SelectedField(field) => RootField(field, derivedFromAtt = false) :: Nil --- End diff -- How about ```scala case IsNotNull(_: Attribute) | IsNull(_: Attribute) => expr.children.flatMap(getRootFields).map(_.copy(contentAccessed = false)) case _ => expr.children.flatMap(getRootFields) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r215899595 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala --- @@ -155,6 +155,30 @@ class ParquetSchemaPruningSuite Row(null) :: Row(null) :: Nil) } + testSchemaPruning("select a single complex field and in where clause") { +val query = sql("select name.first from contacts where name.first = 'Jane'") +checkScan(query, "struct>") +checkAnswer(query, Row("Jane") :: Nil) --- End diff -- Yes. Added test case for it. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r215899485 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -196,6 +196,7 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { */ private def getRootFields(expr: Expression): Seq[RootField] = { expr match { + case IsNotNull(_: Attribute) | IsNull(_: Attribute) => Seq.empty --- End diff -- Thanks. This was a case I didn't test. Fixed it and added test case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r215877729 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -196,6 +196,7 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { */ private def getRootFields(expr: Expression): Seq[RootField] = { expr match { + case IsNotNull(_: Attribute) | IsNull(_: Attribute) => Seq.empty --- End diff -- I mean, for instance, this case `select address is not null, name.last from contacts` it wouldn't work. I thought this is a quick bandaid fix to resolve a basic case. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r215866501 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -196,6 +196,7 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { */ private def getRootFields(expr: Expression): Seq[RootField] = { expr match { + case IsNotNull(_: Attribute) | IsNull(_: Attribute) => Seq.empty --- End diff -- But the case mentioned here looks specific to the pushed filter itself. Can we add a simple test for project case as well? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r215864442 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -196,6 +196,7 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { */ private def getRootFields(expr: Expression): Seq[RootField] = { expr match { + case IsNotNull(_: Attribute) | IsNull(_: Attribute) => Seq.empty --- End diff -- If this is in projects, I think we also don't need to include all nested fields? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r215857463 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -196,6 +196,7 @@ private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { */ private def getRootFields(expr: Expression): Seq[RootField] = { expr match { + case IsNotNull(_: Attribute) | IsNull(_: Attribute) => Seq.empty --- End diff -- H .. shouldn't we exclude this for filters only in `identifyRootFields`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/22357#discussion_r215853912 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala --- @@ -155,6 +155,30 @@ class ParquetSchemaPruningSuite Row(null) :: Row(null) :: Nil) } + testSchemaPruning("select a single complex field and in where clause") { +val query = sql("select name.first from contacts where name.first = 'Jane'") +checkScan(query, "struct>") +checkAnswer(query, Row("Jane") :: Nil) --- End diff -- can you add another tests that select `name.first` and `name.last,` and apply `where clause` on `name.first`. We should only read `name.first` and `name.last` without name.middle. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22357: [SPARK-25363][SQL] Fix schema pruning in where cl...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/22357 [SPARK-25363][SQL] Fix schema pruning in where clause by ignoring unnecessary root fields ## What changes were proposed in this pull request? Schema pruning doesn't work if nested column is used in where clause. For example, ``` sql("select name.first from contacts where name.first = 'David'") == Physical Plan == *(1) Project [name#19.first AS first#40] +- *(1) Filter (isnotnull(name#19) && (name#19.first = David)) +- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: struct> ``` In above query plan, the scan node reads the entire schema of `name` column. This issue is reported by: https://github.com/apache/spark/pull/21320#issuecomment-419290197 The cause is that we infer a root field from expression `IsNotNull(name)`. However, for such expression, we don't really use the nested fields of this root field, so we can ignore the unnecessary nested fields. ## How was this patch tested? Unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 SPARK-25363 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22357.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22357 commit 3de6ee2c28e1242fead892f922d0053e31509a9f Author: Liang-Chi Hsieh Date: 2018-09-07T05:06:07Z Fix schema pruning when used in where clause. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org