[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21320 @mallman Glad to see this got merged in. Thanks for all of your work pushing through. I'm looking forward to the next phase. Please let me know if I can help again. I did notice that window functions don't get pushed down fully and briefly started looking into that, but don't want to duplicate any work you might be planning. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21889: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user ajacques closed the pull request at: https://github.com/apache/spark/pull/21889 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 Thanks for the response all. @mailman If it's really your preference, I will create a PR against that branch and close this one. My intention was never to take away from your efforts, and I still consider my work here to be just minor stylistic tweaks on top of your work. I did this as service to help bridge the divide and hopefully alleviate frustrations. But this has been a bit frustrating being stuck between two sides of this and changing merge strategies often and don't wish to continue being in between like this. As such, I will create a PR, but hope it does not dragged out to settle any differences in opinions between maintainers and submitters. My goal is to make sure this valuable feature gets merged so many can benefit. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21889: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user ajacques commented on a diff in the pull request: https://github.com/apache/spark/pull/21889#discussion_r210170646 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala --- @@ -0,0 +1,245 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import java.io.File + +import org.scalactic.Equality + +import org.apache.spark.sql.{DataFrame, QueryTest, Row} +import org.apache.spark.sql.catalyst.parser.CatalystSqlParser +import org.apache.spark.sql.execution.FileSourceScanExec +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.SharedSQLContext +import org.apache.spark.sql.types.StructType + +class ParquetSchemaPruningSuite +extends QueryTest +with ParquetTest +with SharedSQLContext { + case class FullName(first: String, middle: String, last: String) + case class Contact( +id: Int, +name: FullName, +address: String, +pets: Int, +friends: Array[FullName] = Array(), +relatives: Map[String, FullName] = Map()) + + val janeDoe = FullName("Jane", "X.", "Doe") + val johnDoe = FullName("John", "Y.", "Doe") + val susanSmith = FullName("Susan", "Z.", "Smith") + + private val contacts = +Contact(0, janeDoe, "123 Main Street", 1, friends = Array(susanSmith), + relatives = Map("brother" -> johnDoe)) :: +Contact(1, johnDoe, "321 Wall Street", 3, relatives = Map("sister" -> janeDoe)) :: Nil + + case class Name(first: String, last: String) + case class BriefContact(id: Int, name: Name, address: String) + + private val briefContacts = +BriefContact(2, Name("Janet", "Jones"), "567 Maple Drive") :: +BriefContact(3, Name("Jim", "Jones"), "6242 Ash Street") :: Nil + + case class ContactWithDataPartitionColumn( +id: Int, +name: FullName, +address: String, +pets: Int, +friends: Array[FullName] = Array(), --- End diff -- This is a definition in a constructor, so I don't think we can do: ```scala case class Contact( [...] friends: Array.empty[FullName], relatives: Map.empty[String, FullName] ) ``` Scala wants a colon, so I opted for: ```scala case class Contact( [...] friends: Array[FullName] = Array.empty, relatives: Map[String, FullName] = Map.empty ) ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21320 @mallman if you're planning on making more code changes, would you be willing to work on a shared branch or something? I've been working to incorporate the CR comments. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21889: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user ajacques commented on a diff in the pull request: https://github.com/apache/spark/pull/21889#discussion_r209830673 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala --- @@ -0,0 +1,200 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import org.apache.spark.sql.catalyst.expressions.{And, Attribute, Expression, NamedExpression} +import org.apache.spark.sql.catalyst.planning.PhysicalOperation +import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, Project} +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.execution.{ProjectionOverSchema, SelectedField} +import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, LogicalRelation} +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.types.{ArrayType, DataType, MapType, StructField, StructType} + +/** + * Prunes unnecessary Parquet columns given a [[PhysicalOperation]] over a + * [[ParquetRelation]]. By "Parquet column", we mean a column as defined in the + * Parquet format. In Spark SQL, a root-level Parquet column corresponds to a + * SQL column, and a nested Parquet column corresponds to a [[StructField]]. + */ +private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = +if (SQLConf.get.nestedSchemaPruningEnabled) { + apply0(plan) +} else { + plan +} + + private def apply0(plan: LogicalPlan): LogicalPlan = +plan transformDown { + case op @ PhysicalOperation(projects, filters, + l @ LogicalRelation(hadoopFsRelation @ HadoopFsRelation(_, _, +dataSchema, _, _: ParquetFileFormat, _), _, _, _)) => +val projectionRootFields = projects.flatMap(getRootFields) +val filterRootFields = filters.flatMap(getRootFields) +val requestedRootFields = (projectionRootFields ++ filterRootFields).distinct + +// If requestedRootFields includes a nested field, continue. Otherwise, +// return op +if (requestedRootFields.exists { case RootField(_, derivedFromAtt) => !derivedFromAtt }) { + val prunedDataSchema = buildPrunedDataSchema(dataSchema, requestedRootFields) + + // If the data schema is different from the pruned data schema, continue. Otherwise, + // return op. We effect this comparison by counting the number of "leaf" fields in + // each schemata, assuming the fields in [[prunedDataSchema]] are a subset of the fields + // in dataSchema. + if (countLeaves(dataSchema) > countLeaves(prunedDataSchema)) { +val prunedParquetRelation = + hadoopFsRelation.copy(dataSchema = prunedDataSchema)(hadoopFsRelation.sparkSession) + +val prunedRelation = buildPrunedRelation(l, prunedParquetRelation) +val projectionOverSchema = ProjectionOverSchema(prunedDataSchema) + +// Construct a new target for our projection by rewriting and +// including the original filters where available +val projectionChild = + if (filters.nonEmpty) { +val projectedFilters = filters.map(_.transformDown { + case projectionOverSchema(expr) => expr +}) +val newFilterCondition = projectedFilters.reduce(And) +Filter(newFilterCondition, prunedRelation) + } else { +prunedRelation + } + +// Construct the new projections of our Project by +// rewriting the original projections +val newProjects = projects.map(_.transformDown { + case projectionOverSche
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 @gatorsmile Do you think there is a on deterministic failure in this change that causes it to inconsistently fail? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 @mallman, while we wait for the go-no-go, do you have the changes for the next PR ready? Is there anything you need help with? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 >> but @gatorsmile wants to review it in a follow-on PR. > Where did he say it after the comment above? It was my interpretation of this comment: https://github.com/apache/spark/pull/21320#issuecomment-406353694 @gatorsmile, @HyukjinKwon Do we wish to block this PR to fix the issue with it enabled? It's not clear what your expectations are for this PR. 1. Are you okay with it not 100% working if it's disabled by default 2. Do you want this issue to be fixed at the cost of bringing more changes into this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 @HyukjinKwon Looks like most of your comments have been already addressed, but I've gone ahead and made a few more tweaks to help this get merged. Please let me know if any blocking comments have been missed. As mentioned: This feature is not known to have any regressions in the default, disabled state. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 Is there anything I can do to help with this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 Jenkins build successful. Any PR comments/blockers to merge for phase 1? cc @HyukjinKwon, @gatorsmile, @cloud-fan --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 Alright to make sure we're all on the same page, it sounds like we're ready to merge this PR pending: * Successful build by Jenkins * Any PR comments from a maintainer This feature will be merged in disabled state and can't be enabled until the next PR is merged, but we do not expect any regression in behavior in the default disabled state. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 @mallman Is it related to [this revert in ParquetReadSupport](https://github.com/apache/spark/pull/21889/commits/0312a5188f0d6c9fc5304195dbdc703bf0aa3fb7#diff-245e70c1f41e353e34cf29bd00fd9029L86). I re-added this logic and all 32 tests in ParquetSchemaPruningSuite passed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 @mallman `select id, name.middle, address from temp` - **Works** `select name.middle, address from temp` - **Fails** --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 The tests as committed pass for me, but I removed the `order by id` and I got that error. Are you saying it works with the specific query in my comment? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 @mallman: I've rebased on top of your changes and pushed. I'm seeing the following: Given the following schema: ``` root |-- id: integer (nullable = true) |-- name: struct (nullable = true) ||-- first: string (nullable = true) ||-- middle: string (nullable = true) ||-- last: string (nullable = true) |-- address: string (nullable = true) |-- pets: integer (nullable = true) |-- friends: array (nullable = true) ||-- element: struct (containsNull = true) |||-- first: string (nullable = true) |||-- middle: string (nullable = true) |||-- last: string (nullable = true) |-- relatives: map (nullable = true) ||-- key: string ||-- value: struct (valueContainsNull = true) |||-- first: string (nullable = true) |||-- middle: string (nullable = true) |||-- last: string (nullable = true) |-- p: integer (nullable = true) ``` The query: `select name.middle, address from temp` throws: ``` Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/private/var/folders/ss/cw601dzn59b2nygs8k1bs78x75lhr0/T/spark-cab140ca-cbba-4dc1-9fe5-6ae739dab70a/contacts/p=2/part-0-91d2abf5-625f-4080-b34c-e373b89c9895-c000.snappy.parquet at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:186) ... 20 more Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:657) at java.util.ArrayList.get(ArrayList.java:433) at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99) at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99) at org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:97) at org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:92) at org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:278) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147) at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109) at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165) at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222) ... 25 more ``` No root cause yet, but I noticed this while working with the unit tests. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21889: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user ajacques commented on a diff in the pull request: https://github.com/apache/spark/pull/21889#discussion_r207718713 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala --- @@ -0,0 +1,205 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.execution.datasources.parquet + +import java.io.File + +import org.apache.spark.sql.{QueryTest, Row} +import org.apache.spark.sql.execution.FileSchemaPruningTest +import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.test.SharedSQLContext + +class ParquetSchemaPruningSuite +extends QueryTest +with ParquetTest +with FileSchemaPruningTest +with SharedSQLContext { + case class FullName(first: String, middle: String, last: String) + case class Contact( +id: Int, +name: FullName, +address: String, +pets: Int, +friends: Array[FullName] = Array(), +relatives: Map[String, FullName] = Map()) + + val janeDoe = FullName("Jane", "X.", "Doe") + val johnDoe = FullName("John", "Y.", "Doe") + val susanSmith = FullName("Susan", "Z.", "Smith") + + val contacts = +Contact(0, janeDoe, "123 Main Street", 1, friends = Array(susanSmith), + relatives = Map("brother" -> johnDoe)) :: +Contact(1, johnDoe, "321 Wall Street", 3, relatives = Map("sister" -> janeDoe)) :: Nil + + case class Name(first: String, last: String) + case class BriefContact(id: Int, name: Name, address: String) + + val briefContacts = +BriefContact(2, Name("Janet", "Jones"), "567 Maple Drive") :: +BriefContact(3, Name("Jim", "Jones"), "6242 Ash Street") :: Nil + + case class ContactWithDataPartitionColumn( +id: Int, +name: FullName, +address: String, +pets: Int, +friends: Array[FullName] = Array(), +relatives: Map[String, FullName] = Map(), +p: Int) + + case class BriefContactWithDataPartitionColumn(id: Int, name: Name, address: String, p: Int) + + val contactsWithDataPartitionColumn = +contacts.map { case Contact(id, name, address, pets, friends, relatives) => + ContactWithDataPartitionColumn(id, name, address, pets, friends, relatives, 1) } + val briefContactsWithDataPartitionColumn = +briefContacts.map { case BriefContact(id, name, address) => + BriefContactWithDataPartitionColumn(id, name, address, 2) } + + testSchemaPruning("select a single complex field") { +val query = sql("select name.middle from contacts order by id") +checkScanSchemata(query, "struct>") +checkAnswer(query, Row("X.") :: Row("Y.") :: Row(null) :: Row(null) :: Nil) + } + + testSchemaPruning("select a single complex field and its parent struct") { +val query = sql("select name.middle, name from contacts order by id") +checkScanSchemata(query, "struct>") +checkAnswer(query, + Row("X.", Row("Jane", "X.", "Doe")) :: + Row("Y.", Row("John", "Y.", "Doe")) :: + Row(null, Row("Janet", null, "Jones")) :: + Row(null, Row("Jim", null, "Jones")) :: + Nil) + } + + testSchemaPruning("select a single complex field array and its parent struct array") { +val query = sql("select friends.middle, friends from contacts where p=1 order by id") +checkScanSchemata(query, + "struct>>") +checkAnswer(query, + Row(Array("Z.&quo
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 @mallman: [This one](https://github.com/apache/spark/pull/21889/files#diff-0c6c7481232e9637b91c179f1005426aR120)? I just enabled it on my branch and the test passed. Was it fixed by your latest changes or am I missing something? ``` Expected: struct,address:string> Actual: fileSourceScanSchemata = {ArrayBuffer@12560} "ArrayBuffer" size = 1 0 = {StructType@15492} "StructType" size = 3 0 = {StructField@15494} "StructField(id,IntegerType,true)" 1 = {StructField@15495} "StructField(name,StructType(StructField(middle,StringType,true)),true)" 2 = {StructField@15496} "StructField(address,StringType,true)" ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 Are there any other blockers to enabling this by default now that @mallman fixed the currently known broken queries? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 Anybody else able to reproduce this failure? It succeeded on my developer machine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 These test failures are in Spark streaming. Is this just an intermittent test failure or actually caused by this PR? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 @mallman, sounds good I'll get this PR updated with your latest changes as soon as I can. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21889 Where does that leave both of these PRs? Do we still want this one with the code refactoring or to go back to the original? Are there any comments for this PR that would block merging? I've set the default to false in this PR. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21889: [SPARK-4502][SQL] Parquet nested column pruning -...
GitHub user ajacques opened a pull request: https://github.com/apache/spark/pull/21889 [SPARK-4502][SQL] Parquet nested column pruning - foundation (2nd attempt) (Link to Jira: https://issues.apache.org/jira/browse/SPARK-4502) **This is a restart of apache/spark#21320. Most of the discussion has taken place over there. I've only taken it as a based to make stylistic changes based on the code review to help move things along.** Due to the urgency of the upcoming 2.4 code freeze, I'm going to open this PR to collect any feedback. This can be closed if you prefer to continue to the work in the original PR. ### What changes were proposed in this pull request? One of the hallmarks of a column-oriented data storage format is the ability to read data from a subset of columns, efficiently skipping reads from other columns. Spark has long had support for pruning unneeded top-level schema fields from the scan of a parquet file. For example, consider a table, contacts, backed by parquet with the following Spark SQL schema: ``` root |-- name: struct ||-- first: string ||-- last: string |-- address: string ``` Parquet stores this table's data in three physical columns: name.first, name.last and address. To answer the query `select address from contacts` Spark will read only from the address column of parquet data. However, to answer the query `select name.first from contacts` Spark will read name.first and name.last from parquet. This PR modifies Spark SQL to support a finer-grain of schema pruning. With this patch, Spark reads only the name.first column to answer the previous query. ### Implementation There are two main components of this patch. First, there is a ParquetSchemaPruning optimizer rule for gathering the required schema fields of a PhysicalOperation over a parquet file, constructing a new schema based on those required fields and rewriting the plan in terms of that pruned schema. The pruned schema fields are pushed down to the parquet requested read schema. ParquetSchemaPruning uses a new ProjectionOverSchema extractor for rewriting a catalyst expression in terms of a pruned schema. Second, the ParquetRowConverter has been patched to ensure the ordinals of the parquet columns read are correct for the pruned schema. ParquetReadSupport has been patched to address a compatibility mismatch between Spark's built in vectorized reader and the parquet-mr library's reader. ### Limitation Among the complex Spark SQL data types, this patch supports parquet column pruning of nested sequences of struct fields only. ### How was this patch tested? Care has been taken to ensure correctness and prevent regressions. A more advanced version of this patch incorporating optimizations for rewriting queries involving aggregations and joins has been running on a production Spark cluster at VideoAmp for several years. In that time, one bug was found and fixed early on, and we added a regression test for that bug. We forward-ported this patch to Spark master in June 2016 and have been running this patch against Spark 2.x branches on ad-hoc clusters since then. ### Known Issues Highlighted in https://github.com/apache/spark/pull/21320#issuecomment-408271470 You can merge this pull request into a Git repository by running: $ git pull https://github.com/ajacques/apache-spark spark-4502-parquet_column_pruning-foundation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21889.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21889 commit 86c572a26918e6380762508d1d868fd5ea231ed1 Author: Michael Allman Date: 2016-06-24T17:21:24Z [SPARK-4502][SQL] Parquet nested column pruning commit 1ffbc6c8f8eaa27b50048e2124afdd2ff9e1e2b4 Author: Michael Allman Date: 2018-06-04T09:59:34Z Refactor SelectedFieldSuite to make its tests simpler and more comprehensible commit 97cd30b221fe78a4ea61c6c24be370fc2dfdd498 Author: Michael Allman Date: 2018-06-04T10:02:28Z Remove test "select function over nested data" of unknown origin and purpose commit c27e87924ed788a4253d077691401d0852306406 Author: Michael Allman Date: 2018-06-04T10:45:28Z Improve readability of ParquetSchemaPruning and ParquetSchemaPruningSuite. Add test to exercise whether the requested root fields in a query exclude any attributes commit 2fc1173499e1fc4ca4205db80e78ca6f110fb5dd Author: Michael Allman Date: 2018-06-04T12:04:23Z Don't handle non-data-field partition column names specially when constructing the new projections in ParquetSchemaPruning. These column projections are left unch
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21320 To confirm we want to start a secondary PR based on my stylistic/minor fixes? As I get up to speed on this code, I won't be able to make heavy changes. I'll have some time tomorrow to take a look at this particular bug, but I'd appreciate any investigation you make. Are we planning on pushing this with the default disabled state? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user ajacques commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r205329769 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/SelectedField.scala --- @@ -0,0 +1,134 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.planning + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.types._ + +/** + * A Scala extractor that builds a [[org.apache.spark.sql.types.StructField]] from a Catalyst + * complex type extractor. For example, consider a relation with the following schema: + * + * {{{ + * root + *|-- name: struct (nullable = true) + *||-- first: string (nullable = true) + *||-- last: string (nullable = true) + *}}} + * + * Further, suppose we take the select expression `name.first`. This will parse into an + * `Alias(child, "first")`. Ignoring the alias, `child` matches the following pattern: + * + * {{{ + * GetStructFieldObject( + * AttributeReference("name", StructType(_), _, _), + * StructField("first", StringType, _, _)) + * }}} + * + * [[SelectedField]] converts that expression into + * + * {{{ + * StructField("name", StructType(Array(StructField("first", StringType + * }}} + * + * by mapping each complex type extractor to a [[org.apache.spark.sql.types.StructField]] with the + * same name as its child (or "parent" going right to left in the select expression) and a data + * type appropriate to the complex type extractor. In our example, the name of the child expression + * is "name" and its data type is a [[org.apache.spark.sql.types.StructType]] with a single string + * field named "first". + * + * @param expr the top-level complex type extractor + */ +object SelectedField { + def unapply(expr: Expression): Option[StructField] = { --- End diff -- ``` Error:(61, 12) constructor cannot be instantiated to expected type; found : org.apache.spark.sql.catalyst.expressions.Alias required: org.apache.spark.sql.catalyst.expressions.ExtractValue case Alias(child, _) => child ``` Alias takes: `Alias(child: Expression, name: String)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...
Github user ajacques commented on a diff in the pull request: https://github.com/apache/spark/pull/21320#discussion_r205329633 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/ProjectionOverSchema.scala --- @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.planning + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.types._ + +/** + * A Scala extractor that projects an expression over a given schema. Data types, + * field indexes and field counts of complex type extractors and attributes + * are adjusted to fit the schema. All other expressions are left as-is. This + * class is motivated by columnar nested schema pruning. + */ +case class ProjectionOverSchema(schema: StructType) { --- End diff -- We can move this to `sql.execution` if we move all three classes: `ProjectionOverSchema`, `GetStructFieldObject`, and `SelectedField`. Is there a difference in the catalyst.planning vs the execution packages? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21320 @HyukjinKwon, I'm not totally familiar with Spark internals yet, so to be honest I don't feel confident making big changes and hopefully can keep them simple at first. I've gone through the code review comments and made as many changes as possible [here](https://github.com/apache/spark/compare/master...ajacques:spark-4502-parquet_column_pruning-foundation). If this PR is mostly feature complete and it's just small things, then I can push forward. If the feedback comments push past simple refactoring level right now I would prefer to let someone else take over, but feel free to use what I've done. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...
Github user ajacques commented on the issue: https://github.com/apache/spark/pull/21320 Hey @mallman, I want to thank you for your work on this so far. I've been watching this pull request hoping this would get merged into 2.4 since it would be a benefit to me, but can see how it might be frustrating. Unfortunately, I've only been following the comments and not the code/architecture itself, so I can't take over effectively, but I did try to make the minor comments as requested hopefully to help out. I've started in 7ee616076f93d6cfd55b6646314f3d4a6d1530d3. This may not be super helpful right now, but if these were the only blockers for getting this change into mainline in time for 2.4. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org