[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-24 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21320
  
@mallman Glad to see this got merged in. Thanks for all of your work 
pushing through. I'm looking forward to the next phase. Please let me know if I 
can help again. I did notice that window functions don't get pushed down fully 
and briefly started looking into that, but don't want to duplicate any work you 
might be planning.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21889: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-08-16 Thread ajacques
Github user ajacques closed the pull request at:

https://github.com/apache/spark/pull/21889


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-16 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
Thanks for the response all. @mailman If it's really your preference, I 
will create a PR against that branch and close this one. My intention was never 
to take away from your efforts, and I still consider my work here to be just 
minor stylistic tweaks on top of your work. I did this as service to help 
bridge the divide and hopefully alleviate frustrations. But this has been a bit 
frustrating being stuck between two sides of this and changing merge strategies 
often and don't wish to continue being in between like this. 

As such, I will create a PR, but hope it does not dragged out to settle any 
differences in opinions between maintainers and submitters. My goal is to make 
sure this valuable feature gets merged so many can benefit.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21889: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-08-14 Thread ajacques
Github user ajacques commented on a diff in the pull request:

https://github.com/apache/spark/pull/21889#discussion_r210170646
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
 ---
@@ -0,0 +1,245 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import java.io.File
+
+import org.scalactic.Equality
+
+import org.apache.spark.sql.{DataFrame, QueryTest, Row}
+import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
+import org.apache.spark.sql.execution.FileSourceScanExec
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+import org.apache.spark.sql.types.StructType
+
+class ParquetSchemaPruningSuite
+extends QueryTest
+with ParquetTest
+with SharedSQLContext {
+  case class FullName(first: String, middle: String, last: String)
+  case class Contact(
+id: Int,
+name: FullName,
+address: String,
+pets: Int,
+friends: Array[FullName] = Array(),
+relatives: Map[String, FullName] = Map())
+
+  val janeDoe = FullName("Jane", "X.", "Doe")
+  val johnDoe = FullName("John", "Y.", "Doe")
+  val susanSmith = FullName("Susan", "Z.", "Smith")
+
+  private val contacts =
+Contact(0, janeDoe, "123 Main Street", 1, friends = Array(susanSmith),
+  relatives = Map("brother" -> johnDoe)) ::
+Contact(1, johnDoe, "321 Wall Street", 3, relatives = Map("sister" -> 
janeDoe)) :: Nil
+
+  case class Name(first: String, last: String)
+  case class BriefContact(id: Int, name: Name, address: String)
+
+  private val briefContacts =
+BriefContact(2, Name("Janet", "Jones"), "567 Maple Drive") ::
+BriefContact(3, Name("Jim", "Jones"), "6242 Ash Street") :: Nil
+
+  case class ContactWithDataPartitionColumn(
+id: Int,
+name: FullName,
+address: String,
+pets: Int,
+friends: Array[FullName] = Array(),
--- End diff --

This is a definition in a constructor, so I don't think we can do:
```scala
case class Contact(
  [...]
  friends: Array.empty[FullName],
  relatives: Map.empty[String, FullName]
)
```
Scala wants a colon, so I opted for:
```scala
case class Contact(
  [...]
  friends: Array[FullName] = Array.empty,
  relatives: Map[String, FullName] = Map.empty
)
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-14 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21320
  
@mallman if you're planning on making more code changes, would you be 
willing to work on a shared branch or something? I've been working to 
incorporate the CR comments.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21889: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-08-13 Thread ajacques
Github user ajacques commented on a diff in the pull request:

https://github.com/apache/spark/pull/21889#discussion_r209830673
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruning.scala
 ---
@@ -0,0 +1,200 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import org.apache.spark.sql.catalyst.expressions.{And, Attribute, 
Expression, NamedExpression}
+import org.apache.spark.sql.catalyst.planning.PhysicalOperation
+import org.apache.spark.sql.catalyst.plans.logical.{Filter, LogicalPlan, 
Project}
+import org.apache.spark.sql.catalyst.rules.Rule
+import org.apache.spark.sql.execution.{ProjectionOverSchema, SelectedField}
+import org.apache.spark.sql.execution.datasources.{HadoopFsRelation, 
LogicalRelation}
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.types.{ArrayType, DataType, MapType, 
StructField, StructType}
+
+/**
+ * Prunes unnecessary Parquet columns given a [[PhysicalOperation]] over a
+ * [[ParquetRelation]]. By "Parquet column", we mean a column as defined 
in the
+ * Parquet format. In Spark SQL, a root-level Parquet column corresponds 
to a
+ * SQL column, and a nested Parquet column corresponds to a 
[[StructField]].
+ */
+private[sql] object ParquetSchemaPruning extends Rule[LogicalPlan] {
+  override def apply(plan: LogicalPlan): LogicalPlan =
+if (SQLConf.get.nestedSchemaPruningEnabled) {
+  apply0(plan)
+} else {
+  plan
+}
+
+  private def apply0(plan: LogicalPlan): LogicalPlan =
+plan transformDown {
+  case op @ PhysicalOperation(projects, filters,
+  l @ LogicalRelation(hadoopFsRelation @ HadoopFsRelation(_, _,
+dataSchema, _, _: ParquetFileFormat, _), _, _, _)) =>
+val projectionRootFields = projects.flatMap(getRootFields)
+val filterRootFields = filters.flatMap(getRootFields)
+val requestedRootFields = (projectionRootFields ++ 
filterRootFields).distinct
+
+// If requestedRootFields includes a nested field, continue. 
Otherwise,
+// return op
+if (requestedRootFields.exists { case RootField(_, derivedFromAtt) 
=> !derivedFromAtt }) {
+  val prunedDataSchema = buildPrunedDataSchema(dataSchema, 
requestedRootFields)
+
+  // If the data schema is different from the pruned data schema, 
continue. Otherwise,
+  // return op. We effect this comparison by counting the number 
of "leaf" fields in
+  // each schemata, assuming the fields in [[prunedDataSchema]] 
are a subset of the fields
+  // in dataSchema.
+  if (countLeaves(dataSchema) > countLeaves(prunedDataSchema)) {
+val prunedParquetRelation =
+  hadoopFsRelation.copy(dataSchema = 
prunedDataSchema)(hadoopFsRelation.sparkSession)
+
+val prunedRelation = buildPrunedRelation(l, 
prunedParquetRelation)
+val projectionOverSchema = 
ProjectionOverSchema(prunedDataSchema)
+
+// Construct a new target for our projection by rewriting and
+// including the original filters where available
+val projectionChild =
+  if (filters.nonEmpty) {
+val projectedFilters = filters.map(_.transformDown {
+  case projectionOverSchema(expr) => expr
+})
+val newFilterCondition = projectedFilters.reduce(And)
+Filter(newFilterCondition, prunedRelation)
+  } else {
+prunedRelation
+  }
+
+// Construct the new projections of our Project by
+// rewriting the original projections
+val newProjects = projects.map(_.transformDown {
+  case projectionOverSche

[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-09 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
@gatorsmile Do you think there is a on deterministic failure in this change 
that causes it to inconsistently fail? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-08 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
@mallman, while we wait for the go-no-go, do you have the changes for the 
next PR ready? Is there anything you need help with?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-07 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
>> but @gatorsmile wants to review it in a follow-on PR.

> Where did he say it after the comment above?

It was my interpretation of this comment: 
https://github.com/apache/spark/pull/21320#issuecomment-406353694

@gatorsmile, @HyukjinKwon Do we wish to block this PR to fix the issue with 
it enabled? It's not clear what your expectations are for this PR. 
1. Are you okay with it not 100% working if it's disabled by default
2. Do you want this issue to be fixed at the cost of bringing more changes 
into this PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-07 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
@HyukjinKwon Looks like most of your comments have been already addressed, 
but I've gone ahead and made a few more tweaks to help this get merged. Please 
let me know if any blocking comments have been missed.

As mentioned: This feature is not known to have any regressions in the 
default, disabled state.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-07 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
Is there anything I can do to help with this PR? 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-05 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
Jenkins build successful. Any PR comments/blockers to merge for phase 1?

cc @HyukjinKwon, @gatorsmile, @cloud-fan 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-05 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
Alright to make sure we're all on the same page, it sounds like we're ready 
to merge this PR pending:
* Successful build by Jenkins
* Any PR comments from a maintainer

This feature will be merged in disabled state and can't be enabled until 
the next PR is merged, but we do not expect any regression in behavior in the 
default disabled state.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-05 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
@mallman Is it related to [this revert in 
ParquetReadSupport](https://github.com/apache/spark/pull/21889/commits/0312a5188f0d6c9fc5304195dbdc703bf0aa3fb7#diff-245e70c1f41e353e34cf29bd00fd9029L86).
 I re-added this logic and all 32 tests in ParquetSchemaPruningSuite passed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-04 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
@mallman 

`select id, name.middle, address from temp` - **Works**
`select name.middle, address from temp` - **Fails**


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-04 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
The tests as committed pass for me, but I removed the `order by id` and I 
got that error. Are you saying it works with the specific query in my comment?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-04 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
@mallman: I've rebased on top of your changes and pushed. I'm seeing the 
following:

Given the following schema:
```
root
 |-- id: integer (nullable = true)
 |-- name: struct (nullable = true)
 ||-- first: string (nullable = true)
 ||-- middle: string (nullable = true)
 ||-- last: string (nullable = true)
 |-- address: string (nullable = true)
 |-- pets: integer (nullable = true)
 |-- friends: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- first: string (nullable = true)
 |||-- middle: string (nullable = true)
 |||-- last: string (nullable = true)
 |-- relatives: map (nullable = true)
 ||-- key: string
 ||-- value: struct (valueContainsNull = true)
 |||-- first: string (nullable = true)
 |||-- middle: string (nullable = true)
 |||-- last: string (nullable = true)
 |-- p: integer (nullable = true)
```

The query: `select name.middle, address from temp` throws:
```
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read 
value at 0 in block -1 in file 
file:/private/var/folders/ss/cw601dzn59b2nygs8k1bs78x75lhr0/T/spark-cab140ca-cbba-4dc1-9fe5-6ae739dab70a/contacts/p=2/part-0-91d2abf5-625f-4080-b34c-e373b89c9895-c000.snappy.parquet
  at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
  at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
  at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:186)
  ... 20 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
  at java.util.ArrayList.rangeCheck(ArrayList.java:657)
  at java.util.ArrayList.get(ArrayList.java:433)
  at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
  at org.apache.parquet.io.GroupColumnIO.getFirst(GroupColumnIO.java:99)
  at 
org.apache.parquet.io.PrimitiveColumnIO.getFirst(PrimitiveColumnIO.java:97)
  at 
org.apache.parquet.io.PrimitiveColumnIO.isFirst(PrimitiveColumnIO.java:92)
  at 
org.apache.parquet.io.RecordReaderImplementation.(RecordReaderImplementation.java:278)
  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
  at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
  at 
org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165)
  at 
org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
  at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
  at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
  ... 25 more
```

No root cause yet, but I noticed this while working with the unit tests.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21889: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-08-04 Thread ajacques
Github user ajacques commented on a diff in the pull request:

https://github.com/apache/spark/pull/21889#discussion_r207718713
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaPruningSuite.scala
 ---
@@ -0,0 +1,205 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.parquet
+
+import java.io.File
+
+import org.apache.spark.sql.{QueryTest, Row}
+import org.apache.spark.sql.execution.FileSchemaPruningTest
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.test.SharedSQLContext
+
+class ParquetSchemaPruningSuite
+extends QueryTest
+with ParquetTest
+with FileSchemaPruningTest
+with SharedSQLContext {
+  case class FullName(first: String, middle: String, last: String)
+  case class Contact(
+id: Int,
+name: FullName,
+address: String,
+pets: Int,
+friends: Array[FullName] = Array(),
+relatives: Map[String, FullName] = Map())
+
+  val janeDoe = FullName("Jane", "X.", "Doe")
+  val johnDoe = FullName("John", "Y.", "Doe")
+  val susanSmith = FullName("Susan", "Z.", "Smith")
+
+  val contacts =
+Contact(0, janeDoe, "123 Main Street", 1, friends = Array(susanSmith),
+  relatives = Map("brother" -> johnDoe)) ::
+Contact(1, johnDoe, "321 Wall Street", 3, relatives = Map("sister" -> 
janeDoe)) :: Nil
+
+  case class Name(first: String, last: String)
+  case class BriefContact(id: Int, name: Name, address: String)
+
+  val briefContacts =
+BriefContact(2, Name("Janet", "Jones"), "567 Maple Drive") ::
+BriefContact(3, Name("Jim", "Jones"), "6242 Ash Street") :: Nil
+
+  case class ContactWithDataPartitionColumn(
+id: Int,
+name: FullName,
+address: String,
+pets: Int,
+friends: Array[FullName] = Array(),
+relatives: Map[String, FullName] = Map(),
+p: Int)
+
+  case class BriefContactWithDataPartitionColumn(id: Int, name: Name, 
address: String, p: Int)
+
+  val contactsWithDataPartitionColumn =
+contacts.map { case Contact(id, name, address, pets, friends, 
relatives) =>
+  ContactWithDataPartitionColumn(id, name, address, pets, friends, 
relatives, 1) }
+  val briefContactsWithDataPartitionColumn =
+briefContacts.map { case BriefContact(id, name, address) =>
+  BriefContactWithDataPartitionColumn(id, name, address, 2) }
+
+  testSchemaPruning("select a single complex field") {
+val query = sql("select name.middle from contacts order by id")
+checkScanSchemata(query, "struct>")
+checkAnswer(query, Row("X.") :: Row("Y.") :: Row(null) :: Row(null) :: 
Nil)
+  }
+
+  testSchemaPruning("select a single complex field and its parent struct") 
{
+val query = sql("select name.middle, name from contacts order by id")
+checkScanSchemata(query, 
"struct>")
+checkAnswer(query,
+  Row("X.", Row("Jane", "X.", "Doe")) ::
+  Row("Y.", Row("John", "Y.", "Doe")) ::
+  Row(null, Row("Janet", null, "Jones")) ::
+  Row(null, Row("Jim", null, "Jones")) ::
+  Nil)
+  }
+
+  testSchemaPruning("select a single complex field array and its parent 
struct array") {
+val query = sql("select friends.middle, friends from contacts where 
p=1 order by id")
+checkScanSchemata(query,
+  
"struct>>")
+checkAnswer(query,
+  Row(Array("Z.&quo

[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-03 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
@mallman: [This 
one](https://github.com/apache/spark/pull/21889/files#diff-0c6c7481232e9637b91c179f1005426aR120)?
 I just enabled it on my branch and the test passed. Was it fixed by your 
latest changes or am I missing something?

```
Expected:
struct,address:string>

Actual:
fileSourceScanSchemata = {ArrayBuffer@12560} "ArrayBuffer" size = 1
 0 = {StructType@15492} "StructType" size = 3
  0 = {StructField@15494} "StructField(id,IntegerType,true)"
  1 = {StructField@15495} 
"StructField(name,StructType(StructField(middle,StringType,true)),true)"
  2 = {StructField@15496} "StructField(address,StringType,true)"
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-03 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
Are there any other blockers to enabling this by default now that @mallman 
fixed the currently known broken queries?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-02 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
Anybody else able to reproduce this failure? It succeeded on my developer 
machine.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-08-02 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
These test failures are in Spark streaming. Is this just an intermittent 
test failure or actually caused by this PR?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-07-31 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
@mallman, sounds good I'll get this PR updated with your latest changes as 
soon as I can.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21889: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-07-31 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21889
  
Where does that leave both of these PRs? Do we still want this one with the 
code refactoring or to go back to the original? Are there any comments for this 
PR that would block merging? I've set the default to false in this PR.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21889: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-07-26 Thread ajacques
GitHub user ajacques opened a pull request:

https://github.com/apache/spark/pull/21889

[SPARK-4502][SQL] Parquet nested column pruning - foundation (2nd attempt)

(Link to Jira: https://issues.apache.org/jira/browse/SPARK-4502)

**This is a restart of apache/spark#21320. Most of the discussion has taken 
place over there. I've only taken it as a based to make stylistic changes based 
on the code review to help move things along.**

Due to the urgency of the upcoming 2.4 code freeze, I'm going to open this 
PR to collect any feedback. This can be closed if you prefer to continue to the 
work in the original PR.

### What changes were proposed in this pull request?
One of the hallmarks of a column-oriented data storage format is the 
ability to read data from a subset of columns, efficiently skipping reads from 
other columns. Spark has long had support for pruning unneeded top-level schema 
fields from the scan of a parquet file. For example, consider a table, 
contacts, backed by parquet with the following Spark SQL schema:

```
root
 |-- name: struct
 ||-- first: string
 ||-- last: string
 |-- address: string
```
Parquet stores this table's data in three physical columns: name.first, 
name.last and address. To answer the query

`select address from contacts`
Spark will read only from the address column of parquet data. However, to 
answer the query

`select name.first from contacts`
Spark will read name.first and name.last from parquet.

This PR modifies Spark SQL to support a finer-grain of schema pruning. With 
this patch, Spark reads only the name.first column to answer the previous query.

### Implementation
There are two main components of this patch. First, there is a 
ParquetSchemaPruning optimizer rule for gathering the required schema fields of 
a PhysicalOperation over a parquet file, constructing a new schema based on 
those required fields and rewriting the plan in terms of that pruned schema. 
The pruned schema fields are pushed down to the parquet requested read schema. 
ParquetSchemaPruning uses a new ProjectionOverSchema extractor for rewriting a 
catalyst expression in terms of a pruned schema.

Second, the ParquetRowConverter has been patched to ensure the ordinals of 
the parquet columns read are correct for the pruned schema. ParquetReadSupport 
has been patched to address a compatibility mismatch between Spark's built in 
vectorized reader and the parquet-mr library's reader.

### Limitation
Among the complex Spark SQL data types, this patch supports parquet column 
pruning of nested sequences of struct fields only.

### How was this patch tested?
Care has been taken to ensure correctness and prevent regressions. A more 
advanced version of this patch incorporating optimizations for rewriting 
queries involving aggregations and joins has been running on a production Spark 
cluster at VideoAmp for several years. In that time, one bug was found and 
fixed early on, and we added a regression test for that bug.

We forward-ported this patch to Spark master in June 2016 and have been 
running this patch against Spark 2.x branches on ad-hoc clusters since then.

### Known Issues
Highlighted in 
https://github.com/apache/spark/pull/21320#issuecomment-408271470

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ajacques/apache-spark 
spark-4502-parquet_column_pruning-foundation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21889.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21889


commit 86c572a26918e6380762508d1d868fd5ea231ed1
Author: Michael Allman 
Date:   2016-06-24T17:21:24Z

[SPARK-4502][SQL] Parquet nested column pruning

commit 1ffbc6c8f8eaa27b50048e2124afdd2ff9e1e2b4
Author: Michael Allman 
Date:   2018-06-04T09:59:34Z

Refactor SelectedFieldSuite to make its tests simpler and more
comprehensible

commit 97cd30b221fe78a4ea61c6c24be370fc2dfdd498
Author: Michael Allman 
Date:   2018-06-04T10:02:28Z

Remove test "select function over nested data" of unknown origin and
purpose

commit c27e87924ed788a4253d077691401d0852306406
Author: Michael Allman 
Date:   2018-06-04T10:45:28Z

Improve readability of ParquetSchemaPruning and
ParquetSchemaPruningSuite. Add test to exercise whether the
requested root fields in a query exclude any attributes

commit 2fc1173499e1fc4ca4205db80e78ca6f110fb5dd
Author: Michael Allman 
Date:   2018-06-04T12:04:23Z

Don't handle non-data-field partition column names specially when
constructing the new projections in ParquetSchemaPruning. These
column projections are left unch

[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-07-26 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21320
  
To confirm we want to start a secondary PR based on my stylistic/minor 
fixes? As I get up to speed on this code, I won't be able to make heavy 
changes. I'll have some time tomorrow to take a look at this particular bug, 
but I'd appreciate any investigation you make.

Are we planning on pushing this with the default disabled state?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-07-25 Thread ajacques
Github user ajacques commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r205329769
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/SelectedField.scala
 ---
@@ -0,0 +1,134 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types._
+
+/**
+ * A Scala extractor that builds a 
[[org.apache.spark.sql.types.StructField]] from a Catalyst
+ * complex type extractor. For example, consider a relation with the 
following schema:
+ *
+ *   {{{
+ *   root
+ *|-- name: struct (nullable = true)
+ *||-- first: string (nullable = true)
+ *||-- last: string (nullable = true)
+ *}}}
+ *
+ * Further, suppose we take the select expression `name.first`. This will 
parse into an
+ * `Alias(child, "first")`. Ignoring the alias, `child` matches the 
following pattern:
+ *
+ *   {{{
+ *   GetStructFieldObject(
+ * AttributeReference("name", StructType(_), _, _),
+ * StructField("first", StringType, _, _))
+ *   }}}
+ *
+ * [[SelectedField]] converts that expression into
+ *
+ *   {{{
+ *   StructField("name", StructType(Array(StructField("first", 
StringType
+ *   }}}
+ *
+ * by mapping each complex type extractor to a 
[[org.apache.spark.sql.types.StructField]] with the
+ * same name as its child (or "parent" going right to left in the select 
expression) and a data
+ * type appropriate to the complex type extractor. In our example, the 
name of the child expression
+ * is "name" and its data type is a 
[[org.apache.spark.sql.types.StructType]] with a single string
+ * field named "first".
+ *
+ * @param expr the top-level complex type extractor
+ */
+object SelectedField {
+  def unapply(expr: Expression): Option[StructField] = {
--- End diff --

```
Error:(61, 12) constructor cannot be instantiated to expected type;
 found   : org.apache.spark.sql.catalyst.expressions.Alias
 required: org.apache.spark.sql.catalyst.expressions.ExtractValue
  case Alias(child, _) => child
```

Alias takes: `Alias(child: Expression, name: String)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21320: [SPARK-4502][SQL] Parquet nested column pruning -...

2018-07-25 Thread ajacques
Github user ajacques commented on a diff in the pull request:

https://github.com/apache/spark/pull/21320#discussion_r205329633
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/ProjectionOverSchema.scala
 ---
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types._
+
+/**
+ * A Scala extractor that projects an expression over a given schema. Data 
types,
+ * field indexes and field counts of complex type extractors and attributes
+ * are adjusted to fit the schema. All other expressions are left as-is. 
This
+ * class is motivated by columnar nested schema pruning.
+ */
+case class ProjectionOverSchema(schema: StructType) {
--- End diff --

We can move this to `sql.execution` if we move all three classes: 
`ProjectionOverSchema`, `GetStructFieldObject`, and `SelectedField`. Is there a 
difference in the catalyst.planning vs the execution packages?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-07-25 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21320
  
@HyukjinKwon, I'm not totally familiar with Spark internals yet, so to be 
honest I don't feel confident making big changes and hopefully can keep them 
simple at first.

I've gone through the code review comments and made as many changes as 
possible 
[here](https://github.com/apache/spark/compare/master...ajacques:spark-4502-parquet_column_pruning-foundation).
 If this PR is mostly feature complete and it's just small things, then I can 
push forward.

If the feedback comments push past simple refactoring level right now I 
would prefer to let someone else take over, but feel free to use what I've done.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #21320: [SPARK-4502][SQL] Parquet nested column pruning - founda...

2018-07-25 Thread ajacques
Github user ajacques commented on the issue:

https://github.com/apache/spark/pull/21320
  
Hey @mallman, I want to thank you for your work on this so far. I've been 
watching this pull request hoping this would get merged into 2.4 since it would 
be a benefit to me, but can see how it might be frustrating.

Unfortunately, I've only been following the comments and not the 
code/architecture itself, so I can't take over effectively, but I did try to 
make the minor comments as requested hopefully to help out. I've started in 
7ee616076f93d6cfd55b6646314f3d4a6d1530d3. This may not be super helpful right 
now, but if these were the only blockers for getting this change into mainline 
in time for 2.4.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org