[GitHub] spark pull request #20476: [SPARK-23301][SQL] data source column pruning sho...

2018-02-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20476


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20476: [SPARK-23301][SQL] data source column pruning sho...

2018-02-01 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/20476#discussion_r165437683
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala
 ---
@@ -81,35 +81,34 @@ object PushDownOperatorsToDataSource extends 
Rule[LogicalPlan] with PredicateHel
 
 // TODO: add more push down rules.
 
-// TODO: nested fields pruning
-def pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: 
Seq[Attribute]): Unit = {
-  plan match {
-case Project(projectList, child) =>
-  val required = 
projectList.filter(requiredByParent.contains).flatMap(_.references)
-  pushDownRequiredColumns(child, required)
-
-case Filter(condition, child) =>
-  val required = requiredByParent ++ condition.references
-  pushDownRequiredColumns(child, required)
-
-case DataSourceV2Relation(fullOutput, reader) => reader match {
-  case r: SupportsPushDownRequiredColumns =>
-// Match original case of attributes.
-val attrMap = AttributeMap(fullOutput.zip(fullOutput))
-val requiredColumns = requiredByParent.map(attrMap)
-r.pruneColumns(requiredColumns.toStructType)
-  case _ =>
-}
+pushDownRequiredColumns(filterPushed, filterPushed.outputSet)
+// After column pruning, we may have redundant PROJECT nodes in the 
query plan, remove them.
+RemoveRedundantProject(filterPushed)
+  }
+
+  // TODO: nested fields pruning
+  private def pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: 
AttributeSet): Unit = {
+plan match {
+  case Project(projectList, child) =>
+val required = projectList.flatMap(_.references)
+pushDownRequiredColumns(child, AttributeSet(required))
+
+  case Filter(condition, child) =>
+val required = requiredByParent ++ condition.references
+pushDownRequiredColumns(child, required)
 
-// TODO: there may be more operators can be used to calculate 
required columns, we can add
-// more and more in the future.
-case _ => plan.children.foreach(child => 
pushDownRequiredColumns(child, child.output))
+  case relation: DataSourceV2Relation => relation.reader match {
+case reader: SupportsPushDownRequiredColumns =>
+  val requiredColumns = 
relation.output.filter(requiredByParent.contains)
+  reader.pruneColumns(requiredColumns.toStructType)
+
+case _ =>
   }
-}
 
-pushDownRequiredColumns(filterPushed, filterPushed.output)
-// After column pruning, we may have redundant PROJECT nodes in the 
query plan, remove them.
-RemoveRedundantProject(filterPushed)
+  // TODO: there may be more operators can be used to calculate 
required columns, we can add
+  // more and more in the future.
--- End diff --

Nit.  `there may be more operators that can be used to calculate the 
required columns. We can add more and more in the future.`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20476: [SPARK-23301][SQL] data source column pruning sho...

2018-02-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20476#discussion_r165375489
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala
 ---
@@ -81,35 +81,34 @@ object PushDownOperatorsToDataSource extends 
Rule[LogicalPlan] with PredicateHel
 
 // TODO: add more push down rules.
 
-// TODO: nested fields pruning
-def pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: 
Seq[Attribute]): Unit = {
-  plan match {
-case Project(projectList, child) =>
-  val required = 
projectList.filter(requiredByParent.contains).flatMap(_.references)
-  pushDownRequiredColumns(child, required)
-
-case Filter(condition, child) =>
-  val required = requiredByParent ++ condition.references
-  pushDownRequiredColumns(child, required)
-
-case DataSourceV2Relation(fullOutput, reader) => reader match {
-  case r: SupportsPushDownRequiredColumns =>
-// Match original case of attributes.
-val attrMap = AttributeMap(fullOutput.zip(fullOutput))
-val requiredColumns = requiredByParent.map(attrMap)
-r.pruneColumns(requiredColumns.toStructType)
-  case _ =>
-}
+pushDownRequiredColumns(filterPushed, filterPushed.outputSet)
+// After column pruning, we may have redundant PROJECT nodes in the 
query plan, remove them.
+RemoveRedundantProject(filterPushed)
+  }
+
+  // TODO: nested fields pruning
+  private def pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: 
AttributeSet): Unit = {
+plan match {
+  case Project(projectList, child) =>
+val required = projectList.flatMap(_.references)
+pushDownRequiredColumns(child, AttributeSet(required))
+
+  case Filter(condition, child) =>
+val required = requiredByParent ++ condition.references
+pushDownRequiredColumns(child, required)
 
-// TODO: there may be more operators can be used to calculate 
required columns, we can add
-// more and more in the future.
-case _ => plan.children.foreach(child => 
pushDownRequiredColumns(child, child.output))
+  case relation: DataSourceV2Relation => relation.reader match {
+case reader: SupportsPushDownRequiredColumns =>
+  val requiredColumns = 
relation.output.filter(requiredByParent.contains)
--- End diff --

a cleaner way to retain the original case of attributes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20476: [SPARK-23301][SQL] data source column pruning sho...

2018-02-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20476#discussion_r165375177
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala
 ---
@@ -81,35 +81,34 @@ object PushDownOperatorsToDataSource extends 
Rule[LogicalPlan] with PredicateHel
 
 // TODO: add more push down rules.
 
-// TODO: nested fields pruning
-def pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: 
Seq[Attribute]): Unit = {
-  plan match {
-case Project(projectList, child) =>
-  val required = 
projectList.filter(requiredByParent.contains).flatMap(_.references)
--- End diff --

This line is wrong and I fixed to 
https://github.com/apache/spark/pull/20476/files#diff-b7f3810e65a2bb1585de9609ea491469R93


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20476: [SPARK-23301][SQL] data source column pruning sho...

2018-02-01 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20476#discussion_r165374987
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala
 ---
@@ -81,35 +81,34 @@ object PushDownOperatorsToDataSource extends 
Rule[LogicalPlan] with PredicateHel
 
 // TODO: add more push down rules.
 
-// TODO: nested fields pruning
-def pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: 
Seq[Attribute]): Unit = {
--- End diff --

make it a private method instead of an inline method


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20476: [SPARK-23301][SQL] data source column pruning sho...

2018-02-01 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/20476

[SPARK-23301][SQL] data source column pruning should work for arbitrary 
expressions

## What changes were proposed in this pull request?

This PR fixes a mistake in the `PushDownOperatorsToDataSource` rule, the 
column pruning logic is incorrect about `Project`.

## How was this patch tested?

a new test case for column pruning with arbitrary expressions, and improve 
the existing tests to make sure the `PushDownOperatorsToDataSource` really 
works.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark push-down

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20476.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20476


commit 353dd6bc60ce7123c392d7b51a496d45b1d7ab5c
Author: Wenchen Fan 
Date:   2018-02-01T12:02:23Z

data source column pruning should work for arbitrary expressions




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org