[GitHub] spark pull request #20476: [SPARK-23301][SQL] data source column pruning sho...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20476 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20476: [SPARK-23301][SQL] data source column pruning sho...
Github user gatorsmile commented on a diff in the pull request: https://github.com/apache/spark/pull/20476#discussion_r165437683 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala --- @@ -81,35 +81,34 @@ object PushDownOperatorsToDataSource extends Rule[LogicalPlan] with PredicateHel // TODO: add more push down rules. -// TODO: nested fields pruning -def pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: Seq[Attribute]): Unit = { - plan match { -case Project(projectList, child) => - val required = projectList.filter(requiredByParent.contains).flatMap(_.references) - pushDownRequiredColumns(child, required) - -case Filter(condition, child) => - val required = requiredByParent ++ condition.references - pushDownRequiredColumns(child, required) - -case DataSourceV2Relation(fullOutput, reader) => reader match { - case r: SupportsPushDownRequiredColumns => -// Match original case of attributes. -val attrMap = AttributeMap(fullOutput.zip(fullOutput)) -val requiredColumns = requiredByParent.map(attrMap) -r.pruneColumns(requiredColumns.toStructType) - case _ => -} +pushDownRequiredColumns(filterPushed, filterPushed.outputSet) +// After column pruning, we may have redundant PROJECT nodes in the query plan, remove them. +RemoveRedundantProject(filterPushed) + } + + // TODO: nested fields pruning + private def pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: AttributeSet): Unit = { +plan match { + case Project(projectList, child) => +val required = projectList.flatMap(_.references) +pushDownRequiredColumns(child, AttributeSet(required)) + + case Filter(condition, child) => +val required = requiredByParent ++ condition.references +pushDownRequiredColumns(child, required) -// TODO: there may be more operators can be used to calculate required columns, we can add -// more and more in the future. -case _ => plan.children.foreach(child => pushDownRequiredColumns(child, child.output)) + case relation: DataSourceV2Relation => relation.reader match { +case reader: SupportsPushDownRequiredColumns => + val requiredColumns = relation.output.filter(requiredByParent.contains) + reader.pruneColumns(requiredColumns.toStructType) + +case _ => } -} -pushDownRequiredColumns(filterPushed, filterPushed.output) -// After column pruning, we may have redundant PROJECT nodes in the query plan, remove them. -RemoveRedundantProject(filterPushed) + // TODO: there may be more operators can be used to calculate required columns, we can add + // more and more in the future. --- End diff -- Nit. `there may be more operators that can be used to calculate the required columns. We can add more and more in the future.` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20476: [SPARK-23301][SQL] data source column pruning sho...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20476#discussion_r165375489 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala --- @@ -81,35 +81,34 @@ object PushDownOperatorsToDataSource extends Rule[LogicalPlan] with PredicateHel // TODO: add more push down rules. -// TODO: nested fields pruning -def pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: Seq[Attribute]): Unit = { - plan match { -case Project(projectList, child) => - val required = projectList.filter(requiredByParent.contains).flatMap(_.references) - pushDownRequiredColumns(child, required) - -case Filter(condition, child) => - val required = requiredByParent ++ condition.references - pushDownRequiredColumns(child, required) - -case DataSourceV2Relation(fullOutput, reader) => reader match { - case r: SupportsPushDownRequiredColumns => -// Match original case of attributes. -val attrMap = AttributeMap(fullOutput.zip(fullOutput)) -val requiredColumns = requiredByParent.map(attrMap) -r.pruneColumns(requiredColumns.toStructType) - case _ => -} +pushDownRequiredColumns(filterPushed, filterPushed.outputSet) +// After column pruning, we may have redundant PROJECT nodes in the query plan, remove them. +RemoveRedundantProject(filterPushed) + } + + // TODO: nested fields pruning + private def pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: AttributeSet): Unit = { +plan match { + case Project(projectList, child) => +val required = projectList.flatMap(_.references) +pushDownRequiredColumns(child, AttributeSet(required)) + + case Filter(condition, child) => +val required = requiredByParent ++ condition.references +pushDownRequiredColumns(child, required) -// TODO: there may be more operators can be used to calculate required columns, we can add -// more and more in the future. -case _ => plan.children.foreach(child => pushDownRequiredColumns(child, child.output)) + case relation: DataSourceV2Relation => relation.reader match { +case reader: SupportsPushDownRequiredColumns => + val requiredColumns = relation.output.filter(requiredByParent.contains) --- End diff -- a cleaner way to retain the original case of attributes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20476: [SPARK-23301][SQL] data source column pruning sho...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20476#discussion_r165375177 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala --- @@ -81,35 +81,34 @@ object PushDownOperatorsToDataSource extends Rule[LogicalPlan] with PredicateHel // TODO: add more push down rules. -// TODO: nested fields pruning -def pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: Seq[Attribute]): Unit = { - plan match { -case Project(projectList, child) => - val required = projectList.filter(requiredByParent.contains).flatMap(_.references) --- End diff -- This line is wrong and I fixed to https://github.com/apache/spark/pull/20476/files#diff-b7f3810e65a2bb1585de9609ea491469R93 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20476: [SPARK-23301][SQL] data source column pruning sho...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20476#discussion_r165374987 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/PushDownOperatorsToDataSource.scala --- @@ -81,35 +81,34 @@ object PushDownOperatorsToDataSource extends Rule[LogicalPlan] with PredicateHel // TODO: add more push down rules. -// TODO: nested fields pruning -def pushDownRequiredColumns(plan: LogicalPlan, requiredByParent: Seq[Attribute]): Unit = { --- End diff -- make it a private method instead of an inline method --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20476: [SPARK-23301][SQL] data source column pruning sho...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/20476 [SPARK-23301][SQL] data source column pruning should work for arbitrary expressions ## What changes were proposed in this pull request? This PR fixes a mistake in the `PushDownOperatorsToDataSource` rule, the column pruning logic is incorrect about `Project`. ## How was this patch tested? a new test case for column pruning with arbitrary expressions, and improve the existing tests to make sure the `PushDownOperatorsToDataSource` really works. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark push-down Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20476.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20476 commit 353dd6bc60ce7123c392d7b51a496d45b1d7ab5c Author: Wenchen FanDate: 2018-02-01T12:02:23Z data source column pruning should work for arbitrary expressions --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org