[GitHub] [spark] viirya commented on a change in pull request #28556: [SPARK-31736][SQL] Nested column aliasing for RepartitionByExpression/Join

2020-06-12 Thread GitBox


viirya commented on a change in pull request #28556:
URL: https://github.com/apache/spark/pull/28556#discussion_r439521936



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##
@@ -34,7 +34,8 @@ object NestedColumnAliasing {
 : Option[(Map[ExtractValue, Alias], Map[ExprId, Seq[Alias]])] = plan match 
{
 case Project(projectList, child)
 if SQLConf.get.nestedSchemaPruningEnabled && 
canProjectPushThrough(child) =>
-  getAliasSubMap(projectList)
+  val exprCandidatesToPrune = projectList ++ child.expressions
+  getAliasSubMap(exprCandidatesToPrune, child.producedAttributes.toSeq)
 
 case plan if SQLConf.get.nestedSchemaPruningEnabled && canPruneOn(plan) =>

Review comment:
   I will change it in other PR. Thanks.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28556: [SPARK-31736][SQL] Nested column aliasing for RepartitionByExpression/Join

2020-06-10 Thread GitBox


viirya commented on a change in pull request #28556:
URL: https://github.com/apache/spark/pull/28556#discussion_r438425121



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##
@@ -215,21 +221,14 @@ object GeneratorNestedColumnAliasing {
   NestedColumnAliasing.getAliasSubMap(
 g.generator.children, g.requiredChildOutput).map {
 case (nestedFieldToAlias, attrToAliases) =>
-  pruneGenerate(g, nestedFieldToAlias, attrToAliases)
+  // Defer updating `Generate.unrequiredChildIndex` to next round of 
`ColumnPruning`.
+  NestedColumnAliasing.replaceWithAliases(g, nestedFieldToAlias, 
attrToAliases)
   }
 
 case _ =>
   None
   }
 
-  private def pruneGenerate(
-  g: Generate,
-  nestedFieldToAlias: Map[ExtractValue, Alias],
-  attrToAliases: Map[ExprId, Seq[Alias]]): LogicalPlan = {
-// Defer updating `Generate.unrequiredChildIndex` to next round of 
`ColumnPruning`.
-NestedColumnAliasing.replaceChildrenWithAliases(g, nestedFieldToAlias, 
attrToAliases)
-  }
-

Review comment:
   Addressed the comment 
https://github.com/apache/spark/pull/28560/files#r437968812. @HyukjinKwon 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28556: [SPARK-31736][SQL] Nested column aliasing for RepartitionByExpression/Join

2020-05-21 Thread GitBox


viirya commented on a change in pull request #28556:
URL: https://github.com/apache/spark/pull/28556#discussion_r429048929



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala
##
@@ -338,6 +349,93 @@ abstract class SchemaPruningSuite
 }
   }
 
+  testSchemaPruning("select one deep nested complex field after repartition") {
+val query = sql("select * from contacts")
+  .repartition(100)
+  .where("employer.company.address is not null")
+  .selectExpr("employer.id as employer_id")
+checkScan(query,
+  "struct>>")
+checkAnswer(query, Row(0) :: Nil)
+  }
+
+  testSchemaPruning("select one deep nested complex field after repartition by 
expression") {
+val query1 = sql("select * from contacts")
+  .repartition(100, col("id"))
+  .where("employer.company.address is not null")
+  .selectExpr("employer.id as employer_id")
+checkScan(query1,
+  "struct>>")
+checkAnswer(query1, Row(0) :: Nil)
+
+val query2 = sql("select * from contacts")
+  .repartition(100, col("employer"))
+  .where("employer.company.address is not null")
+  .selectExpr("employer.id as employer_id")
+checkScan(query2,
+  
"struct>>")
+checkAnswer(query2, Row(0) :: Nil)
+
+val query3 = sql("select * from contacts")
+  .repartition(100, col("employer.company"))
+  .where("employer.company.address is not null")
+  .selectExpr("employer.company as employer_company")
+checkScan(query3,
+  "struct>>")
+checkAnswer(query3, Row(Row("abc", "123 Business Street")) :: Nil)
+
+val query4 = sql("select * from contacts")
+  .repartition(100, col("employer.company.address"))
+  .where("employer.company.address is not null")
+  .selectExpr("employer.company.address as employer_company_addr")
+checkScan(query4,
+  "struct>>")
+checkAnswer(query4, Row("123 Business Street") :: Nil)
+  }
+
+  testSchemaPruning("select one deep nested complex field after join") {
+val query1 = sql("select contacts.name.middle from contacts, departments 
where " +
+"contacts.id = departments.contactId")
+checkScan(query1,
+  "struct>",
+"struct")
+checkAnswer(query1, Row("X.") :: Row("Y.") :: Nil)
+
+val query2 = sql("select contacts.name.middle from contacts, departments 
where " +

Review comment:
   Seems all tests in this test suite are using lowercases. Changing all 
tests seems too bothering... :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28556: [SPARK-31736][SQL] Nested column aliasing for RepartitionByExpression/Join

2020-05-21 Thread GitBox


viirya commented on a change in pull request #28556:
URL: https://github.com/apache/spark/pull/28556#discussion_r428520430



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##
@@ -82,6 +87,8 @@ object NestedColumnAliasing {
 case _: LocalLimit => true
 case _: Repartition => true
 case _: Sample => true
+case _: RepartitionByExpression => true
+case _: Join => true

Review comment:
   Added some tests for outer join.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28556: [SPARK-31736][SQL] Nested column aliasing for RepartitionByExpression/Join

2020-05-21 Thread GitBox


viirya commented on a change in pull request #28556:
URL: https://github.com/apache/spark/pull/28556#discussion_r428473708



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##
@@ -34,7 +34,8 @@ object NestedColumnAliasing {
 : Option[(Map[ExtractValue, Alias], Map[ExprId, Seq[Alias]])] = plan match 
{
 case Project(projectList, child)
 if SQLConf.get.nestedSchemaPruningEnabled && 
canProjectPushThrough(child) =>
-  getAliasSubMap(projectList)
+  val exprsToPrune = projectList ++ child.expressions

Review comment:
   changed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] viirya commented on a change in pull request #28556: [SPARK-31736][SQL] Nested column aliasing for RepartitionByExpression/Join

2020-05-21 Thread GitBox


viirya commented on a change in pull request #28556:
URL: https://github.com/apache/spark/pull/28556#discussion_r428473655



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala
##
@@ -204,15 +211,8 @@ object GeneratorNestedColumnAliasing {
   g: Generate,
   nestedFieldToAlias: Map[ExtractValue, Alias],
   attrToAliases: Map[ExprId, Seq[Alias]]): LogicalPlan = {
-val newGenerator = g.generator.transform {
-  case f: ExtractValue if nestedFieldToAlias.contains(f) =>
-nestedFieldToAlias(f).toAttribute
-}.asInstanceOf[Generator]
-
 // Defer updating `Generate.unrequiredChildIndex` to next round of 
`ColumnPruning`.
-val newGenerate = g.copy(generator = newGenerator)
-
-NestedColumnAliasing.replaceChildrenWithAliases(newGenerate, attrToAliases)
+NestedColumnAliasing.replaceChildrenWithAliases(g, nestedFieldToAlias, 
attrToAliases)

Review comment:
   Changed the method name.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org