[GitHub] [spark] brkyvz commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2
brkyvz commented on a change in pull request #28027: URL: https://github.com/apache/spark/pull/28027#discussion_r525389190 ## File path: sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java ## @@ -0,0 +1,29 @@ +package org.apache.spark.sql.connector.catalog; + +import org.apache.spark.annotation.Evolving; +import org.apache.spark.sql.connector.read.SupportsPushDownRequiredColumns; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; + +/** + * An interface for exposing data columns for a table that are not in the table schema. For example, + * a file source could expose a "file" column that contains the path of the file that contained each + * row. + * Review comment: Should we talk about the behavior of reserving names for metadata columns or the behavior that will happen during name collisions here (data columns will be selected over metadata)? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] brkyvz commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2
brkyvz commented on a change in pull request #28027: URL: https://github.com/apache/spark/pull/28027#discussion_r512996706 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala ## @@ -48,6 +48,15 @@ case class DataSourceV2Relation( import DataSourceV2Implicits._ + override lazy val metadataOutput: Seq[AttributeReference] = table match { +case hasMeta: SupportsMetadataColumns => + val attrs = hasMeta.metadataColumns + val outputNames = outputSet.map(_.name).toSet + attrs.filterNot(col => outputNames.contains(col.name)).toAttributes Review comment: what if we don't have the structs, but just have to use this UDF to access a metadata column? So you can do: ``` select id, source_metadata(file) ``` this would avoid name collisions and explicitly tells you which functions need to be resolved. You can also return better error messages and things knowing that the column you tried to resolve was a metadata column or a column. I believe other systems like Oracle didn't have a function for pseudocolumns, therefore I'm also fine with this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] brkyvz commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2
brkyvz commented on a change in pull request #28027: URL: https://github.com/apache/spark/pull/28027#discussion_r512364187 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala ## @@ -880,6 +880,12 @@ case class SubqueryAlias( val qualifierList = identifier.qualifier :+ alias child.output.map(_.withQualifier(qualifierList)) } + + override def metadataOutput: Seq[Attribute] = { +val qualifierList = identifier.qualifier :+ alias +child.metadataOutput.map(_.withQualifier(qualifierList)) + } Review comment: why is this differentiation needed? Won't the metadata columns be part of `output`? ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala ## @@ -48,6 +48,15 @@ case class DataSourceV2Relation( import DataSourceV2Implicits._ + override lazy val metadataOutput: Seq[AttributeReference] = table match { +case hasMeta: SupportsMetadataColumns => + val attrs = hasMeta.metadataColumns + val outputNames = outputSet.map(_.name).toSet + attrs.filterNot(col => outputNames.contains(col.name)).toAttributes Review comment: don't you need to resolve column names case insensitively (according to column name)? What if there's a match between a data column and the metadata column? I believe we'll have to come up with UDFs to access these metadata columns. `input_file_name()` could be one such UDF ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala ## @@ -48,6 +48,15 @@ case class DataSourceV2Relation( import DataSourceV2Implicits._ + override lazy val metadataOutput: Seq[AttributeReference] = table match { +case hasMeta: SupportsMetadataColumns => + val attrs = hasMeta.metadataColumns + val outputNames = outputSet.map(_.name).toSet + attrs.filterNot(col => outputNames.contains(col.name)).toAttributes Review comment: Another option is that metadata columns can be part of a struct, and you can select fields out of those metadata columns. A `get_source_metadata()` UDF can be a catch all, and you can use it like: `get_source_metadata("file_size")` or `get_source_metadata("row_id")` and things like that. What do you think? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org