[GitHub] [spark] brkyvz commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

2020-11-17 Thread GitBox


brkyvz commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r525389190



##
File path: 
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java
##
@@ -0,0 +1,29 @@
+package org.apache.spark.sql.connector.catalog;
+
+import org.apache.spark.annotation.Evolving;
+import org.apache.spark.sql.connector.read.SupportsPushDownRequiredColumns;
+import org.apache.spark.sql.types.StructField;
+import org.apache.spark.sql.types.StructType;
+
+/**
+ * An interface for exposing data columns for a table that are not in the 
table schema. For example,
+ * a file source could expose a "file" column that contains the path of the 
file that contained each
+ * row.
+ * 

Review comment:
   Should we talk about the behavior of reserving names for metadata 
columns or the behavior that will happen during name collisions here (data 
columns will be selected over metadata)?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] brkyvz commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

2020-10-27 Thread GitBox


brkyvz commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r512996706



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+case hasMeta: SupportsMetadataColumns =>
+  val attrs = hasMeta.metadataColumns
+  val outputNames = outputSet.map(_.name).toSet
+  attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
   what if we don't have the structs, but just have to use this UDF to 
access a metadata column?
   So you can do:
   ```
   select id, source_metadata(file)
   ```
   this would avoid name collisions and explicitly tells you which functions 
need to be resolved. You can also return better error messages and things 
knowing that the column you tried to resolve was a metadata column or a column.
   
   I believe other systems like Oracle didn't have a function for 
pseudocolumns, therefore I'm also fine with this.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] brkyvz commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

2020-10-26 Thread GitBox


brkyvz commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r512364187



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
##
@@ -880,6 +880,12 @@ case class SubqueryAlias(
 val qualifierList = identifier.qualifier :+ alias
 child.output.map(_.withQualifier(qualifierList))
   }
+
+  override def metadataOutput: Seq[Attribute] = {
+val qualifierList = identifier.qualifier :+ alias
+child.metadataOutput.map(_.withQualifier(qualifierList))
+  }

Review comment:
   why is this differentiation needed? Won't the metadata columns be part 
of `output`?

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+case hasMeta: SupportsMetadataColumns =>
+  val attrs = hasMeta.metadataColumns
+  val outputNames = outputSet.map(_.name).toSet
+  attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
   don't you need to resolve column names case insensitively (according to 
column name)?
   What if there's a match between a data column and the metadata column?
   
   I believe we'll have to come up with UDFs to access these metadata columns. 
`input_file_name()` could be one such UDF

##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+case hasMeta: SupportsMetadataColumns =>
+  val attrs = hasMeta.metadataColumns
+  val outputNames = outputSet.map(_.name).toSet
+  attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
   Another option is that metadata columns can be part of a struct, and you 
can select fields out of those metadata columns. A `get_source_metadata()` UDF 
can be a catch all, and you can use it like:
   `get_source_metadata("file_size")` or `get_source_metadata("row_id")` and 
things like that. What do you think?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org