[GitHub] [spark] bart-samwel commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

GitBox Wed, 04 Nov 2020 00:41:02 -0800


bart-samwel commented on a change in pull request #28027:
URL: https://github.com/apache/spark/pull/28027#discussion_r517176346




##########
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
##########
@@ -48,6 +48,15 @@ case class DataSourceV2Relation(
 
   import DataSourceV2Implicits._
 
+  override lazy val metadataOutput: Seq[AttributeReference] = table match {
+    case hasMeta: SupportsMetadataColumns =>
+      val attrs = hasMeta.metadataColumns
+      val outputNames = outputSet.map(_.name).toSet
+      attrs.filterNot(col => outputNames.contains(col.name)).toAttributes

Review comment:
       I think the column approach has the following benefits over the function 
approach:
   
   - You can scope the location of the calls (you can't reference the metadata 
unless you're directly SELECTing from the table, so no calls in places where it 
doesn't make sense),
   
   - You can call them immediately from a join, eg. `SELECT 
t1._input_file_name, t2._input_file_name FROM Table1 t1 JOIN Table2 t2`. The 
function syntax doesn't allow this, unless you make it something "special" like 
`input_file_name(t1)`.
   
   - You can easily push down the projection using existing mechanisms to turn 
off the overhead of generating the columns. The semantics of such pushdown with 
functions is iffy at best.
   
   - If you convert this to the physical level directly, a column is more 
compatible with physical transformations e.g. things that add readahead and 
buffering. The current approach like `input_file_name()` fundamentally makes 
buffering and readahead hard. You'd have to convert it to columns under the 
covers anyway.
   
   However, the column approach has name conflicts. This gets problematic if 
people actually select these columns without assigning new aliases and then 
write the result to files, and then try to read those back. That means that 
either DataFrames don't round trip with 100% fidelity via files (which may 
break assumptions of libraries that people have built), or you can't access 
metadata columns on files that have columns with these names, which breaks 
libraries that assume that these features work. If I'd write 
`table._some_metadata_column` in an auditing-related library, I'd like to be 
sure that I get what I asked for, and that this cannot be spoofed.
   
   How about using _special syntax_ like `input_file_name(t1)`, or 
`metadata(t1, file_name)`, where t1 must be a table (_data source_, not 
subquery/view) in the FROM clause, optional if there's only one table. So it's 
not a real function, it's just function-call-like syntax. That can be converted 
into a special type of column reference under the covers, even at parsing time, 
giving it all the advantages of columns from an implementation perspective. The 
column reference could use a reserved name that is not possible to write 
otherwise. When used in the SELECT list, we wouldn't assign that column name as 
the SELECT list alias, so it wouldn't show up in files! You still get an error 
at analysis time if you call it from a place where there's no table, so there's 
no incorrect use like with normal nondeterministic functions. This has the 
advantage that you can't have name conflicts. If we would add 
`input_file_name(t1)` and `input_file_name()` as such special functions with 
special syntax
 , then we'd have backward compatibility with existing queries, and we could 
migrate that mechanism to the new mechanism. And to avoid having to add extra 
functions for everything, we'd add`metadata(...)` for all other metadata 
columns.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] bart-samwel commented on a change in pull request #28027: [SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2

Reply via email to