bart-samwel commented on a change in pull request #28027: URL: https://github.com/apache/spark/pull/28027#discussion_r517176346
########## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala ########## @@ -48,6 +48,15 @@ case class DataSourceV2Relation( import DataSourceV2Implicits._ + override lazy val metadataOutput: Seq[AttributeReference] = table match { + case hasMeta: SupportsMetadataColumns => + val attrs = hasMeta.metadataColumns + val outputNames = outputSet.map(_.name).toSet + attrs.filterNot(col => outputNames.contains(col.name)).toAttributes Review comment: I think the column approach has the following benefits over the function approach: - You can scope the location of the calls (you can't reference the metadata unless you're directly SELECTing from the table, so no calls in places where it doesn't make sense), - You can call them immediately from a join, eg. `SELECT t1._input_file_name, t2._input_file_name FROM Table1 t1 JOIN Table2 t2`. The function syntax doesn't allow this, unless you make it something "special" like `input_file_name(t1)`. - You can easily push down the projection using existing mechanisms to turn off the overhead of generating the columns. The semantics of such pushdown with functions is iffy at best. - If you convert this to the physical level directly, a column is more compatible with physical transformations e.g. things that add readahead and buffering. The current approach like `input_file_name()` fundamentally makes buffering and readahead hard. You'd have to convert it to columns under the covers anyway. However, the column approach has name conflicts. This gets problematic if people actually select these columns without assigning new aliases and then write the result to files, and then try to read those back. That means that either DataFrames don't round trip with 100% fidelity via files (which may break assumptions of libraries that people have built), or you can't access metadata columns on files that have columns with these names, which breaks libraries that assume that these features work. If I'd write `table._some_metadata_column` in an auditing-related library, I'd like to be sure that I get what I asked for, and that this cannot be spoofed. How about using _special syntax_ like `input_file_name(t1)`, or `metadata(t1, file_name)`, where t1 must be a table (_data source_, not subquery/view) in the FROM clause, optional if there's only one table. So it's not a real function, it's just function-call-like syntax. That can be converted into a special type of column reference under the covers, even at parsing time, giving it all the advantages of columns from an implementation perspective. The column reference could use a reserved name that is not possible to write otherwise. When used in the SELECT list, we wouldn't assign that column name as the SELECT list alias, so it wouldn't show up in files! You still get an error at analysis time if you call it from a place where there's no table, so there's no incorrect use like with normal nondeterministic functions. This has the advantage that you can't have name conflicts. If we would add `input_file_name(t1)` and `input_file_name()` as such special functions with special syntax , then we'd have backward compatibility with existing queries, and we could migrate that mechanism to the new mechanism. And to avoid having to add extra functions for everything, we'd add`metadata(...)` for all other metadata columns. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org