bart-samwel commented on a change in pull request #28027: URL: https://github.com/apache/spark/pull/28027#discussion_r517651552
########## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala ########## @@ -48,6 +48,15 @@ case class DataSourceV2Relation( import DataSourceV2Implicits._ + override lazy val metadataOutput: Seq[AttributeReference] = table match { + case hasMeta: SupportsMetadataColumns => + val attrs = hasMeta.metadataColumns + val outputNames = outputSet.map(_.name).toSet + attrs.filterNot(col => outputNames.contains(col.name)).toAttributes Review comment: > @bart-samwel, I like your last suggestion, but with a slight change: I think sources should reject column names that conflict with their metadata columns. That way, a source can choose a format that makes sense (e.g., __offset) and allow everything else. All Spark has to do is to define what the behavior will be when there is a conflict, and recommend that sources reject column names that are also exposed as metadata columns. They probably should reject such column names, yes. Especially at write time -- you don't want to be in a situation where you wrote a file and can then never read it back. But if you did that, how would you deal with it? Could you never read it with Spark? One alternative is to just hide the column if it's in the file. You wouldn't be able to access it, but at least that would satisfy the "auditing library requirement" that the metadata fields always do the expected thing. Another alternative would be to expose the column under a new name. But what if that new name is also there? It gets very annoying very quickly. So maybe just hiding is the best policy. > I think using a special syntax requires significantly more changes, so I don't think that's a good option. Specific replies are below. > > > If I'd write table._some_metadata_column in an auditing-related library, I'd like to be sure that I get what I asked for, and that this cannot be spoofed. > > I think this is a fair argument, but I expect it would very rarely be a problem that you wanted a metadata column and accidentally projected a data column. It'll be rare, but if I have an auditing library that e.g. inserts a special execution node on top of a scan to track exactly what was read, then it would be great if people can't spoof that by adding a column with a particular special name to their underlying files. If the internal metadata column names take precedence over the columns in the file, then that problem is not there. (FWIW, I usually try to apply the principle of "name resolution must be environment independent", i.e., if a query defines something locally, e.g. an alias, then something in the environment (such as a table's schema) shouldn't be able to make that aspect of the query change meaning, because then things will break randomly as a result of schema changes. This fits in the same category.) > > How about using special syntax > > This is what I considered above, "we could use a function to signal that a column is a metadata column only, and keep the interface to sources the same . . ." I think the challenge here is carrying the information through analysis and optimization. This approach either has many of the same complications as using a function (if kept as an expression), or we would need to introduce a new attribute type for metadata attrs that would hopefully work in rules just like normal attributes in existing rules. We can't lose the fact that the attribute references metadata, so we can't replace a metadata attribute reference with an attribute reference. Yeah, I was thinking about the special AttributeReference type. But more in the way of "use a very special naming convention that you can't type" -- maybe because we'd reject it in the parser if you'd try to explicitly write an identifier like that. We'd have to do something to prevent that column name from showing up in output files etc. though, so it's not entirely foolproof. > I think the option most likely to not introduce more problems is to use a regular attribute for these columns, but that eliminates the benefit of a special syntax to track the fact that they are metadata column references. This doesn't seem like a good option to me, if all we want to do is guarantee that metadata columns don't conflict with data columns. > > The special syntax is also more awkward to use. One of the use cases for this feature is to expose non-schema fields like Kafka topic, offset, and partition. I think it is much more natural to select `_offset` than `metadata_column("offset")`. And if these are exposed as columns, we can add them to `DESCRIBE EXTENDED`, like I've done in this PR. That aspect of columns is certainly nice. > > An alternative would be to make the column names use a recognizable pattern that we can forbid as column names. > > I like this suggestion, but I'd leave rejecting column names to sources as I said above. Another option is to make this available, but not enforce it. We could add a check that data columns are not of the form `metadata$.*`, but not require metadata column names using the pattern. That way there is a way to guarantee names don't conflict, but if a source chooses to use a name that may conflict it still works fine. So we could set the example by using this pattern in data sources that we control, but other data sources could choose otherwise. I guess that works. I'm personally inclined to be a bit more prescriptive about these things, basically because if something is a best practice but there's no reviewer for the next data source that knows about this best practice, then the best practice won't be followed. And before you know it, there's a data source out in the wild that you can't migrate to use the best practice because there's already usage out there. Maybe it's already enough if all the existing data sources consistently use the pattern -- then at least it's "cargo cult proof". ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org