[ https://issues.apache.org/jira/browse/SPARK-42704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ryan Johnson updated SPARK-42704: --------------------------------- Description: The `AddMetadataColumns` analyzer rule intends to make resolve available metadata columns, even if the plan already contains projections that did not explicitly mention the metadata column. The `SubqueryAlias` plan node intentionally does not propagate metadata columns automatically from a non-leaf/non-subquery child node, because the following should _not_ work: {code:java} spark.read.table("t").select("a", "b").as("s").select("_metadata"){code} However, today it is too strict in breaks the metadata chain, in case the child node's output already includes the metadata column: {code:java} // expected to work (and does) spark.read.table("t") .select("a", "b").select("_metadata") // by extension, should also work (but does not) spark.read.table("t").select("a", "b", "_metadata").as("s") .select("a", "b").select("_metadata"){code} The solution is for `SubqueryAlias` to always propagate metadata columns that are already in the child's output, thus preserving the `metadataOutput` chain for that column. was: The `AddMetadataColumns` analyzer rule intends to make resolve available metadata columns, even if the plan already contains projections that did not explicitly mention the metadata column. The `SubqueryAlias` plan node intentionally does not propagate metadata columns automatically from a non-leaf/non-subquery child node, because the following should _not_ work: {code:java} spark.read.table("t").select("a", "b").as("s").select("_metadata"){code} However, today it is too strict in breaks the metadata chain, in case the child node's output already includes the metadata column: {code:java} // expected to work spark.read.table("t") .select("a", "b").select("_metadata") // by extension, should also work (but does not) spark.read.table("t").select("a", "b", "_metadata").as("s") .select("a", "b").select("_metadata"){code} The solution is for `SubqueryAlias` to always propagate metadata columns that are already in the child's output, thus preserving the `metadataOutput` chain for that column. > SubqueryAlias should propagate metadata columns its child already selects > -------------------------------------------------------------------------- > > Key: SPARK-42704 > URL: https://issues.apache.org/jira/browse/SPARK-42704 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.3.2, 3.4.0 > Reporter: Ryan Johnson > Priority: Major > > The `AddMetadataColumns` analyzer rule intends to make resolve available > metadata columns, even if the plan already contains projections that did not > explicitly mention the metadata column. > The `SubqueryAlias` plan node intentionally does not propagate metadata > columns automatically from a non-leaf/non-subquery child node, because the > following should _not_ work: > > {code:java} > spark.read.table("t").select("a", "b").as("s").select("_metadata"){code} > However, today it is too strict in breaks the metadata chain, in case the > child node's output already includes the metadata column: > > {code:java} > // expected to work (and does) > spark.read.table("t") > .select("a", "b").select("_metadata") > // by extension, should also work (but does not) > spark.read.table("t").select("a", "b", "_metadata").as("s") > .select("a", "b").select("_metadata"){code} > The solution is for `SubqueryAlias` to always propagate metadata columns that > are already in the child's output, thus preserving the `metadataOutput` chain > for that column. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org