Kaya Kupferschmidt created SPARK-39838: ------------------------------------------
Summary: Passing an empty Metadata object to Column.as() should clear the metadata Key: SPARK-39838 URL: https://issues.apache.org/jira/browse/SPARK-39838 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Kaya Kupferschmidt h2. Description The Spark DataFrame API allows developers to attach arbiotrary metadata to individual columns as key/value pairs. The attachment is performed via the method "Column.as(name, metadata)". This works as expected, as long as the metadata object is not empty. But when passing an empty metadata object, the final column in the resulting DataFrame will still hold the metadata of the original incoming column, i.e. you cannot use this method to essentially reset the metadata of a column. This is not the expected behaviour and has changed in Spark 3.3.0. In Spark 3.2.1 and earlier, passing an empty metadata object to the method "Column.as(name, metadata)" resets the columns metadata as expected. h2. Steps to Reproduce The following code snippet will show the issue in Spark shell: {code:scala} import org.apache.spark.sql.types.MetadataBuilder // Create a DataFrame with one column with Metadata attached val df1 = spark.range(1,10) .withColumn("col_with_metadata", col("id").as("col_with_metadata", new MetadataBuilder().putString("metadata", "value").build())) // Create a derived DataFrame which should reset the metadata of the column val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new MetadataBuilder().build())) // Display metadata of both DataFrames columns println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}") println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}") {code} Expected would be that df1 has non-empty metadata, but df2 has empty metadata. But this is not the case, df2 still holds the same metadata as df1. h2. Analysis I think the problem stems from the changes in the method "trimNonTopLevelAliases" in the class AliasHelper: {code:scala} protected def trimNonTopLevelAliases[T <: Expression](e: T): T = { val res = e match { case a: Alias => val metadata = if (a.metadata == Metadata.empty) { None } else { Some(a.metadata) } a.copy(child = trimAliases(a.child))( exprId = a.exprId, qualifier = a.qualifier, explicitMetadata = metadata, nonInheritableMetadataKeys = a.nonInheritableMetadataKeys) case a: MultiAlias => a.copy(child = trimAliases(a.child)) case other => trimAliases(other) } res.asInstanceOf[T] } {code} The method will remove any empty metadata object from an Alias, which in turn means that Alias will inherit its childs metadata. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org