[ https://issues.apache.org/jira/browse/SPARK-39838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569842#comment-17569842 ]
Kaya Kupferschmidt commented on SPARK-39838: -------------------------------------------- Please find a PR at https://github.com/apache/spark/pull/37251 > Passing an empty Metadata object to Column.as() should clear the metadata > ------------------------------------------------------------------------- > > Key: SPARK-39838 > URL: https://issues.apache.org/jira/browse/SPARK-39838 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.3.0 > Reporter: Kaya Kupferschmidt > Priority: Major > > h2. Description > The Spark DataFrame API allows developers to attach arbiotrary metadata to > individual columns as key/value pairs. The attachment is performed via the > method "Column.as(name, metadata)". This works as expected, as long as the > metadata object is not empty. But when passing an empty metadata object, the > final column in the resulting DataFrame will still hold the metadata of the > original incoming column, i.e. you cannot use this method to essentially > reset the metadata of a column. > This is not the expected behaviour and has changed in Spark 3.3.0. In Spark > 3.2.1 and earlier, passing an empty metadata object to the method > "Column.as(name, metadata)" resets the columns metadata as expected. > h2. Steps to Reproduce > The following code snippet will show the issue in Spark shell: > {code:scala} > import org.apache.spark.sql.types.MetadataBuilder > // Create a DataFrame with one column with Metadata attached > val df1 = spark.range(1,10) > .withColumn("col_with_metadata", col("id").as("col_with_metadata", new > MetadataBuilder().putString("metadata", "value").build())) > // Create a derived DataFrame which should reset the metadata of the column > val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new > MetadataBuilder().build())) > // Display metadata of both DataFrames columns > println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}") > println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}") > {code} > This code results in the following lines printed onto the console > {code} > df1 metadata: {"metadata":"value"} > df2 metadata: {"metadata":"value"} > {code} > This result does not meet my expectations. I expect that df1 has non-empty > metadata, but df2 should have empty metadata. But this is not the case, df2 > still holds the same metadata as df1. > h2. Analysis > I think the problem stems from the changes in the method > "trimNonTopLevelAliases" in the class AliasHelper: > {code:scala} > protected def trimNonTopLevelAliases[T <: Expression](e: T): T = { > val res = e match { > case a: Alias => > val metadata = if (a.metadata == Metadata.empty) { > None > } else { > Some(a.metadata) > } > a.copy(child = trimAliases(a.child))( > exprId = a.exprId, > qualifier = a.qualifier, > explicitMetadata = metadata, > nonInheritableMetadataKeys = a.nonInheritableMetadataKeys) > case a: MultiAlias => > a.copy(child = trimAliases(a.child)) > case other => trimAliases(other) > } > res.asInstanceOf[T] > } > {code} > The method will remove any empty metadata object from an Alias, which in turn > means that Alias will inherit its childs metadata. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org