[ 
https://issues.apache.org/jira/browse/SPARK-39838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569842#comment-17569842
 ] 

Kaya Kupferschmidt commented on SPARK-39838:
--------------------------------------------

Please find a PR at https://github.com/apache/spark/pull/37251

> Passing an empty Metadata object to Column.as() should clear the metadata
> -------------------------------------------------------------------------
>
>                 Key: SPARK-39838
>                 URL: https://issues.apache.org/jira/browse/SPARK-39838
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Kaya Kupferschmidt
>            Priority: Major
>
> h2. Description
> The Spark DataFrame API allows developers to attach arbiotrary metadata to 
> individual columns as key/value pairs. The attachment is performed via the 
> method "Column.as(name, metadata)". This works as expected, as long as the 
> metadata object is not empty. But when passing an empty metadata object, the 
> final column in the resulting DataFrame will still hold the metadata of the 
> original incoming column, i.e. you cannot use this method to essentially 
> reset the metadata of a column.
> This is not the expected behaviour and has changed in Spark 3.3.0. In Spark 
> 3.2.1 and earlier, passing an empty metadata object to the method 
> "Column.as(name, metadata)" resets the columns metadata as expected.
> h2. Steps to Reproduce
> The following code snippet will show the issue in Spark shell:
> {code:scala}
> import org.apache.spark.sql.types.MetadataBuilder
> // Create a DataFrame with one column with Metadata attached
> val df1 = spark.range(1,10)
>     .withColumn("col_with_metadata", col("id").as("col_with_metadata", new 
> MetadataBuilder().putString("metadata", "value").build()))
> // Create a derived DataFrame which should reset the metadata of the column
> val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new 
> MetadataBuilder().build()))
> // Display metadata of both DataFrames columns
> println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}")
> println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}")
> {code} 
> This code results in the following lines printed onto the console
> {code}
> df1 metadata: {"metadata":"value"}
> df2 metadata: {"metadata":"value"}
> {code}
> This result does not meet my expectations. I expect that df1 has non-empty 
> metadata, but df2 should have empty metadata. But this is not the case, df2 
> still holds the same metadata as df1.
> h2. Analysis
> I think the problem stems from the changes in the method 
> "trimNonTopLevelAliases" in the class AliasHelper:
> {code:scala}
>   protected def trimNonTopLevelAliases[T <: Expression](e: T): T = {
>     val res = e match {
>       case a: Alias =>
>         val metadata = if (a.metadata == Metadata.empty) {
>           None
>         } else {
>           Some(a.metadata)
>         }
>         a.copy(child = trimAliases(a.child))(
>           exprId = a.exprId,
>           qualifier = a.qualifier,
>           explicitMetadata = metadata,
>           nonInheritableMetadataKeys = a.nonInheritableMetadataKeys)
>       case a: MultiAlias =>
>         a.copy(child = trimAliases(a.child))
>       case other => trimAliases(other)
>     }
>     res.asInstanceOf[T]
>   }
> {code}
> The method will remove any empty metadata object from an Alias, which in turn 
> means that Alias will inherit its childs metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to