Kaya Kupferschmidt created SPARK-39838:
------------------------------------------

             Summary: Passing an empty Metadata object to Column.as() should 
clear the metadata
                 Key: SPARK-39838
                 URL: https://issues.apache.org/jira/browse/SPARK-39838
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.3.0
            Reporter: Kaya Kupferschmidt


h2. Description

The Spark DataFrame API allows developers to attach arbiotrary metadata to 
individual columns as key/value pairs. The attachment is performed via the 
method "Column.as(name, metadata)". This works as expected, as long as the 
metadata object is not empty. But when passing an empty metadata object, the 
final column in the resulting DataFrame will still hold the metadata of the 
original incoming column, i.e. you cannot use this method to essentially reset 
the metadata of a column.

This is not the expected behaviour and has changed in Spark 3.3.0. In Spark 
3.2.1 and earlier, passing an empty metadata object to the method 
"Column.as(name, metadata)" resets the columns metadata as expected.

h2. Steps to Reproduce

The following code snippet will show the issue in Spark shell:
{code:scala}
import org.apache.spark.sql.types.MetadataBuilder

// Create a DataFrame with one column with Metadata attached
val df1 = spark.range(1,10)
    .withColumn("col_with_metadata", col("id").as("col_with_metadata", new 
MetadataBuilder().putString("metadata", "value").build()))

// Create a derived DataFrame which should reset the metadata of the column
val df2 = df1.select(col("col_with_metadata").as("col_without_metadata", new 
MetadataBuilder().build()))

// Display metadata of both DataFrames columns
println(s"df1 metadata: ${df1.schema("col_with_metadata").metadata}")
println(s"df2 metadata: ${df2.schema("col_without_metadata").metadata}")
{code} 

Expected would be that df1 has non-empty metadata, but df2 has empty metadata. 
But this is not the case, df2 still holds the same metadata as df1.

h2. Analysis

I think the problem stems from the changes in the method 
"trimNonTopLevelAliases" in the class AliasHelper:
{code:scala}
  protected def trimNonTopLevelAliases[T <: Expression](e: T): T = {
    val res = e match {
      case a: Alias =>
        val metadata = if (a.metadata == Metadata.empty) {
          None
        } else {
          Some(a.metadata)
        }
        a.copy(child = trimAliases(a.child))(
          exprId = a.exprId,
          qualifier = a.qualifier,
          explicitMetadata = metadata,
          nonInheritableMetadataKeys = a.nonInheritableMetadataKeys)
      case a: MultiAlias =>
        a.copy(child = trimAliases(a.child))
      case other => trimAliases(other)
    }

    res.asInstanceOf[T]
  }
{code}

The method will remove any empty metadata object from an Alias, which in turn 
means that Alias will inherit its childs metadata.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to