Re: [PR] [SPARK-52772][SQL] Inconsistent table attribute handling during updates [spark]

via GitHub Tue, 22 Jul 2025 10:29:34 -0700


ksbeyer commented on code in PR #51466:
URL: https://github.com/apache/spark/pull/51466#discussion_r2223315147



##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala:
##########
@@ -222,6 +227,37 @@ case class InsertIntoHadoopFsRelationCommand(
     Seq.empty[Row]
   }
 
+  /**
+   * The JSON writer [[org.apache.spark.sql.catalyst.json.JacksonGenerator]] 
has a special feature
+   * that changes the null handling of top-level columns that have a default 
value such that a
+   * explicit null is written.  This is detected today by looking for the 
metadata key
+   * [[ResolveDefaultColumnsUtils#EXISTS_DEFAULT_COLUMN_METADATA_KEY]] on the 
query attribute.
+   * This function copies this key from the table attribute to the query 
attribute only
+   * when a table metadata is available, only for JSON output, and only when 
the configuration
+   * requests the special feature.
+   *
+   * We should instead pass the table description down to the writers instead 
of using query
+   * attribute metadata, but this is a nontrivial change.
+   */
+  private def markColumnsWithDefaultForJson(outputColumns: Seq[Attribute]): 
Seq[Attribute] = {
+    if (catalogTable.isEmpty || !fileFormat.isInstanceOf[JsonFileFormat] ||

Review Comment:
   My goal is to eliminate the propagation.   I don't think we should propagate 
the table attribute metadata to query attributes.   I want to limit the 
propagation for a documented purpose.  Is should also file another jira to fix 
the json writer to get table info and remove this propagation.
   
   At a broader scale, the use of schema and attribute metadata seems pretty 
adhoc and error prone.  Eg, I see questionable propagation of char/varchar info 
that makes it unclear where the length constraints / padding should be 
enforced.  This forced me to add unnecessary aliases to hide the metdata that 
otherwise causes the write logic to get upset.  I'm still unsure if / where the 
constraints are enforced.  I think we should push to eliminate the metadata; 
perhaps the 1-pass analyzer can help propagate the right information during 
analysis rather than adhoc tags.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-52772][SQL] Inconsistent table attribute handling during updates [spark]

Reply via email to