ksbeyer commented on code in PR #51466:
URL: https://github.com/apache/spark/pull/51466#discussion_r2223315147
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala:
##########
@@ -222,6 +227,37 @@ case class InsertIntoHadoopFsRelationCommand(
Seq.empty[Row]
}
+ /**
+ * The JSON writer [[org.apache.spark.sql.catalyst.json.JacksonGenerator]]
has a special feature
+ * that changes the null handling of top-level columns that have a default
value such that a
+ * explicit null is written. This is detected today by looking for the
metadata key
+ * [[ResolveDefaultColumnsUtils#EXISTS_DEFAULT_COLUMN_METADATA_KEY]] on the
query attribute.
+ * This function copies this key from the table attribute to the query
attribute only
+ * when a table metadata is available, only for JSON output, and only when
the configuration
+ * requests the special feature.
+ *
+ * We should instead pass the table description down to the writers instead
of using query
+ * attribute metadata, but this is a nontrivial change.
+ */
+ private def markColumnsWithDefaultForJson(outputColumns: Seq[Attribute]):
Seq[Attribute] = {
+ if (catalogTable.isEmpty || !fileFormat.isInstanceOf[JsonFileFormat] ||
Review Comment:
My goal is to eliminate the propagation. I don't think we should propagate
the table attribute metadata to query attributes. I want to limit the
propagation for a documented purpose. Is should also file another jira to fix
the json writer to get table info and remove this propagation.
At a broader scale, the use of schema and attribute metadata seems pretty
adhoc and error prone. Eg, I see questionable propagation of char/varchar info
that makes it unclear where the length constraints / padding should be
enforced. This forced me to add unnecessary aliases to hide the metdata that
otherwise causes the write logic to get upset. I'm still unsure if / where the
constraints are enforced. I think we should push to eliminate the metadata;
perhaps the 1-pass analyzer can help propagate the right information during
analysis rather than adhoc tags.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]