[ https://issues.apache.org/jira/browse/SPARK-28008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949125#comment-16949125 ]
Terry Moschou commented on SPARK-28008: --------------------------------------- We also have a use case for propagating application specific metadata other than {{comment}}, that is currently being dropped by {{SchemaConverters}}. The Avro [specification|http://avro.apache.org/docs/current/spec.html#schemas] does support user-defined attributes whose names are not reserved: bq. Attributes not defined in this document are permitted as metadata, but must not affect the format of serialized data. Some something like a {{"metadata"}} key would work. I guess {{doc}} could be a shortcut for {{metadata.comment}}? {code:json} { "type": "record", "name": "topLevelRecord", "fields": [ { "name": "a", "type": "string", "metadata": { "comment": "AAAAAAA", "foo": "bar" } } ] } {code} > Default values & column comments in AVRO schema converters > ---------------------------------------------------------- > > Key: SPARK-28008 > URL: https://issues.apache.org/jira/browse/SPARK-28008 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.0.0 > Reporter: Mathew Wicks > Priority: Major > > Currently in both `toAvroType` and `toSqlType` > [SchemaConverters.scala#L134|https://github.com/apache/spark/blob/branch-2.4/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L134] > there are two behaviours which are unexpected. > h2. Nullable fields in spark are converted to UNION[TYPE, NULL] and no > default value is set: > *Current Behaviour:* > {code:java} > import org.apache.spark.sql.avro.SchemaConverters > import org.apache.spark.sql.types._ > val schema = new StructType().add("a", "string", nullable = true) > val avroSchema = SchemaConverters.toAvroType(schema) > println(avroSchema.toString(true)) > { > "type" : "record", > "name" : "topLevelRecord", > "fields" : [ { > "name" : "a", > "type" : [ "string", "null" ] > } ] > } > {code} > *Expected Behaviour:* > (NOTE: The reversal of "null" & "string" in the union, needed for a default > value of null) > {code:java} > import org.apache.spark.sql.avro.SchemaConverters > import org.apache.spark.sql.types._ > val schema = new StructType().add("a", "string", nullable = true) > val avroSchema = SchemaConverters.toAvroType(schema) > println(avroSchema.toString(true)) > { > "type" : "record", > "name" : "topLevelRecord", > "fields" : [ { > "name" : "a", > "type" : [ "null", "string" ], > "default" : null > } ] > }{code} > h2. Field comments/metadata is not propagated: > *Current Behaviour:* > {code:java} > import org.apache.spark.sql.avro.SchemaConverters > import org.apache.spark.sql.types._ > val schema = new StructType().add("a", "string", nullable=false, > comment="AAAAAAA") > val avroSchema = SchemaConverters.toAvroType(schema) > println(avroSchema.toString(true)) > { > "type" : "record", > "name" : "topLevelRecord", > "fields" : [ { > "name" : "a", > "type" : "string" > } ] > }{code} > *Expected Behaviour:* > {code:java} > import org.apache.spark.sql.avro.SchemaConverters > import org.apache.spark.sql.types._ > val schema = new StructType().add("a", "string", nullable=false, > comment="AAAAAAA") > val avroSchema = SchemaConverters.toAvroType(schema) > println(avroSchema.toString(true)) > { > "type" : "record", > "name" : "topLevelRecord", > "fields" : [ { > "name" : "a", > "type" : "string", > "doc" : "AAAAAAA" > } ] > }{code} > > The behaviour should be similar (but the reverse) for `toSqlType`. > I think we should aim to get this in before 3.0, as it will probably be a > breaking change for some usage of the AVRO API. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org