Mathew Wicks created SPARK-28008: ------------------------------------ Summary: Default values & column comments in AVRO schema converters Key: SPARK-28008 URL: https://issues.apache.org/jira/browse/SPARK-28008 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3 Reporter: Mathew Wicks
Currently in both `toAvroType` and `toSqlType` [SchemaConverters.scala#L134|https://github.com/apache/spark/blob/branch-2.4/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L134] there are two behaviours which are unexpected. h2. Nullable fields in spark are converted to UNION[TYPE, NULL] and no default value is set: *Current Behaviour:* {code:java} import org.apache.spark.sql.avro.SchemaConverters import org.apache.spark.sql.types._ val schema = new StructType().add("a", "string", nullable = true) val avroSchema = SchemaConverters.toAvroType(schema) println(avroSchema.toString(true)) { "type" : "record", "name" : "topLevelRecord", "fields" : [ { "name" : "a", "type" : [ "string", "null" ] } ] } {code} *Expected Behaviour:* (NOTE: The reversal of "null" & "string" in the union, needed for a default value of null) {code:java} import org.apache.spark.sql.avro.SchemaConverters import org.apache.spark.sql.types._ val schema = new StructType().add("a", "string", nullable = true) val avroSchema = SchemaConverters.toAvroType(schema) println(avroSchema.toString(true)) { "type" : "record", "name" : "topLevelRecord", "fields" : [ { "name" : "a", "type" : [ "null", "string" ], "default" : null } ] }{code} h2. Field comments/metadata is not propagated: *Current Behaviour:* {code:java} import org.apache.spark.sql.avro.SchemaConverters import org.apache.spark.sql.types._ val schema = new StructType().add("a", "string", nullable=false, comment="AAAAAAA") val avroSchema = SchemaConverters.toAvroType(schema) println(avroSchema.toString(true)) { "type" : "record", "name" : "topLevelRecord", "fields" : [ { "name" : "a", "type" : "string" } ] }{code} *Expected Behaviour:* {code:java} import org.apache.spark.sql.avro.SchemaConverters import org.apache.spark.sql.types._ val schema = new StructType().add("a", "string", nullable=false, comment="AAAAAAA") val avroSchema = SchemaConverters.toAvroType(schema) println(avroSchema.toString(true)) { "type" : "record", "name" : "topLevelRecord", "fields" : [ { "name" : "a", "type" : "string", "doc" : "AAAAAAA" } ] }{code} The behaviour should be similar (but the reverse) for `toSqlType`. I think we should aim to get this in before 3.0, as it will probably be a breaking change for some usage of the AVRO API. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org