Mathew Wicks created SPARK-28008:
------------------------------------

             Summary: Default values & column comments in AVRO schema converters
                 Key: SPARK-28008
                 URL: https://issues.apache.org/jira/browse/SPARK-28008
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.3
            Reporter: Mathew Wicks


Currently in both `toAvroType` and `toSqlType` 
[SchemaConverters.scala#L134|https://github.com/apache/spark/blob/branch-2.4/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L134]
 there are two behaviours which are unexpected.
h2. Nullable fields in spark are converted to UNION[TYPE, NULL] and no default 
value is set:

*Current Behaviour:*
{code:java}
import org.apache.spark.sql.avro.SchemaConverters
import org.apache.spark.sql.types._

val schema = new StructType().add("a", "string", nullable = true)
val avroSchema = SchemaConverters.toAvroType(schema)

println(avroSchema.toString(true))
{
  "type" : "record",
  "name" : "topLevelRecord",
  "fields" : [ {
    "name" : "a",
    "type" : [ "string", "null" ]
  } ]
}
{code}
*Expected Behaviour:*

(NOTE: The reversal of "null" & "string" in the union, needed for a default 
value of null)
{code:java}
import org.apache.spark.sql.avro.SchemaConverters
import org.apache.spark.sql.types._

val schema = new StructType().add("a", "string", nullable = true)
val avroSchema = SchemaConverters.toAvroType(schema)

println(avroSchema.toString(true))
{
  "type" : "record",
  "name" : "topLevelRecord",
  "fields" : [ {
    "name" : "a",
    "type" : [ "null", "string" ],
    "default" : null
  } ]
}{code}
h2. Field comments/metadata is not propagated:

*Current Behaviour:*
{code:java}
import org.apache.spark.sql.avro.SchemaConverters
import org.apache.spark.sql.types._

val schema = new StructType().add("a", "string", nullable=false, 
comment="AAAAAAA")
val avroSchema = SchemaConverters.toAvroType(schema)

println(avroSchema.toString(true))
{
  "type" : "record",
  "name" : "topLevelRecord",
  "fields" : [ {
    "name" : "a",
    "type" : "string"
  } ]
}{code}
*Expected Behaviour:*
{code:java}
import org.apache.spark.sql.avro.SchemaConverters
import org.apache.spark.sql.types._

val schema = new StructType().add("a", "string", nullable=false, 
comment="AAAAAAA")
val avroSchema = SchemaConverters.toAvroType(schema)

println(avroSchema.toString(true))
{
  "type" : "record",
  "name" : "topLevelRecord",
  "fields" : [ {
    "name" : "a",
    "type" : "string",
    "doc" : "AAAAAAA"
  } ]
}{code}
 

The behaviour should be similar (but the reverse) for `toSqlType`.

I think we should aim to get this in before 3.0, as it will probably be a 
breaking change for some usage of the AVRO API.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to