Brian Lindblom created SPARK-24855:
--------------------------------------

             Summary: Built-in AVRO support should support specified schema on 
write
                 Key: SPARK-24855
                 URL: https://issues.apache.org/jira/browse/SPARK-24855
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 2.4.0
            Reporter: Brian Lindblom


spark-avro appears to have been brought in from an upstream project, 
[https://github.com/databricks/spark-avro.]  I opened a PR a while ago to 
enable support for 'forceSchema', which allows us to specify an AVRO schema 
with which to write our records to handle some use cases we have.  I didn't get 
this code merged but would like to add this feature to the AVRO reader/writer 
code that was brought in.  The PR is here and I will follow up with a more 
formal PR/Patch rebased on spark master branch.

 

This change allows us to specify a schema, which should be compatible with the 
schema generated by spark-avro from the dataset definition.  This allows a user 
to do things like specify default values, change union ordering, or... in the 
case where you're reading in an AVRO data set, doing some sort of in-line field 
cleansing, then writing out with the original schema, preserve that original 
schema in the output container files.  I've had several use cases where this 
behavior was desired and there were several other asks for this in the 
spark-avro project.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to