[ https://issues.apache.org/jira/browse/SPARK-24855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dongjoon Hyun updated SPARK-24855: ---------------------------------- Affects Version/s: (was: 2.4.0) 3.0.0 > Built-in AVRO support should support specified schema on write > -------------------------------------------------------------- > > Key: SPARK-24855 > URL: https://issues.apache.org/jira/browse/SPARK-24855 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.0.0 > Reporter: Brian Lindblom > Assignee: Brian Lindblom > Priority: Minor > Original Estimate: 168h > Remaining Estimate: 168h > > spark-avro appears to have been brought in from an upstream project, > [https://github.com/databricks/spark-avro.] I opened a PR a while ago to > enable support for 'forceSchema', which allows us to specify an AVRO schema > with which to write our records to handle some use cases we have. I didn't > get this code merged but would like to add this feature to the AVRO > reader/writer code that was brought in. The PR is here and I will follow up > with a more formal PR/Patch rebased on spark master branch: > https://github.com/databricks/spark-avro/pull/222 > > This change allows us to specify a schema, which should be compatible with > the schema generated by spark-avro from the dataset definition. This allows > a user to do things like specify default values, change union ordering, or... > in the case where you're reading in an AVRO data set, doing some sort of > in-line field cleansing, then writing out with the original schema, preserve > that original schema in the output container files. I've had several use > cases where this behavior was desired and there were several other asks for > this in the spark-avro project. -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org