Baohe Zhang created SPARK-34336: ----------------------------------- Summary: Use GenericData as Avro serialization data model can improve Avro write/read performance Key: SPARK-34336 URL: https://issues.apache.org/jira/browse/SPARK-34336 Project: Spark Issue Type: Improvement Components: Input/Output, SQL Affects Versions: 3.1.2 Reporter: Baohe Zhang
We found that using "org.apache.avro.generic.GenericData" as Avro serialization data model in Avro writer can significantly improve Avro write performance and slightly improve Avro read performance. This optimization was originally put up by [~samkhan] in this PR https://github.com/apache/spark/pull/29354. We re-evaluated the change "Use GenericData instead of ReflectData when writing Avro data" in that PR and verified it can provide performance improvement in Avro write/read benchmarks. The base branch is today(2/2/21)'s branch-3.1. Besides current Avro read/write benchmarks, I also ran some extra benchmarks for nested structs and arrays read/write, these benchmarks were put up in this PR https://github.com/apache/spark/pull/29352 but haven't been merged. Benchmark results are added in the comment. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org