Baohe Zhang created SPARK-34336:
-----------------------------------

             Summary: Use GenericData as Avro serialization data model can 
improve Avro write/read performance
                 Key: SPARK-34336
                 URL: https://issues.apache.org/jira/browse/SPARK-34336
             Project: Spark
          Issue Type: Improvement
          Components: Input/Output, SQL
    Affects Versions: 3.1.2
            Reporter: Baohe Zhang


We found that using "org.apache.avro.generic.GenericData" as Avro serialization 
data model in Avro writer can significantly improve Avro write performance and 
slightly improve Avro read performance.

This optimization was originally put up by [~samkhan]  in this PR 
https://github.com/apache/spark/pull/29354.

We re-evaluated the change "Use GenericData instead of ReflectData when writing 
Avro data" in that PR and verified it can provide performance improvement in 
Avro write/read benchmarks.

The base branch is today(2/2/21)'s branch-3.1.

Besides current Avro read/write benchmarks, I also ran some extra benchmarks 
for nested structs and arrays read/write, these benchmarks were put up in this 
PR https://github.com/apache/spark/pull/29352 but haven't been merged.

Benchmark results are added in the comment.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to