[ https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277488#comment-17277488 ]
Erik Krogen commented on SPARK-34336: ------------------------------------- Thanks for bringing this up [~Baohe Zhang], I came across PR 29354 and was concerned that it would fall by the wayside now that [~samkhan] was no longer at Verizon Media :) I am not a committer so cannot give binding +1 but will be happy to help review and do what I can to get this in (y) > Use GenericData as Avro serialization data model can improve Avro write/read > performance > ---------------------------------------------------------------------------------------- > > Key: SPARK-34336 > URL: https://issues.apache.org/jira/browse/SPARK-34336 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL > Affects Versions: 3.1.2 > Reporter: Baohe Zhang > Priority: Major > Attachments: base_read.txt, base_write.txt, generic_data_read.txt, > generic_data_write.txt, read_comparison.png, write_comparison.png > > > We found that using "org.apache.avro.generic.GenericData" as Avro > serialization data model in Avro writer can significantly improve Avro write > performance and slightly improve Avro read performance. > This optimization was originally put up by [~samkhan] in this PR > https://github.com/apache/spark/pull/29354. > We re-evaluated the change "Use GenericData instead of ReflectData when > writing Avro data" in that PR and verified it can provide performance > improvement in Avro write/read benchmarks. > The base branch is today(2/2/21)'s branch-3.1. > Besides current Avro read/write benchmarks, I also ran some extra benchmarks > for nested structs and arrays read/write, these benchmarks were put up in > this PR https://github.com/apache/spark/pull/29352 but haven't been merged. > Benchmark results are added in the comment. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org