[jira] [Commented] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance
[ https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277495#comment-17277495 ] Apache Spark commented on SPARK-34336: -- User 'baohe-zhang' has created a pull request for this issue: https://github.com/apache/spark/pull/31446 > Use GenericData as Avro serialization data model can improve Avro write/read > performance > > > Key: SPARK-34336 > URL: https://issues.apache.org/jira/browse/SPARK-34336 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL >Affects Versions: 3.1.2 >Reporter: Baohe Zhang >Priority: Major > Attachments: base_read.txt, base_write.txt, generic_data_read.txt, > generic_data_write.txt, read_comparison.png, write_comparison.png > > > We found that using "org.apache.avro.generic.GenericData" as Avro > serialization data model in Avro writer can significantly improve Avro write > performance and slightly improve Avro read performance. > This optimization was originally put up by [~samkhan] in this PR > https://github.com/apache/spark/pull/29354. > We re-evaluated the change "Use GenericData instead of ReflectData when > writing Avro data" in that PR and verified it can provide performance > improvement in Avro write/read benchmarks. > The base branch is today(2/2/21)'s branch-3.1. > Besides current Avro read/write benchmarks, I also ran some extra benchmarks > for nested structs and arrays read/write, these benchmarks were put up in > this PR https://github.com/apache/spark/pull/29352 but haven't been merged. > Benchmark results are added in the comment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance
[ https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277488#comment-17277488 ] Erik Krogen commented on SPARK-34336: - Thanks for bringing this up [~Baohe Zhang], I came across PR 29354 and was concerned that it would fall by the wayside now that [~samkhan] was no longer at Verizon Media :) I am not a committer so cannot give binding +1 but will be happy to help review and do what I can to get this in (y) > Use GenericData as Avro serialization data model can improve Avro write/read > performance > > > Key: SPARK-34336 > URL: https://issues.apache.org/jira/browse/SPARK-34336 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL >Affects Versions: 3.1.2 >Reporter: Baohe Zhang >Priority: Major > Attachments: base_read.txt, base_write.txt, generic_data_read.txt, > generic_data_write.txt, read_comparison.png, write_comparison.png > > > We found that using "org.apache.avro.generic.GenericData" as Avro > serialization data model in Avro writer can significantly improve Avro write > performance and slightly improve Avro read performance. > This optimization was originally put up by [~samkhan] in this PR > https://github.com/apache/spark/pull/29354. > We re-evaluated the change "Use GenericData instead of ReflectData when > writing Avro data" in that PR and verified it can provide performance > improvement in Avro write/read benchmarks. > The base branch is today(2/2/21)'s branch-3.1. > Besides current Avro read/write benchmarks, I also ran some extra benchmarks > for nested structs and arrays read/write, these benchmarks were put up in > this PR https://github.com/apache/spark/pull/29352 but haven't been merged. > Benchmark results are added in the comment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance
[ https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277485#comment-17277485 ] Baohe Zhang commented on SPARK-34336: - Full benchmark results are added as txt attachments. > Use GenericData as Avro serialization data model can improve Avro write/read > performance > > > Key: SPARK-34336 > URL: https://issues.apache.org/jira/browse/SPARK-34336 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL >Affects Versions: 3.1.2 >Reporter: Baohe Zhang >Priority: Major > Attachments: base_read.txt, base_write.txt, generic_data_read.txt, > generic_data_write.txt, read_comparison.png, write_comparison.png > > > We found that using "org.apache.avro.generic.GenericData" as Avro > serialization data model in Avro writer can significantly improve Avro write > performance and slightly improve Avro read performance. > This optimization was originally put up by [~samkhan] in this PR > https://github.com/apache/spark/pull/29354. > We re-evaluated the change "Use GenericData instead of ReflectData when > writing Avro data" in that PR and verified it can provide performance > improvement in Avro write/read benchmarks. > The base branch is today(2/2/21)'s branch-3.1. > Besides current Avro read/write benchmarks, I also ran some extra benchmarks > for nested structs and arrays read/write, these benchmarks were put up in > this PR https://github.com/apache/spark/pull/29352 but haven't been merged. > Benchmark results are added in the comment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34336) Use GenericData as Avro serialization data model can improve Avro write/read performance
[ https://issues.apache.org/jira/browse/SPARK-34336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277483#comment-17277483 ] Baohe Zhang commented on SPARK-34336: - Column chart comparison on avg time: Avro write: !write_comparison.png! Avro read: !read_comparison.png! > Use GenericData as Avro serialization data model can improve Avro write/read > performance > > > Key: SPARK-34336 > URL: https://issues.apache.org/jira/browse/SPARK-34336 > Project: Spark > Issue Type: Improvement > Components: Input/Output, SQL >Affects Versions: 3.1.2 >Reporter: Baohe Zhang >Priority: Major > Attachments: read_comparison.png, write_comparison.png > > > We found that using "org.apache.avro.generic.GenericData" as Avro > serialization data model in Avro writer can significantly improve Avro write > performance and slightly improve Avro read performance. > This optimization was originally put up by [~samkhan] in this PR > https://github.com/apache/spark/pull/29354. > We re-evaluated the change "Use GenericData instead of ReflectData when > writing Avro data" in that PR and verified it can provide performance > improvement in Avro write/read benchmarks. > The base branch is today(2/2/21)'s branch-3.1. > Besides current Avro read/write benchmarks, I also ran some extra benchmarks > for nested structs and arrays read/write, these benchmarks were put up in > this PR https://github.com/apache/spark/pull/29352 but haven't been merged. > Benchmark results are added in the comment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org