[ https://issues.apache.org/jira/browse/AVRO-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16681607#comment-16681607 ]
ASF GitHub Bot commented on AVRO-2247: -------------------------------------- rstata commented on issue #354: AVRO-2247 - improved java reading performance with new reader URL: https://github.com/apache/avro/pull/354#issuecomment-437398604 I've been meaning to comment on this for a while. Looking at your code quickly, I wasn't convinced that it worked for recursive records (and maybe not even for nested records). Also, the solution as posted re-implements schema resolution. The schema-resolution code is subjected to a large number of regression tests that came about because the resolution logic is subtle in places. A re-implementation of that logic should subject itself to that test suite, which yours does not. Inspired by both your JIRA (AVRO-2247) and my own thoughts about further improving performance of reading with resolution, I have refactored the schema-resolution logic away from the resolving-grammar generation logic. I have published this in the branch [`'refactor-resolving-2018-11-09`](https://github.com/rstata-projects/avro/tree/refactor-resolving-2018-11-09) of my Avro fork. This code is "bug for bug" compatible with Avro's existing schema resolution (e.g., it implements the funky "best match" algorithm currently used for unions), and it passes the full schema-resolution regression suite. You might want to look to see if this would be a good foundation for implementing your improvements. Start at the new [Resolver](https://github.com/rstata-projects/avro/blob/refactor-resolving-2018-11-09/lang/java/avro/src/main/java/org/apache/avro/Resolver.java) class, and also look to see how ResolvingGrammarGenerator uses the output of Resolver. However, be warned that I intend to "re-write history" on this code pretty severely before proposing it as an actual improvement, so you might want to wait about a week before actually depending upon this code. Over the last few weeks I've been working on the performance-testing suite. What I've found is that the variance between runs of this suite varies _significantly:_ in places, over 30%! Across the board, I see variance of over 5% between runs for over 40% of the test cases. With this much variance, it's impossible to say if a proposed performance improvement is really an improvement (and impossible to tell whether or not an attempt to improve one set of performance cases has degraded performance elsewhere). By the end of next week I should have a proposed set of changes to the performance benchmark, plus a "cookbook" for using it (on AWS), which minimizes variance between runs of the suite. With that in place, I will return to the `refactor-resolving` work and submit it along with testing that shows it doesn't degrade performance (and, in fact, improves it in places). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Improve Java reading performance with a new reader > -------------------------------------------------- > > Key: AVRO-2247 > URL: https://issues.apache.org/jira/browse/AVRO-2247 > Project: Apache Avro > Issue Type: Improvement > Components: java > Reporter: Martin Jubelgas > Priority: Major > Fix For: 1.9.0 > > Attachments: Perf-Comparison.md > > > Complementary to AVRO-2090, I have been working on decoding of Avro objects > in Java and am suggesting a new implementation of a DatumReader that improves > read performance for both generic and specific records by approximately 20% > (and even more in cases of nested objects with defaults, a case I encounter a > lot in practical use). > Key concept is to create a detailed execution plan once at DatumReader. This > execution plan contains all required defaulting/lookup values so they need > not be looked up during object traversal while reading. > The reader implementation can be enabled and disabled per GenericData > instance. The system default is set via the system variable > "org.apache.avro.fastread" (defaults to "false"). > Attached a performance comparison of the existing implementation with the > proposed one. Will open a pull request with respective code in a bit (not > including interoperability with the optimizations of AVRO-2090 yet). Please > let me know your opinion of whether this is worth pursuing further. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)