[ 
https://issues.apache.org/jira/browse/AVRO-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16681607#comment-16681607
 ] 

ASF GitHub Bot commented on AVRO-2247:
--------------------------------------

rstata commented on issue #354: AVRO-2247 - improved java reading performance 
with new reader
URL: https://github.com/apache/avro/pull/354#issuecomment-437398604
 
 
   I've been meaning to comment on this for a while.  Looking at your code 
quickly, I wasn't convinced that it worked for recursive records (and maybe not 
even for nested records).  Also, the solution as posted re-implements schema 
resolution.  The schema-resolution code is subjected to a large number of 
regression tests that came about because the resolution logic is subtle in 
places.  A re-implementation of that logic should subject itself to that test 
suite, which yours does not.
   
   Inspired by both your JIRA (AVRO-2247) and my own thoughts about further 
improving performance of reading with resolution, I have refactored the 
schema-resolution logic away from the resolving-grammar generation logic.  I 
have published this in the branch 
[`'refactor-resolving-2018-11-09`](https://github.com/rstata-projects/avro/tree/refactor-resolving-2018-11-09)
 of my Avro fork.  This code is "bug for bug" compatible with Avro's existing 
schema resolution (e.g., it implements the funky "best match" algorithm 
currently used for unions), and it passes the full schema-resolution regression 
suite.
   
   You might want to look to see if this would be a good foundation for 
implementing your improvements.  Start at the new 
[Resolver](https://github.com/rstata-projects/avro/blob/refactor-resolving-2018-11-09/lang/java/avro/src/main/java/org/apache/avro/Resolver.java)
 class, and also look to see how ResolvingGrammarGenerator uses the output of 
Resolver.  However, be warned that I intend to "re-write history" on this code 
pretty severely before proposing it as an actual improvement, so you might want 
to wait about a week before actually depending upon this code.
   
   Over the last few weeks I've been working on the performance-testing suite.  
What I've found is that the variance between runs of this suite varies 
_significantly:_ in places, over 30%!  Across the board, I see variance of over 
5% between runs for over 40% of the test cases.  With this much variance, it's 
impossible to say if a proposed performance improvement is really an 
improvement (and impossible to tell whether or not an attempt to improve one 
set of performance cases has degraded performance elsewhere).
   
   By the end of next week I should have a proposed set of changes to the 
performance benchmark, plus a "cookbook" for using it (on AWS), which minimizes 
variance between runs of the suite.  With that in place, I will return to the 
`refactor-resolving` work and submit it along with testing that shows it 
doesn't degrade performance (and, in fact, improves it in places).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve Java reading performance with a new reader
> --------------------------------------------------
>
>                 Key: AVRO-2247
>                 URL: https://issues.apache.org/jira/browse/AVRO-2247
>             Project: Apache Avro
>          Issue Type: Improvement
>          Components: java
>            Reporter: Martin Jubelgas
>            Priority: Major
>             Fix For: 1.9.0
>
>         Attachments: Perf-Comparison.md
>
>
> Complementary to AVRO-2090, I have been working on decoding of Avro objects 
> in Java and am suggesting a new implementation of a DatumReader that improves 
> read performance for both generic and specific records by approximately 20% 
> (and even more in cases of nested objects with defaults, a case I encounter a 
> lot in practical use).
> Key concept is to create a detailed execution plan once at DatumReader. This 
> execution plan contains all required defaulting/lookup values so they need 
> not be looked up during object traversal while reading.
> The reader implementation can be enabled and disabled per GenericData 
> instance. The system default is set via the system variable 
> "org.apache.avro.fastread" (defaults to "false").
> Attached a performance comparison of the existing implementation with the 
> proposed one. Will open a pull request with respective code in a bit (not 
> including interoperability with the optimizations of AVRO-2090 yet). Please 
> let me know your opinion of whether this is worth pursuing further.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to