[ https://issues.apache.org/jira/browse/NIFI-5640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16630956#comment-16630956 ]
ASF GitHub Bot commented on NIFI-5640: -------------------------------------- Github user mattyb149 commented on the issue: https://github.com/apache/nifi/pull/3036 +1 LGTM, ran full build and tried various Avro conversions, reading/writing. Thanks for the improvement! Merging to master > Improve efficiency of Avro Record Reader > ---------------------------------------- > > Key: NIFI-5640 > URL: https://issues.apache.org/jira/browse/NIFI-5640 > Project: Apache NiFi > Issue Type: Improvement > Reporter: Mark Payne > Assignee: Mark Payne > Priority: Major > Fix For: 1.8.0 > > > There are a few things that we are doing in the Avro Reader that cause subpar > performance. Firstly, in the AvroTypeUtil, when converting an Avro > GenericRecord to our Record, the building of the RecordSchema is slow because > we call toString() (which is quite expensive) on the Avro schema in order to > provide a textual version to RecordSchema. However, the text is typically not > used and it is optional to provide the schema text, so we should avoid > calling Schema#toString() whenever possible. > The AvroTypeUtil class also calls #getNonNullSubSchemas() a lot. In some > cases we don't really need to do this and can avoid creating the sublist. In > other cases, we do need to call it. However, the method uses the stream() > method on an existing List just to filter out 0 or 1 elements. While use of > the stream() method makes the code very readable, it is quite a bit more > expensive than just iterating over the existing list and adding to an > ArrayList. We should avoid use of the {{stream()}} method for trivial pieces > of code in time-critical parts of the codebase. > Additionally, I've found that Avro's GenericDatumReader is extremely > inefficient, at least in some cases, when reading Strings because it uses an > IdentityHashMap to cache details about the schema. But IdentityHashMap is far > slower than if it were to just use HashMap so we could subclass the reader in > order to avoid the slow caching. -- This message was sent by Atlassian JIRA (v7.6.3#76005)