[ https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Luca Della Toffola updated TIKA-1149: ------------------------------------- Attachment: CompositeParser.patch ParseContext.patch > 12% performance improvement by caching in CompositeParser > --------------------------------------------------------- > > Key: TIKA-1149 > URL: https://issues.apache.org/jira/browse/TIKA-1149 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 1.3, 1.4 > Reporter: Luca Della Toffola > Priority: Minor > Labels: performance > Attachments: CompositeParser.patch, ParseContext.patch > > > We found an easy way to improve Tika's performance. The idea is to avoid > recomputing parsers map over and over > in CompositeParser.getParsers(...) if the context is empty and to cache the > returned value instead. > This can be done safely even under the assumption that the media-registry and > the list of component parsers do change while Tika is executing, by > invalidating the cache in the case. > Our attached patch computes the parsers map once per instance of > CompositeParser. > The patch checks for the case where the context is empty and invalidates the > cache if both media-registry and the list of component parsers change in the > corresponding setters. > For example, when running Tika 1.3 on a set of large (~50k classes) JAR files > (i.e., Java class library + Tika app + other apps), the patch reduces the > running time > from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the > same order of magnitude are found also for smaller workloads. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira