[ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13716454#comment-13716454
 ] 

Luca Della Toffola edited comment on TIKA-1149 at 7/23/13 3:07 PM:
-------------------------------------------------------------------

I tried to have a deeper look at what you suggested.
It seems to me (at least with my limited knowledge of Tika's codebase) that 
there is no easy/clean way, to gain a meaningful amount of performance (> 10%), 
by refactoring {{CompositeParser.getParser(Metadata, ParseContext)}}. Using the 
full type->parser map seems to be the cleanest way to go.

The alternative, if I understood correctly, is to add a method to 
{{DefaultParser}} that builds a (new) list of parsers based upon the content of 
{{CompositeParser.parsers}} and the dynamic lookup mechanism in 
{{ServiceLoader}}. 
To search the appropriate parser would result in something similar as the 
actual {{CompositeParser.getParsers(ParseContext)}}. Instead of building each 
time the full type->parser map we will do a search in the returned list of 
supported types from the (new combined) parsers list. A quick test using this 
strategy showed only 1.85% speedup (without taking into account building the 
new list). Would be that a feasible solution for you?

 
                
      was (Author: ldellatoffola):
    I tried to have a deeper look at what you suggested.
It seems to me (at least with my limited knowledge of Tika's codebase) that 
there is no easy/clean way, to gain a meaningful amount of performance (> 10%), 
by refactoring {{CompositeParser.getParser(Metadata, ParseContext)}}. Using the 
full type->parser map seems to be the cleanest way to go.

The alternative, if I understood correctly, is to add a method to 
{{DefaultParser}} that builds a (new) list of parsers based upon the content of 
{{CompositeParser.parsers}} and the dynamic lookup mechanism in 
{{ServiceLoader}}. 
To search the appropriate parser would result in something similar as the 
actual {{CompositeParser.getParsers(ParseContext)}}. Instead of building each 
time the full type->parser map we will do a search in the returned list of 
supported types from the (new combined) parsers list. A quick test using this 
strategy showed only 1.85% speedup. Would be that a feasible solution for you?

 
                  
> 12% performance improvement by caching in CompositeParser
> ---------------------------------------------------------
>
>                 Key: TIKA-1149
>                 URL: https://issues.apache.org/jira/browse/TIKA-1149
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.3, 1.4
>            Reporter: Luca Della Toffola
>            Priority: Minor
>              Labels: performance
>         Attachments: CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid 
> recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the 
> returned value instead. 
> This can be done safely even under the assumption that the media-registry and 
> the list of component parsers do change while Tika is executing, by 
> invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of 
> CompositeParser.
> The patch checks for the case where the context is empty and invalidates the 
> cache if both media-registry and the list of component parsers change in the 
> corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
> (i.e., Java class library + Tika app + other apps), the patch reduces the 
> running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
> same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to