[jira] [Comment Edited] (TIKA-1149) Improve parser lookup performance

2013-08-13 Thread Luca Della Toffola (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738130#comment-13738130
 ] 

Luca Della Toffola edited comment on TIKA-1149 at 8/13/13 12:27 PM:


I did a quick test with the new patch. By letting {{CompositeParser}} inherit 
from {{SimpleParser}} and commenting the current 
{{CompositeParser.getSupportedTypes(ParseContext)}} method I obtain ~5% 
speedup. I used the same workload as before and I ran Tika with {{-d --text}} 
redirecting the output to {{/dev/null}}. Obviously all test-cases don't pass 
also in my case.

  was (Author: ldellatoffola):
I did a quick test with the new patch. By letting {{CompositeParser}} 
inherit from {{SimpleParser}} and commenting the current 
{{CompositeParser.getSupportedTypes(ParseContext)}} method I obtain ~5% 
speedup. I used the same workload as before and I ran Tika with {{-d --text}}. 
Obviously all test-cases don't pass also in my case.
  
> Improve parser lookup performance
> -
>
> Key: TIKA-1149
> URL: https://issues.apache.org/jira/browse/TIKA-1149
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3, 1.4
>Reporter: Luca Della Toffola
>Priority: Minor
>  Labels: performance
> Attachments: 0001-TIKA-1149-Improve-parser-lookup-performance.patch, 
> CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid 
> recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the 
> returned value instead. 
> This can be done safely even under the assumption that the media-registry and 
> the list of component parsers do change while Tika is executing, by 
> invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of 
> CompositeParser.
> The patch checks for the case where the context is empty and invalidates the 
> cache if both media-registry and the list of component parsers change in the 
> corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
> (i.e., Java class library + Tika app + other apps), the patch reduces the 
> running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
> same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1149) Improve parser lookup performance

2013-08-13 Thread Luca Della Toffola (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738130#comment-13738130
 ] 

Luca Della Toffola commented on TIKA-1149:
--

I did a quick test with the new patch. By letting {{CompositeParser}} inherit 
from {{SimpleParser}} and commenting the current 
{{CompositeParser.getSupportedTypes(ParseContext)}} method I obtain ~5% 
speedup. I used the same workload as before and I ran Tika with {{-d --text}}. 
Obviously all test-cases don't pass also in my case.

> Improve parser lookup performance
> -
>
> Key: TIKA-1149
> URL: https://issues.apache.org/jira/browse/TIKA-1149
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3, 1.4
>Reporter: Luca Della Toffola
>Priority: Minor
>  Labels: performance
> Attachments: 0001-TIKA-1149-Improve-parser-lookup-performance.patch, 
> CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid 
> recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the 
> returned value instead. 
> This can be done safely even under the assumption that the media-registry and 
> the list of component parsers do change while Tika is executing, by 
> invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of 
> CompositeParser.
> The patch checks for the case where the context is empty and invalidates the 
> cache if both media-registry and the list of component parsers change in the 
> corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
> (i.e., Java class library + Tika app + other apps), the patch reduces the 
> running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
> same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (TIKA-1149) 12% performance improvement by caching in CompositeParser

2013-07-23 Thread Luca Della Toffola (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13716454#comment-13716454
 ] 

Luca Della Toffola edited comment on TIKA-1149 at 7/23/13 3:16 PM:
---

I tried to have a deeper look at what you suggested.
It seems to me (at least with my limited knowledge of Tika's codebase) that 
there is no easy/clean way, to gain a meaningful amount of performance (> 10%), 
by refactoring {{CompositeParser.getParser(Metadata, ParseContext)}}. Using the 
full type->parser map seems to be the cleanest way to go.

The alternative, if I understood correctly, is to add a method to 
{{DefaultParser}} that builds a (new) list of parsers based upon the content of 
{{CompositeParser.parsers}} and the dynamic lookup mechanism in 
{{ServiceLoader}}. 
To search the appropriate parser would result in something similar as the 
actual {{CompositeParser.getParsers(ParseContext)}}. Instead of building each 
time the full type->parser map we will do a search in the returned list of 
supported types from the (new combined) parsers list. A quick test using this 
strategy but using the existing list of parsers in {{CompositeParser}} with an 
instance of {{CompositeParser}} showed only ~1.85% speedup with the same 
workload as mentioned. Would be that a feasible solution for you?

 

  was (Author: ldellatoffola):
I tried to have a deeper look at what you suggested.
It seems to me (at least with my limited knowledge of Tika's codebase) that 
there is no easy/clean way, to gain a meaningful amount of performance (> 10%), 
by refactoring {{CompositeParser.getParser(Metadata, ParseContext)}}. Using the 
full type->parser map seems to be the cleanest way to go.

The alternative, if I understood correctly, is to add a method to 
{{DefaultParser}} that builds a (new) list of parsers based upon the content of 
{{CompositeParser.parsers}} and the dynamic lookup mechanism in 
{{ServiceLoader}}. 
To search the appropriate parser would result in something similar as the 
actual {{CompositeParser.getParsers(ParseContext)}}. Instead of building each 
time the full type->parser map we will do a search in the returned list of 
supported types from the (new combined) parsers list. A quick test using this 
strategy but using the existing list of parsers in {{CompositeParser}} with an 
instance of {{CompositeParser}} showed only ~1.85% speedup with the same 
workload as mentioned before. Would be that a feasible solution for you?

 
  
> 12% performance improvement by caching in CompositeParser
> -
>
> Key: TIKA-1149
> URL: https://issues.apache.org/jira/browse/TIKA-1149
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3, 1.4
>Reporter: Luca Della Toffola
>Priority: Minor
>  Labels: performance
> Attachments: CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid 
> recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the 
> returned value instead. 
> This can be done safely even under the assumption that the media-registry and 
> the list of component parsers do change while Tika is executing, by 
> invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of 
> CompositeParser.
> The patch checks for the case where the context is empty and invalidates the 
> cache if both media-registry and the list of component parsers change in the 
> corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
> (i.e., Java class library + Tika app + other apps), the patch reduces the 
> running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
> same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (TIKA-1149) 12% performance improvement by caching in CompositeParser

2013-07-23 Thread Luca Della Toffola (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13716454#comment-13716454
 ] 

Luca Della Toffola edited comment on TIKA-1149 at 7/23/13 3:16 PM:
---

I tried to have a deeper look at what you suggested.
It seems to me (at least with my limited knowledge of Tika's codebase) that 
there is no easy/clean way, to gain a meaningful amount of performance (> 10%), 
by refactoring {{CompositeParser.getParser(Metadata, ParseContext)}}. Using the 
full type->parser map seems to be the cleanest way to go.

The alternative, if I understood correctly, is to add a method to 
{{DefaultParser}} that builds a (new) list of parsers based upon the content of 
{{CompositeParser.parsers}} and the dynamic lookup mechanism in 
{{ServiceLoader}}. 
To search the appropriate parser would result in something similar as the 
actual {{CompositeParser.getParsers(ParseContext)}}. Instead of building each 
time the full type->parser map we will do a search in the returned list of 
supported types from the (new combined) parsers list. A quick test using this 
strategy but using the existing list of parsers in {{CompositeParser}} with an 
instance of {{CompositeParser}}) showed only 1.85% speedup with the same 
workload as mentioned before. Would be that a feasible solution for you?

 

  was (Author: ldellatoffola):
I tried to have a deeper look at what you suggested.
It seems to me (at least with my limited knowledge of Tika's codebase) that 
there is no easy/clean way, to gain a meaningful amount of performance (> 10%), 
by refactoring {{CompositeParser.getParser(Metadata, ParseContext)}}. Using the 
full type->parser map seems to be the cleanest way to go.

The alternative, if I understood correctly, is to add a method to 
{{DefaultParser}} that builds a (new) list of parsers based upon the content of 
{{CompositeParser.parsers}} and the dynamic lookup mechanism in 
{{ServiceLoader}}. 
To search the appropriate parser would result in something similar as the 
actual {{CompositeParser.getParsers(ParseContext)}}. Instead of building each 
time the full type->parser map we will do a search in the returned list of 
supported types from the (new combined) parsers list. A quick test using this 
strategy showed only 1.85% speedup with the same workload as mentioned before 
(without taking into account building the new list). Would be that a feasible 
solution for you?

 
  
> 12% performance improvement by caching in CompositeParser
> -
>
> Key: TIKA-1149
> URL: https://issues.apache.org/jira/browse/TIKA-1149
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3, 1.4
>Reporter: Luca Della Toffola
>Priority: Minor
>  Labels: performance
> Attachments: CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid 
> recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the 
> returned value instead. 
> This can be done safely even under the assumption that the media-registry and 
> the list of component parsers do change while Tika is executing, by 
> invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of 
> CompositeParser.
> The patch checks for the case where the context is empty and invalidates the 
> cache if both media-registry and the list of component parsers change in the 
> corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
> (i.e., Java class library + Tika app + other apps), the patch reduces the 
> running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
> same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (TIKA-1149) 12% performance improvement by caching in CompositeParser

2013-07-23 Thread Luca Della Toffola (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13716454#comment-13716454
 ] 

Luca Della Toffola edited comment on TIKA-1149 at 7/23/13 3:16 PM:
---

I tried to have a deeper look at what you suggested.
It seems to me (at least with my limited knowledge of Tika's codebase) that 
there is no easy/clean way, to gain a meaningful amount of performance (> 10%), 
by refactoring {{CompositeParser.getParser(Metadata, ParseContext)}}. Using the 
full type->parser map seems to be the cleanest way to go.

The alternative, if I understood correctly, is to add a method to 
{{DefaultParser}} that builds a (new) list of parsers based upon the content of 
{{CompositeParser.parsers}} and the dynamic lookup mechanism in 
{{ServiceLoader}}. 
To search the appropriate parser would result in something similar as the 
actual {{CompositeParser.getParsers(ParseContext)}}. Instead of building each 
time the full type->parser map we will do a search in the returned list of 
supported types from the (new combined) parsers list. A quick test using this 
strategy but using the existing list of parsers in {{CompositeParser}} with an 
instance of {{CompositeParser}} showed only ~1.85% speedup with the same 
workload as mentioned before. Would be that a feasible solution for you?

 

  was (Author: ldellatoffola):
I tried to have a deeper look at what you suggested.
It seems to me (at least with my limited knowledge of Tika's codebase) that 
there is no easy/clean way, to gain a meaningful amount of performance (> 10%), 
by refactoring {{CompositeParser.getParser(Metadata, ParseContext)}}. Using the 
full type->parser map seems to be the cleanest way to go.

The alternative, if I understood correctly, is to add a method to 
{{DefaultParser}} that builds a (new) list of parsers based upon the content of 
{{CompositeParser.parsers}} and the dynamic lookup mechanism in 
{{ServiceLoader}}. 
To search the appropriate parser would result in something similar as the 
actual {{CompositeParser.getParsers(ParseContext)}}. Instead of building each 
time the full type->parser map we will do a search in the returned list of 
supported types from the (new combined) parsers list. A quick test using this 
strategy but using the existing list of parsers in {{CompositeParser}} with an 
instance of {{CompositeParser}}) showed only 1.85% speedup with the same 
workload as mentioned before. Would be that a feasible solution for you?

 
  
> 12% performance improvement by caching in CompositeParser
> -
>
> Key: TIKA-1149
> URL: https://issues.apache.org/jira/browse/TIKA-1149
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3, 1.4
>Reporter: Luca Della Toffola
>Priority: Minor
>  Labels: performance
> Attachments: CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid 
> recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the 
> returned value instead. 
> This can be done safely even under the assumption that the media-registry and 
> the list of component parsers do change while Tika is executing, by 
> invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of 
> CompositeParser.
> The patch checks for the case where the context is empty and invalidates the 
> cache if both media-registry and the list of component parsers change in the 
> corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
> (i.e., Java class library + Tika app + other apps), the patch reduces the 
> running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
> same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (TIKA-1149) 12% performance improvement by caching in CompositeParser

2013-07-23 Thread Luca Della Toffola (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13716454#comment-13716454
 ] 

Luca Della Toffola edited comment on TIKA-1149 at 7/23/13 3:09 PM:
---

I tried to have a deeper look at what you suggested.
It seems to me (at least with my limited knowledge of Tika's codebase) that 
there is no easy/clean way, to gain a meaningful amount of performance (> 10%), 
by refactoring {{CompositeParser.getParser(Metadata, ParseContext)}}. Using the 
full type->parser map seems to be the cleanest way to go.

The alternative, if I understood correctly, is to add a method to 
{{DefaultParser}} that builds a (new) list of parsers based upon the content of 
{{CompositeParser.parsers}} and the dynamic lookup mechanism in 
{{ServiceLoader}}. 
To search the appropriate parser would result in something similar as the 
actual {{CompositeParser.getParsers(ParseContext)}}. Instead of building each 
time the full type->parser map we will do a search in the returned list of 
supported types from the (new combined) parsers list. A quick test using this 
strategy showed only 1.85% speedup with the same workload as mentioned before 
(without taking into account building the new list). Would be that a feasible 
solution for you?

 

  was (Author: ldellatoffola):
I tried to have a deeper look at what you suggested.
It seems to me (at least with my limited knowledge of Tika's codebase) that 
there is no easy/clean way, to gain a meaningful amount of performance (> 10%), 
by refactoring {{CompositeParser.getParser(Metadata, ParseContext)}}. Using the 
full type->parser map seems to be the cleanest way to go.

The alternative, if I understood correctly, is to add a method to 
{{DefaultParser}} that builds a (new) list of parsers based upon the content of 
{{CompositeParser.parsers}} and the dynamic lookup mechanism in 
{{ServiceLoader}}. 
To search the appropriate parser would result in something similar as the 
actual {{CompositeParser.getParsers(ParseContext)}}. Instead of building each 
time the full type->parser map we will do a search in the returned list of 
supported types from the (new combined) parsers list. A quick test using this 
strategy showed only 1.85% speedup (without taking into account building the 
new list). Would be that a feasible solution for you?

 
  
> 12% performance improvement by caching in CompositeParser
> -
>
> Key: TIKA-1149
> URL: https://issues.apache.org/jira/browse/TIKA-1149
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3, 1.4
>Reporter: Luca Della Toffola
>Priority: Minor
>  Labels: performance
> Attachments: CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid 
> recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the 
> returned value instead. 
> This can be done safely even under the assumption that the media-registry and 
> the list of component parsers do change while Tika is executing, by 
> invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of 
> CompositeParser.
> The patch checks for the case where the context is empty and invalidates the 
> cache if both media-registry and the list of component parsers change in the 
> corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
> (i.e., Java class library + Tika app + other apps), the patch reduces the 
> running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
> same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (TIKA-1149) 12% performance improvement by caching in CompositeParser

2013-07-23 Thread Luca Della Toffola (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13716454#comment-13716454
 ] 

Luca Della Toffola edited comment on TIKA-1149 at 7/23/13 3:07 PM:
---

I tried to have a deeper look at what you suggested.
It seems to me (at least with my limited knowledge of Tika's codebase) that 
there is no easy/clean way, to gain a meaningful amount of performance (> 10%), 
by refactoring {{CompositeParser.getParser(Metadata, ParseContext)}}. Using the 
full type->parser map seems to be the cleanest way to go.

The alternative, if I understood correctly, is to add a method to 
{{DefaultParser}} that builds a (new) list of parsers based upon the content of 
{{CompositeParser.parsers}} and the dynamic lookup mechanism in 
{{ServiceLoader}}. 
To search the appropriate parser would result in something similar as the 
actual {{CompositeParser.getParsers(ParseContext)}}. Instead of building each 
time the full type->parser map we will do a search in the returned list of 
supported types from the (new combined) parsers list. A quick test using this 
strategy showed only 1.85% speedup (without taking into account building the 
new list). Would be that a feasible solution for you?

 

  was (Author: ldellatoffola):
I tried to have a deeper look at what you suggested.
It seems to me (at least with my limited knowledge of Tika's codebase) that 
there is no easy/clean way, to gain a meaningful amount of performance (> 10%), 
by refactoring {{CompositeParser.getParser(Metadata, ParseContext)}}. Using the 
full type->parser map seems to be the cleanest way to go.

The alternative, if I understood correctly, is to add a method to 
{{DefaultParser}} that builds a (new) list of parsers based upon the content of 
{{CompositeParser.parsers}} and the dynamic lookup mechanism in 
{{ServiceLoader}}. 
To search the appropriate parser would result in something similar as the 
actual {{CompositeParser.getParsers(ParseContext)}}. Instead of building each 
time the full type->parser map we will do a search in the returned list of 
supported types from the (new combined) parsers list. A quick test using this 
strategy showed only 1.85% speedup. Would be that a feasible solution for you?

 
  
> 12% performance improvement by caching in CompositeParser
> -
>
> Key: TIKA-1149
> URL: https://issues.apache.org/jira/browse/TIKA-1149
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3, 1.4
>Reporter: Luca Della Toffola
>Priority: Minor
>  Labels: performance
> Attachments: CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid 
> recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the 
> returned value instead. 
> This can be done safely even under the assumption that the media-registry and 
> the list of component parsers do change while Tika is executing, by 
> invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of 
> CompositeParser.
> The patch checks for the case where the context is empty and invalidates the 
> cache if both media-registry and the list of component parsers change in the 
> corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
> (i.e., Java class library + Tika app + other apps), the patch reduces the 
> running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
> same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1149) 12% performance improvement by caching in CompositeParser

2013-07-23 Thread Luca Della Toffola (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13716454#comment-13716454
 ] 

Luca Della Toffola commented on TIKA-1149:
--

I tried to have a deeper look at what you suggested.
It seems to me (at least with my limited knowledge of Tika's codebase) that 
there is no easy/clean way, to gain a meaningful amount of performance (> 10%), 
by refactoring {{CompositeParser.getParser(Metadata, ParseContext)}}. Using the 
full type->parser map seems to be the cleanest way to go.

The alternative, if I understood correctly, is to add a method to 
{{DefaultParser}} that builds a (new) list of parsers based upon the content of 
{{CompositeParser.parsers}} and the dynamic lookup mechanism in 
{{ServiceLoader}}. 
To search the appropriate parser would result in something similar as the 
actual {{CompositeParser.getParsers(ParseContext)}}. Instead of building each 
time the full type->parser map we will do a search in the returned list of 
supported types from the (new combined) parsers list. A quick test using this 
strategy showed only 1.85% speedup. Would be that a feasible solution for you?

 

> 12% performance improvement by caching in CompositeParser
> -
>
> Key: TIKA-1149
> URL: https://issues.apache.org/jira/browse/TIKA-1149
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3, 1.4
>Reporter: Luca Della Toffola
>Priority: Minor
>  Labels: performance
> Attachments: CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid 
> recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the 
> returned value instead. 
> This can be done safely even under the assumption that the media-registry and 
> the list of component parsers do change while Tika is executing, by 
> invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of 
> CompositeParser.
> The patch checks for the case where the context is empty and invalidates the 
> cache if both media-registry and the list of component parsers change in the 
> corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
> (i.e., Java class library + Tika app + other apps), the patch reduces the 
> running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
> same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1149) 12% performance improvement by caching in CompositeParser

2013-07-22 Thread Luca Della Toffola (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13715675#comment-13715675
 ] 

Luca Della Toffola commented on TIKA-1149:
--

First of all, thanks for the very fast response!
Tomorrow I will take some time to make few experiments with the optimization 
that you suggested.


> 12% performance improvement by caching in CompositeParser
> -
>
> Key: TIKA-1149
> URL: https://issues.apache.org/jira/browse/TIKA-1149
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3, 1.4
>Reporter: Luca Della Toffola
>Priority: Minor
>  Labels: performance
> Attachments: CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid 
> recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the 
> returned value instead. 
> This can be done safely even under the assumption that the media-registry and 
> the list of component parsers do change while Tika is executing, by 
> invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of 
> CompositeParser.
> The patch checks for the case where the context is empty and invalidates the 
> cache if both media-registry and the list of component parsers change in the 
> corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
> (i.e., Java class library + Tika app + other apps), the patch reduces the 
> running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
> same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1149) 12% performance improvement by caching in CompositeParser

2013-07-22 Thread Luca Della Toffola (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Della Toffola updated TIKA-1149:
-

Attachment: CompositeParser.patch
ParseContext.patch

> 12% performance improvement by caching in CompositeParser
> -
>
> Key: TIKA-1149
> URL: https://issues.apache.org/jira/browse/TIKA-1149
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3, 1.4
>Reporter: Luca Della Toffola
>Priority: Minor
>  Labels: performance
> Attachments: CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid 
> recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the 
> returned value instead. 
> This can be done safely even under the assumption that the media-registry and 
> the list of component parsers do change while Tika is executing, by 
> invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of 
> CompositeParser.
> The patch checks for the case where the context is empty and invalidates the 
> cache if both media-registry and the list of component parsers change in the 
> corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
> (i.e., Java class library + Tika app + other apps), the patch reduces the 
> running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
> same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1149) 12% performance improvement by caching in CompositeParser

2013-07-22 Thread Luca Della Toffola (JIRA)
Luca Della Toffola created TIKA-1149:


 Summary: 12% performance improvement by caching in CompositeParser
 Key: TIKA-1149
 URL: https://issues.apache.org/jira/browse/TIKA-1149
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4, 1.3
Reporter: Luca Della Toffola
Priority: Minor


We found an easy way to improve Tika's performance. The idea is to avoid 
recomputing parsers map over and over 
in CompositeParser.getParsers(...) if the context is empty and to cache the 
returned value instead. 
This can be done safely even under the assumption that the media-registry and 
the list of component parsers do change while Tika is executing, by 
invalidating the cache in the case.
Our attached patch computes the parsers map once per instance of 
CompositeParser.
The patch checks for the case where the context is empty and invalidates the 
cache if both media-registry and the list of component parsers change in the 
corresponding setters.
For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
(i.e., Java class library + Tika app + other apps), the patch reduces the 
running time
from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the same 
order of magnitude are found also for smaller workloads.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira