[jira] [Commented] (TIKA-1149) 12% performance improvement by caching in CompositeParser
[ https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13715675#comment-13715675 ] Luca Della Toffola commented on TIKA-1149: -- First of all, thanks for the very fast response! Tomorrow I will take some time to make few experiments with the optimization that you suggested. > 12% performance improvement by caching in CompositeParser > - > > Key: TIKA-1149 > URL: https://issues.apache.org/jira/browse/TIKA-1149 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.3, 1.4 >Reporter: Luca Della Toffola >Priority: Minor > Labels: performance > Attachments: CompositeParser.patch, ParseContext.patch > > > We found an easy way to improve Tika's performance. The idea is to avoid > recomputing parsers map over and over > in CompositeParser.getParsers(...) if the context is empty and to cache the > returned value instead. > This can be done safely even under the assumption that the media-registry and > the list of component parsers do change while Tika is executing, by > invalidating the cache in the case. > Our attached patch computes the parsers map once per instance of > CompositeParser. > The patch checks for the case where the context is empty and invalidates the > cache if both media-registry and the list of component parsers change in the > corresponding setters. > For example, when running Tika 1.3 on a set of large (~50k classes) JAR files > (i.e., Java class library + Tika app + other apps), the patch reduces the > running time > from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the > same order of magnitude are found also for smaller workloads. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts
Ray Gauss II created TIKA-1151: -- Summary: Maven Build Should Automatically Produce test-jar Artifacts Key: TIKA-1151 URL: https://issues.apache.org/jira/browse/TIKA-1151 Project: Tika Issue Type: Improvement Components: packaging Reporter: Ray Gauss II Assignee: Ray Gauss II The Maven build should be updated to produce test jar artifacts for appropriate sub-projects (see below) such that developers can extend test classes by adding the {{test-jar}} artifact as a dependency, i.e.: {code} org.apache.tika tika-parsers 1.5-SNAPSHOT test-jar test {code} The following sub-projects contain tests that developers might want to extend and their corresponding {{pom.xml}} should have the [attached tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added: - tika-app - tika-bundle - tika-core - tika-parsers - tika-server - tika-xmp -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Tika Core and Parsers Test Artifacts
Hi Ken, Yes, by other tika projects I meant tika-app, tika-bundle, tika-xmp, etc., and yes each sub-project would end up with it's own test-jar. It probably makes more sense to just add the plugin to each project individually. Since there's been no opposition to the concept in general I'll create a JIRA issue where we can discuss the details. Regards, Ray On Jul 21, 2013, at 3:25 PM, Ken Krugler wrote: > Hi Ray, > > On Jul 18, 2013, at 6:37am, Ray Gauss II wrote: > >> Hi Ken, >> >> They recommend test-jar instead of classifier now [1], but yes. > > Thanks for the reference. > >> Perhaps the other tika projects could benefit from this as well and it could >> just go into tika-parent's build plugins. > > By "other tika projects" do you mean things like tika-app? > > And if it's in the tika-parent's build plugins, does that mean each > sub-project would wind up with its own corresponding test-jar? > > Thanks, > > -- Ken > >> [1] http://maven.apache.org/guides/mini/guide-attached-tests.html >> >> >> On Jul 18, 2013, at 9:19 AM, Ken Krugler wrote: >> >>> Hi Ray, >>> >>> On Jul 18, 2013, at 5:14am, Ray Gauss II wrote: >>> I don't recall if we've discussed this already (I did do a brief search and didn't see anything). Is there any opposition to adding test-jar Maven artifacts for tika-core and tika-parsers? Seems like it would be good to allow others to extend from tests there if need be. >>> >>> +1 >>> >>> I assume you're talking about adding a >>> tika-(core|parsers)--tests.jar, so that we'd pull it in via: >>> >>> >>> org.apache.tika >>> tika-parsers >>> 1.4 >>> tests >>> test >>> >>> >>> -- Ken >>> >>> -- >>> Ken Krugler >>> +1 530-210-6378 >>> http://www.scaleunlimited.com >>> custom big data solutions & training >>> Hadoop, Cascading, Cassandra & Solr >>> >>> >>> >>> >>> >> > > -- > Ken Krugler > +1 530-210-6378 > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Cassandra & Solr > > > > >
[jira] [Updated] (TIKA-1150) Extract text from textbox in XLSX
[ https://issues.apache.org/jira/browse/TIKA-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1150: -- Attachment: testEXCEL_textbox.xlsx Simple file that shows issue. > Extract text from textbox in XLSX > - > > Key: TIKA-1150 > URL: https://issues.apache.org/jira/browse/TIKA-1150 > Project: Tika > Issue Type: New Feature > Components: parser >Affects Versions: 1.4 >Reporter: Tim Allison >Priority: Minor > Attachments: testEXCEL_textbox.xlsx > > > Underlying POI library doesn't appear to support easy extraction of text from > text boxes in XLSX files. Personal preference would be to wait for > modifications in POI and then make a few small changes to Tika to run > XSSFTextBox code. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1150) Extract text from textbox in XLSX
Tim Allison created TIKA-1150: - Summary: Extract text from textbox in XLSX Key: TIKA-1150 URL: https://issues.apache.org/jira/browse/TIKA-1150 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.4 Reporter: Tim Allison Priority: Minor Underlying POI library doesn't appear to support easy extraction of text from text boxes in XLSX files. Personal preference would be to wait for modifications in POI and then make a few small changes to Tika to run XSSFTextBox code. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-1149) 12% performance improvement by caching in CompositeParser
[ https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13715180#comment-13715180 ] Jukka Zitting commented on TIKA-1149: - Note that for example {{DefaultParser.getParsers(ParseContext)}} can return a different set of parsers on each invocation, thanks to the dynamic service lookup mechanism in {{ServiceLoader}}. Thus caching the return value can lead to incorrect behavior. An alternative optimization would be to refactor the {{CompositeParser.getParser(Metadata, ParseContext)}} method so that it doesn't need to always instantiate the full type->parser map. Instead it could for example restrict the search to only the specified type and its supertypes. > 12% performance improvement by caching in CompositeParser > - > > Key: TIKA-1149 > URL: https://issues.apache.org/jira/browse/TIKA-1149 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.3, 1.4 >Reporter: Luca Della Toffola >Priority: Minor > Labels: performance > Attachments: CompositeParser.patch, ParseContext.patch > > > We found an easy way to improve Tika's performance. The idea is to avoid > recomputing parsers map over and over > in CompositeParser.getParsers(...) if the context is empty and to cache the > returned value instead. > This can be done safely even under the assumption that the media-registry and > the list of component parsers do change while Tika is executing, by > invalidating the cache in the case. > Our attached patch computes the parsers map once per instance of > CompositeParser. > The patch checks for the case where the context is empty and invalidates the > cache if both media-registry and the list of component parsers change in the > corresponding setters. > For example, when running Tika 1.3 on a set of large (~50k classes) JAR files > (i.e., Java class library + Tika app + other apps), the patch reduces the > running time > from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the > same order of magnitude are found also for smaller workloads. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-1149) 12% performance improvement by caching in CompositeParser
[ https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luca Della Toffola updated TIKA-1149: - Attachment: CompositeParser.patch ParseContext.patch > 12% performance improvement by caching in CompositeParser > - > > Key: TIKA-1149 > URL: https://issues.apache.org/jira/browse/TIKA-1149 > Project: Tika > Issue Type: Improvement > Components: parser >Affects Versions: 1.3, 1.4 >Reporter: Luca Della Toffola >Priority: Minor > Labels: performance > Attachments: CompositeParser.patch, ParseContext.patch > > > We found an easy way to improve Tika's performance. The idea is to avoid > recomputing parsers map over and over > in CompositeParser.getParsers(...) if the context is empty and to cache the > returned value instead. > This can be done safely even under the assumption that the media-registry and > the list of component parsers do change while Tika is executing, by > invalidating the cache in the case. > Our attached patch computes the parsers map once per instance of > CompositeParser. > The patch checks for the case where the context is empty and invalidates the > cache if both media-registry and the list of component parsers change in the > corresponding setters. > For example, when running Tika 1.3 on a set of large (~50k classes) JAR files > (i.e., Java class library + Tika app + other apps), the patch reduces the > running time > from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the > same order of magnitude are found also for smaller workloads. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (TIKA-1149) 12% performance improvement by caching in CompositeParser
Luca Della Toffola created TIKA-1149: Summary: 12% performance improvement by caching in CompositeParser Key: TIKA-1149 URL: https://issues.apache.org/jira/browse/TIKA-1149 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.4, 1.3 Reporter: Luca Della Toffola Priority: Minor We found an easy way to improve Tika's performance. The idea is to avoid recomputing parsers map over and over in CompositeParser.getParsers(...) if the context is empty and to cache the returned value instead. This can be done safely even under the assumption that the media-registry and the list of component parsers do change while Tika is executing, by invalidating the cache in the case. Our attached patch computes the parsers map once per instance of CompositeParser. The patch checks for the case where the context is empty and invalidates the cache if both media-registry and the list of component parsers change in the corresponding setters. For example, when running Tika 1.3 on a set of large (~50k classes) JAR files (i.e., Java class library + Tika app + other apps), the patch reduces the running time from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the same order of magnitude are found also for smaller workloads. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira