date:20130722

[jira] [Commented] (TIKA-1149) 12% performance improvement by caching in CompositeParser

2013-07-22 Thread Luca Della Toffola (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13715675#comment-13715675
 ] 

Luca Della Toffola commented on TIKA-1149:
--

First of all, thanks for the very fast response!
Tomorrow I will take some time to make few experiments with the optimization 
that you suggested.


> 12% performance improvement by caching in CompositeParser
> -
>
> Key: TIKA-1149
> URL: https://issues.apache.org/jira/browse/TIKA-1149
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3, 1.4
>Reporter: Luca Della Toffola
>Priority: Minor
>  Labels: performance
> Attachments: CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid 
> recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the 
> returned value instead. 
> This can be done safely even under the assumption that the media-registry and 
> the list of component parsers do change while Tika is executing, by 
> invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of 
> CompositeParser.
> The patch checks for the case where the context is empty and invalidates the 
> cache if both media-registry and the list of component parsers change in the 
> corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
> (i.e., Java class library + Tika app + other apps), the patch reduces the 
> running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
> same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

2013-07-22 Thread Ray Gauss II (JIRA)

Ray Gauss II created TIKA-1151:
--

 Summary: Maven Build Should Automatically Produce test-jar 
Artifacts
 Key: TIKA-1151
 URL: https://issues.apache.org/jira/browse/TIKA-1151
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Reporter: Ray Gauss II
Assignee: Ray Gauss II


The Maven build should be updated to produce test jar artifacts for appropriate 
sub-projects (see below) such that developers can extend test classes by adding 
the {{test-jar}} artifact as a dependency, i.e.:
{code}

  org.apache.tika
  tika-parsers
  1.5-SNAPSHOT
  test-jar
  test

{code}

The following sub-projects contain tests that developers might want to extend 
and their corresponding {{pom.xml}} should have the [attached 
tests|http://maven.apache.org/guides/mini/guide-attached-tests.html] added:
- tika-app
- tika-bundle
- tika-core
- tika-parsers
- tika-server
- tika-xmp



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Tika Core and Parsers Test Artifacts

2013-07-22 Thread Ray Gauss II

Hi Ken, 

Yes, by other tika projects I meant tika-app, tika-bundle, tika-xmp, etc., and 
yes each sub-project would end up with it's own test-jar.

It probably makes more sense to just add the plugin to each project 
individually.

Since there's been no opposition to the concept in general I'll create a JIRA 
issue where we can discuss the details.

Regards,

Ray


On Jul 21, 2013, at 3:25 PM, Ken Krugler  wrote:

> Hi Ray,
> 
> On Jul 18, 2013, at 6:37am, Ray Gauss II wrote:
> 
>> Hi Ken,
>> 
>> They recommend test-jar instead of classifier now [1], but yes.
> 
> Thanks for the reference.
> 
>> Perhaps the other tika projects could benefit from this as well and it could 
>> just go into tika-parent's build plugins.
> 
> By "other tika projects" do you mean things like tika-app?
> 
> And if it's in the tika-parent's build plugins, does that mean each 
> sub-project would wind up with its own corresponding test-jar?
> 
> Thanks,
> 
> -- Ken
> 
>> [1] http://maven.apache.org/guides/mini/guide-attached-tests.html
>> 
>> 
>> On Jul 18, 2013, at 9:19 AM, Ken Krugler  wrote:
>> 
>>> Hi Ray,
>>> 
>>> On Jul 18, 2013, at 5:14am, Ray Gauss II wrote:
>>> 
 I don't recall if we've discussed this already (I did do a brief search 
 and didn't see anything).
 
 Is there any opposition to adding test-jar Maven artifacts for tika-core 
 and tika-parsers?
 
 Seems like it would be good to allow others to extend from tests there if 
 need be.
>>> 
>>> +1
>>> 
>>> I assume you're talking about adding a 
>>> tika-(core|parsers)--tests.jar, so that we'd pull it in via:
>>> 
>>>  
>>>   org.apache.tika
>>>  tika-parsers
>>>  1.4
>>>  tests
>>>  test
>>>  
>>> 
>>> -- Ken
>>> 
>>> --
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://www.scaleunlimited.com
>>> custom big data solutions & training
>>> Hadoop, Cascading, Cassandra & Solr
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
> 
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 
> 
> 
> 
>

[jira] [Updated] (TIKA-1150) Extract text from textbox in XLSX

2013-07-22 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1150:
--

Attachment: testEXCEL_textbox.xlsx

Simple file that shows issue.

> Extract text from textbox in XLSX
> -
>
> Key: TIKA-1150
> URL: https://issues.apache.org/jira/browse/TIKA-1150
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.4
>Reporter: Tim Allison
>Priority: Minor
> Attachments: testEXCEL_textbox.xlsx
>
>
> Underlying POI library doesn't appear to support easy extraction of text from 
> text boxes in XLSX files. Personal preference would be to wait for 
> modifications in POI and then make a few small changes to Tika to run 
> XSSFTextBox code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (TIKA-1150) Extract text from textbox in XLSX

2013-07-22 Thread Tim Allison (JIRA)

Tim Allison created TIKA-1150:
-

 Summary: Extract text from textbox in XLSX
 Key: TIKA-1150
 URL: https://issues.apache.org/jira/browse/TIKA-1150
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.4
Reporter: Tim Allison
Priority: Minor


Underlying POI library doesn't appear to support easy extraction of text from 
text boxes in XLSX files. Personal preference would be to wait for 
modifications in POI and then make a few small changes to Tika to run 
XSSFTextBox code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1149) 12% performance improvement by caching in CompositeParser

2013-07-22 Thread Jukka Zitting (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13715180#comment-13715180
 ] 

Jukka Zitting commented on TIKA-1149:
-

Note that for example {{DefaultParser.getParsers(ParseContext)}} can return a 
different set of parsers on each invocation, thanks to the dynamic service 
lookup mechanism in {{ServiceLoader}}. Thus caching the return value can lead 
to incorrect behavior.

An alternative optimization would be to refactor the 
{{CompositeParser.getParser(Metadata, ParseContext)}} method so that it doesn't 
need to always instantiate the full type->parser map. Instead it could for 
example restrict the search to only the specified type and its supertypes.

> 12% performance improvement by caching in CompositeParser
> -
>
> Key: TIKA-1149
> URL: https://issues.apache.org/jira/browse/TIKA-1149
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3, 1.4
>Reporter: Luca Della Toffola
>Priority: Minor
>  Labels: performance
> Attachments: CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid 
> recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the 
> returned value instead. 
> This can be done safely even under the assumption that the media-registry and 
> the list of component parsers do change while Tika is executing, by 
> invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of 
> CompositeParser.
> The patch checks for the case where the context is empty and invalidates the 
> cache if both media-registry and the list of component parsers change in the 
> corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
> (i.e., Java class library + Tika app + other apps), the patch reduces the 
> running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
> same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-1149) 12% performance improvement by caching in CompositeParser

2013-07-22 Thread Luca Della Toffola (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Della Toffola updated TIKA-1149:
-

Attachment: CompositeParser.patch
ParseContext.patch

> 12% performance improvement by caching in CompositeParser
> -
>
> Key: TIKA-1149
> URL: https://issues.apache.org/jira/browse/TIKA-1149
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3, 1.4
>Reporter: Luca Della Toffola
>Priority: Minor
>  Labels: performance
> Attachments: CompositeParser.patch, ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid 
> recomputing parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the 
> returned value instead. 
> This can be done safely even under the assumption that the media-registry and 
> the list of component parsers do change while Tika is executing, by 
> invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of 
> CompositeParser.
> The patch checks for the case where the context is empty and invalidates the 
> cache if both media-registry and the list of component parsers change in the 
> corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
> (i.e., Java class library + Tika app + other apps), the patch reduces the 
> running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the 
> same order of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (TIKA-1149) 12% performance improvement by caching in CompositeParser

2013-07-22 Thread Luca Della Toffola (JIRA)

Luca Della Toffola created TIKA-1149:


 Summary: 12% performance improvement by caching in CompositeParser
 Key: TIKA-1149
 URL: https://issues.apache.org/jira/browse/TIKA-1149
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4, 1.3
Reporter: Luca Della Toffola
Priority: Minor


We found an easy way to improve Tika's performance. The idea is to avoid 
recomputing parsers map over and over 
in CompositeParser.getParsers(...) if the context is empty and to cache the 
returned value instead. 
This can be done safely even under the assumption that the media-registry and 
the list of component parsers do change while Tika is executing, by 
invalidating the cache in the case.
Our attached patch computes the parsers map once per instance of 
CompositeParser.
The patch checks for the case where the context is empty and invalidates the 
cache if both media-registry and the list of component parsers change in the 
corresponding setters.
For example, when running Tika 1.3 on a set of large (~50k classes) JAR files 
(i.e., Java class library + Tika app + other apps), the patch reduces the 
running time
from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the same 
order of magnitude are found also for smaller workloads.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-1149) 12% performance improvement by caching in CompositeParser

[jira] [Created] (TIKA-1151) Maven Build Should Automatically Produce test-jar Artifacts

Re: Tika Core and Parsers Test Artifacts

[jira] [Updated] (TIKA-1150) Extract text from textbox in XLSX

[jira] [Created] (TIKA-1150) Extract text from textbox in XLSX

[jira] [Commented] (TIKA-1149) 12% performance improvement by caching in CompositeParser

[jira] [Updated] (TIKA-1149) 12% performance improvement by caching in CompositeParser

[jira] [Created] (TIKA-1149) 12% performance improvement by caching in CompositeParser

8 matches

Site Navigation

Mail list logo

Footer information