[jira] [Comment Edited] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

Tim Allison (JIRA) Mon, 19 May 2014 17:42:12 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002647#comment-14002647
 ]


Tim Allison edited comment on TIKA-1294 at 5/20/14 12:41 AM:
-------------------------------------------------------------

As very preliminary work towards TIKA-1302, I ran Tika 1.5 and 1.6-trunk 
against 500 randomly selected PDFs from [govdocs1 | 
http://digitalcorpora.org/corpora/files].



I calculated the following per-file times in milliseconds to extract all 
embedded documents and all embedded metadata.




||Measure||Tika 1.5||Tika 1.6-SNAPSHOT||

|median|122.0|177.5|

|mean|362.3|1345.1|

|stdev|746.4|5237.6|



These summary statistics suggest to me that 1.6 may be slower, but that there 
are some monster files that are probably skewing the numbers.
  Review of the data confirms this.
 There are slightly more than 20 files with > 100 attachments in 1.6 and 0 or 1 
in 1.5, and there are slightly more than 100 files with > 10 attachments in 1.6 
but only 0 or 1 in Tika 1.5. 
The worst offender: [ 905020 | 
http://digitalcorpora.org/corp/nps/files/govdocs1/905/905020.pdf ] has 4041 
attachments and 1.6 million metadata elements with Tika 1.6, but 0 attachments 
with Tika 1.5.

The above is based on a small sample from mostly American gov't PDFs.  Your 
mileage may vary.


was (Author: talli...@mitre.org):
As very preliminary work towards TIKA-1302, I ran Tika 1.5 and 1.6-trunk 
against 500 randomly selected PDFs from [govdocs1 | 
http://digitalcorpora.org/corpora/files].



I calculated the following per-file times in milliseconds to extract all 
embedded documents and all embedded metadata.




||Measure||Tika 1.5||Tika 1.6-SNAPSHOT||

|median|122.0|177.5|

|mean|362.3|1345.1|

|stdev|746.4|5237.6|



These summary statistics suggest to me that 1.6 may be slower, but that there 
are some monster files that are probably skewing the numbers.
  Review of the data confirms this.
 There are slightly more than 20 files with > 100 attachments in 1.6 and 0 or 1 
in 1.5, and there are slightly more than 100 files with > 10 attachments in 1.6 
but only 0 or 1 in Tika 1.5. 
The worst offender: [ 905020 | 
http://digitalcorpora.org/corp/nps/files/govdocs1/905/905020.pdf] has 4041 
attachments and 1.6 million metadata elements with Tika 1.6, but 0 attachments 
with Tika 1.5.

The above is based on a small sample from mostly American gov't PDFs.  Your 
mileage may vary.

> Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
> ---------------------------------------------------------------------------
>
>                 Key: TIKA-1294
>                 URL: https://issues.apache.org/jira/browse/TIKA-1294
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Trivial
>         Attachments: TIKA-1294.patch
>
>
> TIKA-1268 added the capability to extract embedded images as regular embedded 
> resources...a great feature!
> However, for some use cases, it might not be desirable to extract those types 
> of embedded resources.  I see two ways of allowing the client to choose 
> whether or not to extract those images:
> 1) set a value in the metadata for the extracted images that identifies them 
> as embedded PDXObjectImages vs regular image attachments.  The client can 
> then choose not to process embedded resources with a given metadata value.
> 2) allow the client to set a parameter in the PDFConfig object.
> My initial proposal is to go with option 2, and I'll attach a patch shortly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (TIKA-1294) Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs

Reply via email to