[
https://issues.apache.org/jira/browse/TIKA-3067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated TIKA-3067:
------------------------------
Description:
I ran extract inline images on a local sample of 20k files of common crawl and
govdocs1.
These are embedded files missing in 1.23 when compared with 1.24-pre-rc1:
||MIME_STRING||CNT||
|image/png|175,413|
|image/tiff|59,507|
|image/jpeg|6,435|
|image/x-jbig2|4,998|
|image/jp2|4,573|
|image/x-jp2-codestream|1|
This would look like we're gaining ~175k png files with the new
method...However, in other files, it looks like we're losing a bunch of
embedded images as well.
These are embedded files missing in 1.24-pre-rc1
|MIME_STRING||CNT||
|image/png|105,885|
|image/tiff|55,636|
|image/jpeg|3,289|
|image/x-jbig2|291|
|text/plain; charset=windows-1252|2|
was:
I ran extract inline images on a local sample of 20k files of common crawl and
govdocs1.
These are embedded files missing in 1.23 when compared with 1.24-pre-rc1:
||MIME_STRING||CNT||
|image/png|175,413|
|image/tiff|59,507|
|image/jpeg|6,435|
|image/x-jbig2|4,998|
|image/jp2|4,573|
|image/x-jp2-codestream|1|
This would look like we're gaining ~175k png files with the new
method...However, in other files, it looks like we're losing a bunch of files
as well.
These are embedded files missing in 1.24-pre-rc1
|MIME_STRING||CNT||
|image/png|105,885|
|image/tiff|55,636|
|image/jpeg|3,289|
|image/x-jbig2|291|
|text/plain; charset=windows-1252|2|
> Different numbers of embedded inline images with PDF inline image extraction
> code
> ---------------------------------------------------------------------------------
>
> Key: TIKA-3067
> URL: https://issues.apache.org/jira/browse/TIKA-3067
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
> Attachments: 437698_tika_1_23.tgz, 437698_tika_1_24.tgz,
> attachment_diffs_with_exceptions.xlsx
>
>
> I ran extract inline images on a local sample of 20k files of common crawl
> and govdocs1.
> These are embedded files missing in 1.23 when compared with 1.24-pre-rc1:
> ||MIME_STRING||CNT||
> |image/png|175,413|
> |image/tiff|59,507|
> |image/jpeg|6,435|
> |image/x-jbig2|4,998|
> |image/jp2|4,573|
> |image/x-jp2-codestream|1|
> This would look like we're gaining ~175k png files with the new
> method...However, in other files, it looks like we're losing a bunch of
> embedded images as well.
> These are embedded files missing in 1.24-pre-rc1
> |MIME_STRING||CNT||
> |image/png|105,885|
> |image/tiff|55,636|
> |image/jpeg|3,289|
> |image/x-jbig2|291|
> |text/plain; charset=windows-1252|2|
--
This message was sent by Atlassian Jira
(v8.3.4#803005)