[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1445:
------------------------------
    Attachment: 000003.doc

I'm sorry that I haven't had a chance to kick the tires on the fix for this 
issue.

I just discovered that the current fix is not pulling metadata from embedded 
image files in tika-trunk or tika-1.7-rc2.

Test doc from govdocs1 attached.

We should be extracting these values (at least) in the embedded tiff:

{noformat}
"Data Precision":"8 bits","Image Height":"169 pixels","Image Width":"752 
pixels","Number of Components":"3","Resolution Units":"inch","X 
Resolution":"300 dots","Y Resolution":"300 
dots","resourceName":"image1.jpg","tiff:BitsPerSample":"8","tiff:ImageLength":"169","tiff:ImageWidth":"752","tika.mime.file":"image1.jpg"
{noformat}

> Figure out how to add Image metadata extraction to Tesseract parser
> -------------------------------------------------------------------
>
>                 Key: TIKA-1445
>                 URL: https://issues.apache.org/jira/browse/TIKA-1445
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>             Fix For: 1.8
>
>         Attachments: 000003.doc, TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to