[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1445: -- Priority: Blocker (was: Major) Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Blocker Fix For: 1.7 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1445: -- Attachment: 03.doc I'm sorry that I haven't had a chance to kick the tires on the fix for this issue. I just discovered that the current fix is not pulling metadata from embedded image files in tika-trunk or tika-1.7-rc2. Test doc from govdocs1 attached. We should be extracting these values (at least) in the embedded tiff: {noformat} Data Precision:8 bits,Image Height:169 pixels,Image Width:752 pixels,Number of Components:3,Resolution Units:inch,X Resolution:300 dots,Y Resolution:300 dots,resourceName:image1.jpg,tiff:BitsPerSample:8,tiff:ImageLength:169,tiff:ImageWidth:752,tika.mime.file:image1.jpg {noformat} Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1445: -- Attachment: TIKA-1445_tallison_v3_20141027.patch This version subclasses Parser to create an ImageMetaParser class, which our current image metadata parsers then extend. This adds a DefaultImageMetadataparser that is a copy and paste of DefaultParser...can't override static loader unfortunately! We now specify regular parsers in the Parser services file and ImageMetadataParsers in a separate services file. I don't like that this creates a new class of parsers, but I can't think of another way of guaranteeing that the OCRParser will find an image metadata parser correctly. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1445: -- Attachment: TIKA-1445_tallison_20141027.patch.txt Something along these lines? Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1445: -- Attachment: TIKA-1445.Palsulich.102614.patch Here is an updated patch with the above idea. I created a new public method in CompositeParser and DefaultParser -- {{getAllParsersFor(ParseContext, MediaType}} -- which returns a list of all Parsers that support the given type. This list is then searched from TesseractOCRParser for a second Parser for the image being parsed. I created a dummy BodyContentHandler to drop all content from the second Parser. Thoughts? Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1445: Fix Version/s: (was: 1.7) 1.8 - push to 1.8 Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1445: Attachment: TIKA-1445.Mattmann.101214.patch.txt - thinking about this, this may not be the right solution since each parser may consume the InputStream when generating metadata - may need to think about this more, but here's where I left off. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1445.Mattmann.101214.patch.txt Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)