[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272570#comment-14272570 ] Chris A. Mattmann commented on TIKA-1445: - yeesh, caught up on all this great work. Awesome job guys. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Blocker Fix For: 1.7 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271201#comment-14271201 ] Tim Allison commented on TIKA-1445: --- No major problems found via quick and dirty govdocs1 eval. Let's roll! Better: Fewer pdf exceptions, better pdf text extraction (thank you, [~tilman]!) fixed exceptions: 2426 xls, 895 ppt, 158 pdf, 17 pps and 5 doc Note: fixed exceptions for xls are driven entirely by [~gagravarr]'s addition of parsing for xls .4. Thank you, Nick!!! More attachments for 27 pdf and 1 doc More metadata values for all comparable file pairs (no exceptions, = number of attachments) Areas for investigation: new exceptions 27 xls 173 exceptions for newly added parsing of vnd.ms.excel.sheet.3 Fewer attachments for 19 ppt, 6 doc and 1 rtf Permanent hangs/oom. These numbers differ by run because of multi-threading, but we went from 4 to 3. I'll follow up with investigation of these issues and open appropriate tickets next week. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Blocker Fix For: 1.7 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271266#comment-14271266 ] Tim Allison commented on TIKA-1445: --- Might have been neater, but you figured out how to get it to actually work with MimeTypesRegistry etc in integrated ForkParser tests! :) I really like the caching strategy to prevent the use of the parser if Tesseract isn't installed. Thank you! Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Blocker Fix For: 1.7 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271599#comment-14271599 ] Nick Burch commented on TIKA-1445: -- Please open a ticket for the excel 3 issue, and if you can, share a small file that shows it. The Excel 3 support was written from reading the OpenOffice provided spec document, and a bit of guessing, in the absence of any test files... Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Blocker Fix For: 1.7 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269765#comment-14269765 ] Nick Burch commented on TIKA-1445: -- If we're going to close this for 1.7, then we need to pull out the composite parser with strategy of what available parsers / parser combinations to use as a new task for 1.8 Then we need to come up with some better names for the strategies :) Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Blocker Fix For: 1.7 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269768#comment-14269768 ] Tim Allison commented on TIKA-1445: --- Completely agree! Opening new issues now. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Blocker Fix For: 1.7 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269800#comment-14269800 ] Tyler Palsulich commented on TIKA-1445: --- Thanks guys! [~tallison], let me know once you finish running against govdocs and I'll roll a new RC. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Blocker Fix For: 1.7 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269454#comment-14269454 ] Tim Allison commented on TIKA-1445: --- I'll have time to rerun trunk against govdocs1 and compare with 1.6 by tomorrow (January 9) 10am EST. If the community is willing to wait a day, let's hold off. Another day might also allow others to identify small issues (similar to [~davemeikle]'s recent find). Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267553#comment-14267553 ] Nick Burch commented on TIKA-1445: -- I wonder if it wouldn't be better to do the is tessaract there check in the `getSupportedTypes` method? That way, if tessaract can't be found, then the main composite parser (eg AutoDetectParser, if being used) would just skip over the Tessarct one, and fall back to the Jpeg or Image one as appropriate We could then do an additional check at parse time, in case of a direct call to the parser. I'll have a go at working that up shortly Oh, and the fallback parser you've come up with looks much neater than mine :) Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267584#comment-14267584 ] Nick Burch commented on TIKA-1445: -- As of r1650051, I think we're correctly handling the case of tesseract not being installed falling back to the normal parsers, and calling the normal image parsers after tesseract is done. I've got a couple of unit tests that seem to show that Any chance you could add a unit test based on your govdocs word file, and check that it's working correctly for embedded images as well? Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267586#comment-14267586 ] Hudson commented on TIKA-1445: -- UNSTABLE: Integrated in tika-trunk-jdk1.7 #411 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/411/]) TIKA-1445 Unit test to check a JPEG via Tesseract gets both OCR text and normal JPEG metadata (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650050) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java * /tika/trunk/tika-parsers/src/test/resources/test-documents/testOCR.jpg TIKA-1445 Unit test to show that when an invalid tesseract config is given, and tesseract cannot be found, TesseractOCRParser will return no types and will not be selected by DefaultParser (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650046) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java Cleaner workaround parser call from Tim Allison from TIKA-1445 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650045) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java TIKA-1445 If Tesseract isn't available, don't offer any supported mime types, so the parser avoids being picked by DefaultParser or similar (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650044) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267643#comment-14267643 ] Nick Burch commented on TIKA-1445: -- Ah, true, I hadn't thought so much about the system call each time. I guess the only thing we need to cache is tesseract path - yes/no - you could pass in different config objects with different paths. Maybe we do a quick bit of caching based on that, and use that to avoid the extra calls? Oh, and I do have tesseract installed now, I installed it to help :) Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267756#comment-14267756 ] Nick Burch commented on TIKA-1445: -- I've no idea why the fork parser is failing when run under osgi. It looks like it isn't send the parser related classes from tika-core over (eg external parser) I've put in a hacky workaround in r1650083, and raised a new issue for it - TIKA-1507 Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267626#comment-14267626 ] Hudson commented on TIKA-1445: -- UNSTABLE: Integrated in tika-trunk-jdk1.7 #412 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/412/]) TIKA-1445 Use assertContains, and fix a problem with the ForkParser integration tests (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650051) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267766#comment-14267766 ] Tim Allison commented on TIKA-1445: --- Y, and why did the tests work before and how does it work without tika-core?!? I don't see how recent changes are now causing this failure, either. Argh... Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267773#comment-14267773 ] Nick Burch commented on TIKA-1445: -- The only other parser that uses ExternalParser is gdal, and I'm guessing that that doesn't get touched by the OSGi fork test... Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267786#comment-14267786 ] Luis Filipe Nassif commented on TIKA-1445: -- It is not related directly to this issue, but I think the user should be able at least to disable the ocr parsing even if tesseract is installed, in the config object. It is a very slow task and the user could choose to not run it over all images. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267792#comment-14267792 ] Nick Burch commented on TIKA-1445: -- [~lfcnassif] Longer term we'll have different config objects that let you pick what you want - see [this comment|https://issues.apache.org/jira/browse/TIKA-1445?focusedCommentId=14222510page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14222510] for one possible plan Short term, just pass in an ocr config to the parser context with an invalid path on it, as one of the unit tests does Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267724#comment-14267724 ] Tim Allison commented on TIKA-1445: --- Not to repeat Jenkins, well, apologies for repeating Jenkins...I'm getting a failure with the ForkParser tests now in BundleIT: can't find ExternalParser class. Once trunk is back to stable, I'll add in the extra tests. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267854#comment-14267854 ] Tim Allison commented on TIKA-1445: --- [~gagravarr], see if you have success with r1650117. I don't have Tesseract installed, so it'll be good to see if the tests pass with it installed. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267871#comment-14267871 ] Hudson commented on TIKA-1445: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #399 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/399/]) TIKA-1445: add tests to TesseractOCRParserTest to ensure metadata is extracted (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650117) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java TIKA-1445: need to fix TikaMimeTypesTest in tika-server to accomodate two options for parser (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650111) * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaMimeTypesTest.java Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267879#comment-14267879 ] Tyler Palsulich commented on TIKA-1445: --- All tests pass with and without Tesseract installed on my computer (Java 1.7, Ubuntu 14.04, Tesseract 3.03). Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267840#comment-14267840 ] Tim Allison commented on TIKA-1445: --- Fixed the tika-server test failure with r1650111. Going to add mods to TesseractOCRParserTest Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268021#comment-14268021 ] Hudson commented on TIKA-1445: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #401 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/401/]) TIKA-1445. Split TesseractOCRParser#offersNoTypesIfNotFound in two. Small import and comment changes. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650133) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268003#comment-14268003 ] Hudson commented on TIKA-1445: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #416 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/416/]) TIKA-1445. Split TesseractOCRParser#offersNoTypesIfNotFound in two. Small import and comment changes. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650133) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268006#comment-14268006 ] Tyler Palsulich commented on TIKA-1445: --- Done. I made some small changes and split one of the tests in two. [~talli...@apache.org], [~gagravarr], or anyone else, any more changes/features needed for this issue/1.7? It looks like we grab normal metadata regardless of whether or not Tesseract is installed. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267618#comment-14267618 ] Tim Allison commented on TIKA-1445: --- Yes, that's a great idea. I was disturbed by the current plan of making a system call for every image file if Tesseract is not installed; I was thinking of a static check, but your solution is far cleaner. The patch I submitted last night caused the integrated ForkParser tests to fail: class loading issues. So, I now have a slightly more manual hack class that borrows from CompositeParser. Instead of the govdocs1 doc, I'll add tests based on our current test docs in the next 8 hours or so. [~tpalsulich], after I add those tests, would you mind testing with Tesseract installed? I don't have it installed, and IIRC, I don't think Nick does either... Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267892#comment-14267892 ] Tim Allison commented on TIKA-1445: --- Thank you! Do you mind doing a quick code review of TesseractOCRParser? I made a number of mods... Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267934#comment-14267934 ] Hudson commented on TIKA-1445: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #415 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/415/]) TIKA-1445: add tests to TesseractOCRParserTest to ensure metadata is extracted (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650117) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java TIKA-1445: need to fix TikaMimeTypesTest in tika-server to accomodate two options for parser (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650111) * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaMimeTypesTest.java Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267161#comment-14267161 ] Tim Allison commented on TIKA-1445: --- Looking into this a bit more...we aren't even getting metadata out of regular images, for example, our testJPEG.jpg from tika-parser's test-documents yields no useful metadata with trunk, it looks like this isn't even being touched by the TesseractOCRParser: {noformat} Content-Length: 7686 Content-Type: image/jpeg X-Parsed-By: org.apache.tika.parser.DefaultParser resourceName: testJPEG.jpg {noformat} Again, my apologies if I need to make modifications to our config... Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252952#comment-14252952 ] Nick Burch commented on TIKA-1445: -- For 1.7, how about we just have the Tesseract Parser call out to the normal image parser (as appropriate), so that you always get both ocr and metadata? (Hopefully very quick to do) Then for 1.8, we can implement the config as described above, without that blocking the 1.7 release Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252956#comment-14252956 ] Tyler Palsulich commented on TIKA-1445: --- +1, Nick. That sounds good to me. I'll implement it in the next couple days, if no one else does first. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252973#comment-14252973 ] Nick Burch commented on TIKA-1445: -- In r1646624 I've added what I think should do the trick for now. I don't have Tesseract installed to check though, could someone who does verify + update unit tests? Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252985#comment-14252985 ] Hudson commented on TIKA-1445: -- UNSTABLE: Integrated in tika-trunk-jdk1.7 #371 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/371/]) Temporary workaround for TIKA-1445 for Tika 1.7 - always pass the image to the regular parser to get the metadata set. Will be replaced in 1.8 with composite parsers + user selected config with strategy (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1646624) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253057#comment-14253057 ] Hudson commented on TIKA-1445: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #372 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/372/]) TIKA-1445 - Allow you to exclude certain mimetypes from a parser that would otherwise handle them, in your Tika Config xml (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1646626) * /tika/trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java * /tika/trunk/tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java * /tika/trunk/tika-core/src/main/java/org/apache/tika/parser/ParserDecorator.java * /tika/trunk/tika-core/src/test/java/org/apache/tika/config/TikaConfigTest.java * /tika/trunk/tika-core/src/test/resources/org/apache/tika/config/TIKA-1445-default-except.xml Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253076#comment-14253076 ] Hudson commented on TIKA-1445: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #356 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/356/]) TIKA-1445 - Allow you to exclude certain mimetypes from a parser that would otherwise handle them, in your Tika Config xml (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1646626) * /tika/trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java * /tika/trunk/tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java * /tika/trunk/tika-core/src/main/java/org/apache/tika/parser/ParserDecorator.java * /tika/trunk/tika-core/src/test/java/org/apache/tika/config/TikaConfigTest.java * /tika/trunk/tika-core/src/test/resources/org/apache/tika/config/TIKA-1445-default-except.xml Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222510#comment-14222510 ] Nick Burch commented on TIKA-1445: -- I quite like Tim's idea. We can have things like {{TikaConfig.getDefaultConfig()}}, {{TikaConfig.getMaxiumMetadataConfig()}}, {{TikaConfig.getTryEachInTurnConfig()}} etc. People with specific needs can either pass those in as options to a TikaConfig constructor, or they can provide a tika config xml file that lists their preferences, perhaps with an expanded syntax like {code} parser class=composite childparserorg.apache.tika.parser.jpeg.JPegParser/childparser childparser.../childparser childparser.../childparser childparserorg.apache.tika.parser.ocr.TesseractOCR/childparser /parser parser class=tryinturn childparserorg.apache.tika.text/childparser childparserorg.apache.tika.text.findtextstrings/childparser /parser parser class=defaultparser excludeorg.apache.tika.netcdf/exclude /parser {code} The above slightly pseudocode example would try to merge all the image parsers output in turn, would for plain text try the normal parser then fall back to the talked-about bit like strings if that failed, and would use the default parser for everything else but excluding the netcdf parser Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222512#comment-14222512 ] Chris A. Mattmann commented on TIKA-1445: - Yep I like the idea too. Time to figure out how to implement and get some cycles to do so :) Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217685#comment-14217685 ] Dave Meikle commented on TIKA-1445: --- bq. Hey Guys, to be honest, the way I see that we solve the ServiceLoading problem is somehow to move away from it. Relying on the JVM to implicitly decide which parser to load based on ClassLoading is not scalable IMO. At worst, even capturing an ordered preference file that isn't ServiceLoading is 1000x better IMO than relying on the JVM and the classpath. We need somehow to bring this logic into Tika (still thinking about how and will try to prototype something). +1 - I think this is example of something we will probably hit more and more as we further extend Tika, i.e. wanting multiple parsers to have an interest in and then parse content of the same mime type, and moving away from using the re-ordering approach seems like the only way to go here. _ServiceLoading_ per se is not a problem, indeed this is a nice way to make it simple for external providers to be added, but I think we need to think about Parsers in a pipeline and allow users to customise the parsers that participate in the pipeline through positive exclusions via config. The above is a big change and I think if we went with something like this would need to be a 2.X of Tika. I suspect the problem with clashing Metadata entries is not really there, as most parsers look for different keys, or in cases where they process commons ones (e.g. title, size, description, etc) they should hopefully be getting the same value anyway. IMO I think we could send the same Metadata object through the 'pipeline', adding any unique new value in for a key. Will join the party and try to flesh out thoughts on a branch. bq. 3) It is a good idea to identify which parser produced each content with a div tag. +1 - this will be really helpful. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217965#comment-14217965 ] Tim Allison commented on TIKA-1445: --- How about using the order of parsers as specified in TikaConfig? That should accommodate 6 class files in different jars, no? Via TikaConfig, we could also specify the which subclass of a default composite parser to use. I now see at least three use cases: 1) Tika classic: pick the first parser that applies and hope that it is the one you meant, ignore the others. :) 2) The use case we've been discussing, where each parser is additive. 3) A BackOffOnExceptionParser (TIKA-1483 got me thinking about this) Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216351#comment-14216351 ] Tim Allison commented on TIKA-1445: --- [~gagravarr], thank you for explaining the original design decision. I knew there must be a good reason. My idea was to create one list of non-o.a.t parsers and one list of o.a.t parsers and then prioritize the non-o.a.t. in a joint list, but within each list, the parsers would be in the order they were when loaded. Is it common for people to have more than the out-of-the-box o.a.t.p.Parser services file and then maybe one user-defined one? Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216365#comment-14216365 ] Tim Allison commented on TIKA-1445: --- Copied from dev discussion to record points on this issue. Will not duplicate in future. Sorry! On issue 1: The proposal is that we'd send in a fresh Metadata object to each parser and then combine that information into a new Metadata object either via add or set. If we go this route, we'll lose the restrictions that Properties may have originally held (e.g. one value as in TikaCoreProperties.TITLE). On Issue 2: I think we're talking about different things. Yes, we'll definitely need to reset or spool the stream depending on its length. My concern was more with the handlers. If the first parser calls endDocument() and we don't shield that, then if someone uses the BodyContentHandler, then they might not see contents from the second/third parser because the initial parser ended the document. I need to test this concern, but I think that this was the root of TIKA-1124. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216444#comment-14216444 ] Nick Burch commented on TIKA-1445: -- I think it's fairly common for people to have 4-5 parser services files, and whatever we do needs to accept that as a normal use case. Pretty much anyone depending on tika-parsers is going to have at least 2. Think of the case of {code:title=tika-parsers.jar:META-INF/services/org.apache.tika.parser.Parser} org.apache.tika.parser.gdal.GDALParser org.apache.tika.parser.html.HtmlParser org.apache.tika.parser.image.ImageParser {code} {code:title=my-tika-extension.jar:META-INF/services/org.apache.tika.parser.Parser} com.example.tika.ocr.customocrparser org.apache.tika.parser.image.ImageParser {code} Under your plan, given that the JVM could return the two service files to you in any order, how do you decide which of GDALParser or ImageParser goes second after the OCR one? In one parser file, Image comes first, in the other it's second. Which wins? How do we make it deterministic, and not just based on which jar the JVM spots first? Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216451#comment-14216451 ] Chris A. Mattmann commented on TIKA-1445: - Hey Guys, to be honest, the way I see that we solve the ServiceLoading problem is somehow to move away from it. Relying on the JVM to implicitly decide which parser to load based on ClassLoading is not scalable IMO. At worst, even capturing an ordered preference file that isn't ServiceLoading is 1000x better IMO than relying on the JVM and the classpath. We need somehow to bring this logic into Tika (still thinking about how and will try to prototype something). Further, as for the use case of 4-5 service files being common - I guess I'm the outlier, b/c I've never ever created or used more than the default one? Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216466#comment-14216466 ] Nick Burch commented on TIKA-1445: -- Anyone using tika-parser OOTB has two parsers services files - built-in and vorbis. Anyone adding a third party parser under a non-ASLv2 license off the wiki will get a third. Anyone adding their own custom parsers following the instructions on the website will get a few more. My hunch is that most users won't care at all about what order the parsers are asked hey, can you handle this file type in. My second hunch is that users who do care will typically only care about it for a handful of formats, eg for jpeg try ocr then image, everything else default is fine. We also need to support those users who currently say I don't care what you find on the classpath, I only ever want you to use these 5 parsers and in this explicit order I'm passing you now I can describe the problem, but I'm not sure on the right solution at this point... Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216960#comment-14216960 ] Chris A. Mattmann commented on TIKA-1445: - Hi Nick: I think we need to be careful to define users. In my case, users aren't developers (who I think you are talking about when discussing adding new parsers above). My users simply want metadata and parsing that currently are partitioned amongst multiple Parsers in Tika, for the same MIME/MediaType. I could make one super Parser that combines them together; use the services trick per class to declare priority parsers, or delegates, or whatever. I think a much more modular and thus more easily maintainable way would be to provide a mechanism in which we allow multiple Parsers to be called for the same MediaType and to fill the Metadata object and Content stream. That said, I don't have a solution yet, but I am trying to think of one. Glad to have the conversation with you guys here. It's a tough problem. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217100#comment-14217100 ] Lewis John McGibbney commented on TIKA-1445: We can run many extractors against one MediaType with Any23. In this case we produce triples output. In the case of Tika, if we were to start with a scenario where we were *just* populating the Metadata container then I think it would be an excellent start. I'm going to investigate how we currently chain the extractors together in Any23 tonight and will make best efforts report it here. [~p_ansell] can maybe help out here as well as he has been influential in refactoring Any23 extractor behavior in the past. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217407#comment-14217407 ] Lewis John McGibbney commented on TIKA-1445: OK so in Any23, if we were to take the following example where we are focusing on a *single document extraction* e.g. (0) then it can be said that for any given document, when we run (1) the extraction we: * from all registered extractors, filter the extractors by MimeType (2) * from all matching extractors for the given MimeType, create the extractor (3) * loop through the matching extractors and actually run (4) each extractor on the local document source as an InputStream (5) for instance. We also have an Extraction Content and Extraction Reporting layers within Any23 which may be of use to Tika. To be honest I find the reports and context objects extremely useful for obtaining metrics from extraction... maybe we could do the same for Tika? There are some improvements which can be made to SingleDocumentExtraction within Any23 however that conversation is not relevant here. Hopefully the high level overview of the chaining extraction algorithm within Any23 is of some value to this conversation. (0) https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java (1) https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L205 (2) https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L223 (3) https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L252 (4) https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L440 (5) https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L465 Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214668#comment-14214668 ] Tim Allison commented on TIKA-1445: --- This might muddy results, initially, but users could choose to turn off/not load parsers that they didn't want. It would be a significant change over what we're currently doing. How will we handle: 1) Two parsers both set a value in the Metadata object? Will the second overwrite the value of the first? 2) Content: How will we know when a document ends? AutoDetectParser would wrap the handler in an EndDocumentShieldingContentHandler and then call endDocument when done? 3) Will the user be able to parse the output from the handler to figure out which parser is responsible for which content? Let's say a user wants to pull the electronic text out of a PDF _and_ render the page as an image and then run it through OCR, would we have something like div parser=o.a.t.p.PDFParser or similar? If we go this route, we'd want to make sure we don't have literally duplicate parsers (as we do now). This sounds more complicated than having parent parsers know which children they control and how to control them, but, it might make sense. Aside from OCR, what other use cases do we have where we might want multiple parsers operating on the same doc type? Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215170#comment-14215170 ] Luis Filipe Nassif commented on TIKA-1445: -- +1 to respect the order of parsers in the service file, instead of sorting the full class names. 1) Creating a service loading of ImageMetadataParsers, afaik, can have the same problem of different parsers trying to set the same metadata values. Metadata values are multivalued, so can we simply add the values produced by different parsers? 2) Yes, I think CompositeParser should append the content produced by different supported parsers. If the user do not want all the parsers, he should customize the parser service loading file. 3) It is a good idea to identify which parser produced each content with a div tag. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215292#comment-14215292 ] Nick Burch commented on TIKA-1445: -- +1 to respect the order of parsers in the service file, instead of sorting the full class names. The problem is that you can have multiple service files on your classpath. How do we respect the order of parsers in that case, when the order we get the service files in can be random due to the JVM's behaviour? (It was this non-determinicity of service files that led us to initially add explicit sorting of parsers, so we'd have consistent behaviour between multiple runs) Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215303#comment-14215303 ] Chris A. Mattmann commented on TIKA-1445: - Hey [~talli...@apache.org]: Here are my replies (also I moved this convo to the dev list since I think it's super important!): {noformat} #1 We will use a default policy of “append” which allows the Metadata object to append values to the same key, rather than replace them. We could also couple this with X-Parsed-By, which is an ordered list of what Parser parsed what so that we can reconstruct what Parser contributed what field. If it’s multi-valued, we can also add fields for Offsets, etc. An alternative here would also be to prefix metadata keys in this CompositeParser by the X-Parsed-By parser name, to avoid conflicts. Users would be able to switch the policy from “append” to “overwrite” in which this isn’t a problem, and we simply allow the last parser to input into a conflicting key to be the one that takes precedence. One option with overwrite would be to allow in this policy for providing a precedence order of Parsers (e.g., the current service list could be a precedence order). That said, how sure are we that this is a *real* problem? Some parsers parse the same MediaType but contribute vastly different and non overlapping keys to the metadata object? #2 I like your suggestion - or the alternative as I suggested would be to reset the stream to the beginning after each parser, or alternatively keep a clone of the original stream as a copy, and then clone it for each called Parser attempt? #3 I like your idea about wrapping content provided by handlers with the parser attribute. Very neat, let’s try that! {noformat} Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213858#comment-14213858 ] Chris A. Mattmann commented on TIKA-1445: - Tim, I wonder if it's possible to clone the original InputStream provided and to simply reset it to its original state after each Parser is run so that they can simply augment rather than replace what's there. I honestly think we should run all sets of matching Parsers for a given or detected MediaType. Thoughts? Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212246#comment-14212246 ] Tim Allison commented on TIKA-1445: --- The AutoDetectParser was doing its regular lookup for which parser supported x file type. No luck in that. Now, there is unfortunately something approaching luck in how we're handling the case where multiple parsers support a given file type. Our current algorithm, if I understand it correctly is to sort parsers in reverse alphabetical order by their package+class name (with a special case of prefer non-o.a.t parsers) and then pick the first parser that claims that it will parse the given file type. From the DefaultParser: {noformat} ListParser parsers = loader.loadStaticServiceProviders(Parser.class); Collections.sort(parsers, new ComparatorParser() { public int compare(Parser p1, Parser p2) { String n1 = p1.getClass().getName(); String n2 = p2.getClass().getName(); boolean t1 = n1.startsWith(org.apache.tika.); boolean t2 = n2.startsWith(org.apache.tika.); if (t1 == t2) { return n1.compareTo(n2); } else if (t1) { return -1; } else { return 1; } } }); {noformat} and {noformat} if (loader != null) { // Add dynamic parser service (they always override static ones) MediaTypeRegistry registry = getMediaTypeRegistry(); ListParser parsers = loader.loadDynamicServiceProviders(Parser.class); Collections.reverse(parsers); // best parser last for (Parser parser : parsers) { for (MediaType type : parser.getSupportedTypes(context)) { map.put(registry.normalize(type), parser); } } } {noformat} The luck so far is that, for example, the org.apache.tika.parser.gdal.GDALParser parser (which supports jpeg and gif) happens to sort after the org.apache.tika.parser.jpeg.JPegParser, the org.apache.tika.parser.image.ImageParser and the other o.a.t.p.image.* parsers. If you run the GDALParser on /test-documents/testJPEG_EXIF.jpg, you get no metadata. :( Depending on what the community thinks, we may want to open a separate issue and change DefaultParser's method of selecting a parser so that it: 1) selects non-o.a.t. parsers first 2) respects the order of parsers in the services files This wouldn't change the behavior, but it would allow users to select parser preference by a means other than relying on reverse alphabetical order. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212258#comment-14212258 ] Tim Allison commented on TIKA-1445: --- This is what we're currently doing in CompositeParser#getParsers(ParseContext context) {noformat} clobbering: o.a.t.p.gdal.GDALParser@677556a0 with o.a.t.p.hdf.HDFParser@488a5770 for application/x-hdf clobbering: o.a.t.p.gdal.GDALParser@677556a0 with o.a.t.p.image.ImageParser@72729f44 for image/x-ms-bmp clobbering: o.a.t.p.gdal.GDALParser@677556a0 with o.a.t.p.image.ImageParser@72729f44 for image/png clobbering: o.a.t.p.gdal.GDALParser@677556a0 with o.a.t.p.image.ImageParser@72729f44 for image/gif clobbering: o.a.t.p.image.ImageParser@72729f44 with o.a.t.p.image.ImageParser@72729f44 for image/x-ms-bmp clobbering: o.a.t.p.gdal.GDALParser@677556a0 with o.a.t.p.jpeg.JpegParser@4336640f for image/jpeg clobbering: o.a.t.p.microsoft.TNEFParser@27e33742 with o.a.t.p.microsoft.TNEFParser@27e33742 for application/vnd.ms-tnef clobbering: o.a.t.p.gdal.GDALParser@677556a0 with o.a.t.p.netcdf.NetCDFParser@3640e283 for application/x-netcdf clobbering: o.a.t.p.image.ImageParser@72729f44 with o.a.t.p.ocr.TesseractOCRParser@5dd72248 for image/x-ms-bmp clobbering: o.a.t.p.jpeg.JpegParser@4336640f with o.a.t.p.ocr.TesseractOCRParser@5dd72248 for image/jpeg clobbering: o.a.t.p.image.ImageParser@72729f44 with o.a.t.p.ocr.TesseractOCRParser@5dd72248 for image/png clobbering: o.a.t.p.image.TiffParser@570bd519 with o.a.t.p.ocr.TesseractOCRParser@5dd72248 for image/tiff clobbering: o.a.t.p.image.ImageParser@72729f44 with o.a.t.p.ocr.TesseractOCRParser@5dd72248 for image/gif clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with o.a.t.p.odf.OpenDocumentParser@49d388f4 for application/vnd.oasis.opendocument.image-template clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with o.a.t.p.odf.OpenDocumentParser@49d388f4 for application/vnd.oasis.opendocument.spreadsheet-template clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with o.a.t.p.odf.OpenDocumentParser@49d388f4 for application/vnd.oasis.opendocument.chart-template clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with o.a.t.p.odf.OpenDocumentParser@49d388f4 for application/vnd.oasis.opendocument.formula clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with o.a.t.p.odf.OpenDocumentParser@49d388f4 for application/vnd.oasis.opendocument.text-web clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with o.a.t.p.odf.OpenDocumentParser@49d388f4 for application/vnd.oasis.opendocument.text clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with o.a.t.p.odf.OpenDocumentParser@49d388f4 for application/vnd.oasis.opendocument.formula-template clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with o.a.t.p.odf.OpenDocumentParser@49d388f4 for application/vnd.oasis.opendocument.spreadsheet clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with o.a.t.p.odf.OpenDocumentParser@49d388f4 for application/vnd.oasis.opendocument.text-master clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with o.a.t.p.odf.OpenDocumentParser@49d388f4 for application/vnd.oasis.opendocument.text-template clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with o.a.t.p.odf.OpenDocumentParser@49d388f4 for application/vnd.oasis.opendocument.graphics clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with o.a.t.p.odf.OpenDocumentParser@49d388f4 for application/vnd.oasis.opendocument.graphics-template clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with o.a.t.p.odf.OpenDocumentParser@49d388f4 for application/vnd.oasis.opendocument.presentation clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with o.a.t.p.odf.OpenDocumentParser@49d388f4 for application/vnd.oasis.opendocument.image clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with o.a.t.p.odf.OpenDocumentParser@49d388f4 for application/vnd.oasis.opendocument.presentation-template clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with o.a.t.p.odf.OpenDocumentParser@49d388f4 for application/vnd.oasis.opendocument.chart clobbering: o.a.t.p.pkg.CompressorParser@5ec47109 with o.a.t.p.pkg.CompressorParser@5ec47109 for application/gzip {noformat} Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types,
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14211277#comment-14211277 ] Tyler Palsulich commented on TIKA-1445: --- [~talli...@apache.org], what was the system before the Tesseract Parser? Were we just getting lucky that metadata was extracted? Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185574#comment-14185574 ] Tim Allison commented on TIKA-1445: --- I played with this a bit with a png test file. The problem there is that besides the TesseractOCRParser, the GDALParser and the ImageParser both process png files. So, there's no way to guarantee that the other parser actually parses Metadata. One hack would be to hardcode checking the ImageParser or the JpegParser only to see if there is a match. A better option would be something along the lines of what we do with the service loading pattern with AutoDetectReader. The user could specify ImageMetadataParsers in a service listing, and we would try each one in turn to see if there is a match on type. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185644#comment-14185644 ] Tyler Palsulich commented on TIKA-1445: --- bq. Doh! Send in a DefaultHandler instead of BodyContentHandler to the otherParser I made the same mistake. I think our ideas are very similar. But, I offloaded the dynamic loading to {{DefaultParser.getAllParsersFor}}, since it already has service loading. But, my logic for getting the underlying DefaultParser from the AutoDetectParser is somewhat hacky. +1 to the expanded tests and always parsing with the otherParser, though! Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183873#comment-14183873 ] Tyler Palsulich commented on TIKA-1445: --- I've been trying my hand at this some time now. An idea I had was to create a temporary file from the input InputStream, then create new input streams from that file to run each Parser on. But, before this OCR Parser, we only ran one Parser on the image, anyway. So, what if there was a way to get the second best default parser for the image? An option is to hard code the exact working Parsers. But, in my opinion, we should load them dynamically. So, that would require getting a {{ListParser}}, instead of just the best Parser for a given MediaType ({{CompositeParser.getParsers(ParseContext)}}). If we only chose the second best Parser, we wouldn't have to merge the Metadata results, since the OCRParser doesn't add Metadata. But, it might call the ContentHandler. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1445.Mattmann.101214.patch.txt Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169090#comment-14169090 ] Hong-Thai Nguyen commented on TIKA-1445: Interesting question ! For me, parser's selection and parsers priority decision should be done on runtime by configuration, not inside a parser. Image's parser is an interesting case of concurrent parsers (Tesseract vs classical Image Parsers). We have double problem here: 1. When many parsers can work with same mime type, which one is selected ? 2. When we have many parsers, can we apply many parsers and merge results (metadata handler) . * For case 1, if we use a override config of parsers on runtime, we can declare many parsers with matching mimetype and the later one in list will be selected. We may extend CLI/WebService to inject this kind of configuration. * For case 2, we don't have a solution for now. We may extend CompositeParser to accept a mode 'many' parsers and call matching parsers in chain. The merging result is an other problem.we can accept a same metadata name is override by an other parser. The perfect solution is (again) using nested structure on our metadata which enable store each parser's result. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.7 Attachments: TIKA-1445.Mattmann.101214.patch.txt Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)