[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267553#comment-14267553 ] Nick Burch commented on TIKA-1445: -- I wonder if it wouldn't be better to do the is tessaract there check in the `getSupportedTypes` method? That way, if tessaract can't be found, then the main composite parser (eg AutoDetectParser, if being used) would just skip over the Tessarct one, and fall back to the Jpeg or Image one as appropriate We could then do an additional check at parse time, in case of a direct call to the parser. I'll have a go at working that up shortly Oh, and the fallback parser you've come up with looks much neater than mine :) Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267584#comment-14267584 ] Nick Burch commented on TIKA-1445: -- As of r1650051, I think we're correctly handling the case of tesseract not being installed falling back to the normal parsers, and calling the normal image parsers after tesseract is done. I've got a couple of unit tests that seem to show that Any chance you could add a unit test based on your govdocs word file, and check that it's working correctly for embedded images as well? Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267586#comment-14267586 ] Hudson commented on TIKA-1445: -- UNSTABLE: Integrated in tika-trunk-jdk1.7 #411 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/411/]) TIKA-1445 Unit test to check a JPEG via Tesseract gets both OCR text and normal JPEG metadata (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650050) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java * /tika/trunk/tika-parsers/src/test/resources/test-documents/testOCR.jpg TIKA-1445 Unit test to show that when an invalid tesseract config is given, and tesseract cannot be found, TesseractOCRParser will return no types and will not be selected by DefaultParser (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650046) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java Cleaner workaround parser call from Tim Allison from TIKA-1445 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650045) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java TIKA-1445 If Tesseract isn't available, don't offer any supported mime types, so the parser avoids being picked by DefaultParser or similar (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650044) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267643#comment-14267643 ] Nick Burch commented on TIKA-1445: -- Ah, true, I hadn't thought so much about the system call each time. I guess the only thing we need to cache is tesseract path - yes/no - you could pass in different config objects with different paths. Maybe we do a quick bit of caching based on that, and use that to avoid the extra calls? Oh, and I do have tesseract installed now, I installed it to help :) Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-894) Add webapp mode for Tika Server, simplifies deployment
[ https://issues.apache.org/jira/browse/TIKA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268605#comment-14268605 ] Lewis John McGibbney commented on TIKA-894: --- I have a half baked patch locally for webapp and WAR support similar to what we have over on Any23. I'll try my best to hammer this soon folks. Sorry about the ridiculous wait. God Add webapp mode for Tika Server, simplifies deployment -- Key: TIKA-894 URL: https://issues.apache.org/jira/browse/TIKA-894 Project: Tika Issue Type: Improvement Components: packaging Affects Versions: 1.1, 1.2 Reporter: Chris Wilson Labels: maven, newbie, patch Fix For: 1.8 Attachments: tika-server-webapp.patch For use in production services, Tika Server should really be deployed as a WAR file, under a reliable servlet container that knows how to run as a system service, for example Tomcat or JBoss. This is especially important on Windows, where I wasted an entire day trying to make TikaServerCli run as some kind of a service. Maven makes building a webapp pretty trivial. With the attached patch applied, mvn war:war should work. It seems to run fine in Tomcat, which makes Windows deployment much simpler. Just install Tomcat and drop the WAR file into tomcat's webapps directory and you're away. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [VOTE] Apache Tika 1.7 Release
-1 on this for me too as there is a small unit test failure from ODFParser on Windows from TIKA-1412. I have added the tweak to fix this on trunk. (I have also tested the latest changes added by Tim and Tyler in TIKA-1445 on Windows, Mac and Ubuntu with a decent batch of files, and everything is working nicely at this end.) On 7 January 2015 at 01:11, Allison, Timothy B. talli...@mitre.org wrote: -1 I'm sorry that I haven't had a chance to kick the tires on the recent changes to the metadata extraction from images until now, but it looks like 1.7-rc2 and trunk are not pulling metadata from embedded images. I've posted a test file from govdocs1 to TIKA-1445. I may have time tomorrow to see what's going on. I should also have time tomorrow to finish the analysis of the comparison between 1.6 and 1.7 on govdocs1. Sorry for my delay, all! And even greater apologies if user error is at fault and metadata is successfully being extracted from embedded images. :) Thank you, Tyler, for running this release! -Original Message- From: Nick Burch [mailto:apa...@gagravarr.org] Sent: Tuesday, January 06, 2015 11:36 AM To: dev@tika.apache.org Subject: Re: [VOTE] Apache Tika 1.7 Release On Tue, 6 Jan 2015, Tyler Palsulich wrote: A candidate for the Tika 1.7 release is available at: https://dist.apache.org/repos/dist/dev/tika/ The release candidate is a zip archive of the sources in: http://svn.apache.org/repos/asf/tika/tags/1.7-rc2/ The SHA1 checksum of the archive is 0307a8367ae6f8b1103824fd11337fd89e24e6a4. In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachetika-1006/org/apache/tika/ Looks good to me, I'm +1 Nick
[jira] [Created] (TIKA-1507) Under OSGi, ForkParser failes to send core parser classes like ExternalParser
Nick Burch created TIKA-1507: Summary: Under OSGi, ForkParser failes to send core parser classes like ExternalParser Key: TIKA-1507 URL: https://issues.apache.org/jira/browse/TIKA-1507 Project: Tika Issue Type: Bug Components: packaging, parser Affects Versions: 1.6, 1.7 Reporter: Nick Burch Under OSGi, if you try to use ForkParser with the Tesseract OCR parser, it will fail with: java.lang.NoClassDefFoundError: org/apache/tika/parser/external/ExternalParser at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:91) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:622) at org.apache.tika.fork.ForkServer.call(ForkServer.java:144) at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:124) at org.apache.tika.fork.ForkServer.main(ForkServer.java:69) Caused by: java.lang.ClassNotFoundException: Unable to find class org.apache.tika.parser.external.ExternalParser at org.apache.tika.fork.ClassLoaderProxy.findClass(ClassLoaderProxy.java:117) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) ... 13 more ExternalParser lives in the Tika Core jar, not the Tika Parsers one. This all works fine outside of OSGi, so it looks like something about the OSGi bundling is causing the fork parser to fail to send the parser-related classes from Tika Core over to the forked JVM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1507) Under OSGi, ForkParser failes to send core parser classes like ExternalParser
[ https://issues.apache.org/jira/browse/TIKA-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267754#comment-14267754 ] Nick Burch commented on TIKA-1507: -- To reproduce this, remove the try/catch NoClassDefFoundError in TesseractOCRParser.hasTesseract Under OSGi, ForkParser failes to send core parser classes like ExternalParser - Key: TIKA-1507 URL: https://issues.apache.org/jira/browse/TIKA-1507 Project: Tika Issue Type: Bug Components: packaging, parser Affects Versions: 1.6, 1.7 Reporter: Nick Burch Under OSGi, if you try to use ForkParser with the Tesseract OCR parser, it will fail with: java.lang.NoClassDefFoundError: org/apache/tika/parser/external/ExternalParser at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:91) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:622) at org.apache.tika.fork.ForkServer.call(ForkServer.java:144) at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:124) at org.apache.tika.fork.ForkServer.main(ForkServer.java:69) Caused by: java.lang.ClassNotFoundException: Unable to find class org.apache.tika.parser.external.ExternalParser at org.apache.tika.fork.ClassLoaderProxy.findClass(ClassLoaderProxy.java:117) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) ... 13 more ExternalParser lives in the Tika Core jar, not the Tika Parsers one. This all works fine outside of OSGi, so it looks like something about the OSGi bundling is causing the fork parser to fail to send the parser-related classes from Tika Core over to the forked JVM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267756#comment-14267756 ] Nick Burch commented on TIKA-1445: -- I've no idea why the fork parser is failing when run under osgi. It looks like it isn't send the parser related classes from tika-core over (eg external parser) I've put in a hacky workaround in r1650083, and raised a new issue for it - TIKA-1507 Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267626#comment-14267626 ] Hudson commented on TIKA-1445: -- UNSTABLE: Integrated in tika-trunk-jdk1.7 #412 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/412/]) TIKA-1445 Use assertContains, and fix a problem with the ForkParser integration tests (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650051) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267766#comment-14267766 ] Tim Allison commented on TIKA-1445: --- Y, and why did the tests work before and how does it work without tika-core?!? I don't see how recent changes are now causing this failure, either. Argh... Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1495) Parser for BPG (Better Portable Graphics) format
[ https://issues.apache.org/jira/browse/TIKA-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267775#comment-14267775 ] Hudson commented on TIKA-1495: -- UNSTABLE: Integrated in tika-trunk-jdk1.7 #414 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/414/]) Disabled exif related bpg tests for TIKA-1495 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650084) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/image/BPGParserTest.java Parser for BPG (Better Portable Graphics) format Key: TIKA-1495 URL: https://issues.apache.org/jira/browse/TIKA-1495 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.7 Reporter: Nick Burch Following on from TIKA-1491, it would be good to also have a parser for BPG files as well. Likely this would pull out some very basic metadata from the header, then locate the EXIF and XMP blocks + hand those on for parsing There doesn't appear to be a suitable Java library yet, but based on reading the file format spec at http://bellard.org/bpg/bpg_spec.txt it doesn't look like a basic parser would be that much work! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267773#comment-14267773 ] Nick Burch commented on TIKA-1445: -- The only other parser that uses ExternalParser is gdal, and I'm guessing that that doesn't get touched by the OSGi fork test... Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1507) Under OSGi, ForkParser failes to send core parser classes like ExternalParser
[ https://issues.apache.org/jira/browse/TIKA-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267774#comment-14267774 ] Hudson commented on TIKA-1507: -- UNSTABLE: Integrated in tika-trunk-jdk1.7 #414 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/414/]) Temporary workaround for the TIKA-1507 ForkParser / OGI issue (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650083) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java Under OSGi, ForkParser failes to send core parser classes like ExternalParser - Key: TIKA-1507 URL: https://issues.apache.org/jira/browse/TIKA-1507 Project: Tika Issue Type: Bug Components: packaging, parser Affects Versions: 1.6, 1.7 Reporter: Nick Burch Under OSGi, if you try to use ForkParser with the Tesseract OCR parser, it will fail with: java.lang.NoClassDefFoundError: org/apache/tika/parser/external/ExternalParser at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:91) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:622) at org.apache.tika.fork.ForkServer.call(ForkServer.java:144) at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:124) at org.apache.tika.fork.ForkServer.main(ForkServer.java:69) Caused by: java.lang.ClassNotFoundException: Unable to find class org.apache.tika.parser.external.ExternalParser at org.apache.tika.fork.ClassLoaderProxy.findClass(ClassLoaderProxy.java:117) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) ... 13 more ExternalParser lives in the Tika Core jar, not the Tika Parsers one. This all works fine outside of OSGi, so it looks like something about the OSGi bundling is causing the fork parser to fail to send the parser-related classes from Tika Core over to the forked JVM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267786#comment-14267786 ] Luis Filipe Nassif commented on TIKA-1445: -- It is not related directly to this issue, but I think the user should be able at least to disable the ocr parsing even if tesseract is installed, in the config object. It is a very slow task and the user could choose to not run it over all images. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267792#comment-14267792 ] Nick Burch commented on TIKA-1445: -- [~lfcnassif] Longer term we'll have different config objects that let you pick what you want - see [this comment|https://issues.apache.org/jira/browse/TIKA-1445?focusedCommentId=14222510page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14222510] for one possible plan Short term, just pass in an ocr config to the parser context with an invalid path on it, as one of the unit tests does Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1495) Parser for BPG (Better Portable Graphics) format
[ https://issues.apache.org/jira/browse/TIKA-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267798#comment-14267798 ] Hudson commented on TIKA-1495: -- UNSTABLE: Integrated in tika-trunk-jdk1.6 #398 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/398/]) Disabled exif related bpg tests for TIKA-1495 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650084) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/image/BPGParserTest.java Parser for BPG (Better Portable Graphics) format Key: TIKA-1495 URL: https://issues.apache.org/jira/browse/TIKA-1495 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.7 Reporter: Nick Burch Following on from TIKA-1491, it would be good to also have a parser for BPG files as well. Likely this would pull out some very basic metadata from the header, then locate the EXIF and XMP blocks + hand those on for parsing There doesn't appear to be a suitable Java library yet, but based on reading the file format spec at http://bellard.org/bpg/bpg_spec.txt it doesn't look like a basic parser would be that much work! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1507) Under OSGi, ForkParser failes to send core parser classes like ExternalParser
[ https://issues.apache.org/jira/browse/TIKA-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267797#comment-14267797 ] Hudson commented on TIKA-1507: -- UNSTABLE: Integrated in tika-trunk-jdk1.6 #398 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/398/]) Temporary workaround for the TIKA-1507 ForkParser / OGI issue (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650083) * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java Under OSGi, ForkParser failes to send core parser classes like ExternalParser - Key: TIKA-1507 URL: https://issues.apache.org/jira/browse/TIKA-1507 Project: Tika Issue Type: Bug Components: packaging, parser Affects Versions: 1.6, 1.7 Reporter: Nick Burch Under OSGi, if you try to use ForkParser with the Tesseract OCR parser, it will fail with: java.lang.NoClassDefFoundError: org/apache/tika/parser/external/ExternalParser at org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117) at org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:91) at org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81) at org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95) at org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:622) at org.apache.tika.fork.ForkServer.call(ForkServer.java:144) at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:124) at org.apache.tika.fork.ForkServer.main(ForkServer.java:69) Caused by: java.lang.ClassNotFoundException: Unable to find class org.apache.tika.parser.external.ExternalParser at org.apache.tika.fork.ClassLoaderProxy.findClass(ClassLoaderProxy.java:117) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) ... 13 more ExternalParser lives in the Tika Core jar, not the Tika Parsers one. This all works fine outside of OSGi, so it looks like something about the OSGi bundling is causing the fork parser to fail to send the parser-related classes from Tika Core over to the forked JVM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1412) NPE in OpenDocumentParser
[ https://issues.apache.org/jira/browse/TIKA-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268337#comment-14268337 ] Hudson commented on TIKA-1412: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #403 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/403/]) TIKA-1412: Fixed test issue on Windows build (dmeikle: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650163) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java NPE in OpenDocumentParser - Key: TIKA-1412 URL: https://issues.apache.org/jira/browse/TIKA-1412 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Andrzej Bialecki Fix For: 1.7 Attachments: TIKA-1412.diff There's a missing else in OpenDocumentParser when it constructs a ZipInputStream from the InputStream, which results in NPE when the InputStream is an instance of TikaInputStream but has neither openContainer nor file: {code} ... Caused by: java.lang.NullPointerException at org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:161) ~[tika-parsers-1.6.jar:1.6] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) ~[tika-core-1.6.jar:1.6] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1412) NPE in OpenDocumentParser
[ https://issues.apache.org/jira/browse/TIKA-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268303#comment-14268303 ] Hudson commented on TIKA-1412: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #418 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/418/]) TIKA-1412: Fixed test issue on Windows build (dmeikle: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650163) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java NPE in OpenDocumentParser - Key: TIKA-1412 URL: https://issues.apache.org/jira/browse/TIKA-1412 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.6 Reporter: Andrzej Bialecki Fix For: 1.7 Attachments: TIKA-1412.diff There's a missing else in OpenDocumentParser when it constructs a ZipInputStream from the InputStream, which results in NPE when the InputStream is an instance of TikaInputStream but has neither openContainer nor file: {code} ... Caused by: java.lang.NullPointerException at org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:161) ~[tika-parsers-1.6.jar:1.6] at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) ~[tika-core-1.6.jar:1.6] {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267724#comment-14267724 ] Tim Allison commented on TIKA-1445: --- Not to repeat Jenkins, well, apologies for repeating Jenkins...I'm getting a failure with the ForkParser tests now in BundleIT: can't find ExternalParser class. Once trunk is back to stable, I'll add in the extra tests. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267854#comment-14267854 ] Tim Allison commented on TIKA-1445: --- [~gagravarr], see if you have success with r1650117. I don't have Tesseract installed, so it'll be good to see if the tests pass with it installed. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267871#comment-14267871 ] Hudson commented on TIKA-1445: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #399 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/399/]) TIKA-1445: add tests to TesseractOCRParserTest to ensure metadata is extracted (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650117) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java TIKA-1445: need to fix TikaMimeTypesTest in tika-server to accomodate two options for parser (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650111) * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaMimeTypesTest.java Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267879#comment-14267879 ] Tyler Palsulich commented on TIKA-1445: --- All tests pass with and without Tesseract installed on my computer (Java 1.7, Ubuntu 14.04, Tesseract 3.03). Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267840#comment-14267840 ] Tim Allison commented on TIKA-1445: --- Fixed the tika-server test failure with r1650111. Going to add mods to TesseractOCRParserTest Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268021#comment-14268021 ] Hudson commented on TIKA-1445: -- SUCCESS: Integrated in tika-trunk-jdk1.6 #401 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/401/]) TIKA-1445. Split TesseractOCRParser#offersNoTypesIfNotFound in two. Small import and comment changes. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650133) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268003#comment-14268003 ] Hudson commented on TIKA-1445: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #416 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/416/]) TIKA-1445. Split TesseractOCRParser#offersNoTypesIfNotFound in two. Small import and comment changes. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650133) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268006#comment-14268006 ] Tyler Palsulich commented on TIKA-1445: --- Done. I made some small changes and split one of the tests in two. [~talli...@apache.org], [~gagravarr], or anyone else, any more changes/features needed for this issue/1.7? It looks like we grab normal metadata regardless of whether or not Tesseract is installed. Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267618#comment-14267618 ] Tim Allison commented on TIKA-1445: --- Yes, that's a great idea. I was disturbed by the current plan of making a system call for every image file if Tesseract is not installed; I was thinking of a static check, but your solution is far cleaner. The patch I submitted last night caused the integrated ForkParser tests to fail: class loading issues. So, I now have a slightly more manual hack class that borrows from CompositeParser. Instead of the govdocs1 doc, I'll add tests based on our current test docs in the next 8 hours or so. [~tpalsulich], after I add those tests, would you mind testing with Tesseract installed? I don't have it installed, and IIRC, I don't think Nick does either... Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267892#comment-14267892 ] Tim Allison commented on TIKA-1445: --- Thank you! Do you mind doing a quick code review of TesseractOCRParser? I made a number of mods... Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267892#comment-14267892 ] Tim Allison edited comment on TIKA-1445 at 1/7/15 5:21 PM: --- Thank you! Do you mind doing a quick code review of TesseractOCRParserTest? I made a number of mods... was (Author: talli...@mitre.org): Thank you! Do you mind doing a quick code review of TesseractOCRParser? I made a number of mods... Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser
[ https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267934#comment-14267934 ] Hudson commented on TIKA-1445: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #415 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/415/]) TIKA-1445: add tests to TesseractOCRParserTest to ensure metadata is extracted (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650117) * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java TIKA-1445: need to fix TikaMimeTypesTest in tika-server to accomodate two options for parser (tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650111) * /tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaMimeTypesTest.java Figure out how to add Image metadata extraction to Tesseract parser --- Key: TIKA-1445 URL: https://issues.apache.org/jira/browse/TIKA-1445 Project: Tika Issue Type: Bug Components: parser Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Fix For: 1.8 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch Now that Tesseract is the default image parser in Tika for many image types, consider how to add back in the metadata extraction capabilities by the other Image parsers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)