[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267553#comment-14267553
 ] 

Nick Burch commented on TIKA-1445:
--

I wonder if it wouldn't be better to do the is tessaract there check in the 
`getSupportedTypes` method? That way, if tessaract can't be found, then the 
main composite parser (eg AutoDetectParser, if being used) would just skip over 
the Tessarct one, and fall back to the Jpeg or Image one as appropriate

We could then do an additional check at parse time, in case of a direct call to 
the parser.

I'll have a go at working that up shortly

Oh, and the fallback parser you've come up with looks much neater than mine :)

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267584#comment-14267584
 ] 

Nick Burch commented on TIKA-1445:
--

As of r1650051, I think we're correctly handling the case of tesseract not 
being installed falling back to the normal parsers, and calling the normal 
image parsers after tesseract is done. I've got a couple of unit tests that 
seem to show that

Any chance you could add a unit test based on your govdocs word file, and check 
that it's working correctly for embedded images as well?

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267586#comment-14267586
 ] 

Hudson commented on TIKA-1445:
--

UNSTABLE: Integrated in tika-trunk-jdk1.7 #411 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/411/])
TIKA-1445 Unit test to check a JPEG via Tesseract gets both OCR text and normal 
JPEG metadata (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650050)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testOCR.jpg
TIKA-1445 Unit test to show that when an invalid tesseract config is given, and 
tesseract cannot be found, TesseractOCRParser will return no types and will not 
be selected by DefaultParser (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650046)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
Cleaner workaround parser call from Tim Allison from TIKA-1445 (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650045)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
TIKA-1445 If Tesseract isn't available, don't offer any supported mime types, 
so the parser avoids being picked by DefaultParser or similar (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650044)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267643#comment-14267643
 ] 

Nick Burch commented on TIKA-1445:
--

Ah, true, I hadn't thought so much about the system call each time. I guess the 
only thing we need to cache is tesseract path - yes/no - you could pass in 
different config objects with different paths. Maybe we do a quick bit of 
caching based on that, and use that to avoid the extra calls?

Oh, and I do have tesseract installed now, I installed it to help :)

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-894) Add webapp mode for Tika Server, simplifies deployment

2015-01-07 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268605#comment-14268605
 ] 

Lewis John McGibbney commented on TIKA-894:
---

I have a half baked patch locally for webapp and WAR support similar to what we 
have over on Any23.
I'll try my best to hammer this soon folks. Sorry about the ridiculous wait. God

 Add webapp mode for Tika Server, simplifies deployment
 --

 Key: TIKA-894
 URL: https://issues.apache.org/jira/browse/TIKA-894
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Affects Versions: 1.1, 1.2
Reporter: Chris Wilson
  Labels: maven, newbie, patch
 Fix For: 1.8

 Attachments: tika-server-webapp.patch


 For use in production services, Tika Server should really be deployed as a 
 WAR file, under a reliable servlet container that knows how to run as a 
 system service, for example Tomcat or JBoss.
 This is especially important on Windows, where I wasted an entire day trying 
 to make TikaServerCli run as some kind of a service. 
 Maven makes building a webapp pretty trivial. With the attached patch 
 applied, mvn war:war should work. It seems to run fine in Tomcat, which 
 makes Windows deployment much simpler. Just install Tomcat and drop the WAR 
 file into tomcat's webapps directory and you're away.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Apache Tika 1.7 Release

2015-01-07 Thread David Meikle
-1 on this for me too as there is a small unit test failure from ODFParser
on Windows from TIKA-1412.

I have added the tweak to fix this on trunk.

(I have also tested the latest changes added by Tim and Tyler in TIKA-1445
on Windows, Mac and Ubuntu with a decent batch of files, and everything is
working nicely at this end.)

On 7 January 2015 at 01:11, Allison, Timothy B. talli...@mitre.org wrote:

 -1

 I'm sorry that I haven't had a chance to kick the tires on the recent
 changes to the metadata extraction from images until now, but it looks like
 1.7-rc2 and trunk are not pulling metadata from embedded images.

 I've posted a test file from govdocs1 to TIKA-1445.  I may have time
 tomorrow to see what's going on.  I should also have time tomorrow to
 finish the analysis of the comparison between 1.6 and 1.7 on govdocs1.

 Sorry for my delay, all!  And even greater apologies if user error is at
 fault and metadata is successfully being extracted from embedded images. :)

 Thank you, Tyler, for running this release!


 -Original Message-
 From: Nick Burch [mailto:apa...@gagravarr.org]
 Sent: Tuesday, January 06, 2015 11:36 AM
 To: dev@tika.apache.org
 Subject: Re: [VOTE] Apache Tika 1.7 Release

 On Tue, 6 Jan 2015, Tyler Palsulich wrote:
  A candidate for the Tika 1.7 release is available at:
 https://dist.apache.org/repos/dist/dev/tika/
 
  The release candidate is a zip archive of the sources in:
 http://svn.apache.org/repos/asf/tika/tags/1.7-rc2/
 
  The SHA1 checksum of the archive is
 0307a8367ae6f8b1103824fd11337fd89e24e6a4.
 
  In addition, a staged maven repository is available here:
 
 
 https://repository.apache.org/content/repositories/orgapachetika-1006/org/apache/tika/

 Looks good to me, I'm +1

 Nick



[jira] [Created] (TIKA-1507) Under OSGi, ForkParser failes to send core parser classes like ExternalParser

2015-01-07 Thread Nick Burch (JIRA)
Nick Burch created TIKA-1507:


 Summary: Under OSGi, ForkParser failes to send core parser classes 
like ExternalParser
 Key: TIKA-1507
 URL: https://issues.apache.org/jira/browse/TIKA-1507
 Project: Tika
  Issue Type: Bug
  Components: packaging, parser
Affects Versions: 1.6, 1.7
Reporter: Nick Burch


Under OSGi, if you try to use ForkParser with the Tesseract OCR parser, it will 
fail with:

java.lang.NoClassDefFoundError: org/apache/tika/parser/external/ExternalParser
at 
org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
at 
org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:91)
at 
org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
at 
org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
at 
org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:622)
at org.apache.tika.fork.ForkServer.call(ForkServer.java:144)
at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:124)
at org.apache.tika.fork.ForkServer.main(ForkServer.java:69)
Caused by: java.lang.ClassNotFoundException: Unable to find class 
org.apache.tika.parser.external.ExternalParser
at 
org.apache.tika.fork.ClassLoaderProxy.findClass(ClassLoaderProxy.java:117)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
... 13 more

ExternalParser lives in the Tika Core jar, not the Tika Parsers one. This all 
works fine outside of OSGi, so it looks like something about the OSGi bundling 
is causing the fork parser to fail to send the parser-related classes from Tika 
Core over to the forked JVM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1507) Under OSGi, ForkParser failes to send core parser classes like ExternalParser

2015-01-07 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267754#comment-14267754
 ] 

Nick Burch commented on TIKA-1507:
--

To reproduce this, remove the try/catch NoClassDefFoundError in 
TesseractOCRParser.hasTesseract

 Under OSGi, ForkParser failes to send core parser classes like ExternalParser
 -

 Key: TIKA-1507
 URL: https://issues.apache.org/jira/browse/TIKA-1507
 Project: Tika
  Issue Type: Bug
  Components: packaging, parser
Affects Versions: 1.6, 1.7
Reporter: Nick Burch

 Under OSGi, if you try to use ForkParser with the Tesseract OCR parser, it 
 will fail with:
 java.lang.NoClassDefFoundError: org/apache/tika/parser/external/ExternalParser
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:91)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:622)
   at org.apache.tika.fork.ForkServer.call(ForkServer.java:144)
   at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:124)
   at org.apache.tika.fork.ForkServer.main(ForkServer.java:69)
 Caused by: java.lang.ClassNotFoundException: Unable to find class 
 org.apache.tika.parser.external.ExternalParser
   at 
 org.apache.tika.fork.ClassLoaderProxy.findClass(ClassLoaderProxy.java:117)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
   ... 13 more
 ExternalParser lives in the Tika Core jar, not the Tika Parsers one. This all 
 works fine outside of OSGi, so it looks like something about the OSGi 
 bundling is causing the fork parser to fail to send the parser-related 
 classes from Tika Core over to the forked JVM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267756#comment-14267756
 ] 

Nick Burch commented on TIKA-1445:
--

I've no idea why the fork parser is failing when run under osgi. It looks like 
it isn't send the parser related classes from tika-core over (eg external 
parser)

I've put in a hacky workaround in r1650083, and raised a new issue for it - 
TIKA-1507

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267626#comment-14267626
 ] 

Hudson commented on TIKA-1445:
--

UNSTABLE: Integrated in tika-trunk-jdk1.7 #412 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/412/])
TIKA-1445 Use assertContains, and fix a problem with the ForkParser integration 
tests (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650051)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267766#comment-14267766
 ] 

Tim Allison commented on TIKA-1445:
---

Y, and why did the tests work before and how does it work without tika-core?!?  
I don't see how recent changes are now causing this failure, either. Argh...

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1495) Parser for BPG (Better Portable Graphics) format

2015-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267775#comment-14267775
 ] 

Hudson commented on TIKA-1495:
--

UNSTABLE: Integrated in tika-trunk-jdk1.7 #414 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/414/])
Disabled exif related bpg tests for TIKA-1495 (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650084)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/image/BPGParserTest.java


 Parser for BPG (Better Portable Graphics) format
 

 Key: TIKA-1495
 URL: https://issues.apache.org/jira/browse/TIKA-1495
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.7
Reporter: Nick Burch

 Following on from TIKA-1491, it would be good to also have a parser for BPG 
 files as well. Likely this would pull out some very basic metadata from the 
 header, then locate the EXIF and XMP blocks + hand those on for parsing
 There doesn't appear to be a suitable Java library yet, but based on reading 
 the file format spec at http://bellard.org/bpg/bpg_spec.txt it doesn't look 
 like a basic parser would be that much work!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267773#comment-14267773
 ] 

Nick Burch commented on TIKA-1445:
--

The only other parser that uses ExternalParser is gdal, and I'm guessing that 
that doesn't get touched by the OSGi fork test...

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1507) Under OSGi, ForkParser failes to send core parser classes like ExternalParser

2015-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267774#comment-14267774
 ] 

Hudson commented on TIKA-1507:
--

UNSTABLE: Integrated in tika-trunk-jdk1.7 #414 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/414/])
Temporary workaround for the TIKA-1507 ForkParser / OGI issue (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650083)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java


 Under OSGi, ForkParser failes to send core parser classes like ExternalParser
 -

 Key: TIKA-1507
 URL: https://issues.apache.org/jira/browse/TIKA-1507
 Project: Tika
  Issue Type: Bug
  Components: packaging, parser
Affects Versions: 1.6, 1.7
Reporter: Nick Burch

 Under OSGi, if you try to use ForkParser with the Tesseract OCR parser, it 
 will fail with:
 java.lang.NoClassDefFoundError: org/apache/tika/parser/external/ExternalParser
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:91)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:622)
   at org.apache.tika.fork.ForkServer.call(ForkServer.java:144)
   at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:124)
   at org.apache.tika.fork.ForkServer.main(ForkServer.java:69)
 Caused by: java.lang.ClassNotFoundException: Unable to find class 
 org.apache.tika.parser.external.ExternalParser
   at 
 org.apache.tika.fork.ClassLoaderProxy.findClass(ClassLoaderProxy.java:117)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
   ... 13 more
 ExternalParser lives in the Tika Core jar, not the Tika Parsers one. This all 
 works fine outside of OSGi, so it looks like something about the OSGi 
 bundling is causing the fork parser to fail to send the parser-related 
 classes from Tika Core over to the forked JVM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267786#comment-14267786
 ] 

Luis Filipe Nassif commented on TIKA-1445:
--

It is not related directly to this issue, but I think the user should be able 
at least to disable the ocr parsing even if tesseract is installed, in the 
config object. It is a very slow task and the user could choose to not run it 
over all images.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267792#comment-14267792
 ] 

Nick Burch commented on TIKA-1445:
--

[~lfcnassif] Longer term we'll have different config objects that let you pick 
what you want - see [this 
comment|https://issues.apache.org/jira/browse/TIKA-1445?focusedCommentId=14222510page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14222510]
 for one possible plan

Short term, just pass in an ocr config to the parser context with an invalid 
path on it, as one of the unit tests does

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1495) Parser for BPG (Better Portable Graphics) format

2015-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267798#comment-14267798
 ] 

Hudson commented on TIKA-1495:
--

UNSTABLE: Integrated in tika-trunk-jdk1.6 #398 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/398/])
Disabled exif related bpg tests for TIKA-1495 (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650084)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/image/BPGParserTest.java


 Parser for BPG (Better Portable Graphics) format
 

 Key: TIKA-1495
 URL: https://issues.apache.org/jira/browse/TIKA-1495
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.7
Reporter: Nick Burch

 Following on from TIKA-1491, it would be good to also have a parser for BPG 
 files as well. Likely this would pull out some very basic metadata from the 
 header, then locate the EXIF and XMP blocks + hand those on for parsing
 There doesn't appear to be a suitable Java library yet, but based on reading 
 the file format spec at http://bellard.org/bpg/bpg_spec.txt it doesn't look 
 like a basic parser would be that much work!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1507) Under OSGi, ForkParser failes to send core parser classes like ExternalParser

2015-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267797#comment-14267797
 ] 

Hudson commented on TIKA-1507:
--

UNSTABLE: Integrated in tika-trunk-jdk1.6 #398 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/398/])
Temporary workaround for the TIKA-1507 ForkParser / OGI issue (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650083)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java


 Under OSGi, ForkParser failes to send core parser classes like ExternalParser
 -

 Key: TIKA-1507
 URL: https://issues.apache.org/jira/browse/TIKA-1507
 Project: Tika
  Issue Type: Bug
  Components: packaging, parser
Affects Versions: 1.6, 1.7
Reporter: Nick Burch

 Under OSGi, if you try to use ForkParser with the Tesseract OCR parser, it 
 will fail with:
 java.lang.NoClassDefFoundError: org/apache/tika/parser/external/ExternalParser
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:91)
   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:622)
   at org.apache.tika.fork.ForkServer.call(ForkServer.java:144)
   at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:124)
   at org.apache.tika.fork.ForkServer.main(ForkServer.java:69)
 Caused by: java.lang.ClassNotFoundException: Unable to find class 
 org.apache.tika.parser.external.ExternalParser
   at 
 org.apache.tika.fork.ClassLoaderProxy.findClass(ClassLoaderProxy.java:117)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
   ... 13 more
 ExternalParser lives in the Tika Core jar, not the Tika Parsers one. This all 
 works fine outside of OSGi, so it looks like something about the OSGi 
 bundling is causing the fork parser to fail to send the parser-related 
 classes from Tika Core over to the forked JVM



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1412) NPE in OpenDocumentParser

2015-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268337#comment-14268337
 ] 

Hudson commented on TIKA-1412:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #403 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/403/])
TIKA-1412: Fixed test issue on Windows build (dmeikle: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650163)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java


 NPE in OpenDocumentParser
 -

 Key: TIKA-1412
 URL: https://issues.apache.org/jira/browse/TIKA-1412
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Andrzej Bialecki 
 Fix For: 1.7

 Attachments: TIKA-1412.diff


 There's a missing else in OpenDocumentParser when it constructs a 
 ZipInputStream from the InputStream, which results in NPE when the 
 InputStream is an instance of TikaInputStream but has neither openContainer 
 nor file:
 {code}
 ...
 Caused by: java.lang.NullPointerException
 at 
 org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:161)
  ~[tika-parsers-1.6.jar:1.6]
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) 
 ~[tika-core-1.6.jar:1.6]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1412) NPE in OpenDocumentParser

2015-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268303#comment-14268303
 ] 

Hudson commented on TIKA-1412:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #418 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/418/])
TIKA-1412: Fixed test issue on Windows build (dmeikle: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650163)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java


 NPE in OpenDocumentParser
 -

 Key: TIKA-1412
 URL: https://issues.apache.org/jira/browse/TIKA-1412
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Andrzej Bialecki 
 Fix For: 1.7

 Attachments: TIKA-1412.diff


 There's a missing else in OpenDocumentParser when it constructs a 
 ZipInputStream from the InputStream, which results in NPE when the 
 InputStream is an instance of TikaInputStream but has neither openContainer 
 nor file:
 {code}
 ...
 Caused by: java.lang.NullPointerException
 at 
 org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:161)
  ~[tika-parsers-1.6.jar:1.6]
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244) 
 ~[tika-core-1.6.jar:1.6]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267724#comment-14267724
 ] 

Tim Allison commented on TIKA-1445:
---

Not to repeat Jenkins, well, apologies for repeating Jenkins...I'm getting a 
failure with the ForkParser tests now in BundleIT: can't find ExternalParser 
class.

Once trunk is back to stable, I'll add in the extra tests.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267854#comment-14267854
 ] 

Tim Allison commented on TIKA-1445:
---

[~gagravarr], see if you have success with r1650117.  I don't have Tesseract 
installed, so it'll be good to see if the tests pass with it installed.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267871#comment-14267871
 ] 

Hudson commented on TIKA-1445:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #399 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/399/])
TIKA-1445: add tests to TesseractOCRParserTest to ensure metadata is extracted 
(tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650117)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
TIKA-1445: need to fix TikaMimeTypesTest in tika-server to accomodate two 
options for parser (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650111)
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaMimeTypesTest.java


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267879#comment-14267879
 ] 

Tyler Palsulich commented on TIKA-1445:
---

All tests pass with and without Tesseract installed on my computer (Java 1.7, 
Ubuntu 14.04, Tesseract 3.03).

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267840#comment-14267840
 ] 

Tim Allison commented on TIKA-1445:
---

Fixed the tika-server test failure with r1650111.

Going to add mods to TesseractOCRParserTest

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268021#comment-14268021
 ] 

Hudson commented on TIKA-1445:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #401 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/401/])
TIKA-1445. Split TesseractOCRParser#offersNoTypesIfNotFound in two. Small 
import and comment changes. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650133)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268003#comment-14268003
 ] 

Hudson commented on TIKA-1445:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #416 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/416/])
TIKA-1445. Split TesseractOCRParser#offersNoTypesIfNotFound in two. Small 
import and comment changes. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650133)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268006#comment-14268006
 ] 

Tyler Palsulich commented on TIKA-1445:
---

Done. I made some small changes and split one of the tests in two. 
[~talli...@apache.org], [~gagravarr], or anyone else, any more changes/features 
needed for this issue/1.7? It looks like we grab normal metadata regardless of 
whether or not Tesseract is installed.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267618#comment-14267618
 ] 

Tim Allison commented on TIKA-1445:
---

Yes, that's a great idea.  I was disturbed by the current plan of making a 
system call for every image file if Tesseract is not installed; I was thinking 
of a static check, but your solution is far cleaner.

The patch I submitted last night caused the integrated ForkParser tests to 
fail: class loading issues.  So, I now have a slightly more manual hack class 
that borrows from CompositeParser.

Instead of the govdocs1 doc, I'll add tests based on our current test docs in 
the next 8 hours or so.

[~tpalsulich], after I add those tests, would you mind testing with Tesseract 
installed?  I don't have it installed, and IIRC, I don't think Nick does 
either...

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267892#comment-14267892
 ] 

Tim Allison commented on TIKA-1445:
---

Thank you!  Do you mind doing a quick code review of TesseractOCRParser?  I 
made a number of mods...

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267892#comment-14267892
 ] 

Tim Allison edited comment on TIKA-1445 at 1/7/15 5:21 PM:
---

Thank you!  Do you mind doing a quick code review of TesseractOCRParserTest?  I 
made a number of mods...


was (Author: talli...@mitre.org):
Thank you!  Do you mind doing a quick code review of TesseractOCRParser?  I 
made a number of mods...

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267934#comment-14267934
 ] 

Hudson commented on TIKA-1445:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #415 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/415/])
TIKA-1445: add tests to TesseractOCRParserTest to ensure metadata is extracted 
(tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650117)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
TIKA-1445: need to fix TikaMimeTypesTest in tika-server to accomodate two 
options for parser (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650111)
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaMimeTypesTest.java


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)