[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-10 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14272570#comment-14272570
 ] 

Chris A. Mattmann commented on TIKA-1445:
-

yeesh, caught up on all this great work. Awesome job guys.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Blocker
 Fix For: 1.7

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271201#comment-14271201
 ] 

Tim Allison commented on TIKA-1445:
---

No major problems found via quick and dirty govdocs1 eval.  

Let's roll!


Better:
Fewer pdf exceptions, better pdf text extraction (thank you, [~tilman]!)

fixed exceptions: 2426 xls, 895 ppt, 158 pdf, 17 pps and 5 doc 

Note: fixed exceptions for xls are driven entirely by [~gagravarr]'s addition 
of parsing for xls .4.  Thank you, Nick!!!

More attachments for 27 pdf and 1 doc

More metadata values for all comparable file pairs (no exceptions, = number of 
attachments)

Areas for investigation:
new exceptions 27 xls
173 exceptions for newly added parsing of vnd.ms.excel.sheet.3
Fewer attachments for 19 ppt, 6 doc and 1 rtf
Permanent hangs/oom. These numbers differ by run because of multi-threading, 
but we went from 4 to 3.


I'll follow up with investigation of these issues and open appropriate tickets 
next week.


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Blocker
 Fix For: 1.7

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-09 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271266#comment-14271266
 ] 

Tim Allison commented on TIKA-1445:
---

Might have been neater, but you figured out how to get it to actually work with 
MimeTypesRegistry etc in integrated ForkParser tests! :)

I really like the caching strategy to prevent the use of the parser if 
Tesseract isn't installed.  Thank you!

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Blocker
 Fix For: 1.7

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-09 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271599#comment-14271599
 ] 

Nick Burch commented on TIKA-1445:
--

Please open a ticket for the excel 3 issue, and if you can, share a small file 
that shows it. The Excel 3 support was written from reading the OpenOffice 
provided spec document, and a bit of guessing, in the absence of any test 
files...

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Blocker
 Fix For: 1.7

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-08 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269765#comment-14269765
 ] 

Nick Burch commented on TIKA-1445:
--

If we're going to close this for 1.7, then we need to pull out the composite 
parser with strategy of what available parsers / parser combinations to use as 
a new task for 1.8

Then we need to come up with some better names for the strategies :)

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Blocker
 Fix For: 1.7

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-08 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269768#comment-14269768
 ] 

Tim Allison commented on TIKA-1445:
---

Completely agree! Opening new issues now.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Blocker
 Fix For: 1.7

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-08 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269800#comment-14269800
 ] 

Tyler Palsulich commented on TIKA-1445:
---

Thanks guys! [~tallison], let me know once you finish running against govdocs  
and I'll roll a new RC.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
Priority: Blocker
 Fix For: 1.7

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-08 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269454#comment-14269454
 ] 

Tim Allison commented on TIKA-1445:
---

I'll have time to rerun trunk against govdocs1 and compare with 1.6 by tomorrow 
(January 9) 10am EST.  If the community is willing to wait a day, let's hold 
off.  Another day might also allow others to identify small issues (similar to 
[~davemeikle]'s recent find).

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267553#comment-14267553
 ] 

Nick Burch commented on TIKA-1445:
--

I wonder if it wouldn't be better to do the is tessaract there check in the 
`getSupportedTypes` method? That way, if tessaract can't be found, then the 
main composite parser (eg AutoDetectParser, if being used) would just skip over 
the Tessarct one, and fall back to the Jpeg or Image one as appropriate

We could then do an additional check at parse time, in case of a direct call to 
the parser.

I'll have a go at working that up shortly

Oh, and the fallback parser you've come up with looks much neater than mine :)

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267584#comment-14267584
 ] 

Nick Burch commented on TIKA-1445:
--

As of r1650051, I think we're correctly handling the case of tesseract not 
being installed falling back to the normal parsers, and calling the normal 
image parsers after tesseract is done. I've got a couple of unit tests that 
seem to show that

Any chance you could add a unit test based on your govdocs word file, and check 
that it's working correctly for embedded images as well?

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267586#comment-14267586
 ] 

Hudson commented on TIKA-1445:
--

UNSTABLE: Integrated in tika-trunk-jdk1.7 #411 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/411/])
TIKA-1445 Unit test to check a JPEG via Tesseract gets both OCR text and normal 
JPEG metadata (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650050)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
* /tika/trunk/tika-parsers/src/test/resources/test-documents/testOCR.jpg
TIKA-1445 Unit test to show that when an invalid tesseract config is given, and 
tesseract cannot be found, TesseractOCRParser will return no types and will not 
be selected by DefaultParser (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650046)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
Cleaner workaround parser call from Tim Allison from TIKA-1445 (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650045)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
TIKA-1445 If Tesseract isn't available, don't offer any supported mime types, 
so the parser avoids being picked by DefaultParser or similar (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650044)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267643#comment-14267643
 ] 

Nick Burch commented on TIKA-1445:
--

Ah, true, I hadn't thought so much about the system call each time. I guess the 
only thing we need to cache is tesseract path - yes/no - you could pass in 
different config objects with different paths. Maybe we do a quick bit of 
caching based on that, and use that to avoid the extra calls?

Oh, and I do have tesseract installed now, I installed it to help :)

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267756#comment-14267756
 ] 

Nick Burch commented on TIKA-1445:
--

I've no idea why the fork parser is failing when run under osgi. It looks like 
it isn't send the parser related classes from tika-core over (eg external 
parser)

I've put in a hacky workaround in r1650083, and raised a new issue for it - 
TIKA-1507

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267626#comment-14267626
 ] 

Hudson commented on TIKA-1445:
--

UNSTABLE: Integrated in tika-trunk-jdk1.7 #412 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/412/])
TIKA-1445 Use assertContains, and fix a problem with the ForkParser integration 
tests (nick: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650051)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267766#comment-14267766
 ] 

Tim Allison commented on TIKA-1445:
---

Y, and why did the tests work before and how does it work without tika-core?!?  
I don't see how recent changes are now causing this failure, either. Argh...

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267773#comment-14267773
 ] 

Nick Burch commented on TIKA-1445:
--

The only other parser that uses ExternalParser is gdal, and I'm guessing that 
that doesn't get touched by the OSGi fork test...

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267786#comment-14267786
 ] 

Luis Filipe Nassif commented on TIKA-1445:
--

It is not related directly to this issue, but I think the user should be able 
at least to disable the ocr parsing even if tesseract is installed, in the 
config object. It is a very slow task and the user could choose to not run it 
over all images.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267792#comment-14267792
 ] 

Nick Burch commented on TIKA-1445:
--

[~lfcnassif] Longer term we'll have different config objects that let you pick 
what you want - see [this 
comment|https://issues.apache.org/jira/browse/TIKA-1445?focusedCommentId=14222510page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14222510]
 for one possible plan

Short term, just pass in an ocr config to the parser context with an invalid 
path on it, as one of the unit tests does

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267724#comment-14267724
 ] 

Tim Allison commented on TIKA-1445:
---

Not to repeat Jenkins, well, apologies for repeating Jenkins...I'm getting a 
failure with the ForkParser tests now in BundleIT: can't find ExternalParser 
class.

Once trunk is back to stable, I'll add in the extra tests.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267854#comment-14267854
 ] 

Tim Allison commented on TIKA-1445:
---

[~gagravarr], see if you have success with r1650117.  I don't have Tesseract 
installed, so it'll be good to see if the tests pass with it installed.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267871#comment-14267871
 ] 

Hudson commented on TIKA-1445:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #399 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/399/])
TIKA-1445: add tests to TesseractOCRParserTest to ensure metadata is extracted 
(tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650117)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
TIKA-1445: need to fix TikaMimeTypesTest in tika-server to accomodate two 
options for parser (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650111)
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaMimeTypesTest.java


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267879#comment-14267879
 ] 

Tyler Palsulich commented on TIKA-1445:
---

All tests pass with and without Tesseract installed on my computer (Java 1.7, 
Ubuntu 14.04, Tesseract 3.03).

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267840#comment-14267840
 ] 

Tim Allison commented on TIKA-1445:
---

Fixed the tika-server test failure with r1650111.

Going to add mods to TesseractOCRParserTest

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268021#comment-14268021
 ] 

Hudson commented on TIKA-1445:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #401 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/401/])
TIKA-1445. Split TesseractOCRParser#offersNoTypesIfNotFound in two. Small 
import and comment changes. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650133)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268003#comment-14268003
 ] 

Hudson commented on TIKA-1445:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #416 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/416/])
TIKA-1445. Split TesseractOCRParser#offersNoTypesIfNotFound in two. Small 
import and comment changes. (tpalsulich: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650133)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268006#comment-14268006
 ] 

Tyler Palsulich commented on TIKA-1445:
---

Done. I made some small changes and split one of the tests in two. 
[~talli...@apache.org], [~gagravarr], or anyone else, any more changes/features 
needed for this issue/1.7? It looks like we grab normal metadata regardless of 
whether or not Tesseract is installed.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267618#comment-14267618
 ] 

Tim Allison commented on TIKA-1445:
---

Yes, that's a great idea.  I was disturbed by the current plan of making a 
system call for every image file if Tesseract is not installed; I was thinking 
of a static check, but your solution is far cleaner.

The patch I submitted last night caused the integrated ForkParser tests to 
fail: class loading issues.  So, I now have a slightly more manual hack class 
that borrows from CompositeParser.

Instead of the govdocs1 doc, I'll add tests based on our current test docs in 
the next 8 hours or so.

[~tpalsulich], after I add those tests, would you mind testing with Tesseract 
installed?  I don't have it installed, and IIRC, I don't think Nick does 
either...

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267892#comment-14267892
 ] 

Tim Allison commented on TIKA-1445:
---

Thank you!  Do you mind doing a quick code review of TesseractOCRParser?  I 
made a number of mods...

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267934#comment-14267934
 ] 

Hudson commented on TIKA-1445:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #415 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/415/])
TIKA-1445: add tests to TesseractOCRParserTest to ensure metadata is extracted 
(tallison: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650117)
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
TIKA-1445: need to fix TikaMimeTypesTest in tika-server to accomodate two 
options for parser (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1650111)
* 
/tika/trunk/tika-server/src/test/java/org/apache/tika/server/TikaMimeTypesTest.java


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_20150106_tallison.patch, 
 TIKA-1445_tallison_20141027.patch.txt, TIKA-1445_tallison_v2_20141027.patch, 
 TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2015-01-06 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267161#comment-14267161
 ] 

Tim Allison commented on TIKA-1445:
---

Looking into this a bit more...we aren't even getting metadata out of regular 
images, for example, our testJPEG.jpg from tika-parser's test-documents yields 
no useful metadata with trunk, it looks like this isn't even being touched by 
the TesseractOCRParser:

{noformat}
Content-Length: 7686
Content-Type: image/jpeg
X-Parsed-By: org.apache.tika.parser.DefaultParser
resourceName: testJPEG.jpg
{noformat}

Again, my apologies if I need to make modifications to our config...

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: 03.doc, TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-12-18 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252952#comment-14252952
 ] 

Nick Burch commented on TIKA-1445:
--

For 1.7, how about we just have the Tesseract Parser call out to the normal 
image parser (as appropriate), so that you always get both ocr and metadata? 
(Hopefully very quick to do)

Then for 1.8, we can implement the config as described above, without that 
blocking the 1.7 release

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-12-18 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252956#comment-14252956
 ] 

Tyler Palsulich commented on TIKA-1445:
---

+1, Nick. That sounds good to me. I'll implement it in the next couple days, if 
no one else does first.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-12-18 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252973#comment-14252973
 ] 

Nick Burch commented on TIKA-1445:
--

In r1646624 I've added what I think should do the trick for now. I don't have 
Tesseract installed to check though, could someone who does verify + update 
unit tests?

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-12-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14252985#comment-14252985
 ] 

Hudson commented on TIKA-1445:
--

UNSTABLE: Integrated in tika-trunk-jdk1.7 #371 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/371/])
Temporary workaround for TIKA-1445 for Tika 1.7 - always pass the image to the 
regular parser to get the metadata set. Will be replaced in 1.8 with composite 
parsers + user selected config with strategy (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1646624)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-12-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253057#comment-14253057
 ] 

Hudson commented on TIKA-1445:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #372 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/372/])
TIKA-1445 - Allow you to exclude certain mimetypes from a parser that would 
otherwise handle them, in your Tika Config xml (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1646626)
* /tika/trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java
* 
/tika/trunk/tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java
* 
/tika/trunk/tika-core/src/main/java/org/apache/tika/parser/ParserDecorator.java
* /tika/trunk/tika-core/src/test/java/org/apache/tika/config/TikaConfigTest.java
* 
/tika/trunk/tika-core/src/test/resources/org/apache/tika/config/TIKA-1445-default-except.xml


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-12-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14253076#comment-14253076
 ] 

Hudson commented on TIKA-1445:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #356 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/356/])
TIKA-1445 - Allow you to exclude certain mimetypes from a parser that would 
otherwise handle them, in your Tika Config xml (nick: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1646626)
* /tika/trunk/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java
* 
/tika/trunk/tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java
* 
/tika/trunk/tika-core/src/main/java/org/apache/tika/parser/ParserDecorator.java
* /tika/trunk/tika-core/src/test/java/org/apache/tika/config/TikaConfigTest.java
* 
/tika/trunk/tika-core/src/test/resources/org/apache/tika/config/TIKA-1445-default-except.xml


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-23 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222510#comment-14222510
 ] 

Nick Burch commented on TIKA-1445:
--

I quite like Tim's idea. We can have things like 
{{TikaConfig.getDefaultConfig()}}, {{TikaConfig.getMaxiumMetadataConfig()}}, 
{{TikaConfig.getTryEachInTurnConfig()}} etc. People with specific needs can 
either pass those in as options to a TikaConfig constructor, or they can 
provide a tika config xml file that lists their preferences, perhaps with an 
expanded syntax like
{code}
parser class=composite
  childparserorg.apache.tika.parser.jpeg.JPegParser/childparser
  childparser.../childparser
  childparser.../childparser
  childparserorg.apache.tika.parser.ocr.TesseractOCR/childparser
/parser
parser class=tryinturn
  childparserorg.apache.tika.text/childparser
  childparserorg.apache.tika.text.findtextstrings/childparser
/parser
parser class=defaultparser
  excludeorg.apache.tika.netcdf/exclude
/parser
{code}

The above slightly pseudocode example would try to merge all the image parsers 
output in turn, would for plain text try the normal parser then fall back to 
the talked-about bit like strings if that failed, and would use the default 
parser for everything else but excluding the netcdf parser

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-23 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222512#comment-14222512
 ] 

Chris A. Mattmann commented on TIKA-1445:
-

Yep I like the idea too. Time to figure out how to implement and get some 
cycles to do so :)

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-19 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217685#comment-14217685
 ] 

Dave Meikle commented on TIKA-1445:
---

bq. Hey Guys, to be honest, the way I see that we solve the ServiceLoading 
problem is somehow to move away from it. Relying on the JVM to implicitly 
decide which parser to load based on ClassLoading is not scalable IMO. At 
worst, even capturing an ordered preference file that isn't ServiceLoading is 
1000x better IMO than relying on the JVM and the classpath. We need somehow to 
bring this logic into Tika (still thinking about how and will try to prototype 
something).

+1 - I think this is example of something we will probably hit more and more as 
we further extend Tika, i.e. wanting multiple parsers to have an interest in 
and then parse content of the same mime type, and moving away from using the 
re-ordering approach seems like the only way to go here.

_ServiceLoading_ per se is not a problem, indeed this is a nice way to make it 
simple for external providers to be added, but I think we need to think about 
Parsers in a pipeline and allow users to customise the parsers that participate 
in the pipeline through positive exclusions via config.

The above is a big change and I think if we went with something like this would 
need to be a 2.X of Tika. 

I suspect the problem with clashing Metadata entries is not really there, as 
most parsers look for different keys, or in cases where they process commons 
ones (e.g. title, size, description, etc) they should hopefully be getting the 
same value anyway.  IMO I think we could send the same Metadata object through 
the 'pipeline', adding any unique new value in for a key.

Will join the party and try to flesh out thoughts on a branch.

bq. 3) It is a good idea to identify which parser produced each content with a 
div tag.

+1 - this will be really helpful.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217965#comment-14217965
 ] 

Tim Allison commented on TIKA-1445:
---

How about using the order of parsers as specified in TikaConfig?  That should 
accommodate 6 class files in different jars, no?

Via TikaConfig, we could also specify the which subclass of a default composite 
parser to use.  I now see at least three use cases:
1) Tika classic: pick the first parser that applies and hope that it is the one 
you meant, ignore the others. :)
2) The use case we've been discussing, where each parser is additive.
3) A BackOffOnExceptionParser (TIKA-1483 got me thinking about this)

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-18 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216351#comment-14216351
 ] 

Tim Allison commented on TIKA-1445:
---

[~gagravarr], thank you for explaining the original design decision.  I knew 
there must be a good reason.  My idea was to create one list of non-o.a.t 
parsers and one list of o.a.t parsers and then prioritize the non-o.a.t. in a 
joint list, but within each list, the parsers would be in the order they were 
when loaded.  Is it common for people to have more than the out-of-the-box 
o.a.t.p.Parser services file and then maybe one user-defined one?

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-18 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216365#comment-14216365
 ] 

Tim Allison commented on TIKA-1445:
---

Copied from dev discussion to record points on this issue.  Will not duplicate 
in future.  Sorry!

On issue 1: The proposal is that we'd send in a fresh Metadata object to each 
parser and then combine that information into a new Metadata object either via 
add or set.  If we go this route, we'll lose the restrictions that Properties 
may have originally held (e.g. one value as in TikaCoreProperties.TITLE).

On Issue 2:
I think we're talking about different things.  Yes, we'll definitely need to 
reset or spool the stream depending on its length.  My concern was more with 
the handlers.  If the first parser calls endDocument() and we don't shield 
that, then if someone uses the BodyContentHandler, then they might not see 
contents from the second/third parser because the initial parser ended the 
document.  I need to test this concern, but I think that this was the root of 
TIKA-1124.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-18 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216444#comment-14216444
 ] 

Nick Burch commented on TIKA-1445:
--

I think it's fairly common for people to have 4-5 parser services files, and 
whatever we do needs to accept that as a normal use case. Pretty much anyone 
depending on tika-parsers is going to have at least 2.

Think of the case of
{code:title=tika-parsers.jar:META-INF/services/org.apache.tika.parser.Parser}
org.apache.tika.parser.gdal.GDALParser
org.apache.tika.parser.html.HtmlParser
org.apache.tika.parser.image.ImageParser
{code}
{code:title=my-tika-extension.jar:META-INF/services/org.apache.tika.parser.Parser}
com.example.tika.ocr.customocrparser
org.apache.tika.parser.image.ImageParser
{code}

Under your plan, given that the JVM could return the two service files to you 
in any order, how do you decide which of GDALParser or ImageParser goes second 
after the OCR one? In one parser file, Image comes first, in the other it's 
second. Which wins? How do we make it deterministic, and not just based on 
which jar the JVM spots first?

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-18 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216451#comment-14216451
 ] 

Chris A. Mattmann commented on TIKA-1445:
-

Hey Guys, to be honest, the way I see that we solve the ServiceLoading problem 
is somehow to move away from it. Relying on the JVM to implicitly decide which 
parser to load based on ClassLoading is not scalable IMO. At worst, even 
capturing an ordered preference file that isn't ServiceLoading is 1000x better 
IMO than relying on the JVM and the classpath. We need somehow to bring this 
logic into Tika (still thinking about how and will try to prototype something). 

Further, as for the use case of 4-5 service files being common - I guess I'm 
the outlier, b/c I've never ever created or used more than the default one?

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-18 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216466#comment-14216466
 ] 

Nick Burch commented on TIKA-1445:
--

Anyone using tika-parser OOTB has two parsers services files - built-in and 
vorbis. Anyone adding a third party parser under a non-ASLv2 license off the 
wiki will get a third. Anyone adding their own custom parsers following the 
instructions on the website will get a few more. 

My hunch is that most users won't care at all about what order the parsers are 
asked hey, can you handle this file type in. My second hunch is that users 
who do care will typically only care about it for a handful of formats, eg for 
jpeg try ocr then image, everything else default is fine. 

We also need to support those users who currently say I don't care what you 
find on the classpath, I only ever want you to use these 5 parsers and in this 
explicit order I'm passing you now

I can describe the problem, but I'm not sure on the right solution at this 
point...

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-18 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14216960#comment-14216960
 ] 

Chris A. Mattmann commented on TIKA-1445:
-

Hi Nick:

I think we need to be careful to define users. In my case, users aren't 
developers (who I think you are talking about when discussing adding new 
parsers above). My users simply want metadata and parsing that currently are 
partitioned amongst multiple Parsers in Tika, for the same MIME/MediaType. I 
could make one super Parser that combines them together; use the services 
trick per class to declare priority parsers, or delegates, or whatever. I think 
a much more modular and thus more easily maintainable way would be to provide a 
mechanism in which we allow multiple Parsers to be called for the same 
MediaType and to fill the Metadata object and Content stream.

That said, I don't have a solution yet, but I am trying to think of one. Glad 
to have the conversation with you guys here. It's a tough problem.


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-18 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217100#comment-14217100
 ] 

Lewis John McGibbney commented on TIKA-1445:


We can run many extractors against one MediaType with Any23. In this case we 
produce triples output. In the case of Tika, if we were to start with a 
scenario where we were *just* populating the Metadata container then I think it 
would be an excellent start.
I'm going to investigate how we currently chain the extractors together in 
Any23 tonight and will make best efforts report it here. [~p_ansell] can maybe 
help out here as well as he has been influential in refactoring Any23 extractor 
behavior in the past.  

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-18 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14217407#comment-14217407
 ] 

Lewis John McGibbney commented on TIKA-1445:


OK so in Any23, if we were to take the following example where we are focusing 
on a *single document extraction* e.g. (0) then it can be said that for any 
given document, when we run (1) the extraction we:
 * from all registered extractors, filter the extractors by MimeType (2) 
 * from all matching extractors for the given MimeType, create the extractor (3)
 * loop through the matching extractors and actually run (4) each extractor on 
the local document source as an InputStream (5) for instance.

We also have an Extraction Content and Extraction Reporting layers within Any23 
which may be of use to Tika. To be honest I find the reports and context 
objects extremely useful for obtaining metrics from extraction... maybe we 
could do the same for Tika?

There are some improvements which can be made to SingleDocumentExtraction 
within Any23 however that conversation is not relevant here. Hopefully the high 
level overview of the chaining extraction algorithm within Any23 is of some 
value to this conversation.

(0) 
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java
(1) 
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L205
(2) 
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L223
(3) 
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L252
(4) 
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L440
(5) 
https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L465

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-17 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14214668#comment-14214668
 ] 

Tim Allison commented on TIKA-1445:
---

This might muddy results, initially, but users could choose to turn off/not 
load parsers that they didn't want.  It would be a significant change over what 
we're currently doing.

How will we handle:
1) Two parsers both set a value in the Metadata object?  Will the second 
overwrite the value of the first?
2) Content:  How will we know when a document ends?  AutoDetectParser would 
wrap the handler in an EndDocumentShieldingContentHandler and then call 
endDocument when done?
3) Will the user be able to parse the output from the handler to figure out 
which parser is responsible for which content?  Let's say a user wants to pull 
the electronic text out of a PDF _and_ render the page as an image and then run 
it through OCR, would we have something like div parser=o.a.t.p.PDFParser 
or similar?

If we go this route, we'd want to make sure we don't have literally duplicate 
parsers (as we do now).

This sounds more complicated than having parent parsers know which children 
they control and how to control them, but, it might make sense.

Aside from OCR, what other use cases do we have where we might want multiple 
parsers operating on the same doc type?

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-17 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215170#comment-14215170
 ] 

Luis Filipe Nassif commented on TIKA-1445:
--

+1 to respect the order of parsers in the service file, instead of sorting the 
full class names.

1) Creating a service loading of ImageMetadataParsers, afaik, can have the same 
problem of different parsers trying to set the same metadata values. Metadata 
values are multivalued, so can we simply add the values produced by different 
parsers?

2) Yes, I think CompositeParser should append the content produced by different 
supported parsers. If the user do not want all the parsers, he should customize 
the parser service loading file.  

3) It is a good idea to identify which parser produced each content with a 
div tag.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-17 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215292#comment-14215292
 ] 

Nick Burch commented on TIKA-1445:
--

 +1 to respect the order of parsers in the service file, instead of sorting 
 the full class names.

The problem is that you can have multiple service files on your classpath. How 
do we respect the order of parsers in that case, when the order we get the 
service files in can be random due to the JVM's behaviour? 

(It was this non-determinicity of service files that led us to initially add 
explicit sorting of parsers, so we'd have consistent behaviour between multiple 
runs)

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-17 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14215303#comment-14215303
 ] 

Chris A. Mattmann commented on TIKA-1445:
-

Hey [~talli...@apache.org]:

Here are my replies (also I moved this convo to the dev list since I think it's 
super important!):

{noformat}
#1 We will use a default policy of “append” which allows the Metadata
object to append values to the same key, rather than replace them.
We could also couple this with X-Parsed-By, which is an ordered
list of what Parser parsed what so that we can reconstruct what
Parser contributed what field. If it’s multi-valued, we can also
add fields for Offsets, etc.  An alternative here would also be to
prefix metadata keys in this CompositeParser by the X-Parsed-By
parser name, to avoid conflicts. Users would be able to switch the
policy from “append” to “overwrite” in which this isn’t a problem,
and we simply allow the last parser to input into a conflicting key
to be the one that takes precedence. One option with overwrite would
be to allow in this policy for providing a precedence order of
Parsers (e.g., the current service list could be a precedence order).

That said, how sure are we that this is a *real* problem? Some
parsers parse the same MediaType but contribute vastly different
and non overlapping keys to the metadata object?

#2 I like your suggestion - or the alternative as I suggested would
be to reset the stream to the beginning after each parser, or
alternatively keep a clone of the original stream as a copy, and
then clone it for each called Parser attempt?

#3 I like your idea about wrapping content provided by handlers
with the parser attribute. Very neat, let’s try that!

{noformat}


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-15 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213858#comment-14213858
 ] 

Chris A. Mattmann commented on TIKA-1445:
-

Tim, I wonder if it's possible to clone the original InputStream provided and 
to simply reset it to its original state after each Parser is run so that they 
can simply augment rather than replace what's there. I honestly think we should 
run all sets of matching Parsers for a given or detected MediaType. Thoughts?

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-14 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212246#comment-14212246
 ] 

Tim Allison commented on TIKA-1445:
---

The AutoDetectParser was doing its regular lookup for which parser supported x 
file type.  No luck in that.

Now, there is unfortunately something approaching luck in how we're handling 
the case where multiple parsers support a given file type.  Our current 
algorithm, if I understand it correctly is to sort parsers in reverse 
alphabetical order by their package+class name (with a special case of prefer 
non-o.a.t parsers) and then pick the first parser that claims that it will 
parse the given file type.  

From the DefaultParser:
{noformat}
ListParser parsers =
loader.loadStaticServiceProviders(Parser.class);
Collections.sort(parsers, new ComparatorParser() {
public int compare(Parser p1, Parser p2) {
String n1 = p1.getClass().getName();
String n2 = p2.getClass().getName();
boolean t1 = n1.startsWith(org.apache.tika.);
boolean t2 = n2.startsWith(org.apache.tika.);
if (t1 == t2) {
return n1.compareTo(n2);
} else if (t1) {
return -1;
} else {
return 1;
}
}
});
{noformat}

and 

{noformat}
if (loader != null) {
// Add dynamic parser service (they always override static ones)
MediaTypeRegistry registry = getMediaTypeRegistry();
ListParser parsers =
loader.loadDynamicServiceProviders(Parser.class);
Collections.reverse(parsers); // best parser last
for (Parser parser : parsers) {
for (MediaType type : parser.getSupportedTypes(context)) {
map.put(registry.normalize(type), parser);
}
}
}
{noformat}

The luck so far is that, for example, the 
org.apache.tika.parser.gdal.GDALParser parser (which supports jpeg and gif) 
happens to sort after the org.apache.tika.parser.jpeg.JPegParser, the 
org.apache.tika.parser.image.ImageParser and the other o.a.t.p.image.* parsers. 
 If you run the GDALParser on /test-documents/testJPEG_EXIF.jpg, you get no 
metadata. :(

Depending on what the community thinks, we may want to open a separate issue 
and change DefaultParser's method of selecting a parser so that it:

1) selects non-o.a.t. parsers first
2) respects the order of parsers in the services files

This wouldn't change the behavior, but it would allow users to select parser 
preference by a means other than relying on reverse alphabetical order.


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-14 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212258#comment-14212258
 ] 

Tim Allison commented on TIKA-1445:
---

This is what we're currently doing in CompositeParser#getParsers(ParseContext 
context) 

{noformat}
clobbering: o.a.t.p.gdal.GDALParser@677556a0 with 
o.a.t.p.hdf.HDFParser@488a5770 for application/x-hdf
clobbering: o.a.t.p.gdal.GDALParser@677556a0 with 
o.a.t.p.image.ImageParser@72729f44 for image/x-ms-bmp
clobbering: o.a.t.p.gdal.GDALParser@677556a0 with 
o.a.t.p.image.ImageParser@72729f44 for image/png
clobbering: o.a.t.p.gdal.GDALParser@677556a0 with 
o.a.t.p.image.ImageParser@72729f44 for image/gif
clobbering: o.a.t.p.image.ImageParser@72729f44 with 
o.a.t.p.image.ImageParser@72729f44 for image/x-ms-bmp
clobbering: o.a.t.p.gdal.GDALParser@677556a0 with 
o.a.t.p.jpeg.JpegParser@4336640f for image/jpeg
clobbering: o.a.t.p.microsoft.TNEFParser@27e33742 with 
o.a.t.p.microsoft.TNEFParser@27e33742 for application/vnd.ms-tnef
clobbering: o.a.t.p.gdal.GDALParser@677556a0 with 
o.a.t.p.netcdf.NetCDFParser@3640e283 for application/x-netcdf
clobbering: o.a.t.p.image.ImageParser@72729f44 with 
o.a.t.p.ocr.TesseractOCRParser@5dd72248 for image/x-ms-bmp
clobbering: o.a.t.p.jpeg.JpegParser@4336640f with 
o.a.t.p.ocr.TesseractOCRParser@5dd72248 for image/jpeg
clobbering: o.a.t.p.image.ImageParser@72729f44 with 
o.a.t.p.ocr.TesseractOCRParser@5dd72248 for image/png
clobbering: o.a.t.p.image.TiffParser@570bd519 with 
o.a.t.p.ocr.TesseractOCRParser@5dd72248 for image/tiff
clobbering: o.a.t.p.image.ImageParser@72729f44 with 
o.a.t.p.ocr.TesseractOCRParser@5dd72248 for image/gif
clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with 
o.a.t.p.odf.OpenDocumentParser@49d388f4 for 
application/vnd.oasis.opendocument.image-template
clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with 
o.a.t.p.odf.OpenDocumentParser@49d388f4 for 
application/vnd.oasis.opendocument.spreadsheet-template
clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with 
o.a.t.p.odf.OpenDocumentParser@49d388f4 for 
application/vnd.oasis.opendocument.chart-template
clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with 
o.a.t.p.odf.OpenDocumentParser@49d388f4 for 
application/vnd.oasis.opendocument.formula
clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with 
o.a.t.p.odf.OpenDocumentParser@49d388f4 for 
application/vnd.oasis.opendocument.text-web
clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with 
o.a.t.p.odf.OpenDocumentParser@49d388f4 for 
application/vnd.oasis.opendocument.text
clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with 
o.a.t.p.odf.OpenDocumentParser@49d388f4 for 
application/vnd.oasis.opendocument.formula-template
clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with 
o.a.t.p.odf.OpenDocumentParser@49d388f4 for 
application/vnd.oasis.opendocument.spreadsheet
clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with 
o.a.t.p.odf.OpenDocumentParser@49d388f4 for 
application/vnd.oasis.opendocument.text-master
clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with 
o.a.t.p.odf.OpenDocumentParser@49d388f4 for 
application/vnd.oasis.opendocument.text-template
clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with 
o.a.t.p.odf.OpenDocumentParser@49d388f4 for 
application/vnd.oasis.opendocument.graphics
clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with 
o.a.t.p.odf.OpenDocumentParser@49d388f4 for 
application/vnd.oasis.opendocument.graphics-template
clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with 
o.a.t.p.odf.OpenDocumentParser@49d388f4 for 
application/vnd.oasis.opendocument.presentation
clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with 
o.a.t.p.odf.OpenDocumentParser@49d388f4 for 
application/vnd.oasis.opendocument.image
clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with 
o.a.t.p.odf.OpenDocumentParser@49d388f4 for 
application/vnd.oasis.opendocument.presentation-template
clobbering: o.a.t.p.odf.OpenDocumentParser@49d388f4 with 
o.a.t.p.odf.OpenDocumentParser@49d388f4 for 
application/vnd.oasis.opendocument.chart
clobbering: o.a.t.p.pkg.CompressorParser@5ec47109 with 
o.a.t.p.pkg.CompressorParser@5ec47109 for application/gzip

{noformat}

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 

[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-13 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14211277#comment-14211277
 ] 

Tyler Palsulich commented on TIKA-1445:
---

[~talli...@apache.org], what was the system before the Tesseract Parser? Were 
we just getting lucky that metadata was extracted?

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
 TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-10-27 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185574#comment-14185574
 ] 

Tim Allison commented on TIKA-1445:
---

I played with this a bit with a png test file.

The problem there is that besides the TesseractOCRParser, the GDALParser and 
the ImageParser both process png files.  So, there's no way to guarantee that 
the other parser actually parses Metadata.

One hack would be to hardcode checking the ImageParser or the JpegParser only 
to see if there is a match.

A better option would be something along the lines of what we do with the 
service loading pattern with AutoDetectReader.

The user could specify ImageMetadataParsers in a service listing, and we would 
try each one in turn to see if there is a match on type.


 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-10-27 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185644#comment-14185644
 ] 

Tyler Palsulich commented on TIKA-1445:
---

bq. Doh! Send in a DefaultHandler instead of BodyContentHandler to the 
otherParser
I made the same mistake.

I think our ideas are very similar. But, I offloaded the dynamic loading to 
{{DefaultParser.getAllParsersFor}}, since it already has service loading. But, 
my logic for getting the underlying DefaultParser from the AutoDetectParser is 
somewhat hacky. +1 to the expanded tests and always parsing with the 
otherParser, though!

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.8

 Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
 TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-10-24 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14183873#comment-14183873
 ] 

Tyler Palsulich commented on TIKA-1445:
---

I've been trying my hand at this some time now. An idea I had was to create a 
temporary file from the input InputStream, then create new input streams from 
that file to run each Parser on.

But, before this OCR Parser, we only ran one Parser on the image, anyway. So, 
what if there was a way to get the second best default parser for the image? 
An option is to hard code the exact working Parsers. But, in my opinion, we 
should load them dynamically. So, that would require getting a 
{{ListParser}}, instead of just the best Parser for a given MediaType 
({{CompositeParser.getParsers(ParseContext)}}). 

If we only chose the second best Parser, we wouldn't have to merge the Metadata 
results, since the OCRParser doesn't add Metadata. But, it might call the 
ContentHandler.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1445.Mattmann.101214.patch.txt


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-10-13 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169090#comment-14169090
 ] 

Hong-Thai Nguyen commented on TIKA-1445:


Interesting question !
For me, parser's selection and parsers priority decision should be done on 
runtime by configuration, not inside a parser.
Image's parser is an interesting case of concurrent parsers (Tesseract vs 
classical Image Parsers). We have double problem here:
1. When many parsers can work with same mime type, which one is selected ?
2. When we have many parsers, can we apply many parsers and merge results 
(metadata  handler) .

* For case 1, if we use a override config of parsers on runtime, we can declare 
many parsers with matching mimetype and the later one in list will be selected. 
We may extend CLI/WebService to inject this kind of configuration.
* For case 2, we don't have a solution for now. We may extend CompositeParser 
to accept a mode 'many' parsers and call matching parsers in chain. The merging 
result is an other problem.we can accept a same metadata name is override by an 
other parser. The perfect solution is (again) using nested structure on our 
metadata which enable store each parser's result.

 Figure out how to add Image metadata extraction to Tesseract parser
 ---

 Key: TIKA-1445
 URL: https://issues.apache.org/jira/browse/TIKA-1445
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.7

 Attachments: TIKA-1445.Mattmann.101214.patch.txt


 Now that Tesseract is the default image parser in Tika for many image types, 
 consider how to add back in the metadata extraction capabilities by the other 
 Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)