[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13894779#comment-13894779
 ] 

Grant Ingersoll commented on TIKA-93:
-------------------------------------

I'm noodling around with producing a patch for this and have a few questions 
for the group:

# Where in Tika do people usually put these kind of "downstream" tasks?  
Presumably we would need to work with the mime type detection process to know 
that the input is something that is binary and potentially OCR-able.  I would 
imagine we would want something that inserts between Detection and Parsing.  
I'd also suggest we make it pluggable, so that we can support other OCR 
solutions.
# Is anyone aware of anything in PDFBox that allows you to know if a document 
is an Image based PDF?





> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to