[ 
https://issues.apache.org/jira/browse/NIFI-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248878#comment-15248878
 ] 

Jeremy Dyer commented on NIFI-1718:
-----------------------------------

[~dgoldenberg] I came to create a jira for a NiFi Tesseract processor today and 
stumbled across this jira. Seems I'm a few days late. I created a purely 
Tesseract processor already accounts for all of the bullet points you listed 
(and the ability to pass in raw configuration key/values) but it doesn't use 
Tika as you have described here. I would be glad to contribute what I have but 
wanted run it by you first since you specifically called out Tika and I'm not 
using that. Would it be a big deal if my implementation didn't use Tika 
explicitly or are you needing that for something else?

Just for reference here is a quick screen recording of what I have so far 
https://www.linkedin.com/pulse/nifi-ocr-using-apache-read-childrens-books-jeremy-dyer

> Processor(s) to perform OCR
> ---------------------------
>
>                 Key: NIFI-1718
>                 URL: https://issues.apache.org/jira/browse/NIFI-1718
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>            Reporter: Dmitry Goldenberg
>
> This ticket is a follow-up to NIFI-1717.
> Apache Tika by default performs OCR on image files such as PNG, BMP, JPEG, 
> GIF, etc. using Tesseract, assuming that it is installed and properly 
> configured.
> Design issue: should ExtractMediaAttributes processor allow Tika to perform 
> OCR or should OCR be handled elsewhere, whether by a processor or by a 
> service?  Could both models be allowed, where ExtractMediaAttributes supports 
> OCR but there's also a separate PerformOCR processor and/or service?
> If OCR is supported on the ExtractMediaAttributes processor, it'd be best if 
> it supported the following OCR related options (which are exposed by Tika's 
> TesseractOCRConfig class):
> * tesseractPath - Path to tesseract installation folder, if not on system 
> path.
> * language - Language ID (e.g. "eng"); language dictionary to be used.
> * pageSegMode - Tesseract page segmentation mode, defaults to 1.
> * minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
> * maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to 
> Integer.MAX_VALUE.
> * timeout - Maximum time (in seconds) to wait for the OCR process 
> termination; defaults to 120.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to