[ https://issues.apache.org/jira/browse/NIFI-1718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248878#comment-15248878 ]
Jeremy Dyer commented on NIFI-1718: ----------------------------------- [~dgoldenberg] I came to create a jira for a NiFi Tesseract processor today and stumbled across this jira. Seems I'm a few days late. I created a purely Tesseract processor already accounts for all of the bullet points you listed (and the ability to pass in raw configuration key/values) but it doesn't use Tika as you have described here. I would be glad to contribute what I have but wanted run it by you first since you specifically called out Tika and I'm not using that. Would it be a big deal if my implementation didn't use Tika explicitly or are you needing that for something else? Just for reference here is a quick screen recording of what I have so far https://www.linkedin.com/pulse/nifi-ocr-using-apache-read-childrens-books-jeremy-dyer > Processor(s) to perform OCR > --------------------------- > > Key: NIFI-1718 > URL: https://issues.apache.org/jira/browse/NIFI-1718 > Project: Apache NiFi > Issue Type: New Feature > Components: Core Framework > Reporter: Dmitry Goldenberg > > This ticket is a follow-up to NIFI-1717. > Apache Tika by default performs OCR on image files such as PNG, BMP, JPEG, > GIF, etc. using Tesseract, assuming that it is installed and properly > configured. > Design issue: should ExtractMediaAttributes processor allow Tika to perform > OCR or should OCR be handled elsewhere, whether by a processor or by a > service? Could both models be allowed, where ExtractMediaAttributes supports > OCR but there's also a separate PerformOCR processor and/or service? > If OCR is supported on the ExtractMediaAttributes processor, it'd be best if > it supported the following OCR related options (which are exposed by Tika's > TesseractOCRConfig class): > * tesseractPath - Path to tesseract installation folder, if not on system > path. > * language - Language ID (e.g. "eng"); language dictionary to be used. > * pageSegMode - Tesseract page segmentation mode, defaults to 1. > * minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0. > * maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to > Integer.MAX_VALUE. > * timeout - Maximum time (in seconds) to wait for the OCR process > termination; defaults to 120. -- This message was sent by Atlassian JIRA (v6.3.4#6332)