Dmitry Goldenberg created NIFI-1718: ---------------------------------------
Summary: Processor(s) to perform OCR Key: NIFI-1718 URL: https://issues.apache.org/jira/browse/NIFI-1718 Project: Apache NiFi Issue Type: New Feature Components: Core Framework Reporter: Dmitry Goldenberg This ticket is a follow-up to NIFI-1717. Apache Tika by default performs OCR on image files such as PNG, BMP, JPEG, GIF, etc. using Tesseract, assuming that it is installed and properly configured. Design issue: should ExtractMediaAttributes processor allow Tika to perform OCR or should OCR be handled elsewhere, whether by a processor or by a service? Could both models be allowed, where ExtractMediaAttributes supports OCR but there's also a separate PerformOCR processor and/or service? If OCR is supported on the ExtractMediaAttributes processor, it'd be best if it supported the following OCR related options (which are exposed by Tika's TesseractOCRConfig class): * tesseractPath - Path to tesseract installation folder, if not on system path. * language - Language ID (e.g. "eng"); language dictionary to be used. * pageSegMode - Tesseract page segmentation mode, defaults to 1. * minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0. * maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to Integer.MAX_VALUE. * timeout - Maximum time (in seconds) to wait for the OCR process termination; defaults to 120. -- This message was sent by Atlassian JIRA (v6.3.4#6332)