Dmitry Goldenberg created NIFI-1718:
---------------------------------------

             Summary: Processor(s) to perform OCR
                 Key: NIFI-1718
                 URL: https://issues.apache.org/jira/browse/NIFI-1718
             Project: Apache NiFi
          Issue Type: New Feature
          Components: Core Framework
            Reporter: Dmitry Goldenberg


This ticket is a follow-up to NIFI-1717.

Apache Tika by default performs OCR on image files such as PNG, BMP, JPEG, GIF, 
etc. using Tesseract, assuming that it is installed and properly configured.

Design issue: should ExtractMediaAttributes processor allow Tika to perform OCR 
or should OCR be handled elsewhere, whether by a processor or by a service?  
Could both models be allowed, where ExtractMediaAttributes supports OCR but 
there's also a separate PerformOCR processor and/or service?

If OCR is supported on the ExtractMediaAttributes processor, it'd be best if it 
supported the following OCR related options (which are exposed by Tika's 
TesseractOCRConfig class):

* tesseractPath - Path to tesseract installation folder, if not on system path.
* language - Language ID (e.g. "eng"); language dictionary to be used.
* pageSegMode - Tesseract page segmentation mode, defaults to 1.
* minFileSizeToOcr - Minimum file size to submit file to OCR, defaults to 0.
* maxFileSizeToOcr - Maximum file size to submit file to OCR, defaults to 
Integer.MAX_VALUE.
* timeout - Maximum time (in seconds) to wait for the OCR process termination; 
defaults to 120.








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to