[ 
https://issues.apache.org/jira/browse/NIFI-1717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Goldenberg updated NIFI-1717:
------------------------------------
    Description: 
This would be some continuation of the work that Joe Skora did on the 
ExtractMediaAttributes processor.

The design discussions so far have centered around the following.

1. The processor will continue to use Apache Tika to extract metadata from 
incoming files, content from the incoming files, or both, as configured.
2. The extracted metadata shall be added as values of attributes on the 
FlowFile.
3. The extracted text shall be added as a value of the field "text".
4. There need to be configuration options to let the user tell the processor 
what needs to be extracted and for which cases. Building on the filename and 
MIME type filters provided by Joe:

* INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input files 
get their content extracted, by file name
* INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input files 
get their metadata extracted, by file name
* INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input files 
get their content extracted, by MIME type
* INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input files 
get their metadata extracted, by MIME type
* EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input files 
do NOT get their content extracted, by file name
* EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input files 
do NOT get their metadata extracted, by file name
* EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input files 
do NOT get their content extracted, by MIME type
* EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input files 
do NOT get their metadata extracted, by MIME type

Per Joe's point, an exclusion shall trump an inclusion rule.

Apache Tika has integrated support for OCR, via Tesseract.  If Tesseract is 
installed and properly configured, Tika performs OCR on the image files such as 
PNG, BMP, JPEG, GIF, etc.

A separate ticket NIFI-1718 is meant to address how OCR should be handled, as 
it is an expensive operation which may require special configuration and 
handling.

  was:
This would be some continuation of the work that Joe Skora did on the 
ExtractMediaAttributes processor.

The design discussions so far have centered around the following.

1. The processor will continue to use Apache Tika to extract metadata from 
incoming files, content from the incoming files, or both, as configured.
2. The extracted metadata shall be added as values of attributes on the 
FlowFile.
3. The extracted text shall be added as a value of the field "text".
4. There need to be configuration options to let the user tell the processor 
what needs to be extracted and for which cases. Building on the filename and 
MIME type filters provided by Joe:

* INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input files 
get their content extracted, by file name
* INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input files 
get their metadata extracted, by file name
* INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input files 
get their content extracted, by MIME type
* INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input files 
get their metadata extracted, by MIME type
* EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input files 
do NOT get their content extracted, by file name
* EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input files 
do NOT get their metadata extracted, by file name
* EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input files 
do NOT get their content extracted, by MIME type
* EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input files 
do NOT get their metadata extracted, by MIME type

Per Joe's point, an exclusion shall trump an inclusion rule.

Apache Tika has integrated support for OCR, via Tesseract.  If Tesseract is 
installed and properly configured, Tika performs OCR on the image files such as 
PNG, BMP, JPEG, GIF, etc.

A separate ticket is to address how OCR should be handled, as it is an 
expensive operation which may require special configuration and handling.


> Processor to extract metadata attributes and content from incoming files
> ------------------------------------------------------------------------
>
>                 Key: NIFI-1717
>                 URL: https://issues.apache.org/jira/browse/NIFI-1717
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Core Framework
>            Reporter: Dmitry Goldenberg
>
> This would be some continuation of the work that Joe Skora did on the 
> ExtractMediaAttributes processor.
> The design discussions so far have centered around the following.
> 1. The processor will continue to use Apache Tika to extract metadata from 
> incoming files, content from the incoming files, or both, as configured.
> 2. The extracted metadata shall be added as values of attributes on the 
> FlowFile.
> 3. The extracted text shall be added as a value of the field "text".
> 4. There need to be configuration options to let the user tell the processor 
> what needs to be extracted and for which cases. Building on the filename and 
> MIME type filters provided by Joe:
> * INCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input 
> files get their content extracted, by file name
> * INCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input 
> files get their metadata extracted, by file name
> * INCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input 
> files get their content extracted, by MIME type
> * INCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input 
> files get their metadata extracted, by MIME type
> * EXCLUDE_CONTENT_FILENAME_FILTER - defines any patterns for which input 
> files do NOT get their content extracted, by file name
> * EXCLUDE_METADATA_FILENAME_FILTER - defines any patterns for which input 
> files do NOT get their metadata extracted, by file name
> * EXCLUDE_CONTENT_MIMETYPE_FILTER - defines any patterns for which input 
> files do NOT get their content extracted, by MIME type
> * EXCLUDE_METADATA_MIMETYPE_FILTER - defines any patterns for which input 
> files do NOT get their metadata extracted, by MIME type
> Per Joe's point, an exclusion shall trump an inclusion rule.
> Apache Tika has integrated support for OCR, via Tesseract.  If Tesseract is 
> installed and properly configured, Tika performs OCR on the image files such 
> as PNG, BMP, JPEG, GIF, etc.
> A separate ticket NIFI-1718 is meant to address how OCR should be handled, as 
> it is an expensive operation which may require special configuration and 
> handling.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to