Github user jskora commented on the pull request:
https://github.com/apache/nifi/pull/252#issuecomment-211868734
@joewitt On [NIFI-1717|https://issues.apache.org/jira/browse/NIFI-1717] and
[NIFI-1718|https://issues.apache.org/jira/browse/NIFI-1718] Dmitry Goldenberg
and I discussed using Tika to extract content (OCR) documents and images.
@markap14 also suggested removing the filters.
I don't know where the OCR changes stand, those tickets have been quiet for
a couple of weeks. I think that's a tougher capability to test, and as pointed
out on [NIFI-1717|https://issues.apache.org/jira/browse/NIFI-1717] and
[NIFI-1718|https://issues.apache.org/jira/browse/NIFI-1718] it is an expensive
process that may need special consideration.
As for the filters, I like having them in the processor, especially since
this one includes filename and mimetype filters. If consensus is to remove
them, I can update the PR for that, but I think they are affective for this
purpose as it currently is.
I don't think we should hold this for the OCR, but if you want the filters
removed let me know. It'd be nice to get the metadata functionality in.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---