Github user mpetronic commented on the pull request:
https://github.com/apache/nifi/pull/112#issuecomment-152697600
Joe, thanks for getting this processor going. I need it. :) I've pulled
this in and am giving it a try. I have some additional thoughts on
functionality.
1. Should it have a "Recurse sub-directories " option? Reason I
mention this is because, in my setup, I have to scan files from an NFS share
and it actually is not so fast, especially if you recurse many levels of
subdirs that you don't really need to look at. That's special case, I know, but
it is a valid use case and we could eliminate some latency by not requiring a
full recursive scan all the time.
2. Should it have the option to specify a seeded last modified time? Say
there is a directory full of files from days or weeks but you only want to
start pulling them in from say, one day ago or some specific date/time, and not
pick up all the previous files
3. If there are empty directories in the path you are scanning, they get
listed in the "filename", just like an actual file would be listed. I think it
would be nice to have another attribute that indicated whether the leaf node
was a file or directory as that could more easily be use by downstream
processors to decide how to act on that value.
4. Should it expose each files actual last modified timestamp in the
FlowFile Attribute Map Content?
I guess for all other types of filtering, like wildcards and such, the
right 'Nifi' thing to do is use a downstream "UpdateAttribute" processor to
massage the list. Correct? Maybe this also applies to #2 above, then?
Maybe the following should/work be part of the code review process but I
will note here just in case. I'm new to this OSS process but, since I see this
as a pull request, it made me think it was ready to go but seems some stuff is
missing?
1. There is no description of the processor
2. The 'path' attribute description of "The path on the system from which
to pull or push files" is misleading, IMO. Maybe "The path on the system where
this processor will scan files and directories to build the file list."
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---