[ 
https://issues.apache.org/jira/browse/NIFI-631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983850#comment-14983850
 ] 

ASF GitHub Bot commented on NIFI-631:
-------------------------------------

Github user mpetronic commented on the pull request:

    https://github.com/apache/nifi/pull/112#issuecomment-152697600
  
    Joe, thanks for getting this processor going. I need it. :) I've pulled 
this in and am giving it a try. I have some additional thoughts on 
functionality.
    
    1. Should it have a "Recurse sub-directories <yes|no>" option? Reason I 
mention this is because, in my setup, I have to scan files from an NFS share 
and it actually is not so fast, especially if you recurse many levels of 
subdirs that you don't really need to look at. That's special case, I know, but 
it is a valid use case and we could eliminate some latency by not requiring a 
full recursive scan all the time.
    2. Should it have the option to specify a seeded last modified time? Say 
there is a directory full of files from days or weeks but you only want to 
start pulling them in from say, one day ago or some specific date/time, and not 
pick up all the previous files
    3. If there are empty directories in the path you are scanning, they get 
listed in the "filename", just like an actual file would be listed. I think it 
would be nice to have another attribute that indicated whether the leaf node 
was a file or directory as that could more easily be use by downstream 
processors to decide how to act on that value.
    4. Should it expose each files actual last modified timestamp in the 
FlowFile Attribute Map Content? 
    
    I guess for all other types of filtering, like wildcards and such, the 
right 'Nifi' thing to do is use a downstream "UpdateAttribute" processor to 
massage the list. Correct? Maybe this also applies to #2 above, then?
    
    Maybe the following should/work be part of the code review process but I 
will note here just in case. I'm new to this OSS process but, since I see this 
as a pull request, it made me think it was ready to go but seems some stuff is 
missing?
    
    1. There is no description of the processor
    2. The 'path' attribute description of "The path on the system from which 
to pull or push files" is misleading, IMO. Maybe "The path on the system where 
this processor will scan files and directories to build the file list."


> Create ListFile and FetchFile processors
> ----------------------------------------
>
>                 Key: NIFI-631
>                 URL: https://issues.apache.org/jira/browse/NIFI-631
>             Project: Apache NiFi
>          Issue Type: Improvement
>            Reporter: Mark Payne
>            Assignee: Joe Skora
>         Attachments: 
> 0001-NIFI-631-Initial-implementation-of-FetchFile-process.patch
>
>
> This pair of Processors will provide several benefits over the existing 
> GetFile processor:
> 1. Currently, GetFile will continually pull the same files if the "Keep 
> Source File" property is set to true. There is no way to pull the file and 
> leave it in the directory without continually pulling the same file. We could 
> implement state here, but it would either be a huge amount of state to 
> remember everything pulled or it would have to always pull the oldest file 
> first so that we can maintain just the Last Modified Date of the last file 
> pulled plus all files with the same Last Modified Date that have already been 
> pulled.
> 2. If pulling from a network attached storage such as NFS, this would allow a 
> single processor to run ListFiles and then distribute those FlowFiles to the 
> cluster so that the cluster can share the work of pulling the data.
> 3. There are use cases when we may want to pull a specific file (for example, 
> in conjunction with ProcessHttpRequest/ProcessHttpResponse) rather than just 
> pull all files in a directory. GetFile does not support this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to