[ 
https://issues.apache.org/jira/browse/NIFI-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025153#comment-16025153
 ] 

Bryan Bende commented on NIFI-3979:
-----------------------------------

After thinking about this some more, the PR I submitted is not correct. The 
challenges here stem from the fact that we can't keep track of every filename 
we listed for performance reasons, and also that we can't really compare the 
system time where NiFi is running to the timestamps of the listings since they 
are coming from an external file system.

The current behavior is that during an execution of the processor, we purposely 
leave out the entries with the latest timestamp, and then include them in the 
next listing. The reason this was done is because we don't know if more entries 
with that timestamp are still coming in, and if we include the latest ones now, 
then we will skip over the additional ones next iteration, or if we include the 
latest ones now then we would have to duplicate them again in the next 
iteration. So our current implementation favors no duplicates and no missed 
data, with the limitation of latest entries lagging behind by one execution.

A potential solution, although somewhat complex to implement, might be to keep 
track of a count of the number of entries with the latest timestamp. So say the 
process runs and there are 10 files with the latest timestamp, we include all 
of them in the current listing and we set a variable to 10, then next time we 
execute we determine there are now 11 files with that previous timestamp, then 
we can list them all again since we don't know which were listed. This leads to 
duplicate listings in the edge case where files are written with the same 
timestamp on each side of an execution, but in the common case would allow us 
to always list the latest files.

We could also change nothing, and just document the behavior of this processor 
and that it is expected to be scheduled fairly frequently, seconds or a few 
minutes, and not hours.

> ListHDFS always skips files with latest timestamp
> -------------------------------------------------
>
>                 Key: NIFI-3979
>                 URL: https://issues.apache.org/jira/browse/NIFI-3979
>             Project: Apache NiFi
>          Issue Type: Bug
>    Affects Versions: 1.1.0, 1.2.0, 1.1.1
>            Reporter: Bryan Bende
>            Assignee: Bryan Bende
>            Priority: Minor
>             Fix For: 1.3.0
>
>
> In NIFI-3213 there was a fix made for ListFile to correct a problem where it 
> was never listing the latest file.
> The same problem exists in ListHDFS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to