[ https://issues.apache.org/jira/browse/NIFI-3979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025153#comment-16025153 ]
Bryan Bende commented on NIFI-3979: ----------------------------------- After thinking about this some more, the PR I submitted is not correct. The challenges here stem from the fact that we can't keep track of every filename we listed for performance reasons, and also that we can't really compare the system time where NiFi is running to the timestamps of the listings since they are coming from an external file system. The current behavior is that during an execution of the processor, we purposely leave out the entries with the latest timestamp, and then include them in the next listing. The reason this was done is because we don't know if more entries with that timestamp are still coming in, and if we include the latest ones now, then we will skip over the additional ones next iteration, or if we include the latest ones now then we would have to duplicate them again in the next iteration. So our current implementation favors no duplicates and no missed data, with the limitation of latest entries lagging behind by one execution. A potential solution, although somewhat complex to implement, might be to keep track of a count of the number of entries with the latest timestamp. So say the process runs and there are 10 files with the latest timestamp, we include all of them in the current listing and we set a variable to 10, then next time we execute we determine there are now 11 files with that previous timestamp, then we can list them all again since we don't know which were listed. This leads to duplicate listings in the edge case where files are written with the same timestamp on each side of an execution, but in the common case would allow us to always list the latest files. We could also change nothing, and just document the behavior of this processor and that it is expected to be scheduled fairly frequently, seconds or a few minutes, and not hours. > ListHDFS always skips files with latest timestamp > ------------------------------------------------- > > Key: NIFI-3979 > URL: https://issues.apache.org/jira/browse/NIFI-3979 > Project: Apache NiFi > Issue Type: Bug > Affects Versions: 1.1.0, 1.2.0, 1.1.1 > Reporter: Bryan Bende > Assignee: Bryan Bende > Priority: Minor > Fix For: 1.3.0 > > > In NIFI-3213 there was a fix made for ListFile to correct a problem where it > was never listing the latest file. > The same problem exists in ListHDFS. -- This message was sent by Atlassian JIRA (v6.3.15#6346)