[ 
https://issues.apache.org/jira/browse/NIFI-3213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025275#comment-16025275
 ] 

Bryan Bende commented on NIFI-3213:
-----------------------------------

I came across this JIRA while looking at a similar issue with ListHDFS, I'm 
wondering about a couple of things...

I believe the reason for the original logic was for the following scenario:
- file1 written with time1
- processor performs listing
- file2 written with time1

Since we are only tracking timestamps and not which files were listed, if we 
include file1 in the listing then we will miss file2 on the next execution 
because we are looking for things newer than time1, if we include it on both 
sides then we get file1 listed twice because we don't know we listed it the 
first time. So instead we were leaving it out and getting them both next time, 
which has the drawback of a delay, but won't miss anything or have duplicates.

With this change we are doing the following:

final long currentListingTimestamp = System.nanoTime();

Then later using that value:

else if (latestListingTimestamp >= currentListingTimestamp - LISTING_LAG_NANOS) 
{
 orderedEntries.remove(latestListingTimestamp);
}

What if the directory we are listing is a remote directory where the timestamps 
don't really correspond with NiFi's timestamps?

Is latestListingTimestamp in milliseconds and we are comparing against 
currentListingTimestamp in nano-seconds?

I'm concerned that we may never go into that else statement for cases where we 
were supposed to.



> ListFile always skips files with the latest timestamp in an iteration even if 
> the files have existed a while ago
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-3213
>                 URL: https://issues.apache.org/jira/browse/NIFI-3213
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: 1.0.0, 0.5.0, 0.6.0, 0.5.1, 0.7.0, 0.6.1, 1.1.0, 0.7.1
>            Reporter: Koji Kawamura
>            Assignee: Koji Kawamura
>             Fix For: 1.2.0
>
>
> NIFI-1484 add few lines of code to avoid files to be emitted if those have 
> the latest timestamp within an iteration of listing, because it may still be 
> written at the same time.
> While it doesn't affect much if ListFiles processor is scheduled with a short 
> period of time, such as few ms, but it does affect negatively if an user 
> scheduled it with longer run schedule such as "1 day" or with cron scheduler. 
> For example, user would expect to process list of files per daily basis. Even 
> if a file is saved few hours ago, the processor will skip this, because the 
> file has the latest timestamp within the iteration.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to