[ 
https://issues.apache.org/jira/browse/NIFI-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536881#comment-16536881
 ] 

Joseph Witt commented on NIFI-3332:
-----------------------------------

[~ijokarumawak] [~bende] I think the timing issue present in ListFile (and it 
sounds like ListSFTP and any other List* that assumes timestamp will be 
safe...) is...

The lastModifiedDate of a file in some systems will be set once data is made 
available/visible in a given directory.  Our current implementation would work 
well against that as it presumes the lastModifiedDate is a sort of meaningful 
state tracker.

The lastModifiedDate of a file in some systems will be set on file creation and 
will not change while being written to or eventually made visible in the 
directory we pull from.  Our current implementation will miss any files then 
that have a lastModifiedDate which is older than the lastModifiedDate of a file 
we've already pulled.  How can that be?  If the producer takes longer to 
generate file A than file B and we see file B and pull it then we'll never see 
file A because we've already moved on to newer last modified dates.

Do you agree our implementation has this issue?  I've not verified the logic in 
the code or read all these threads yet.  If yes we should file a JIRA.  Well 
based on the above comments we should anyway.

Suggested approach:
- The current model is very fast, relatively easy to implement, and state 
management is trivial.  We should make that an option users can select based on 
the behavior of data producers.

- We should introduce another model which is more work to implement, involves 
non-trivial state management possibly including some database or k/v store.

In many ways this is why the original GetFile proc was made lumping listing and 
fetching in one.  It makes the logic a lot easier at the expense of the primary 
state management being 'we took the file'.

Thanks

> Bug in ListXXX causes matching timestamps to be ignored on later runs
> ---------------------------------------------------------------------
>
>                 Key: NIFI-3332
>                 URL: https://issues.apache.org/jira/browse/NIFI-3332
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 0.7.1, 1.1.1
>            Reporter: Joe Skora
>            Assignee: Koji Kawamura
>            Priority: Critical
>             Fix For: 1.4.0
>
>         Attachments: Test-showing-ListFile-timestamp-bug.log, 
> Test-showing-ListFile-timestamp-bug.patch, listfiles.png
>
>
> The new state implementation for the ListXXX processors based on 
> AbstractListProcessor creates a race conditions when processor runs occur 
> while a batch of files is being written with the same timestamp.
> The changes to state management dropped tracking of the files processed for a 
> given timestamp.  Without the record of files processed, the remainder of the 
> batch is ignored on the next processor run since their timestamp is not 
> greater than the one timestamp stored in processor state.  With the file 
> tracking it was possible to process files that matched the timestamp exactly 
> and exclude the previously processed files.
> A basic time goes as follows.
>   T0 - system creates or receives batch of files with Tx timestamp where Tx 
> is more than the current timestamp in processor state.
>   T1 - system writes 1st half of Tx batch to the ListFile source directory.
>   T2 - ListFile runs picking up 1st half of Tx batch and stores Tx timestamp 
> in processor state.
>   T3 - system writes 2nd half of Tx batch to ListFile source directory.
>   T4 - ListFile runs ignoring any files with T <= Tx, eliminating 2nd half Tx 
> timestamp batch.
> I've attached a patch[1] for TestListFile.java that adds an instrumented unit 
> test demonstrates the problem and a log[2] of the output from one such run.  
> The test writes 3 files each in two batches with processor runs after each 
> batch.  Batch 2 writes files with timestamps older than, equal to, and newer 
> than the timestamp stored when batch 1 was processed, but only the newer file 
> is picked up.  The older file is correctly ignored but file with the matchin 
> timestamp file should have been processed.
> [1] Test-showing-ListFile-timestamp-bug.patch
> [2] Test-showing-ListFile-timestamp-bug.log



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to