[ 
https://issues.apache.org/jira/browse/NIFI-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16143900#comment-16143900
 ] 

ASF subversion and git services commented on NIFI-3332:
-------------------------------------------------------

Commit e68ff153e81ddb82d1136d44a96bdb7a70da86d1 in nifi's branch 
refs/heads/master from [~ijokarumawak]
[ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=e68ff15 ]

NIFI-3332: ListXXX to not miss files with the latest processed timestamp

Before this fix, it's possible that ListXXX processors can miss files those 
have the same timestamp as the one which was the latest processed timestamp at 
the previous cycle. Since it only used timestamps, it was not possible to 
determine whether a file is already processed or not.

However, storing every single processed identifier as we used to will not 
perform well.
Instead, this commit makes ListXXX to store only identifiers those have the 
latest timestamp at a cycle to minimize the amount of state data to store.

NIFI-3332: ListXXX to not miss files with the latest processed timestamp

- Fixed TestAbstractListProcessor to use appropriate time precision.
  Without this fix, arbitrary test can fail if generated timestamp does
  not have the desired time unit value, e.g. generated '10:51:00' where
  second precision is tested.
- Fixed TestFTP.basicFileList to use millisecond time precision explicitly
  because FakeFtpServer's time precision is in minutes.
- Changed junit dependency scope to 'provided' as it is needed by
  ListProcessorTestWatcher which is shared among different modules.

This closes #1975.

Signed-off-by: Bryan Bende <bbe...@apache.org>


> Bug in ListXXX causes matching timestamps to be ignored on later runs
> ---------------------------------------------------------------------
>
>                 Key: NIFI-3332
>                 URL: https://issues.apache.org/jira/browse/NIFI-3332
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 0.7.1, 1.1.1
>            Reporter: Joe Skora
>            Assignee: Koji Kawamura
>            Priority: Critical
>         Attachments: listfiles.png, Test-showing-ListFile-timestamp-bug.log, 
> Test-showing-ListFile-timestamp-bug.patch
>
>
> The new state implementation for the ListXXX processors based on 
> AbstractListProcessor creates a race conditions when processor runs occur 
> while a batch of files is being written with the same timestamp.
> The changes to state management dropped tracking of the files processed for a 
> given timestamp.  Without the record of files processed, the remainder of the 
> batch is ignored on the next processor run since their timestamp is not 
> greater than the one timestamp stored in processor state.  With the file 
> tracking it was possible to process files that matched the timestamp exactly 
> and exclude the previously processed files.
> A basic time goes as follows.
>   T0 - system creates or receives batch of files with Tx timestamp where Tx 
> is more than the current timestamp in processor state.
>   T1 - system writes 1st half of Tx batch to the ListFile source directory.
>   T2 - ListFile runs picking up 1st half of Tx batch and stores Tx timestamp 
> in processor state.
>   T3 - system writes 2nd half of Tx batch to ListFile source directory.
>   T4 - ListFile runs ignoring any files with T <= Tx, eliminating 2nd half Tx 
> timestamp batch.
> I've attached a patch[1] for TestListFile.java that adds an instrumented unit 
> test demonstrates the problem and a log[2] of the output from one such run.  
> The test writes 3 files each in two batches with processor runs after each 
> batch.  Batch 2 writes files with timestamps older than, equal to, and newer 
> than the timestamp stored when batch 1 was processed, but only the newer file 
> is picked up.  The older file is correctly ignored but file with the matchin 
> timestamp file should have been processed.
> [1] Test-showing-ListFile-timestamp-bug.patch
> [2] Test-showing-ListFile-timestamp-bug.log



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to