[ https://issues.apache.org/jira/browse/NIFI-3332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15874171#comment-15874171 ]
Koji Kawamura commented on NIFI-3332: ------------------------------------- [~jskora] [~mosermw] I had tried to bling back the old implementation, I was able to make it pass the test Joe Skora provided. However while I was doing so, I realized that this behavior is expected. AbstractListProcessor handles this with a different approach, by using two timestamps. We can refer other JIRAs why this is implemented as is now. Basically, we wanted to avoid storing unlimited list of filenames in state which overwhelms managed state. So we moved to use two timestamps approach with NIFI-1484. NIFI-1588 applied the same logic to ListHDFS. Generally, detecting updated files by its timestamp can be troublesome as it's reported by this JIRA. So, when dealing with a large or lots of files, since those takes time to move around physically, it would be a good practice by storing those data into different temporal directory first, then move those files to the final directory that a program like NiFi watches. Moving (renaming) file is usually done by changing meta data at filesystem level and should be done quickly. AbstractListProcessor and ListHDFS implement Listing LAG (100 ms). The files whose updated timestamp is the latest within a listing activity, will be held back one more cycle to be picked. It can wait for other files in the same batch operation up to this LAG time by default. If such batch operation finishes in this LAG time, it should be handled properly. As we won't be able to go back to the old implementation, we need to address this JIRA differently. One possible improvement would be making the LAG time configurable. Plus more docs how to alleviate the corner case as written above. How do you think? > Bug in ListXXX causes matching timestamps to be ignored on later runs > --------------------------------------------------------------------- > > Key: NIFI-3332 > URL: https://issues.apache.org/jira/browse/NIFI-3332 > Project: Apache NiFi > Issue Type: Bug > Components: Core Framework > Affects Versions: 0.7.1, 1.1.1 > Reporter: Joe Skora > Assignee: Koji Kawamura > Priority: Critical > Attachments: Test-showing-ListFile-timestamp-bug.log, > Test-showing-ListFile-timestamp-bug.patch > > > The new state implementation for the ListXXX processors based on > AbstractListProcessor creates a race conditions when processor runs occur > while a batch of files is being written with the same timestamp. > The changes to state management dropped tracking of the files processed for a > given timestamp. Without the record of files processed, the remainder of the > batch is ignored on the next processor run since their timestamp is not > greater than the one timestamp stored in processor state. With the file > tracking it was possible to process files that matched the timestamp exactly > and exclude the previously processed files. > A basic time goes as follows. > T0 - system creates or receives batch of files with Tx timestamp where Tx > is more than the current timestamp in processor state. > T1 - system writes 1st half of Tx batch to the ListFile source directory. > T2 - ListFile runs picking up 1st half of Tx batch and stores Tx timestamp > in processor state. > T3 - system writes 2nd half of Tx batch to ListFile source directory. > T4 - ListFile runs ignoring any files with T <= Tx, eliminating 2nd half Tx > timestamp batch. > I've attached a patch[1] for TestListFile.java that adds an instrumented unit > test demonstrates the problem and a log[2] of the output from one such run. > The test writes 3 files each in two batches with processor runs after each > batch. Batch 2 writes files with timestamps older than, equal to, and newer > than the timestamp stored when batch 1 was processed, but only the newer file > is picked up. The older file is correctly ignored but file with the matchin > timestamp file should have been processed. > [1] Test-showing-ListFile-timestamp-bug.patch > [2] Test-showing-ListFile-timestamp-bug.log -- This message was sent by Atlassian JIRA (v6.3.15#6346)