Github user ijokarumawak commented on the issue: https://github.com/apache/nifi/pull/2361 @adamlamar I tried to find the API documentation but couldn't find the exact statement. Thanks, now it does makes sense to defer updating `currentTimestamp` after all listed entities are examined. Actually that makes me want to ask another related question. Do we risk making duplication by updating `currentKeys` in the middle of the loop? ## Simulation For example, ListS3 listed following entities at the 1st onTrigger. ### The 1st onTrigger simulation at t1 This should track `currentTimestamp` as `t1`, and `currentKeys` as `b1.txt`. |name|last modified|listed at| |-----|-------------|--------| |a1.txt|t1 - x|t1| |b1.txt|t1|t1| ### The 2nd onTrigger simulation at t2 If S3 is being updated at the same time, there may be additional entities having the same t1 timestamp, but were not listed at the 1st onTrigger. In such case, those will be listed at the next onTrigger. Following table shows the expected effect of tracking `currentKeys`. `b1.txt` will not be listed at t2 because it's already listed at t1 and kept in `currentKeys`. |name|last modified|listed at| |-----|-------------|--------| |a1.txt|t1 - x|t1| |b1.txt|t1|t1| |c1.txt|t1|t2| |d1.txt|t1|t2| ### The 2nd onTrigger simulation at t2 with newer timestamp entry preceding in lexicographical order Based on the above scenario, let's think about an edge case. New entries having later lastModified timestamp can be added at the same time at the time of the 2nd onTrigger ran. This might break the current implementation that updates `currentKeys` in the middle of the loop because entities are returned in lexicographical order. What if there was `a2.txt` having later timestamp than t1? |name|last modified|listed at| |-----|-------------|--------| |a1.txt|t1 - x|t1| |a2.txt|t2|t2| |b1.txt|t1|t1 and *t2*| |c1.txt|t1|t2| |d1.txt|t1|t2| With current implementation, `b1.txt` would be listed again at t2. ## Suggestion - Like `maxTimestamp` (representing the latest timestamp at current onTrigger) and `currentTimestamp` (representing the latest timestamp at the last onTrigger), use separate variables to track the keys having the latest timestamp at the last run and current run. - Probably renaming variables would make code more readable. - Update only `maxTimestamp` and the keys with the latest timestamp of current iteration inside the loop, leave the variables which tracks the previous onTrigger state as it is. Then, after the loop, update the variables to track `previous` onTrigger state. Above approach would reflect background better, and also provide cleaner easily understandable code. I may be overly concerning details, but am feeling this can be better a bit more. Thanks for your patience!
---