[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

ijokarumawak Sun, 31 Dec 2017 00:14:47 -0800

Github user ijokarumawak commented on the issue:

    https://github.com/apache/nifi/pull/2361
  
    @adamlamar I tried to find the API documentation but couldn't find the 
exact statement. Thanks, now it does makes sense to defer updating 
`currentTimestamp` after all listed entities are examined.
    
    Actually that makes me want to ask another related question. Do we risk 
making duplication by updating `currentKeys` in the middle of the loop?
    
    ## Simulation
    
    For example, ListS3 listed following entities at the 1st onTrigger. 
    
    ### The 1st onTrigger simulation at t1
    
    This should track `currentTimestamp` as `t1`, and `currentKeys` as `b1.txt`.
    
    |name|last modified|listed at|
    |-----|-------------|--------|
    |a1.txt|t1 - x|t1|
    |b1.txt|t1|t1|
    
    ### The 2nd onTrigger simulation at t2
    
    If S3 is being updated at the same time, there may be additional entities 
having the same t1 timestamp, but were not listed at the 1st onTrigger. In such 
case, those will be listed at the next onTrigger. Following table shows the 
expected effect of tracking `currentKeys`. `b1.txt` will not be listed at t2 
because it's already listed at t1 and kept in `currentKeys`.
    
    |name|last modified|listed at|
    |-----|-------------|--------|
    |a1.txt|t1 - x|t1|
    |b1.txt|t1|t1|
    |c1.txt|t1|t2|
    |d1.txt|t1|t2|
    
    ### The 2nd onTrigger simulation at t2 with newer timestamp entry preceding 
in lexicographical order
    
    Based on the above scenario, let's think about an edge case. New entries 
having later lastModified timestamp can be added at the same time at the time 
of the 2nd onTrigger ran. This might break the current implementation that 
updates `currentKeys` in the middle of the loop because entities are returned 
in lexicographical order. What if there was `a2.txt` having later timestamp 
than t1?
    
    |name|last modified|listed at|
    |-----|-------------|--------|
    |a1.txt|t1 - x|t1|
    |a2.txt|t2|t2|
    |b1.txt|t1|t1 and *t2*|
    |c1.txt|t1|t2|
    |d1.txt|t1|t2|
    
    With current implementation, `b1.txt` would be listed again at t2.
    
    ## Suggestion
    
    - Like `maxTimestamp` (representing the latest timestamp at current 
onTrigger) and `currentTimestamp` (representing the latest timestamp at the 
last onTrigger), use separate variables to track the keys having the latest 
timestamp at the last run and current run.
    - Probably renaming variables would make code more readable. 
    - Update only `maxTimestamp` and the keys with the latest timestamp of 
current iteration inside the loop, leave the variables which tracks the 
previous onTrigger state as it is. Then, after the loop, update the variables 
to track `previous` onTrigger state.
    
    Above approach would reflect background better, and also provide cleaner 
easily understandable code. I may be overly concerning details, but am feeling 
this can be better a bit more. Thanks for your patience!

---

[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

Reply via email to