[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

2018-11-03 Thread jvwing
Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/2361
  
Thanks, @adamlamar!


---


[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

2018-10-31 Thread adamlamar
Github user adamlamar commented on the issue:

https://github.com/apache/nifi/pull/2361
  
Hi @ijokarumawak, I'm sorry I wasn't able to take this across the finish 
line. Thanks a bunch for continuing the effort! 


---


[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

2018-10-31 Thread ijokarumawak
Github user ijokarumawak commented on the issue:

https://github.com/apache/nifi/pull/2361
  
Hi @adamlamar , I hope this message finds you well. Since some other users 
asked about this issue, I went ahead and took over the remaining concerns 
around updating `currentKeys` during list loop. And submitted another PR #3116 
based on your commits.

When it gets merged, this PR will be closed automatically. If you have any 
comments, please keep discussing on the new PR. Thanks again for your 
contribution!


---


[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

2018-01-10 Thread ijokarumawak
Github user ijokarumawak commented on the issue:

https://github.com/apache/nifi/pull/2361
  
Thank you, @adamlamar I understand that.


---


[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

2018-01-10 Thread adamlamar
Github user adamlamar commented on the issue:

https://github.com/apache/nifi/pull/2361
  
@ijokarumawak I've made some progress, but unfortunately just trying to 
find time! No tech questions at this point, but thanks for checking in.


---


[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

2018-01-08 Thread ijokarumawak
Github user ijokarumawak commented on the issue:

https://github.com/apache/nifi/pull/2361
  
@adamlamar How is it going? Looking forward to review the updated PR. Just 
wanted to check if you have any issues. Thanks!


---


[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

2018-01-03 Thread adamlamar
Github user adamlamar commented on the issue:

https://github.com/apache/nifi/pull/2361
  
@ijokarumawak Roger, I will expand the scope of this JIRA to include those 
other fixes. Should be doable in a single JIRA, was just unsure how you all 
prefer to move forward. Thanks again for all your help - will likely be several 
days before I'm able to push new commits.


---


[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

2018-01-03 Thread ijokarumawak
Github user ijokarumawak commented on the issue:

https://github.com/apache/nifi/pull/2361
  
@adamlamar How long do you think you need to address multiple bugs you are 
aware of? If those can be addressed in the same ListS3 processor, then I'd 
prefer to have all (as much as we can) in this JIRA/PR, as we can reduce 
testing effort and overall review cycle. If it's too complicated to be done at 
once, then please submit different JIRAs to beak those into smaller pieces. 
Thank you!


---


[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

2018-01-02 Thread adamlamar
Github user adamlamar commented on the issue:

https://github.com/apache/nifi/pull/2361
  
> Do we risk making duplication by updating currentKeys in the middle of 
the loop?

Yes, I think we do! I identified a similar (possibly the same) bug, and I 
agree with all of your suggestions. The question in my mind is whether we 
should fix all of these issues in this JIRA or defer to another. As far as the 
original JIRA goes, I believe the current commit will address the issue. I also 
did a fair bit of manual testing so I would be comfortable moving forward with 
this change as-is.

Before refactoring, I'd like to put some additional unit tests in place for 
safety. Its clear from the discussion that there is some meat here and I'd 
really like to enumerate a few cases I've seen while testing in unit tests.

So its up to you - would you prefer that I start the unit tests and address 
(potentially) multiple bugs in this PR, or should we merge this and create 
another JIRA?


---


[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

2017-12-31 Thread ijokarumawak
Github user ijokarumawak commented on the issue:

https://github.com/apache/nifi/pull/2361
  
@adamlamar I tried to find the API documentation but couldn't find the 
exact statement. Thanks, now it does makes sense to defer updating 
`currentTimestamp` after all listed entities are examined.

Actually that makes me want to ask another related question. Do we risk 
making duplication by updating `currentKeys` in the middle of the loop?

## Simulation

For example, ListS3 listed following entities at the 1st onTrigger. 

### The 1st onTrigger simulation at t1

This should track `currentTimestamp` as `t1`, and `currentKeys` as `b1.txt`.

|name|last modified|listed at|
|-|-||
|a1.txt|t1 - x|t1|
|b1.txt|t1|t1|

### The 2nd onTrigger simulation at t2

If S3 is being updated at the same time, there may be additional entities 
having the same t1 timestamp, but were not listed at the 1st onTrigger. In such 
case, those will be listed at the next onTrigger. Following table shows the 
expected effect of tracking `currentKeys`. `b1.txt` will not be listed at t2 
because it's already listed at t1 and kept in `currentKeys`.

|name|last modified|listed at|
|-|-||
|a1.txt|t1 - x|t1|
|b1.txt|t1|t1|
|c1.txt|t1|t2|
|d1.txt|t1|t2|

### The 2nd onTrigger simulation at t2 with newer timestamp entry preceding 
in lexicographical order

Based on the above scenario, let's think about an edge case. New entries 
having later lastModified timestamp can be added at the same time at the time 
of the 2nd onTrigger ran. This might break the current implementation that 
updates `currentKeys` in the middle of the loop because entities are returned 
in lexicographical order. What if there was `a2.txt` having later timestamp 
than t1?

|name|last modified|listed at|
|-|-||
|a1.txt|t1 - x|t1|
|a2.txt|t2|t2|
|b1.txt|t1|t1 and *t2*|
|c1.txt|t1|t2|
|d1.txt|t1|t2|

With current implementation, `b1.txt` would be listed again at t2.

## Suggestion

- Like `maxTimestamp` (representing the latest timestamp at current 
onTrigger) and `currentTimestamp` (representing the latest timestamp at the 
last onTrigger), use separate variables to track the keys having the latest 
timestamp at the last run and current run.
- Probably renaming variables would make code more readable. 
- Update only `maxTimestamp` and the keys with the latest timestamp of 
current iteration inside the loop, leave the variables which tracks the 
previous onTrigger state as it is. Then, after the loop, update the variables 
to track `previous` onTrigger state.

Above approach would reflect background better, and also provide cleaner 
easily understandable code. I may be overly concerning details, but am feeling 
this can be better a bit more. Thanks for your patience!


---


[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

2017-12-30 Thread adamlamar
Github user adamlamar commented on the issue:

https://github.com/apache/nifi/pull/2361
  
@ijokarumawak From the https://docs.aws.amazon.com/AmazonS3/latest/API/v2-RESTBucketGET.html#v2-RESTBucketGET-requests;>AWS
 S3 API documentation (see the `continuation-token` section):

> Amazon S3 lists objects in UTF-8 character encoding in lexicographical 
order

I really wish we could take the approach you suggested (would certainly 
make things easier), but since the entries are in lexicographical/alphabetical 
order, we must iterate over the entire listing before updating 
`currentTimestamp`. Otherwise we risk skipping keys newer than 
`currentTimestamp` but older than keys in the middle of the list. The 
lexicographical ordering also matches my experience when using the API.

Unfortunately this does also mean that duplicates are possible when a 
listing fails, like the `IOException` scenario you mentioned. This is an 
existing limitation in ListS3.

I appreciate your help getting this reviewed! :)


---


[GitHub] nifi issue #2361: NIFI-4715: ListS3 produces duplicates in frequently update...

2017-12-29 Thread adamlamar
Github user adamlamar commented on the issue:

https://github.com/apache/nifi/pull/2361
  
@ijokarumawak I did as you suggested and pulled `persistState` out in the 
case when no new keys have been listed, but this actually caused unit tests to 
fail. This is because `currentTimestamp` never changes during the main loop, so 
even though `commit` calls `persistState`, the value of `currentTimestamp` 
doesn't change until the main loop exits. Which is why `persistState` is 
required in both exit paths.

Instead, I took a slightly different approach with the change just pushed. 
Since `currentTimestamp` is the current value persisted to the state manager, 
`maxTimestamp` is the highest timestamp seen in the main loop, and 
`currentKeys` is tied to `maxTimestamp` (not `currentTimestamp`), I removed the 
`persistState` call in `commit`, and did `persistState` at the end of 
`onTrigger` only. While this does continue to `persistState` on each exit, it 
reduces the number of `persistState` calls to once per `onTrigger` rather than 
once per 1000 keys iterated (which was done previously in `commit`).

I did a bunch of manual testing with concurrent `PutS3Object` and `ListS3` 
and always got the correct number of listed keys, even when uploading 20k+ 
objects using 10 threads. I tried a few strategies to skip `persistState` if 
nothing had changed, but in manual testing it always produced the wrong number 
of keys, although sometimes only off by 1. The current code should be quite an 
improvement to the load on the state manager, even if it isn't ideal.

I also introduced `totalListCount` which helps tighten up the log messages 
a bit. Previously it would "successfully list X objects" followed by "no new 
objects to list" in a single `onTrigger` (this was apparent in the unit test 
output). `totalListCount` also avoids an unnecessary `yield`.

There's a lot going on in this one - let me know if you have any other 
questions!


---