[ 
https://issues.apache.org/jira/browse/NIFI-2631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15450272#comment-15450272
 ] 

James Wing commented on NIFI-2631:
----------------------------------

[~jgresock], this looks like a great idea, thanks for contributing.  I have a 
couple of thoughts:

I was able to reproduce the large bucket problem you brought up here all too 
easily.  I aimed ListS3 at a bucket of log files and started it.  ListS3 ran 
for about 1 minute solid without producing flowfiles or bulletins, then dumped 
~90,000 flowfiles in the success queue.  90,000 objects really isn't big in S3 
terms, so aligning the ListS3 processor to the paging pattern of the S3 API 
seems ideal.

Also, I noticed a [comment in 
NIFI-840|https://issues.apache.org/jira/browse/NIFI-840?focusedCommentId=15270156&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15270156],
 where ListS3 was added, that suggests this was a known issue at the time.  
From [~adamonduty]:

bq. ListS3 also persists state after each trigger. I didn't think about this 
until after submitting this PR, but really we should be committing flowfiles 
and persisting state after every batch (1000 objects by default) to avoid 
unbounded memory growth on large buckets. So in some cases the initial restore 
is only one of potentially many roundtrips to the state manager per trigger.

But now that I'm on board, I'm curious why this should be optional behavior?  
If users have small buckets, they probably will not notice or be inconvenienced 
by a small number of session.commit() calls.  Anyone with a large bucket will 
almost certainly be frustrated by a long delay followed by a massive dump of 
flowfiles.  The optional parameter would safely preserve the current behavior, 
but I'm not a big fan of adding potentially confusing options for what now 
seems very normal.  What is the case for not committing each page of results 
from S3?

> ListS3 improvements: "Use versions" and "Commit mode"
> -----------------------------------------------------
>
>                 Key: NIFI-2631
>                 URL: https://issues.apache.org/jira/browse/NIFI-2631
>             Project: Apache NiFi
>          Issue Type: Improvement
>    Affects Versions: 0.7.0
>            Reporter: Joseph Gresock
>            Assignee: Joseph Gresock
>            Priority: Minor
>             Fix For: 1.1.0, 0.8.0
>
>
> Our team needs to be able to list individual versions in S3.  We also ran 
> into a use case where a bucket with many objects (over 1 million in our case) 
> seemed to cause ListS3 to run forever.  The S3 list command finished in a few 
> minutes, but we believe it was taking a very long time for NiFi to commit all 
> the flow files at once.
> To handle this use case, we added a Commit Mode property to ListS3 that 
> allows you specify that you want to commit "Per page" vs. "Once".  This has 
> proven to correctly emit the flow files as the S3 paging progresses.
> We also implemented support for S3 List Versions, which includes the 
> "s3.version" and "s3.isLatest" attributes if applicable.  The "s3.version" 
> attribute can in turn be used in the FetchS3 processor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to