[ https://issues.apache.org/jira/browse/NIFI-4715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Koji Kawamura updated NIFI-4715: -------------------------------- Attachment: ListS3_Duplication.xml > ListS3 produces duplicates in frequently updated buckets > -------------------------------------------------------- > > Key: NIFI-4715 > URL: https://issues.apache.org/jira/browse/NIFI-4715 > Project: Apache NiFi > Issue Type: Bug > Components: Core Framework > Affects Versions: 1.2.0, 1.3.0, 1.4.0 > Environment: All > Reporter: Milan Das > Assignee: Koji Kawamura > Priority: Major > Attachments: List-S3-dup-issue.xml, ListS3_Duplication.xml, > screenshot-1.png > > > ListS3 state is implemented using HashSet. HashSet is not thread safe. When > ListS3 operates in multi threaded mode, sometimes it tries to list same > file from S3 bucket. Seems like HashSet data is getting corrupted. > currentKeys = new HashSet<>(); // need to be implemented Thread Safe like > currentKeys = //ConcurrentHashMap.newKeySet(); > *{color:red}+Update+{color}*: > This is not a HashSet issue: > Root cause is: > When the file gets uploaded to S3 simultaneously when List S3 is in progress. > onTrigger--> maxTimestamp is initiated as 0L. > This is clearing keys as per the code below > When lastModifiedTime on S3 object is same as currentTimestamp for the listed > key it should be skipped. As the key is cleared, it is loading the same file > again. > I think fix should be to initiate the maxTimestamp with currentTimestamp not > 0L. > {code} > long maxTimestamp = currentTimestamp; > {code} > Following block is clearing keys. > {code:title=org.apache.nifi.processors.aws.s3.ListS3.java|borderStyle=solid} > if (lastModified > maxTimestamp) { > maxTimestamp = lastModified; > currentKeys.clear(); > getLogger().debug("clearing keys"); > } > {code} > Update: 01/03/2018 > There is one more flavor of same defect. > Suppose: file1 is modified at 1514987611000 on S3 and currentTimestamp = > 1514987311000 on state. > 1. File will be picked up time current state will be updated to > currentTimestamp=1514987311000 (but OS System time is 1514987611000) > 2. next cycle for file2 with lastmodified: 1514987611000 : keys will be > cleared because lastModified > maxTimeStamp > (=currentTimestamp=1514987311000). > CurrentTimeStamp will saved as 1514987611000 > 3. next cycle: currentTimestamp=1514987611000 , "file1 modified at > 1514987611000" will be picked up again because file1 is no longer in the keys. > I think solution is currentTimeStamp need to persisted current system time > stamp. -- This message was sent by Atlassian JIRA (v7.6.3#76005)